EvalView

The image at the top announces a bold mission: EvalView, the open-source regression gate for AI agents. The visual immediately anchors the idea with a distinctive logo—EvalView—drawn from the assets/logo.png that sits beside the bold declaration “The open-source regression gate for AI agents.” The accompanying line nudges the imagination: Think Playwright, but for tool-calling and multi-turn AI agents. A suite of badges lines the header, signaling the project’s status and community: the PyPI release badge, its download count, a GitHub stars badge, CI status, the Apache 2.0 license, and a note about contributors. These elements announce a mature project that invites collaboration and practical use.

The core proposition of EvalView is both simple and transformative. It recognizes that an agent can appear to be functioning correctly by returning a 200 status, while silently drifting into behavior that is undesirable, wrong, or unsafe. A model update, a provider change, a shift in tool availability, or a refinement of a policy may alter tool choice or the quality of outputs without breaking health checks. EvalView’s promise is to catch those silent regressions before users notice them. Traditional tests ensure the presence of a system, but EvalView ensures that the system continues to behave correctly under evolving conditions. It is designed to track drift across multiple dimensions: outputs, tools, model identifiers, and runtime fingerprints. The aim is clear: distinguish between a provider change and a genuine systemic regression, enabling teams to respond with confidence rather than guesswork.

A quick tour of the key features reveals the practical and user-centric design of EvalView. The system monitors drift not only in what is output, but in the identity of the models themselves, and in the fingerprints that emerge from runtime traces. When a drift is detected, EvalView surfaces a run-level signal that classifies the drift as either declared or suspected, assigns a confidence level, and presents evidence drawn from fingerprints, retries, and the set of affected tests. If the new behavior is deemed correct, users can rebaseline by re-running a snapshot to accept the updated baseline. The emphasis is not merely on detection but on actionable remediation and ongoing stabilization.

A 30-second live demo is highlighted to illustrate the workflow in action. The demo video, referenced as demo.mp4, offers a concise, real-time glimpse into how the system identifies drift and reasons about regressions. If the video cannot render in a GitHub app, EvalView provides a GIF preview, demo.gif, that conveys the same motion and outcomes. Together, the video and the GIF offer a tangible sense of the speed and clarity with which EvalView operates, transforming abstract regression concepts into observable evidence.

What makes EvalView stand out is its holistic approach to regression testing for AI agents. It doesn’t stop at detection; it provides a framework to classify, inspect, and heal regressions. It helps you distinguish provider or model drift from genuine system regressions and introduces automation to repair flaky failures. With features like retries, review gates, and audit logs, the platform supports a robust pipeline for stable, repeatable testing across evolving AI environments.

To get started, EvalView emphasizes three concise commands that empower any AI developer to implement regression testing quickly and deterministically. The Quick Start sequence begins with a simple installation: pip install evalview. Next, evalview init automatically detects the agent type—whether the agent is engaged in chat, tool usage, multi-step reasoning, Retrieval-Augmented Generation (RAG), or coding—and configures the appropriate evaluators, thresholds, and assertions. After that, evalview snapshot saves the current behavior as the baseline, and evalview check performs regression testing against that baseline. This compact flow—install, detect, snapshot, check—offers a practical path to regression-gate coverage that scales with the complexity of modern AI agents.

EvalView provides several additional installation options for varied workflows. A curl-based script can install EvalView in environments where a standard package manager is unavailable, while evalview demo makes it possible to see regression detection live without requiring API keys. For teams integrating EvalView into real-world pipelines, there is guidance to clone and run a template project, illustrating how EvalView can become part of a broader testing and automation ecosystem.

The project positions itself in a competitive landscape by contrasting itself with observability and scoring tools. A comparative guide presents LangSmith, Braintrust, and Promptfoo alongside EvalView, outlining distinct purposes: LangSmith emphasizes observability and tracing, Braintrust focuses on scoring, and Promptfoo helps with prompt comparison. EvalView stands out by offering regression detection, automatic drift classification, and auto-healing capabilities—features designed to catch silent regressions and to stabilize behavior across tool calls, sequences, and model changes. The narrative implies a practical, developer-focused stance: you can rely on EvalView to monitor, diagnose, and correct behavior as your AI stack evolves, without sacrificing autonomy or requiring constant manual intervention.

EvalView’s scope unfolds through a set of concrete categories that describe what it catches, how it behaves, and what it costs to operate. At the “What It Catches” level, the system distinguishes:

PASSED: Behavior matches the baseline, enabling shipping with confidence.
TOOLS_CHANGED: Different tools are invoked, signaling a potential reconfiguration or policy change that warrants review.
OUTPUT_CHANGED: The same tools are used, but the output has shifted, calling for a diff exploration.
REGRESSION: The overall score drops significantly, prompting a fix before shipping.

The platform enriches these signals with a nuanced understanding of model and runtime changes. It distinguishes between a declared model change (an adapter-reported model shift) and a runtime fingerprint change (observed labels or traces changing even if the named model remains the same). It also detects coordinated drift, where multiple tests shift together—an indicator of a provider rollout or a runtime change that affects the entire system. When such a drift is detected, EvalView surfaces a detailed classification, confidence, and evidence, enabling teams to decide whether to retry, rebaseline, or escalate for human review.

To measure change comprehensively, EvalView uses four scoring layers, with the first two being free and offline:

Tool calls and sequence: Checks the exact tool names, the order, and the parameters.
Code-based checks: Regex, JSON schema validation, and contains/not_contains checks.
Semantic similarity: Interprets the intended meaning of outputs using embeddings.
LLM-as-judge: Uses large language models to assess output quality, kinds of errors, and overall fit.

A sample scoring breakdown illustrates how metrics translate into actionable insights. For example, tools might be 100% correct with a 30% weight, outputs could be scored at 42/100 with 50% weight, and a minor sequence issue could contribute to the overall 54/100 score. This transparent scoring model empowers teams to interpret results and focus on the most impactful regressions.

EvalView supports mature CI/CD integration, including a ready-to-use GitHub Actions block that runs in pull requests and push events. A compact YAML snippet demonstrates how to wire an action that checks a running agent against a configured OpenAI API key, producing PR-level artifacts and a succinct status summary. The idea is to bring regression gating into the code review flow so that regressions are surfaced early, with artifacts and job summaries automatically attached to the PR.

Practical messaging around outcomes appears in the interface as well. A sample report shows “EvalView: PASSED” with a table of metrics such as Tests and unchanged counts, accompanied by a note that the results were generated by EvalView. When regressions occur, the system surfaces a structured set of alerts and changes, including cost spikes, model changes, and tool modifications. The narrative suggests that the system can present a structured “Changes from Baseline” section that highlights which tests changed, how scores shifted, and which tools were altered. The intent is to provide precise, codified guidance for remediation and rebaseline.

The platform also emphasizes operational convenience through “Watch Mode,” which keeps a regression gate active as developers work. With watch mode, a file save triggers a regression check, offering options for quick checks that bypass the heavy LLM judge for sub-second feedback. The interface presents a synthesized dashboard that aggregates the current scorecard, tool changes, and regressions, enabling engineers to stay in flow while monitoring health in real time.

Multi-turn testing is a highlighted strength. EvalView recognizes that many AI agents must navigate a conversation with clarifications, follow-up questions, and tool use across turns. A YAML example illustrates a multi-turn test scenario that asks for a refund, checks for an order number, and validates a path through a sequence of tools such as lookuporder and checkpolicy, ensuring that a desired end state—refund processing—occurs without including forbidden steps like deleting an order. Each turn can carry its own expected outputs, required tools, forbidden tools, and per-turn thresholds, so that the evaluation is not a single monolithic outcome but a per-turn scoreboard that respects context and sequence.

Assertion Wizard and Auto-Variant Discovery address real-world determinism challenges. The Assertion Wizard analyzes real traffic captures to suggest pre-configured tests, eliminating the need for YAML-centric test authoring. It can propose assertions such as locking a particular tool sequence or requiring specific tools, while also recommending latency bounds and minimum quality scores. Auto-Variant Discovery tackles non-determinism by running tests multiple times, clustering the resulting paths, and saving valid variants as golden baselines. When a non-deterministic path yields acceptable scores, EvalView can save it as a variant to stabilize the test suite and reduce flakiness.

Another powerful capability is Auto-Heal. Whenモデル updates or drift causes harmless output changes, EvalView can retry failed paths, propose variants, and escalate more significant structural changes when required. The system documents how retries were triggered by model or runtime updates and maintains an audit trail that helps teams understand why a fix was accepted. A concept called a Budget Circuit Breaker is introduced to prevent runaway costs: a mid-execution budget is monitored, and when limits are breached, the system can skip tests to stay under budget. Smart eval profiles further tailor evaluators to the agent type—five profiles such as chat, tool-use, multi-step, rag, and coding—so the gating logic matches the actual workload. Override capabilities allow teams to fine-tune sensitivity when needed.

EvalView’s supported frameworks span a wide range, including LangGraph, CrewAI, OpenAI, Claude, Mistral, HuggingFace, Ollama, MCP, and any HTTP API. This broad compatibility signals its adaptability to diverse AI stacks, whether running locally, in the cloud, or across hybrid environments. The documentation points to a rich ecosystem of starter templates, framework references, and integration guides to accelerate adoption.

The “How It Works” section provides an easy-to-visualize flow: a simple loop where test cases (written in YAML or generated from traffic) feed into EvalView, which replays the tests against your agent running either locally or in the cloud, and then compares results to baselines to generate an HTML report. The snapshot workflow—creating baselines, revalidating on changes, and replaying tests—forms the backbone of stability in evolving AI systems. The approach is designed to keep data local by default, with cloud sync as an opt-in feature, preserving privacy and control while enabling collaboration.

Production monitoring extends EvalView beyond the local development cycle. The monitoring mode runs at regular intervals, offering a live terminal dashboard and Slack alerts when new regressions arise or recoveries occur. JSONL history supports dashboards and post-mortems, transforming regression testing into a continuous, observable process rather than a one-off check.

Key features are enumerated with a clear focus on usefulness and integration ease. Multi-turn capture enables the system to record conversations as tests, and semantic similarity adds a layer of meaning-based comparison to traditional string matching. Production monitoring, cost tracking, and a Pytest plugin for standard test harnesses demonstrate an emphasis on integrating EvalView into existing workflows. A terminal dashboard provides a quick, at-a-glance view of scorecard status, trend lines, and confidence indicators, enabling rapid triage during development cycles.

For developers looking to embed EvalView inside code, the Python API offers gate and gate_async functions that yield a structured result object with a status and a summary of diffs. Quick mode provides deterministic checks without the heavy judge, suitable for ultra-fast feedback in high-velocity environments. The authorization path includes an OpenClaw integration that enables autonomous loops and self-healing behaviors within an agent’s lifecycle. The OpenClaw workflow can be invoked with simple commands to install, check, and autonomously revert if regressions are detected, demonstrating a hands-off approach to maintaining stability in production agents.

The Pytest plugin and Claude Code (MCP) pathways illustrate the project’s breadth of integration. A Pytest plugin enables regression checks alongside existing tests, with a straightforward workflow for test authors. Claude Code integration enables a conversational, proactive regression-checking workflow within Claude’s ecosystem, making it possible for language model agents to ask whether a refactor affected anything and to receive inline checks.

Beyond tooling, EvalView emphasizes ergonomics for engineers and collaboration for teams. Agent-friendly docs provide a broad collection of resources designed to accelerate adoption and collaboration across roles. Architecture maps, contracts, invariants, and verification commands live in agent-oriented documents; task-specific playbooks help extend the evaluation framework to new adapters and evaluators; and explicit guides show how to extend HTML reports or integrate with Ollama. The inclusion of agent instructions, recipes for common extensions, and debugging guidance makes the platform approachable to practitioners who must ship reliable AI features.

Documentation is comprehensive and organized to cover Getting Started, Core Features, Integrations, and more. The site links to Getting Started, Golden Traces, CI/CD, CLI references, Evaluation Metrics, MCP Contracts, Agent Instructions, Agent Recipes, and various other documentation bundles. The ecosystem invites contributions, with channels for bug reporting, feature requests, discussions, setup help, and ongoing support. The Apache 2.0 license underlines the project’s open-source ethos and commitment to broad usage and collaboration.

The project’s narrative also hints at a thriving community with visibility into development momentum. A star history diagram (as indicated by the inclusion of a Star History chart) signals ongoing interest and adoption. The visual and textual assets together portray EvalView as a mature, multi-faceted platform that integrates testing, observability, and automation into a cohesive regression-gating workflow for AI agents.

From a user experience perspective, EvalView presents a coherent story: you run a tool that can automatically detect the kind of agent you’re testing, configure a tailored set of evaluators and thresholds, and then begin an ongoing cycle of baseline creation, regression checking, and rapid healing when things drift. The system uses a per-turn evaluation approach for multi-turn agents, meaning you don’t just verify a final answer; you verify the chain of reasoning, tool use, and intermediate steps that lead to that answer. This per-turn scrutiny is crucial for ensuring that complex AI workflows remain robust under changes in tools, models, and runtime environments.

In practice, EvalView is designed to fit into real product development environments. It can be invoked as a CLI, used within Python workflows, or integrated through a variety of frameworks. It supports multi-turn capture to create representative test suites that mirror actual user flows, with automatic seed generation from real traffic, enabling a regression gate that grows with the product rather than becoming a bottleneck. The tool’s emphasis on auto-heal, auto-variant discovery, and budget-aware testing reflects a pragmatic stance toward reliability: you can continue delivering features while maintaining discipline and traceability around regressions.

From the perspective of visual assets provided in the input, the project’s branding and media give more than cosmetic value. The logo and the demo media (demo.mp4 and demo.gif) serve as tangible touchpoints for users who want to understand the product at a glance. The video and GIF offer a compact, intuitive demonstration of what regression gating looks like in action, complementing the textual descriptions with dynamic examples of checks, alerts, and the kinds of diffs that appear in a real regression run.

In sum, EvalView presents a robust, end-to-end approach to regression testing for AI agents. It acknowledges that the AI landscape is dynamic, with provider shifts, model evolutions, and tool changes that can alter behavior in subtle ways. Rather than rely on static tests alone, EvalView empowers teams to observe, classify, and heal drift in a structured, auditable, and automated manner. It fuses test execution, baseline management, similarity analysis, and multi-turn evaluation into a single framework that is designed to grow with the agents it protects. It is a community-driven, open-source solution that offers practical tooling—an assertion wizard, auto-variant discovery, auto-heal capabilities, a budget-conscious workflow, and smart-eval profiles—while remaining adaptable to a broad spectrum of frameworks and deployment models.

If you are developing AI agents that must interact with tools, run multi-turn conversations, or perform complex reasoning with external services, EvalView positions itself as a capable partner in your engineering stack. It promises not only to detect regressions but to illuminate their nature, quantify their impact, and guide you toward fast, responsible recovery. The combination of live demonstrations, detailed feature sets, cross-framework compatibility, and developer-oriented documentation makes EvalView a compelling candidate for teams seeking to tame the complexity of evolving AI systems while delivering reliable, maintainable products.

The imagery in the input—the EvalView logo, the demo video, and the supporting GIF—serves to anchor this narrative in concrete visuals. They invite you to see a project that is not merely theoretical but actively used by developers who want to ship better AI agents with fewer surprises. The ethos is one of openness, rigor, and practical utility: a regression gate that respects the realities of modern AI development and provides the tools needed to keep behavior correct as the ecosystem continues to evolve.

Open-Source AI Agent Regression Gateway

Enjoying this project?

GitHub - hidai25/eval-view: Open-Source AI Agent Regression Gateway

Stay Updated

Product

Learn

Company

Legal

Stay Updated