Your AI Agent Passed Tests. It Failed in Production.

AI Integration

•3 de julio de 2026•5 min read•Por Daily Miranda Pardo

Your CI pipeline is green. The deploy goes out. Four days later you get a message from a client: "the agent stopped extracting VAT correctly."

No visible error. No exception in the logs. The model just started behaving differently in production after you updated to a newer version — or changed three lines of your system prompt.

That is the problem evals solve. And most teams building AI agents in 2026 still don't have them.

Why Normal Tests Don't Work for LLMs

With deterministic code, the contract is clear: parseDate("2026-07-03") always returns the same structure. You write the assertion, it passes or it doesn't.

With an LLM the contract is probabilistic. Same input, different output distribution. Change the model, the system prompt, or the available tools, and behavior shifts at the edges: not in the main happy-path case you always test, but in the edge cases your client hits every week.

A classic test verifies that code does what you say. An eval verifies that the agent's behavior satisfies the criteria you define. They operate at two different levels.

The Four Most Common Silent Regression Scenarios

Across the AI agent integration projects we build at DAILYMP, these are the cases that come up repeatedly:

Model upgrade: you switch from claude-sonnet-4-5 to claude-sonnet-4-6 and the agent now formats dates differently, or prefers calling a secondary tool instead of the primary one.
System prompt edit: you update an instruction to improve extraction of one field and inadvertently change how the model prioritizes another.
New tool added: the agent starts preferring it for cases where it previously used the correct tool.
Temperature or parameter change: latency tuning that shifts the output distribution.

None of these changes throw an exception. The code compiles. Integration tests pass. The damage is invisible until the client notices.

Three Eval Patterns for Production Agents

1. Golden Dataset with Binary Criteria

Start with 20-50 real cases pulled from production, each with success criteria rather than exact expected output:

interface EvalCase {
  id: string;
  input: string;
  criteria: {
    toolsCalled: string[];        // tools the agent must invoke
    fieldsExtracted: string[];    // fields that must appear in the output
    mustNotCall?: string[];       // tools it must not call
  };
}

const goldenDataset: EvalCase[] = [
  {
    id: "invoice-vat-extraction",
    input: "Create an invoice for €500 with 21% VAT for ACME Corp",
    criteria: {
      toolsCalled: ["create_invoice"],
      fieldsExtracted: ["amount", "vatRate", "customerId"],
      mustNotCall: ["send_email"],  // must not send email without confirmation
    }
  },
  {
    id: "ambiguous-customer-lookup",
    input: "Invoice for García",
    criteria: {
      toolsCalled: ["search_customer"],   // must search before assuming
      fieldsExtracted: ["customerId"],
      mustNotCall: ["create_invoice"],    // must not create without confirming customer
    }
  }
];

This kind of eval doesn't test the response text. It tests the agent's decisions: which tools it called, which fields it extracted, what it avoided doing. That is what actually matters in a business agent.

2. LLM-as-Judge for Hard-to-Quantify Outputs

When the output isn't structured — a customer support response, a summary, an explanation — you need another model to evaluate it against concrete criteria:

async function judgeAgentResponse(
  scenario: string,
  agentResponse: string,
  criteria: string[]
): Promise<{ score: number; failures: string[] }> {
  const prompt = `Evaluate the following AI agent response.

Scenario: ${scenario}
Agent response: ${agentResponse}

Criteria to evaluate:
${criteria.map((c, i) => `${i + 1}. ${c}`).join('\n')}

For each criterion, indicate whether it passes (true/false) and why.
Respond in JSON: { "results": [{ "criterion": string, "pass": boolean, "reason": string }] }`;

  const response = await anthropic.messages.create({
    model: 'claude-haiku-4-5-20251001',  // fast and cheap model for evals
    max_tokens: 512,
    messages: [{ role: 'user', content: prompt }],
  });

  const parsed = JSON.parse(extractJSON(response.content[0].text));
  const failures = parsed.results
    .filter((r: any) => !r.pass)
    .map((r: any) => r.criterion);

  return {
    score: parsed.results.filter((r: any) => r.pass).length / criteria.length,
    failures,
  };
}

Estimated cost: €0.001–0.003 per eval using Haiku. Running 50 cases costs less than €0.15.

3. Regression CI with Quality Threshold

Integrate evals into your CI pipeline and block merges when the score drops:

# .github/workflows/agent-evals.yml
name: Agent Evals

on:
  pull_request:
    paths:
      - 'src/agents/**'
      - 'src/prompts/**'
      - 'src/tools/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - name: Run eval suite
        run: npm run evals -- --threshold 0.90
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

The eval script returns exit code 1 if the score drops below threshold. The merge is blocked. The PR author sees exactly which cases failed and why.

What Separates a Prototype from a Maintainable System

An agent without evals is a system that only works as long as nobody touches it. The moment you add a tool, adjust a prompt, or change the model, you're flying blind.

With a dataset of 30-50 well-chosen cases and a CI job that runs in 3 minutes, you have a real safety net. Not 100% coverage — LLMs don't work that way — but enough to catch 90% of regressions before they reach production.

In the AI integration projects we build at DAILYMP, the eval suite is designed in week one, not after the first production incident. Adding it as a patch after the first failure is more expensive, harder to calibrate, and always comes too late.

If you have an agent in production without evals — or with integration tests but without behavior evaluations — there's a high probability you already have regressions you're not seeing:

Let's audit your agent together →