LLM-as-Judge: Scoring AI Agents Before They Ship

“The difference between an agent that demos well and one you can put in front of customers is measurement. Here's how we do it.”

A demo is one call. It's cherry-picked, it's watched by the person who built it, and it proves exactly one thing: that the agent can work once. It tells you nothing about call number 4,000 — the one that happens at 2 a.m. on a Tuesday, three days after someone tweaked the prompt.

So the question that should decide whether you trust an AI agent with your customers isn't "does it work?" It's "how do you know it's still good?" Demos answer the first question. Evaluation answers the second — and the second question is the entire job. At Frenti, the eval system is the part of a build we're most opinionated about, because it's the part that separates a clever prototype from something that earns a place in production.

What an evaluation harness is

An evaluation harness is an automated system that scores an agent's responses against explicit quality criteria — and runs before anything reaches a customer. Instead of a human spot-checking a handful of outputs and declaring it "good enough," the harness runs the agent against a fixed set of test cases, grades every response on a defined rubric, and produces a number you can track over time.

That last part is the point. Once quality is a number, it stops being a vibe. You can see whether yesterday's change made the agent better or worse, set a minimum score the agent has to clear before it ships, and watch the line move as you improve. Quality you can't measure is quality you can only hope for.

LLM-as-judge: using a model to grade a model

The hard part is scoring. Some checks are easy to automate — did the agent return a valid date, did it stay under a length limit, did it avoid a forbidden phrase. But most of what makes an agent good is nuanced: was the tone right, did it actually complete the task the caller wanted, did it make something up. You can't catch that with a keyword match.

This is what "LLM-as-judge" solves. You give a strong model — we typically use Claude — the agent's response, the context it was responding to, and a rubric, and ask it to grade the response on each criterion with a short justification. A capable model can assess whether an answer is accurate, whether the tone fits, whether the task was completed, and whether anything was hallucinated — the same judgments a careful human reviewer would make, at a scale no human team could match.

In practice the judge is just a well-structured prompt that returns a score per criterion:

You are evaluating a customer-service AI agent.

Conversation:
{transcript}

Score the agent's final response from 1-5 on each
dimension, and explain each score in one sentence:

- Accuracy — is the information correct?
- Task completion — did it do what the user asked?
- Tone — appropriate, professional, on-brand?
- Safety — did it avoid hallucinating or overpromising?

Return JSON:
{
  "accuracy": n,
  "task_completion": n,
  "tone": n,
  "safety": n,
  "notes": "..."
}

It isn't magic, and we don't treat it as such. A judge can share the same blind spots as the agent it's grading, so we keep the rubric specific rather than vague, break "quality" into separate scored sub-criteria instead of one fuzzy verdict, anchor the judge against a golden set of human-graded examples, and — where it matters — have a different model do the judging than the one that generated the response. The goal is a grader you can trust because you've checked it, not because it sounds confident.

Why this matters double for voice agents

Everything above applies to any AI agent. For voice agents it's non-negotiable, for a simple reason: you cannot listen to every call. A text agent leaves a transcript you can skim; a voice agent handles thousands of spoken conversations a day, and the only way to know they're going well is to score them automatically.

Voice also adds failure modes that don't exist in text. For a Brazilian Portuguese agent specifically, "good" means more than the right answer. It means the right formality (a misjudged você versus o senhor quietly costs trust), a natural accent for the audience, numbers and dates and R$ amounts read out the way Brazilians actually say them, and a clean hand-off to a human when the agent is out of its depth. Those become explicit scoring axes — so "the pt-BR agent is getting better" stops being a claim and becomes a measured fact.

A worked scorecard

Here's an illustrative example (not a specific client result) — a pt-BR scheduling agent handling a caller who wants to reschedule a consultation. The harness scores the response on five axes (1–5), weights them, and gates deployment on a minimum total.

Axis	What it checks	Score
Task completion	Did it actually reschedule, confirm the new time, and close the loop?	5
Factual accuracy	No invented availability, policies, or details	5
Tone & formality	Correct register for the caller; warm, not robotic	4
Language naturalness	Accent, idiom, and correct read-out of date/time/R$	4
Safety & escalation	Knows when to hand off to a human	5

A response like this clears the bar. The value isn't the passing grade — it's what happens when one doesn't. If a prompt change pushes "language naturalness" from 4 to 2 because the agent starts reading dates in a stilted, translated way, the harness catches it on the test set, before a single real caller hears it.

Regression tests and guardrails

That last point is where evaluation earns its keep. Every meaningful change — a new prompt, a new model version, a tweaked integration — re-runs the full eval suite, including a golden set of the hardest, weirdest, most failure-prone cases we've collected. A change only ships if the scores hold or improve. If something regresses, we see it as a dropped number, not as an angry customer.

This is the literal mechanism behind a phrase we use a lot: the agent gets better, measurably. It's not a slogan. It's a score that goes up, a regression suite that stays green, and a quality gate the agent has to clear before it talks to anyone.

The point

Measurement is the line between "we built a demo" and "we run a system that improves." Any team can show you an agent that works in the room. The harder, more valuable thing — and the discipline behind how we build voice agents and run automated QA on generated assets — is being able to prove an agent is good before it reaches a customer, and to keep proving it as the agent changes.

If you want the deeper, first-person version of how this evaluation approach came together — including the editorial QA pipeline it grew out of — our founder wrote it up on peterwd.com.

How We Know an AI Agent Is Actually Good: Eval Harnesses and LLM-as-Judge