#2: Why a good answer doesn’t mean a good agent
Ammar Mohanna argues that most teams are measuring outcomes when they should be measuring behavior across the entire agent workflow
During a time when AI conversations are often louder than they are useful, Ammar Mohanna, PhD, brings a refreshing perspective.
His career has moved fluidly between academia and industry, from teaching advanced AI courses at the American University of Beirut to advising teams on turning machine learning ideas into systems that can be trusted.
He is also known for his candid take on the current AI landscape, especially the gap between meaningful engineering and what he often calls AI slop.
In this conversation, Ammar challenges one of the most common assumptions in agent development: that a correct answer is evidence of a successful agent. He explains why reliability lies in the path an agent takes, not just in the result it produces, and why evaluation must evolve from output scoring to a discipline that measures behaviour and trustworthiness in production.
Read through to the end. We’ve left something extra for Agentic Engineering readers.
Most teams think they’re evaluating agents, but they’re actually not. Where do you see the biggest illusion of evaluation today?
The biggest illusion is that teams think they are evaluating an agent when they are only evaluating the final answer.
That works well for a chatbot. But an agent is different. It plans, chooses tools, passes arguments, reads observations, retries, stops, and sometimes takes action. A final-answer score hides most of the actual failure surface.
An agent can produce a good-looking answer after calling the wrong tool, wasting ten steps, misreading a tool result, or ignoring a failed call. From the outside, the answer may look acceptable. From a reliability perspective, the run is not acceptable.
So the illusion is: “the answer looked right, therefore the agent worked.” However, what you need to know is whether the path was valid, efficient, grounded, and safe.
You break evaluation into component, trajectory, outcome, and adversarial layers. Where do most teams underinvest, and what failures does that lead to?
Most teams underinvest in trajectory evaluation and adversarial evaluation.
Outcome evaluation is the easiest layer to reach for because the final answer is visible. Component evaluation is also fairly intuitive once tools are involved: did it choose the right tool, did it pass valid arguments, did the plan make sense?
Trajectory evaluation is harder because you need structured traces and assertions over the run itself. But this is where many production failures live: loops, duplicate calls, silent retries, no recovery after a tool failure, unnecessary detours, high latency, high token cost. Two agents can produce the same answer, but one gets there in four clean steps, and the other gets there through an expensive, brittle path. Output scoring treats them as equal. Production does not.
Adversarial evaluation is also underbuilt. Teams may try a few prompt-injection examples manually, but they rarely turn those attacks into a versioned regression suite. That leads to a false sense of safety. A guardrail can look very strong against the examples it was designed for, while still being fragile against slightly different payloads.
Agent failures only show up after deployment. What’s the hardest failure mode to catch early, even with a good evaluation setup?
The hardest failures are the ones that look like successful runs.
A tool returns something plausible but incomplete. The agent takes a reasonable-looking path. The final answer is fluent. No exception is thrown. But the answer is weakly grounded, the evidence is stale, or the agent skipped a recovery step after a bad observation.
These failures are hard because they do not announce themselves as failures. They show up later as retries, edits after the answer, escalations, user abandonment, or quiet loss of trust.
The other hard category is drift. A hosted model changes, a tool schema changes, retrieval content shifts, or user traffic moves into a different distribution. Nothing “breaks” in the traditional software sense, but the agent becomes less reliable. This is why offline evals need to connect to production monitoring. A test suite is necessary, but it is not the whole system.
LLM-as-a-judge is becoming a default pattern. Where does it actually work well, and where does it quietly break?
LLM-as-a-judge works well when the task is bounded, the rubric is explicit, and the judge has the evidence they need. It is useful for rubric-based scoring, regression checks, pairwise comparisons, and multi-dimensional outcome evaluation, especially when you calibrate it against human labels.
The important part is that the judge itself has to be evaluated. I would not trust a judge just because it is an LLM. I would look at correlation with human labels, agreement rates, mean absolute error, and performance by rubric dimension.
Where it quietly breaks is when teams use it as an uncalibrated oracle. Judges often reward verbosity, prefer answers in a certain style, miss missing citations, or give a strong score to an answer that is polished but not grounded. Overall scores can also hide weak dimensions. For example, a judge may be decent on safety or format, but poor on groundedness, which is often the dimension that matters most for a research or retrieval-heavy agent.
So I see LLM judges as useful evaluators, not authorities. They need rubrics, evidence, calibration, and periodic human audit.
If you had to audit an agent system in production with very limited time, what signals or metrics would you look at first to decide if it’s reliable?
Aggregate success rate is often the last metric I look at. I would start with the traces behind the failures and the production signals that users generate when the agent is not working.
The first signals I would inspect are abandonment rate, retry rate, escalation rate, clarification rate, thumbs down, and edit-after-answer rate. Those are often more honest than a dashboard success metric.
Then I would look at trace-level reliability: number of tool calls, duplicate calls, loop-like behaviour, failed tool calls, recovery after failure, latency, and token cost. A reliable agent should not only get the answer right; it should get there through a path that is stable and explainable.
I would also check whether offline evals are tied to production: are failed production examples converted into regression tests? Are adversarial cases versioned? Are judge scores calibrated against human labels? Are there no-go gates for safety, groundedness, cost, latency, and step count?
With limited time, I am looking for one thing: whether the team has connected offline evaluation, online monitoring, and regression gates. If those are disconnected, reliability is usually more assumed than measured.
As organizations continue to explore what AI can and should do, voices like Ammar’s help bring the discussion back to the questions that matter: What problem are we really solving? Can the system be trusted? And are we building something meaningful, or simply adding more noise to an already crowded field?
If you’d like to continue exploring these ideas, Ammar will be speaking at Agent Evals Bootcamp on June 27th, where he will turn these ideas into a hands-on framework for evaluating agents across tool use, planning, trajectories, outcomes, regressions, and adversarial failure modes before deployment.




