Rendered at 22:50:20 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
AlanMishler 1 days ago [-]
Author here — thanks for the comments!
By "evaluator" (aka "eval”), we did indeed mean frameworks for evaluating agent outputs broadly. The article and experiments center on LLM-as-a-judge, where an LLM is the grader, but the argument is ultimately statistical, so it holds regardless of whether the grader is an LLM, a simple supervised model, a set of regex checks, etc.
We were banking on readers being familiar with evals and left out definitions for conciseness, but as Gregaros points out, we could have been more explicit about what we meant.
SmithersBot 1 days ago [-]
as long as OpenAI and Anthropic keep subsidizing dirt cheap Codex or Claude Code usage, I'll just keep using them as evaluators. The trick is to have a fresh instance doing the reviewing, not the one that did the work.
CharlesW 1 days ago [-]
> The trick is to have a fresh instance doing the reviewing, not the one that did the work.
In my experience that's not neccessary (some people even claim that you must use models from different vendors), and it's expensive since a fresh instance needs to rebuild all the context that's needed in order to properly and thoroughly review. LLMs have no problem throwing "them 5 minutes ago" under the bus when asked to review something "skeptically" and "with fresh eyes".
kangalioo 1 days ago [-]
Is there research about this?
That sounds like it would make productive AI usage much easier, but it also sounds very brittle
ai_slop_hater 1 days ago [-]
What is an LLM evaluator?
Gregaros 1 days ago [-]
They should define this, but after having read the entire article I think it’s clear they mean “frameworks for evaluating the output of an agent” rather than what first might come to mind as “LLM evals”.
Their thesis is that even when the eval is useless for correctness of a single agentic action in production, it allows you to choose between two agents by cross-comparing in a large aggregated collection of tasks. Effectively: you can tune your agentic parameters.
Nothing new to the idea that taking many samples and averaging can work when a single datapoint doesn’t. Presumably this is part of a conversation in which we’re lacking context.
ai_slop_hater 1 days ago [-]
Are “frameworks for evaluating the output of an agent” and "LLM evals" different? :) If yes, how?
brianwmunz 1 days ago [-]
"LLM evals" is maybe an overused term because it can mean a bunch of things. This article talks about LLM-as-a-judge where an LLM scores another system's outputs.
GabrielBianconi 15 hours ago [-]
Any function that can score (i.e. "evaluate") your LLM system (e.g. your agent).
For example:
- You write a heuristic (regex, code, etc.) that assigns a score to an output
- You make another LLM score the output from your system (aka "LLM-as-a-judge")
- You have an automated system that can verify the generated outputs (e.g. does generated code compile or pass tests?)
People often talk about "LLM evals (evaluations)" which will include a set of evaluators i.e. scoring functions.
By "evaluator" (aka "eval”), we did indeed mean frameworks for evaluating agent outputs broadly. The article and experiments center on LLM-as-a-judge, where an LLM is the grader, but the argument is ultimately statistical, so it holds regardless of whether the grader is an LLM, a simple supervised model, a set of regex checks, etc.
We were banking on readers being familiar with evals and left out definitions for conciseness, but as Gregaros points out, we could have been more explicit about what we meant.
In my experience that's not neccessary (some people even claim that you must use models from different vendors), and it's expensive since a fresh instance needs to rebuild all the context that's needed in order to properly and thoroughly review. LLMs have no problem throwing "them 5 minutes ago" under the bus when asked to review something "skeptically" and "with fresh eyes".
That sounds like it would make productive AI usage much easier, but it also sounds very brittle
Their thesis is that even when the eval is useless for correctness of a single agentic action in production, it allows you to choose between two agents by cross-comparing in a large aggregated collection of tasks. Effectively: you can tune your agentic parameters.
Nothing new to the idea that taking many samples and averaging can work when a single datapoint doesn’t. Presumably this is part of a conversation in which we’re lacking context.
For example:
- You write a heuristic (regex, code, etc.) that assigns a score to an output
- You make another LLM score the output from your system (aka "LLM-as-a-judge")
- You have an automated system that can verify the generated outputs (e.g. does generated code compile or pass tests?)
People often talk about "LLM evals (evaluations)" which will include a set of evaluators i.e. scoring functions.
We'll make this clearer next time!