How We Fairly Evaluate Agent Tools and Harnesses

Whenever we learn of a new agent architecture or type of search tool, we immediately throw it at the hardest retrieval problems we can find. Internally, we’ve developed a set of evaluation practices that we believe are worth sharing.

Automatic prompt optimization should be a standard part of benchmarking for AI agents. In practice, we’ve repeatedly observed that the ranking of agents, tools, and models can change dramatically once prompts are optimized.


A motivated anecdote

Recently, we ran experiments with agents in sandboxes and file systems. These environments allow agents to write code, interact with a file system, and manage their own “knowledge base” of memories, notes, and skills. These agents commonly use simple search tools like grep and find to explore their environments. We wanted to test how much more efficient and performant an agent becomes at deep research, web browsing, and text-to-SQL tasks if we gave it semantic search over its own file system.

However, these changes were made naively, and the agent with semantic search initially ended up performing worse. After optimizing the prompts for each variant of the agent using automated prompt optimization in DSPy, the agent with semantic search greatly outperformed the original grep-based agent.

The lesson here is agent performance is dominated by instructions. Giving an agent a powerful search tool without prompting it to efficiently reason about its new tool will not set it up for success.

Why

Many evaluation practices in AI borrow their intuition from classical lab experiments. When conducting a lab experiment in school, you manipulate a single well-defined variable while holding everything else fixed. This works because the system under study is assumed to be a stable mechanism.

Agents do not behave this way — they have free internal degrees of freedom, interacting components, and new behavior that emerges from optimization. LLM and agent evals need to catch up to the standards of more mature systems fields.

In database benchmarks, knob tuning is part of the system. Results can hinge on tuning choices, caching, warm-up, hardware, and parameter selection. If you aren't an expert in a system, you can make it look bad without lying. Just by choosing defaults or a plausible-but-suboptimal setup. The same is true in reverse.

In a good distributed systems paper, external constraints (machines, network conditions, workloads) are fixed while allowing each system to choose its internal strategy. Retry policies, batching, and consistency mechanisms are tuned because they define how the system behaves.

In ML benchmarks, hyperparameter tuning is an essential part of benchmarks. Researchers care about how well a model performs under a given optimization budget — often measured in hyperparameter search trials or compute. Varying that budget can change model rankings entirely, which is why serious benchmarks report it explicitly. ML benchmarks are a very useful analogy because optimization budgets are well studied. Even varying the size of the optimization budget can change the rankings of models.1

Agent evaluation today lags behind all three of these fields.

Agent benchmarks already have prompt optimization in the form of human prompt engineering. Humans are not neutral optimizers. Researchers may do a small amount of manual prompt-tuning until it gets to a point where "it works". This introduces a lot of bias into the comparison. Researchers are highly motivated to allocate a much higher optimization budget to the tool, harness, or model they are most excited about. A fairly distributed optimization budget is best enforced by using automated optimization.

Closing thought

Agents are not static mechanisms. They are configurable systems whose behavior emerges from optimization.

If your question is "how well does this work out of the box?", then a stable prompt is fine. If the question is — and we posit this is the right question to ask—"what is this capable of?", then automatic optimization is necessary.

Footnotes

  1. Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith, Show Your Work: Improved Reporting of Experimental Results, Proceedings of EMNLP-IJCNLP 2019 (2019). https://aclanthology.org/D19-1224/