Agent as a Judge
Introduction
Most popular benchmarks like SWE-Bench rely solely on the final resolve rate of automated repair tasks. They do not effectively consider the steps taken by the agentic system to reach the resolve rate. Thus, agentic systems should be evaluated like a human, looking at the thoughts and agent trajectory