4 Comments
User's avatar
Gabriel Alzate's avatar

This post reminds me when I used to learn math, for the teacher the answer not was the most important, the main goal was the path to reach the answer.

It's important remember that have benchmarks as we do actually, motivates the model to bring an answer although don't known this. Only for the chance to answer them correctly by lucky.

This make my thing that we could make our benchmarks based in problems with high development ontologies. When we have a good ontology have a complete framework about a topic, its parameters and a very narrow metrics, where the problems are correctly defined.

Casey Hemingway's avatar

What a great analogy Gabriel, you’re so right!

Camilo Giraldo Jaramillo's avatar

This is a fantastic and highly informative post. Thank you for clarifying the evolution of agent evaluation benchmarks and how they've adapted to overcome legacy issues like saturation and reward hacking as LLMs have advanced.

You've given me a lot to think about, and it's led me to a question regarding the scope of these benchmarks.

I've identified that many of the benchmarks discussed seem geared towards general, "world-based" tasks—activities that are performed similarly across the global industry (like programming, for example, where the "rules" are consistent).

With that in mind, I'd like to discuss and hear your thoughts on benchmarks for more country-based or localized topics.

A key example that comes to mind is the legal field. How can we apply or create a general benchmark for legal tasks when laws are fundamentally different in each country? The legal paths (penal, commerce, civil, etc.) are unique to each jurisdiction. A "correct" answer for an agent in one country could be completely incorrect in another.

Furthermore, how does language impact these benchmarks? This is something I find difficult to understand, as the main benchmarks discussed are often based in a "same" language (like English-dominated programming languages and documentation). The nuances of legal, cultural, or commercial texts in different languages would surely complicate a general evaluation.

In those scenarios, I'll find it really difficult to see how we can create effective general benchmarks across these highly localized industries.

What do you think? How do you see the future of benchmarking for these highly specific, non-uniform, and language-dependent domains?

Sadid Romero's avatar

It is clear that current benchmarks have lost their ability to reflect true progress in language models. Through reward hacking, many systems manage to climb the rankings without actually improving their understanding or reasoning.

This leaves me with a question:

if models are becoming increasingly skilled at finding “shortcuts” within benchmarks,

how could we design evaluations that are more resistant to reward hacking?

Is it possible to create an evaluation system truly “immune to gaming,” or will adversarial learning always remain an inevitable part of AI development? What metrics should be considered to address this issue?