The AI industry is obsessed with Chatbot Arena, but it might not be the best benchmark

Over the past few months, tech execs like Elon Musk have touted the performance of their company's AI models on a particular benchmark: Chatbot Arena.

Maintained by a nonprofit known as LMSYS, Chatbot Arena has become something of an industry obsession. Posts about updates to its model leaderboards garner hundreds of views and reshares across Reddit and X, and the official LMSYS X account has over 54,000 followers. Millions of people have visited the organization's website in the last year alone.

Still, there are some lingering questions about Chatbot Arena's ability to tell us how "good" these models really are.

In search of a new benchmark

Before we dive in, let's take a moment to understand what LMSYS is exactly, and how it became so popular.

The nonprofit only launched last April as a project spearheaded by students and faculty at Carnegie Mellon, UC Berkeley's SkyLab and UC San Diego. Some of the founding members now work at Google DeepMind, Musk's xAI and Nvidia; today, LMSYS is primarily run by SkyLab-affiliated researchers.

LMSYS didn't set out to create a viral model leaderboard. The group's founding mission was making models (specifically generative models à la OpenAI's ChatGPT) more accessible by co-developing and open sourcing them. But shortly after LMSYS' founding, its researchers, dissatisfied with the state of AI benchmarking, saw value in creating a testing tool of their own.

"Current benchmarks fail to adequately address the needs of state-of-the-art [models], particularly in evaluating user preferences," the researchers wrote in a technical paper published in March. "Thus, there is an urgent necessity for an open, live evaluation platform based on human preference that can more accurately mirror real-world usage."

Indeed, as we’ve written before, the most commonly used benchmarks today do a poor job of capturing how the average person interacts with models. Many of the skills the benchmarks probe for — solving PhD-level math problems, for example — will rarely be relevant to the majority of people using, say, Claude.

LMSYS' creators felt similarly, and so they devised an alternative: Chatbot Arena, a crowdsourced benchmark designed to capture the "nuanced" aspects of models and their performance on open-ended, real-world tasks.

LMSYS
The Chatbot Arena rankings as of early September 2024.Image Credits:LMSYS

Chatbot Arena lets anyone on the web ask a question (or questions) of two randomly selected, anonymous models. Once a person agrees to the ToS allowing their data to be used for LMSYS' future research, models and related projects, they can vote for their preferred answers from the two dueling models (they can also declare a tie or say "both are bad"), at which point the models' identities are revealed.