Ranking with Agents

Wed Aug 06 2025

One of the latest Kaggle style competitions I’ve been participating in got me thinking about the difficulties involved in collecting accurate and relevant preferences from humans and aggregating them in somewhat consistent rankings or weight distributions.

I did some research around this general issue and, at the same time, worked on a small tool to explore a potential approach for the competition.

Can a bunch of LLM Agents be used to rank an arbitrary set of items in a consistent way?

A couple of weeks later, I had the chance to attend the Impact Evaluator Research Retreat and, in the first few days, realized the idea was a perfect residency project.

I had the opportunity to explore this idea further and this post explores the main learnings!

contest
Ranking participants in large hackathons is no joke!

Problem

This is the general version of the problem.

Given an evaluation criteria and an arbitrary set of items, how can we produce the highest-quality judging results?

It’s a very common problem as you can imagine. You’ll encounter it when jurors have to evaluate submissions in large hackathons or humans have to rank LLM responses based on “usefulness”.

The naive (and unfortunately the most common) approach is to ask humans to rate each item using a Likert scale or similar. This has several issues:

As many people have realized long before me, there are better ways to rank items (e.g. chocolate). Let’s look at one interesting approach.

Simplify Decisions with Pairwise Comparisons

This is probably one of the simplest solutions. Evaluate or rank the items by making jurors do pairwise comparisons between random items. This helps in several ways:

Quite cool! Pairwise preferences is something that has been explored in the literature, but still feels underused in practice.

In 2025 you have to use LLMs

So, LLMs can surely help with the judging/ranking process, right? Well, they can, but there are some challenges if you approach this challenge naively (using one LLM to rank all items).

Another approach is to have multiple agents and allow them to collaborate and talk between each other. I don’t have any data to back this up but I think they’ll probably lose track of the context and it would be more expensive. The first opinion will be very influential!

We’ve learned a better way though! What if we could have multiple LLMs (agents) that are specialized in evaluating items based on different criteria? Each agent could focus on a specific aspect of the items, and then we could aggregate their results.

That is basically what I set out to explore with Arbitron after realizing that using standalone LLMs with long context wasn’t ideal for the competition I was working on.

Arbitron

Arbitron is “a multi-agent consensus ranking system to derive optimal weights through pairwise comparisons”. Sounds more complex than it is. Think of it as a framework to define agents (LLMs in a loop with tools) that evaluate items based on different criteria. The results then get aggregated to produce a final ranking or weight distribution.

import arbitron

movies = [
    arbitron.Item(id="arrival"),
    arbitron.Item(id="blade_runner"),
    arbitron.Item(id="interstellar"),
    arbitron.Item(id="inception"),
    arbitron.Item(id="the_dark_knight"),
    arbitron.Item(id="dune"),
    arbitron.Item(id="the_matrix"),
    arbitron.Item(id="2001_space_odyssey"),
    arbitron.Item(id="the_fifth_element"),
    arbitron.Item(id="the_martian"),
]

agents = [
    arbitron.Agent(
        id="SciFi Purist",
        prompt="Compare based on scientific accuracy and hard sci-fi concepts.",
        model="google-gla:gemini-2.5-flash",
    ),
    arbitron.Agent(
        id="Nolan Fan",
        prompt="Compare based on complex narratives and emotional depth.",
        model="groq:qwen/qwen3-32b",
    ),
    arbitron.Agent(
        id="Critics Choice",
        prompt="Compare based on artistic merit and cinematic excellence.",
        model="openai:gpt-4.1-nano",
    ),
]

description = "Rank the movies based on their soundtrack quality."

comparisons = arbitron.run(description, agents, movies)
ranking = arbitron.rank(comparisons)

The previous code will give you a ranking of the movies based on the criteria defined in the description! It uses PydanticAI for the LLM things and choix for the ranking algorithms. Some of my favorite features of Arbitron are:

Evaluations

I’ve done a couple of local experiments to evaluate the performance of Arbitron. I compared the ranking accuracy (Kendall Tau) of different systems:

The first eval is to make agents choose which movie was released earlier. This should produce a sorted list of movies.

In this simple example, most of the models got things right! The only ones that didn’t were ChatGPT, GPT 4.1 and the small models. This type of knowledge is easily available inside the LLMs so they didn’t have many problems to retrieve it.

The second eval is trickier. The goal is to rank Wikipedia articles based on their popularity (cumulative number of page views since 2007). Now things get interesting as this data is not common in their corpus.

🎮 Check how you score in the same benchmark

Here are the scores of the latest run. The higher the Kendall Tau score, the better the ranking.

ModelKendall Tau Score
Arbitrinity0.2
Arbiten0.15
Gemini 2.5 Pro-0.24
Gemini 2.5 Flash-0.33
Opus0.28
GPT 4.1-0.06
GPT o30.33
Arbitrinity Max0.16
Arbiten Max0.12

Even in this simple example with only 10 items (not taking a lot of context), Arbitron usually outperformed single models, except for Opus and o3. Interestingly, having 10 agents didn’t seem to improve the results!

Anecdotically, the scores from Arbitron also seemed more consistent across runs (others have noticed this previously). More research is definitely needed!

A few interesting questions are worth still exploring:

Learnings

The biggest learning for me has been the improved intuition of why using pairwise comparisons works great in this context. I’ve also learned many things on the state of the art around using pairwise comparisons when training and evaluating LLMs (a common approach since 2017). There is a lot of literature around aligning LLMs with human judgement using pairwise comparisons that I wasn’t aware of before!

Lots of these ideas are the base of how RLHF works these days. Modern RLHF practices (e.g. pairwise reranker) use preference data rather than absolute due to the advantages of pairwise comparisons shared earlier. Chatbot Arena (which ranks all major LLMs) is entirely based on pairwise comparisons. People building LLMs are relying on this.

Another important realization is how much interfaces and UX matters. Both for Humans and LLMs. While doing the evaluations, I could feel how much the prompt design affected results! E.g: there is a strong verbosity bias where longer responses often win.

Finally, this approach is not a silver bullet. Every context will have different requirements and while Arbitron style systems typically get the direction right, the magnitude/weights it comes up with may be noisy.

Uses

This approach shines where there is some subjectivity to it that is hard to measure (in cases where an objective answer exists, the agents could use a tool and they’ll get it). Here are some areas that come to mind:

Future

Of course, I have a long list of things I’d like to continue exploring. There are many obvious improvements to the current tool like making it a web app, but also more interesting research questions:

Overall, it was a very fun project and I’m very happy with the results (not so much with the costs 2).

Claude Gemini

Before wrapping up, I wanted to leave with a meta reflection. Arbitron name was decided by the tool itself after I asked it to rank a bunch of names. I later realized I don’t like the name Arbitron. The meta-lesson here being that sometimes the more important thing is not better mechanisms for the final rank, but better mechanisms for discussing and coordinating what to propose in the first place.

Acknowledgements

Footnotes

  1. Naming is not my strong suit.

  2. Always set some threshold on your LLM providers!

← Back to home