Weight Allocation Mechanism Evals

Wed Oct 15 2025

There are many folks working on mechanisms that can be boiled down to “assign weights to items (projects, grant proposals, movies, etc) indicating their relative importance in a credibly neutral manner”. Since there is no ground truth, these mechanisms often involve human judgment (like in participatory budgeting) and can become quite elaborate. They end up having jurors, rounds, multiple parties, algorithms to aggregate scores, and so on. I’m talking about mechanisms like Quadratic Funding, Deep Funding, Filecoin ProPGF/RetroPGF, and Gitcoin Grants. As someone interested in these mechanisms (and a participant in some), I’ve been wondering about one simple but important question.

Are these mechanisms assigning weights better than simpler alternatives? How do we know if their weight distribution is good or bad? Are the final weight distributions “better” than, say, a field expert assigning weights a Sunday morning in less than 10 minutes with a simple heuristic? How much better than random? Well, we don’t really know. It is even hard to say what “better” means.

In this post, I explore the idea of having two different mechanisms:

  1. A process to assign relative weights to items.
  2. A process for choosing the best weight distribution mechanism.

For number one, interesting proposals like Deep Funding are already working on it. For the second one though, we need more research and development!

TLDR

We need a meta-mechanism to evaluate and compare mechanisms and not only focus on the weight-setting mechanisms themselves. A simple approach is to use pairwise comparisons from jurors to evaluate how well a weight distribution aligns with their preferences. This framework can be used to measure how close the output of a mechanism is to the preferences of the jurors, optimize and tune the mechanisms themselves, and compare different mechanisms against each other.

Problem Definition

There are a few components to this problem.

The simple and ideal setups for this problem are, for example, a local referendum (high j and low i) where hundreds of voters assign a collective weight to a tiny option set (Yes / No), or the Hugo Awards, where there are ~5 finalists (i) and thousands of ranked ballots (j). This, sadly, is not viable if you want to do things like funding open source repositories (many i) of an entire ecosystem with only a few “experts” (few j). In these areas, we need ways to scale human judgment.

Low i (few items)High i (many items)
Low j (few jurors)Steward heuristic session, kitchen-table consensus, juror-written memo.Funding open source repos with a tiny panel, expert triage queue, Delphi refinement.
High j (many jurors)Local referendum (Yes/No), Hugo Awards ballots, ranked-choice sprint.Deep Funding, Quadratic Funding.

The Meta-Mechanism

We should have a meta-mechanism to evaluate and compare mechanisms and not only focus on the weight-setting mechanism itself. We need a way to measure how “fit” a weight distribution is to the sparse preferences of the jurors. With this, we can do a few things:

Pairwise Comparison Meta-Mechanism

Why not use pairwise comparisons from jurors to evaluate how well a weight distribution aligns with their preferences? Have your jurors do pairwise comparisons between random items and then see which weight distribution agrees most with their choices (something similar to the RLHF approach used in LLMs, but applied to these weight distributions).

In experiments like Deep Funding, we already have jurors making pairwise comparisons (the data is being used to train models). Instead of asking jurors which weight distribution they prefer, we can use all their pairwise comparisons to evaluate how well a weight distribution aligns with their preferences directly. Comparing weight distributions is not something humans are good at. It’s easier to say “I prefer A over B” than “I think A should get 30% and B 20%”. Imagine doing that for thousands of items! We need local preferences, not global ones.

Choose your favorite formula for measuring the distance/agreement between the weight vector and the pairwise data. Then pick the closest vector to all jurors’ pairwise choices. This way, you use only the observed pairs (so sparsity is fine) while the formula aligns with the question “which vector would the jury choose if they had to choose one?”

There are many valid formulas to choose from. Personally, I think the formula you choose should:

I’m sure there are a few options there! My hunch is that something simple like Brier score or log-loss on the pairwise comparisons could work well as starting points. Needs more thought and experiments!

Conclusion

The key insights are that (1) we don’t only need better jury data or weight-setting algorithms, but something to choose the most effective weight-setting mechanism! And, (2) we should use data to choose the best mechanism, not human judgment of the mechanisms themselves. Without this, we might be just optimizing blindly.

With diverse mechanisms, we unlock the possibility of combinations and evolutionary approaches. This could lead to a more robust and adaptable system that can better handle the complexities of real-world decision-making.

Finally, I think these problems can be mapped to RLHF in some ways worth exploring more. Similar to how LLMs use pairwise/ELO style rankings when retrieving, there is a body of work around “models ranking things” that probably have some interesting insights for this problem.

Thanks to Devansh Mehta for the feedback and suggestions on this post!

← Back to home