Credible Neutral AI Competitions

Tue Jun 03 2025

After writing Steering AIs with Transparent and Effective ML Competitions, I’ve been thinking about potential ways to make running similar competitions more credibly neutral and community friendly. Similar to the Datadex project, I wanted to explore how small-to-medium-scale (community level) competitions can be run on public (exit friendly) infrastructure and open source tooling.

Credible Community Competitions (C³)

There are a few properties we want on a platform that runs these sort of competitions.

I previously shared some ideas to improve the current competitions experience, now, I want to explore something different altogether. Building a competition on top of git, public-key cryptography, and a bunch of open source ad-hoc scripts. To do that, let’s remove all the “bells and whistles” until we only have the core of a competition. A private test set and an evaluation script. It looks like this:

That’s it! The setup is nothing revolutionary but it’s a bit simple, open, and flexible.

As there is a lot of hand-waving in there, let me go a bit deeper into different things and expand with more ideas.

Identity

Solving identity is hard, so the goal is not to focus on that. In this case, it’s a bit simpler for us if we push that to the community hosting the competition. They know better what “identity” means for them and might want some nuances that a central platform cannot provide. Leaving identity to the community adds an extra layer of complexity but there are a couple of sane defaults that can be used.

Submissions

In the simplest setup, participants can push their own encrypted submissions to the git repository hosting the competition. That might not work for everyone as there might be times participants don’t want to be linked to the competition via their git (realistically GitHub) user.

In those cases, the hosts can receive the files by any other means while the competition is running (e.g: website form and GH actions) as long as the file is signed and hosts can run their identity verification scripts against it. In theory, participants could rely on sending only hashes (content addressing rules!) of the files and host them somewhere else (IPFS, HuggingFace, …)!

Custom Rules

A cool side effect of running things on git is that we can set up automations! These automations can take care of things like limiting the number of submissions per participant 1, or generating a partial leaderboard every two days.

The most straightforward and open way to do this would be using GitHub Actions or similar! As soon as a participant sends a PR, an action could run all the data checks needed and rule conformance to either accept or deny the submission. Scheduled actions can generate a leaderboard, update datasets by revealing more data, …

If you can codify a rule, you can automate it! GitHub Actions logs are public and you can check what is going on to verify code has been running as you’d expect.

Evaluation

Similar to the above, the easiest way to evaluate will be to run a script (python leaderboard.py) at the end of the competition that generates a simple leaderboard with the participants scores. This is something that anyone will be able to replicate locally if needed.

Discussions and Other Goodies

As a side effect of using a git platform like GitHub or alternatives, we have issues and discussions for participants to raise concerns, problems with the competition, or simply share their approaches/questions while the competition is running.

If the hosts wants it, participants could push (or link with a hash to) their models’ code or any other artifacts. Anyone then (or GitHub actions) could replicate their results! This setup makes a great framework for encouraging open models and open code to be used.

Finally, having a public record of changes to the code, participants can be rewarded for improving the competition infrastructure. This could be ad-hoc, retroactive, or even bounty style (e.g: $500 USDC if you make the leaderboard a website, …)!

Conclusion

This is mostly a thought experiment but something that can probably work without doing too much for smaller competitions. Right now, I’d say all the competitions on Pond have been “small” (less than 20 serious participants) and would be a good fit for this setup.

Oh! One last benefit of using a git platform is that there will probably last more than any external platform (e.g: Kaggle, Pond, …) and the community will have more control over it!

I’m sure there are many optimizations to this process and I’d love to hear them, so please reach out!

Footnotes

  1. An idea I am converging on is that participants should be only allowed to send one final submission and be evaluated on that one. The first reason is that this gives better sybil resistance to the competition (sybils have only one submission per account they control instead of the 3 * $DURATION_DAYS * $N_ACCOUNTS that now causes that in some of the current competitions, it might be worth it to run a Bayesian Optimization on the public scores and approximate the shape of the function). The second reason is that real life doesn’t have a leaderboard. When you are building a model, you don’t have a “hidden test set” to get scored on. You have to build your own validation mechanism and think about how to avoid overfitting. Then you deploy or apply the model.

← Back to home!