Credible Neutral AI Competitions
Tue Jun 03 2025After writing Steering AIs with Transparent and Effective ML Competitions, I’ve been thinking about potential ways to make running similar competitions more credibly neutral and community friendly. Similar to the Datadex project, I wanted to explore how small-to-medium-scale (community level) competitions can be run on public (exit friendly) infrastructure and open source tooling.
Credible Community Competitions (C³)
There are a few properties we want on a platform that runs these sort of competitions.
- Should be familiar to the participants.
- There is a place to raise issues and have discussions.
- It’s more or less resistant to collusion.
- Doesn’t lock anyone out.
- It’s transparent so everyone can see (and verify!) nothing odd is going on.
I previously shared some ideas to improve the current competitions experience, now, I want to explore something different altogether. Building a competition on top of git, public-key cryptography, and a bunch of open source ad-hoc scripts. To do that, let’s remove all the “bells and whistles” until we only have the core of a competition. A private test set and an evaluation script. It looks like this:
- Hosts setup a git repository.
- The public PGP key of the competition (e.g:
competition_public.asc
) (generated by the hosts) is published. - Hosts upload the encrypted test set (
test.csv.asc
) with the private data in which participants will be evaluated. - They add the evaluation script that will be run on participants
submissions.csv
againsttest.csv
. - Hopefully, they add a
README.md
with all the competition details. - Also, they can add any other scripts to take care of anything else they might want to control (submission schemas, identity, custom rules like no more than 3 submissions per participant).
- If there is, they can also upload potential training data.
- Participants send the submissions in any way the hosts allow. The idea is to get any number of encrypted
submission.csv.asc
files per participant on the repo (the exact number being a competition parameter). - The hosts might or might not run the evaluation script on the decrypted submissions and publish a temporary leaderboard based on a percentage of the test set.
- Once the competition ends, a leaderboard is computed and the
competition_private.asc
is revealed, allowing any user to recreate the leaderboard and explore other participant’s submissions.
That’s it! The setup is nothing revolutionary but it’s a bit simple, open, and flexible.
As there is a lot of hand-waving in there, let me go a bit deeper into different things and expand with more ideas.
Identity
Solving identity is hard, so the goal is not to focus on that. In this case, it’s a bit simpler for us if we push that to the community hosting the competition. They know better what “identity” means for them and might want some nuances that a central platform cannot provide. Leaving identity to the community adds an extra layer of complexity but there are a couple of sane defaults that can be used.
- Rely on GitHub / GitLab / … accounts.
- Have an upload portal that takes signed submissions and relies on other media (email, pseudonym, message on chain) to prove they own the private key.
- For example, signed submissions linked to Gitcoin Passport wallets. The hosts can now use the Passport API to check identity at upload/PR time.
- Hosts can request participants to link to multiple external identities (ENS, domain TXT records, hidden comment on GH, …)
Submissions
In the simplest setup, participants can push their own encrypted submissions to the git repository hosting the competition. That might not work for everyone as there might be times participants don’t want to be linked to the competition via their git (realistically GitHub) user.
In those cases, the hosts can receive the files by any other means while the competition is running (e.g: website form and GH actions) as long as the file is signed and hosts can run their identity verification scripts against it. In theory, participants could rely on sending only hashes (content addressing rules!) of the files and host them somewhere else (IPFS, HuggingFace, …)!
Custom Rules
A cool side effect of running things on git is that we can set up automations! These automations can take care of things like limiting the number of submissions per participant 1, or generating a partial leaderboard every two days.
The most straightforward and open way to do this would be using GitHub Actions or similar! As soon as a participant sends a PR, an action could run all the data checks needed and rule conformance to either accept or deny the submission. Scheduled actions can generate a leaderboard, update datasets by revealing more data, …
If you can codify a rule, you can automate it! GitHub Actions logs are public and you can check what is going on to verify code has been running as you’d expect.
Evaluation
Similar to the above, the easiest way to evaluate will be to run a script (python leaderboard.py
) at the end of the competition that generates a simple leaderboard with the participants scores. This is something that anyone will be able to replicate locally if needed.
Discussions and Other Goodies
As a side effect of using a git
platform like GitHub or alternatives, we have issues and discussions for participants to raise concerns, problems with the competition, or simply share their approaches/questions while the competition is running.
If the hosts wants it, participants could push (or link with a hash to) their models’ code or any other artifacts. Anyone then (or GitHub actions) could replicate their results! This setup makes a great framework for encouraging open models and open code to be used.
Finally, having a public record of changes to the code, participants can be rewarded for improving the competition infrastructure. This could be ad-hoc, retroactive, or even bounty style (e.g: $500 USDC if you make the leaderboard a website, …)!
Conclusion
This is mostly a thought experiment but something that can probably work without doing too much for smaller competitions. Right now, I’d say all the competitions on Pond have been “small” (less than 20 serious participants) and would be a good fit for this setup.
Oh! One last benefit of using a git platform is that there will probably last more than any external platform (e.g: Kaggle, Pond, …) and the community will have more control over it!
I’m sure there are many optimizations to this process and I’d love to hear them, so please reach out!
Footnotes
-
An idea I am converging on is that participants should be only allowed to send one final submission and be evaluated on that one. The first reason is that this gives better sybil resistance to the competition (sybils have only one submission per account they control instead of the
3 * $DURATION_DAYS * $N_ACCOUNTS
that now causes that in some of the current competitions, it might be worth it to run a Bayesian Optimization on the public scores and approximate the shape of the function). The second reason is that real life doesn’t have a leaderboard. When you are building a model, you don’t have a “hidden test set” to get scored on. You have to build your own validation mechanism and think about how to avoid overfitting. Then you deploy or apply the model. ↩