Eliciting useful datasets

Wed Feb 11 2026

Open datasets are everywhere. Maintained datasets are rare.

I keep seeing the same pattern in open data ecosystems. A few folks do expensive curation work, the rest of us free-ride, and eventually the dataset goes stale because data wrangling is time consuming, tedious, and technically demanding. Spending time curating and maintaining datasets for other people to use doesn’t make economic sense, unless you can profit from that.

This post is about a simple question, and a potential solution. The question is: Can we design a credibly neutral way to incentivize and elicit useful datasets for tasks with benchmarks? The solution I came up with is a mechanism I call “Tributary”. Let’s dive in.

Mechanism

Tributary is a PoC mechanism that works like a flipped open source Kaggle-ish competition. That is:

Design a benchmarked task with a hidden test set.
Keep the model fixed (e.g., Random Forest).
Let participants submit data, not models.
Reward contributors based on how much their data improves benchmark performance.

So instead of “who trained the best model”, the question is “whose data improved the task you care about the most”.

Having a fixed model and scoring criteria makes the evaluation objective and reproducible. It works like a benchmark for the most useful data.

This framing is close to DataPerf Training Set Acquisition, where the core problem is deciding what data to buy under constraints, then scoring quality on held-out evaluation.

Credible Neutrality

In this mechanism, credible neutrality maps to:

No hand-picked winners in the rules. If the data improves the score, it gets rewarded.
Open code and verifiable execution. Anyone can check the rules and verify the results.
Simple mechanism before fancy economics.
Slow-changing rules so people can trust the game.

In practice, this pushes the design toward plain and simple git repositories, public scripts, deterministic evaluation, and auditable artifacts.

Tributary

I built a minimal prototype, tributary, that implements the above mechanism. It has:

A public.asc for encrypting participants’ dataset submissions
An encrypted test set (data/test.csv.asc)
A fixed model (model.py)
A script to evaluate submissions (evaluate.py)
A registry for submissions (submissions.yaml)

Workflow

Say you want to contribute a dataset. You play the game by:

Encrypting the dataset you want to train the model with using the public key.
Opening a PR adding a URL to submissions.yaml. Ideally you point to a CID (immutable hash of the content).
The PR gets merged and tributary downloads, decrypts, trains a fixed model, computes score on hidden test set and updates the leaderboard based on the marginal contribution of your dataset.

Since we cannot directly verify every row, the mechanism pays for information that improves predictive power or agreement structure. We can use Shapley values or similar techniques to derive the marginal contribution of each row in the dataset.

Conclusion

I don’t think this mechanism is perfect and definitely needs more work, but I do think it is a credibly neutral path to test a narrower claim. That is, given a benchmarked task, can we reward the creation of useful data directly, in public, with rules everyone can audit?.

Hopefully, something like this can be used in the future to start rewarding the hard work of data curation of open datasets, and to start building a culture of dataset maintenance and stewardship.

Finally, here are some extra resources on peer prediction and information elicitation work you might find useful.