Deep Funding
The goal of Deep Funding is to develop a system that can allocate resources to public goods with a level of accuracy, fairness, and open access that rivals how private goods are funded by markets, ensuring that high-quality open-source projects can be sustained. Traditional price signals don’t exist, so we need “artificial markets” that can simulate the information aggregation properties of real markets while being resistant to the unique failure modes of public goods funding.
In Deep Funding, multiple mechanisms work together:
- A mechanism that generates an up-to-date and comprehensive DAG of relevant dependencies given a source node
- A mechanism that fills the graph with relevant weights. These weights represent the latent item utilities. There can be many ways of getting to them!
- Aggregating human preferences (polls, pairwise comparisons, …)
- Using prediction markets
- Getting weights from an AI model
- Collaborative Filtering
- Having experts fill weights manually
- A mechanism that takes that weight vector as input and distributes money to the projects
This problem touches data, mechanism design, and open source! Also, each layer can be optimized and iterated independently.
In its current shape, the graph’s vertices are projects and the edges are the relative impact of each project in its parent. The same approach could be used for anything that matches the graph shape (e.g: science research).
Desired Properties
- Credible Neutrality Through Transparent and Simple Mechanisms
- Comparative Truth-Seeking Over Absolute Metrics
- Plurality-Aware Preference Aggregation
- Collusion-Resistant Architecture
- Practical Scalability
- Credible Exit and Forkability
- Works on Public Existing Infrastructure
- Decentralized and Market-Like Mechanisms to Incentivize Useful Curation
- Dependencies reveal themselves through market mechanisms rather than being declared
- Skin in the Game. Participants have something to lose from bad assessments
- Project Independence (no need to participate in the process to get funded)
Current Approach
So far, Deep Funding has been implemented like this:
- A list of projects is chosen. This is usually provided by an external entity or process (e.g: the best model from the ML competition chooses the next 100 projects). So far a DAG/graph structure has not been needed since all projects have been compared for their impact on the “Ethereum Ecosystem”.
- Jurors do pairwise comparisons between projects. An aggregation method is chosen (Huber loss, L2 norm in log space, …) to derive the “ground truth” relative project weights.
- An ML competition and a Prediction Market are kicked off. Modelers and traders are evaluated against a holdout set of pairwise comparisons.
- Participants are rewarded based on how close they get to the “jurors’ ground truth”.
Open Problems
After participating in the ML competition and Prediction Market, and doing a few deep dives into the data and methodology, I think these are the main open problems.
- Juror Reliability
- So far, expert juror’s pairwise comparisons have been inconsistent, noisy, and low in statistical power
- Getting comparisons has been quite expensive in time and resources
- Asking jurors “how much better” introduces order‑dependence and scale mismatch
- Messy jurors have disproportionate impact on the weights
- Weights are not consistent due to the limited amount of data collected and the variance on it
- Large graphs (hundreds of projects) make getting accurate weights from the pairwise evaluation infeasible
- E.g. GG24 round has ~100 projects and would need more than 3000 “actively sampled” comparisons to get to a relative error of 10%
- This approach/paradigm requires more training examples jurors can produce in a reasonable span of time
- Mechanism Settings
- Some parameters have a large effect and haven’t been adjusted
- The aggregation formula (huber, log loss, bradley terry, …) has a very large impact on both modelers/traders and project rewards
- Need more process around who chooses the aggregation formula and why it is chosen
- In the pilot (huber loss), some projects got weights on a scale jurors didn’t feel reasonable (e.g: EIPs repo got 30%)
- The prediction market might cause good modelers to not participate as time of entry is more important than having a good model
- Weights Evaluation
- How do we measure success? If the goal of pattern recognition was to classify objects in a scene, it made sense to score an algorithm by how often it succeeded in doing so. What is the equivalent for Deep Funding? What is the metric we are optimizing?
- Once the weights are set, there isn’t a process to evaluate how “fit” those are
- E.g: the current idea is to gather a connected graph of pairwise comparisons, why not use that to reward projects directly and skip the Prediction Market?
- We need a falsifiable hypotheses to validate Deep Funding is “better”
- Graph Maintenance
- If the process takes a few weeks, the weights might change significantly (e.g: a project releases a major version)
- Jurors are also affected by temporal drift and their preferences evolve over time
Ideas
Alternative Approach
Given the current open problems, this is interesting and alternative way (inspired by RLHF) of running a Deep Funding “round”. The gist of the idea is to use only a few significant data points to choose and reward the final models instead of deriving weights for the entire set of children/dependencies of a project. Resolve the market with only a few, well-tested pairs!
Like in the current setup, a DAG of projects is needed. The organizers publish that and also an encoded list of projects that will be evaluated by Jurors. Participants can only see the DAG, the “evaluated projects” will be revealed at the end.
Once participants have worked on their models and send/trade their predictions, the “evaluated project” list is revealed and only those projects are used to evaluate weights’ predictions. Best strategy is to price truthfully all items. The question here is: how can we evaluate only a few projects without jurors giving a connected graph to the rest of the projects?
Since we don’t have a global view (no interconnected graph), we need to use comparative and scale free metrics. Metrics like the Brier score or methods like Bradley Terry can be used to evaluate any model or trader’s weights (in that case you’re fitting just a single global scale or temperature parameter to minimize negative log-likelihood)!
Once the best model is chosen (the one that acts closest to the chosen subset of pairwise comparisons), the same pairwise comparisons can be used to adjust the scale of the weight distribution. That means the market resolution uses only the subset (for payouts to traders) but the funding distribution uses the model’s global ranking with its probabilities calibrated to the subset via a single scalar 𝑎 that pins the entire slate to the same scale that was verified by real judgments. The jurors pairwise comparisons can even be “merged” with the best model to incorporate all data in there.
Basically, there are two steps; first, select the best model and then, rescale weights using the jury pairwise comparisons. With much fewer comparisons, we can get to a better final weight distribution since we have more significant graph (relative weights) and we also use the golden juror pairs to adjust the scale.
The task of the organizers is to gather pairwise comparisons to make this subset significant, which is much simpler and feasible than doing it so for the entire dependencies of a node (can be 128). For example, we can estimate that to get a 10% relative error on the weights, we would need ~600 efficiently sampled pairs (or approximate rankings). Compare that with the 2000 needed to get a 20% relative error on 128 items.
Once the competition ends, extra comparisons could be gathered for projects that have high variance or via other trigger mechanism.
More Ideas
- Set a consensus over which meta-mechanism is used to evaluate weights (e.g: Brier Score). Judged/rank mechanism/models solely on their performance against the rigorous pre-built eval set. No subjective opinions. Just a leaderboard of the most effective weight distributions.
- No intensity, just more good ol pairwise comparisons!
- Intensity requires global knowledge, has interpersonal scales, and humans are incoherent when assigning them (even in the same order of magnitude).
- Make it easy and smooth for people to make their comparisons. Use LLM suggestions, good UX with details, remove any friction, and get as many as possible. Filter after the fact using heuristics or something simpler like a whitelist. If there is a test set (labels from people the org trust), evaluate against that to choose the best labelers.
- Fields that use pairwise comparisons.
- Psychology (psychometrics) trying to predict the latent utilities of items
- Consumer science doing preference testing
- Decision analysis
- Marketing (also with top-k or best-worst method)
- Recommendation systems
- Sports (elo)
- RLHF
- We should test the assumption experts jurors give good results. Jurors are messy and not well calibrated. Collecting more information from “expert” jurors will probably add more noise. We should instead assume noisy jurors and use techniques to deal with that.
- There are better and more modern methods to derive weights from noisy pairwise comparisons (from multiple annotators)
- Detect and correct for evaluators’ bias in the task of ranking items from pairwise comparisons
- Use active ranking or dueling bandits to speed up the data gathering process
- Stop with a “budget stability” rule (expected absolute dollar change from one more batch is less than a threshold)
- Do some post processing to the weights:
- Report accuracy/Brier and use paired bootstrap to see if gap is statistically meaningful
- If gaps are not statistically meaningful, bucket rewards (using Zipf’s law) so it feels fair
- If anyone (or jury selection is more relaxed) can rate you can remove low quality raters with heuristics or pick only the best N raters (crowd BT)
- Crowdsourced annotators are often unreliable, effectively integrating multiple noisy labels to produce accurate annotations stands as arguably the most important consideration for designing and implementing a crowdsourcing system.
- To gather more comparisons, a top-k method could be used instead of pairwise. Show 6 projects. Ask for the top 3 (no need to order them).
- How would things look like if they were Bayesian Bradley Terry instead of classic Bradley-Terry? Since comparisons are noisy and we have unreliable jurors, can we compute distributions instead of “skills”?
- Let the dependent set their weight percentage if they’re around
- Instead of one canonical graph, allow different stakeholder groups (developers, funders, users) to maintain their own weight overlays on the same edge structure. Aggregate these views using quadratic or other mechanisms
- If there is a plurality of these “dependency graphs” (or just different set of weights), the funding organization can choose which one to use! The curators gain a % of the money for their service. This creates a market-like mechanism that incentivizes useful curation.
- Have hypercerts or similar. The price of these (total value) sets the weights across dependencies (
numpy’s certificates trade at 3x the price of a utility library, the edge weight reflects this) - If there are reviewers/validators/jurors, need to be public so they have some sort of reputation
- Reputation system / ELO for Jurors whose score is closer to the final one. This biases towards averages
- Account for jurors’ biases with Hierarchical Bradley Terry or similar
- Allow anyone to be a juror, select jurors based on their skills
- Stake-based flow:
- Anyone can propose a new edge, and anyone can stake money on that. If they get funding, you get rewarded. Could also be quadratic voting style where you vouch for something.
- Should the edge weights/stake decay over time unless refreshed by new attestations?
- Quadratic funding or similar for the stake weighting to avoid plutocracy
- Anyone can challenge an edge by staking against it
- Human attestations from project maintainers or a committee
- Doing something similar to Ecosyste.ms might be a better way
- A curated set of repositories. You fund that dependency graph + weights.
- Could be done looking at the funding or license (if there is a license to declare your deps).
- Run the mechanism on “eras” / batches so the graph changes and the weights evolve.
- How to expand to a graph of dependencies that are not only code?
- Academic papers and research that influenced design decisions
- Cross-language inspiration (e.g., Ruby on Rails influencing web frameworks in other languages)
- Standards and specifications that enable interoperability
- Allow projects to “insure” against their critical dependencies disappearing or becoming unmaintained. This creates a market signal for dependency risk and could fund maintenance of critical-but-boring infrastructure
- Composable Evaluation Criteria
- Rather than a single weighting mechanism, allow different communities to define their own evaluation functions (security-focused, innovation-focused, stability-focused) that can be composed. This enables plurality while maintaining comparability
- Create a bounty system where anyone can claim rewards for discovering hidden dependencies (similar to bug bounties)
- This crowdsources the graph discovery problem and incentivizes thorough documentation.
- Projects can opt out of the default distribution and declare a custom one. Organizers can allow or ignore that
- Self declaration needs a “contest process” to resolve issues/abuse.
- Harberger Tax on self declarations? Bayesian Truth Serum for Weight Elicitation?
- Projects continuously auction off “maintenance contracts” where funders bid on keeping projects maintained. The auction mechanism reveals willingness-to-pay for continued operation. Dependencies naturally emerge as projects that lose maintenance see their dependents bid up their contracts
- Explore Rank Centrality. Theoretical and empirical results show that with a graph that has a decent spectral gap
O(n log(𝑛))pair samples suffice for accurate scores and ranking.