Impact Evaluators

Impact Evaluators are frameworks for coordinating work and aligning Incentives in complex Systems. They provide mechanisms for retrospectively evaluating and rewarding contributions based on impact, helping solve coordination problems in Public Goods Funding.

It’s hard to do Public Goods Funding, open-source software, research, etc. that don’t have a clear, immediate financial return, especially high-risk/high-reward projects. Traditional funding often fails here. Instead of just giving money upfront (prospectively), Impact Evaluators create systems that look back at what work was actually done and what impact it actually had (retrospectively). It’s much easier to judge the impact in a retrospective way!

The extent to which an intervention is causally responsible for an specific outcome (intended or unintended) is a hard thing to figure out. There are many classic approaches; Theory of Change, Data Analysis, ML, …
The goal is to create a system with strong Incentives for people/teams to work on valuable, uncertain things by distributing a reward according to the demonstrable impact.
Impact Evaluators work well on concrete areas where you can turn into easily measurable metrics. Impact Evaluators are powerful and will overfit. When the goal is not well aligned, they can be harmful. E.g: Bitcoin increasing the energy consumption of the planet. Impact Evaluators can become Externalities Maximizers.
Start local and iterate.
- Begin with small communities with their own Metrics and evaluation criteria.
- Use rapid Feedback Loops to learn what works.
- Each community understands its context better than outsiders (seeing like a state blinds you to local realities).
- Multiple local experiments surface patterns for higher-level abstractions.
- Impact evaluation should be done by the community at the local level.
  - E.g: “Developers” in OSO filter for GitHub accounts with more than 5 commits. Communities might or might not align with that metric.
- Focus on positive sum games and mechanisms.
- Small groups enable iterated games that reward trust and penalize defection. Reduced size reduces friction.
- Have a deadline or something like that so it fades away if it’s not working or actively used.
- The McNamara Fallacy. Never choose metrics on the basis of what is easily measurable over what is meaningful. Data is inherently objectifying and naturally reduces complex conceptions and process into coarse representations. There’s a certain fetish for data that can be quantified.
- Cultivate a culture which welcomes experimentation.
- Ostrom’s Law. “A resource arrangement that works in practice can work in theory”
- Even in environments with clear and easy to get metrics, someone has to make a decision of why that metric results in a better allocation instead of another similar or a combination. It’s all humand judgment / governance!
  - There is no way around defining a metric/loss function to evaluate allocations! Discussions should be focused on this instead of the specific of an allocation mechanism.
Community Feedback Mechanism.
- Implement robust feedback systems that allow participants to report and address concerns about the integrity of the metrics or behaviors in the community.
- Use the feedback to refine and improve the system.
- Prioritize consent and community feedback.
- Community should steer the ship.
- You want a reactive and self-balancing system. Loops where one part reacts to the other parts.
  - Feedback loop with the errors of the previous round.
- Design a democratic control that reacts to feedback.
- Allow people to express themselves as much as they want.
  - E.g: an expert can give very precise feedback/knowledge/weights to a set of projects, while a community member can give a more general feedback.
- Which algorithm is the best at assigning weights is not the best question.
  - What would you change about the algorithm?
  - What would you change about the process?
Communities usually lack important information to fund public goods
- Communities and institutions want to see a better, more responsive and dynamic provision of public goods within them but usually lack information about which goods have the greatest value and know quite a bit about social structure internally which would allow them to police the way GitCoin has in the domains it knows.
- Impact Evaluators act as a framework for information gathering and can help communities make better decisions.
- Open Data Platforms for the community to gather better data and make better decisions.
Simplicity as a principle.
- The simpler a mechanism, the less space for hidden privilege.
- Fewer parameters mean more resistance to corruption and overfit and more people engaging.
- Fix rules to keep things simple and easy to play. Opinionated framework with sane defaults!
- Demonstrably fair and impartial to all participants (open source and publicly verifiable execution), with no hidden biases or privileged interests.
- Don’t write specific people or outcomes into the mechanism (e.g: using multiple accounts)
Build anti-Goodhart resilience.
- Any metric used for decisions becomes subject to gaming pressures.
- Design for evolution:
  - Run multiple evaluation algorithms in parallel and let humans choose.
  - Use exploration/exploitation trade-offs (like multi-armed bandits) to test new metrics.
  - Make the meta-layer for evaluating evaluators explicit.
- For areas/ecosystems with a continuous and evaluable output (e.g: “better path finding algorithm”, “ROC AUC of X”, …), follow Bittensor model.
- The easier to verify the solution is (e.g: verify a program passes the test vs verify the experiment replicates), the less human judgment is needed, the less Goodhart’s Law applies.
- If the domain of the IE is sortable and differentiable, it can be seen as pure optimization and doesn’t require humans subjective evaluation.
Collusion resistance.
- Any mechanism helping under-coordinated parties will also help over-coordinated parties extract value. Countermeasures include:
  - Identity-free incentives (like proof-of-work).
  - Fork-and-exit rights for minorities.
  - Privacy pools that exclude provably malicious actors.
  - Multiple independent “dashboard organizations” preventing capture.
- They should be flexible as it’s hard to predict ways the evaluation metrics will be gamed.
- Campbell’s Law. The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.
Separate data from judgment. Impact Evaluators work like data-driven organizations:
- Gather objective attestations about work (commits, usage stats, dependencies).
- Apply multiple “evaluation lenses” to interpret the data.
- Let funders choose which lenses align with their values.
- Prefer Pairwise Comparisons for human input over absolute scoring. Standard methods turn pairs into scores/ranks and handle noisy raters.
  - Pairwise comparisons are useful to abstract “impact” and other complex and subjective metrics. Makes the decision more local and easier. More of a hunch based decision.
- Data is good at providing comprehensive coverage of things that are countable. Data is bad at dealing with nuances and qualitative concepts that experts intuitively understand.
Design for composability. Define clear data structures (graphs, weight vectors) as APIs between layers.
- Multiple communities could share measurement infrastructure.
- Different evaluation methods can operate on the same data.
- Evolution through recombination rather than redesign.
- Goal is evolutionary impact evaluation so people that are good, thrive.
- To create a permissionless way for projects to participate, staking is a solution.
- Fix a Data Structure (API) for each layer so they can compose (graph, weight vector).
  - E.g: Deep Funding problem data structure is a graph. Weights are a vector/dict, …
Embrace plurality over perfection.
- No single mechanism can satisfy all desirable properties (efficiency, fairness, incentive compatibility, budget balance). Different contexts need different trade-offs.
- There will be no “stable state”. Whenever you fix an evaluation, some group has an incentive to abuse or break it again and feast on the wreckage.
- This is the formal impossibility theorem that no mechanism can simultaneously achieve four desirable criteria:
  - Pareto Efficiency: The outcome achieved by the mechanism maximizes the overall welfare or some other desirable objective function.
  - Incentive Compatibility: Designing mechanisms so that participants are motivated to act truthfully, without gaining by misrepresenting their preferences.
  - Individual Rationality: Ensuring that every participant has a non-negative utility (or at least no worse off) by participating in the mechanism.
  - Budget Balance: The mechanism generates sufficient revenue to cover its costs or payouts, without running a net deficit.
- If you do something with a large “impact” and I do something with less “impact”. It’s clear you deserve more. How much more, is debatible. Depends on the goals of the organizers!
- In most of the mechanisms working nowadays (e.g: Deep Funding), there are arbitrary decissions that affect the allocation.
  - Small rules might have a disproportionate impact.
Legible Impact Attribution. Make contributions and their value visible.
- Transform vague notions of “alignment” into measurable criteria that projects can compete on.
- Designing Impact Evaluators has the side effect of making impact more legible, decomposed into specific properties, which can be represented by specific metrics.
- Do more to make different aspects of alignment legible, while not centralizing in one single “watcher” (e.g: l2beats, …).
- Let projects compete on measurable criteria rather than connections.
- Create separation of evaluations through multiple independent “dashboard organizations”.
- Take into account that projects have a marginal utility function.
Incomplete contracts problem. It’s expensive to measure what really matters, so we optimize proxies that drift from true goals.
- Current markets optimize clicks and engagement over human flourishing.
- The more powerful the optimization, the more dangerous the misalignment.
- Four interconnected issues:
  - Incomplete contracts - It’s too expensive to measure what really matters (human flourishing), so we contract on proxies (hours worked, subscriptions).
  - Power asymmetries - Large suppliers face millions of individual consumers with take-it-or-leave-it contracts.
  - Externalities - Individual flourishing depends on community wellbeing, but contracts remain individualized.
  - Information asymmetries - Suppliers control the metrics and optimize for growth rather than user outcomes.
Information elicitation without verification.
- Getting truthful data from subjective evaluation when you can’t verify it requires clever Mechanism Design:
  - Peer prediction mechanisms that reward agreement with hidden samples
  - Bayesian Truth Serum that uses both answers and predictions.
  - Coordination games where truth serves as a Schelling point.
- Tradeoffs when jurors vote on public goods funding allocation:
  - Voting directly on projects: halo effect, peanut butter distributions, heavy operational workload
  - Voting on models: feels too abstract for voters and doesn’t leverage their specific project expertise
  - Voting on metrics: judges just play with numbers until they get their favored allocation
An allocation mechanism can be seen as a measurement process, with the goal being the reduction of uncertainty concerning present beliefs about the future. An effective process will gather and leverage as much information as possible while maximizing the signal-to-noise ratio of that information — aims which are often at odds.
In the digital world, we can apply several techniques to the same input and evaluate the potential impacts. E.g: Simulate different voting systems and see which one fits the best with the current views. This is a case for the system to have a meta-evaluation mechanism that acts as a layer for human to express preferences.
Make evaluation infrastructure permissionless. Just as anyone can fork code, anyone should be able to fork evaluation criteria. This prevents capture and enables innovation.
- Anyone should be able to fork the evaluation system with their own criteria, preventing capture and enabling experimentation.
IEs are the scientific method in disguise, like AI evals.
There are two areas of Impmact Evaluators where coordination is needed: allocation rules and mechanism selection.
Focus on error analysis. Like in LLM evaluations, understanding failure modes matters more than optimizing metrics. Study what breaks and why.
- IEs will have to do some sort of “error analysis”. Is the most important activity in LLM evals. Error analysis helps you decide what evals to write in the first place. It allows you to identify failure modes unique to your application and data.
Reduce cognitive load for humans. Let algorithms handle scale while humans set direction and audit results.
- Use humans for sensing qualitative properties, machines for bookkeeping and preserve legitimacy by letting people choose/vote on the prefered evaluation mechanism.
Making it so people don’t have to do something is cool. Making it so people can’t do that thing is bad. E.g: time saving tools like AI is great but humans should be able to jump in if they want!
- If people don’t want to have their “time saved” have the freedom to express themselves. E.g: offer pairwise comparisons by default but let people expand on feedback or send large project reviews.
- Information gathering is messy and noisy. It’s hard to get a clear picture of what people think. Let people express themselves as much as they want.
The more humans get involved, the messier (papers, … academia). You cannot get away from humans in most problems.
Verify the evaluation is actually better than the baseline.
- Run multiple “aggregations” algorithms and have humans blindly select which one they prefer (blind test).
- The meta-layer can help compose and evaluate mechanisms. How do we know mechanism B is better than A? Or even better than A + B, how do we evolve things?
- Is the evaluation/reward better than a centralized/simpler alternative?
  - E.g: on tabular clinical prediction datasets, standard logistic regression was found to be on par with deep recurrent models.
- People only reveal their true opinions after seeing the result (you need to show people something and iterate based on their reactions in order to build something they actually want).
Exploration vs Exploitation. IEs are optimization processes with tend to exploit (more impact, more reward). This ends up with a monopoly (100% exploit). You probably want to always have some exploration.
IEs need to show how the solution is produced by the interactions of people each of whom possesses only partial knowledge.
Set a consensus over which meta-mechanism is used to evaluate weights (e.g: Brier Score). Judged/rank mechanism/models solely on their performance against the rigorous pre-built eval set. No subjective opinions. Just a leaderboard of the most aligned weight distributions.

Principles

Retrospective Reward for Verifiable Impact
Credible Neutrality Through Transparent and Simple Mechanisms
Comparative Truth-Seeking Over Absolute Metrics
Anti-Goodhart Resilience
Permissionless Scalability
Plurality-Aware Preference Aggregation
Collusion-Resistant Architecture
Credible Exit and Forkability
Composable and Interoperable Design
Works on Public Existing Infrastructure
Market-Based Discovery Mechanisms and Incentive Alignment

Reinforcement Learning
Cybernetics
Game Design
Social Choice Theory
Mechanism Design
Computational Social Choice
Machine Learning
Voting Theory
Process Control Theory
Large Language Models Evaluation

Impact Evaluators

Principles

Related Fields

Resources