Pairwise Comparisons
Pairwise comparisons are any processes of comparing entities in pairs to judge which of each entity is preferred, or has a greater amount of some quantitative property, or whether or not the two entities are identical. They are useful when aggregating human preferences.
Why Pairwise
- Humans are better at relative judgments than absolute scoring. Scale-free comparisons reduce calibration headaches (better judging systems).
- Simple binary choices reduce cognitive load and make participation easier than intensity sliders.
- Works across domains (psychometrics, recsys, sports, RLHF) where latent utility is hard to measure directly (pairwise calibrated rewards).
- Disagreement is signal. Diversity in raters surfaces semantics a single expert might miss (CrowdTruth).
Collecting Good Data
- Keep the UX fast and low-friction. Suggest options, keep context in the UI, and let people expand only if they want.
- Avoid intensity questions. They are order-dependent and require global knowledge.
- Use active sampling/dueling bandits to focus on informative pairs. Stop when marginal value drops.
- With efficiently sampled pairs (or approximate rankings) far fewer comparisons are needed.
- Top-k tasks can scale collection (pick best 3 of 6) while still convertible to pairwise data.
- Expect noisy raters. Filter or reweight after the fact using heuristics or gold questions instead of overfitting to “experts” biases.
Aggregation and Evaluation
- There are many aggregation/eval rules; Bradley-Terry, Huber in log-space, Brier, …
- Converting pairs into scores or rankings is standard; start with Elo/Bradley-Terry (or crowd-aware variants) before custom models.
- Use robust methods (crowd BT, hierarchical BT, Bayesian variants) to correct annotator bias and uncertainty.
- Expert jurors can be inconsistent, biased, and expensive. Large graphs of comparisons are needed to tame variance. You can estimate how many pairwise comparisons are needed to make raking significant.
- You can report accuracy/Brier by using bootstrap.