Pairwise Comparisons

Pairwise comparisons are any processes of comparing entities in pairs to judge which of each entity is preferred, or has a greater amount of some quantitative property, or whether or not the two entities are identical. They are useful when aggregating human preferences.

Why Pairwise

  • Humans are better at relative judgments than absolute scoring. Scale-free comparisons reduce calibration headaches (better judging systems).
  • Simple binary choices reduce cognitive load and make participation easier than intensity sliders.
  • Works across domains (psychometrics, recsys, sports, RLHF) where latent utility is hard to measure directly (pairwise calibrated rewards).
  • Disagreement is signal. Diversity in raters surfaces semantics a single expert might miss (CrowdTruth).

Collecting Good Data

Aggregation and Evaluation

  • There are many aggregation/eval rules; Bradley-Terry, Huber in log-space, Brier, …
  • Converting pairs into scores or rankings is standard; start with Elo/Bradley-Terry (or crowd-aware variants) before custom models.
  • Use robust methods (crowd BT, hierarchical BT, Bayesian variants) to correct annotator bias and uncertainty.
  • Expert jurors can be inconsistent, biased, and expensive. Large graphs of comparisons are needed to tame variance. You can estimate how many pairwise comparisons are needed to make raking significant.
  • You can report accuracy/Brier by using bootstrap.

Resources