Pairwise Comparisons

Pairwise comparisons are any processes of comparing entities in pairs to judge which of each entity is preferred, or has a greater amount of some quantitative property, or whether or not the two entities are identical. They are useful when aggregating human preferences.

Why Pairwise

Humans are better at relative judgments than absolute scoring. Scale-free comparisons reduce calibration headaches (better judging systems).
Simple binary choices reduce cognitive load and make participation easier than intensity sliders.
Works across domains (psychometrics, recsys, sports, RLHF) where latent utility is hard to measure directly (pairwise calibrated rewards).
Disagreement is signal. Diversity in raters surfaces semantics a single expert might miss (CrowdTruth).

Collecting Good Data

Keep the UX fast and low-friction. Suggest options, keep context in the UI, and let people expand only if they want.
Avoid intensity questions. They are order-dependent and require global knowledge.
Use active sampling/dueling bandits to focus on informative pairs. Stop when marginal value drops.
With efficiently sampled pairs (or approximate rankings) far fewer comparisons are needed.
Top-k tasks can scale collection (pick best 3 of 6) while still convertible to pairwise data.
Expect noisy raters. Filter or reweight after the fact using heuristics or gold questions instead of overfitting to “experts” biases.

Aggregation and Evaluation

There are many aggregation/eval rules; Bradley-Terry, Huber in log-space, Brier, …
Converting pairs into scores or rankings is standard; start with Elo/Bradley-Terry (or crowd-aware variants) before custom models.
Use robust methods (crowd BT, hierarchical BT, Bayesian variants) to correct annotator bias and uncertainty.
Expert jurors can be inconsistent, biased, and expensive. Large graphs of comparisons are needed to tame variance. You can estimate how many pairwise comparisons are needed to make raking significant.
You can report accuracy/Brier by using bootstrap.

Pairwise Comparisons

Why Pairwise

Collecting Good Data

Aggregation and Evaluation

Resources