Statistics for AI/ML, Part 2: Measuring Ranked Agreements Between Agents

One of the major use application of AI/ML is ranking. Ranking is the process of ordering items based on their predicted relevance or importance relative to a particular query or context. It is used in recommendation systems like social media feeds and search engines like Google. This is also highly applicable to compound AI systems built using retrieval-augmented generation (RAG) such as OpenAI’s Deep Research. These systems order items by relevance, and evaluating their accuracy is vital. In addition, ranking also plays a crucial role in reward modeling for training large reasoning models for reasoning tasks.

With Large Language Models (LLMs) now acting as judges (LLM-as-a-Judge), we need robust ways to measure how well machine rankings align with human or expert rankings. In this post, we will explore two statistical measurements of ranked agreements: Kendall’s Tau and Spearman’s Rank Correlation.

Ranking Evaluation Overview

Both Kendall’s Tau and Spearman’s Rank Correlation measures how similar two rankings are, with some nuances:

Spearman’s Rank Correlation calculates the correlation between the ranks, considering the size of rank differences, making it sensitive to position changes.
Kendall’s Tau looks at pairs of items and counts how often they’re ordered the same way in both rankings, focusing on the overall order.

These tools are essential for technical folks working on AI/ML, especially in evaluating systems like retrieval-augmented generation (RAG) where relevance ranking is key.

Spearman’s Ranked Correlation and Implementations

The Spearman Rank Correlation, or Spearman’s rho ($\rho$), is Pearson (linear) Correlation computed on the ranks of items. It is defined as:

$\rho=1-\frac{6\sum_{i=1}^n(d_i)^2}{n(n^2-1)}$

where $d_i$ is the difference between the ranks of the $i$-th item in the two rankings, and $n$ is the number of items.

Interpretation:

$\rho=1$ indicates perfect positive correlation.
$\rho=-1$ indicates perfect negative correlation.
$\rho=0$ indicates no correlation.

Spearman’s $\rho$ is sensitive to the magnitude of rank differences, making it suitable when the specific positions in the ranking matter. It’s often used when the exact ordering, including the degree of difference, is important.

Implemtation

Spearman’s Rank Correlation is implemented in scipy.stats.spearmanr. And as a best practice, we should include the confidence interval as part of the output.

import numpy as np
from scipy.stats import spearmanr

# Example rankings
gold_rank = [1, 2, 3, 4, 5]
model_rank = [1, 3, 2, 5, 4]

n_bootstraps = 1000

corr, p_value = spearmanr(gold_rank, model_rank)

# Bootstrap confidence intervals follow the same logic as Kendall’s ττ:
spearman_samples = []

for _ in range(n_bootstraps):
    # sample with replacement
    indices = np.random.choice(len(gold_rank), len(gold_rank), replace=True)
    resampled_gold = [gold_rank[i] for i in indices]
    resampled_model = [model_rank[i] for i in indices]
    corr, _ = spearmanr(resampled_gold, resampled_model)
    spearman_samples.append(corr)

lower = np.percentile(spearman_samples, 2.5)
upper = np.percentile(spearman_samples, 97.5)

print(f"Spearman’s ρ: {corr:.3f}, p-value: {p_value:.3f}")
print(f"95% CI for Spearman’s ρ: [{lower:.3f}, {upper:.3f}]")

Kendall’s $\tau$ and Implementations

The Kendall’s $\tau$ distance between two lists is proportional to the number of pairwise adjacent swaps
needed to convert one ranking into the other. It is used to determine which agent is better relative to a “gold standard”. The higher the correlation between the output ranking of a method and the “gold standard”, the better the method is concluded to be. Pairs of rankings whose Kendall’s $\tau$ values are at or above 0.9 are often considered “effectively equivalent”.

It is defined as:

\[\tau = \frac{\text{number of concordant pairs} − \text{number of discordant pairs}}{\frac{n(n−1)}{2}}\]

Concordant pairs are pairs ordered the same way in both rankings.
Discordant pairs are pairs ordered differently.

Interpretation:

$\tau = 1$ indicates perfect agreement.
$\tau = -1$ indicates perfect disagreement (one is the reverse of the other).
$\tau = 0$ indicates no agreement.

Implemtation

Kendall’s $\tau$ can be computed using scipy.stats.kendalltau. Again, as a best practice, we should include the confidence interval as part of the output.

import numpy as np
from scipy.stats import kendalltau

# Example rankings (gold standard vs. model output)
gold_rank = [1, 2, 3, 4, 5]
model_rank = [1, 3, 2, 5, 4]

n_bootstraps = 1000

tau, p_value = kendalltau(gold_rank, model_rank)

tau_samples = []

for _ in range(n_bootstraps):
    # sample with replacement
    indices = np.random.choice(len(gold_rank), size=len(gold_rank), replace=True)
    resampled_gold = [gold_rank[i] for i in indices]
    resampled_model = [model_rank[i] for i in indices]
    tau, _ = kendalltau(resampled_gold, resampled_model)
    tau_samples.append(tau)

lower = np.percentile(tau_samples, 2.5)
upper = np.percentile(tau_samples, 97.5)

print(f"Kendall’s τ: {tau:.3f}, p-value: {p_value:.3f}")
print(f"95% CI for Kendall’s τ: [{lower:.3f}, {upper:.3f}]")

Conclusion

Kendall’s $\tau$ and Spearman’s Rank Correlation are foundational tools for evaluating agreement between ranked lists in AI/ML systems. While Kendall’s $\tau$ emphasizes pairwise concordance, Spearman’s Rank Correlation $\rho$ measures rank-order alignment. Both metrics are critical for assessing systems like search engines, recommendation engines, RAG pipelines, and LLM-as-a-Judge frameworks.

Key takeaways:

Use Kendall’s $\tau$ when pairwise disagreements matter most, and Spearman’s $\rho$ for global rank alignment.
Values of $\tau \ge 0.9$ often indicate near-equivalent rankings.

By using these statistical tools, Data Scientists and Machine Learning Engineers can rigorously evaluate ranking systems and drive improvements in generative AI and retrieval-augmented applications.

References

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {Statistics for AI/ML, Part 2: Measuring Ranked Agreements Between Agents},
    year = {2025},
    month = {03},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/03/01/spearmans-and-kendalls/}
}