Statistics for AI/ML, Part 3: Measuring Classification Agreements Between Agents

Following our last post on measuring ranking agreements, we introduce evaluation metrics for classification agreements.
Classification is a task where an algorithm assigns a category or a class to an input. In AI/ML, classification is crucial for tasks like email spam detection, image recognition, sentiment analysis, and decision making. Measuring agreement is essential to ensure models and agents align with expert opinions or other models, especially in high-stakes areas.

Cohen’s Kappa for Classification Agreements

Cohen’s Kappa is a statistical metric that measure how well two raters, such as AI agents or human agents, agree on classifying data into categories. For example, labeling customer reviews as positive, neutral, or negative. Unlike simple agreement percentages, it adjusts for agreements that might happen by chance, making it more reliable for AI/ML evaluations where the system is sensitive to variations and can be stochastic.

NOTE: this measure is intended to compare labeling by different annotators, not a classifier versus a ground truth.

Cohen’s Kappa, introduced by Jacob Cohen in 1960, is defined as:

\[\kappa = \frac{p_o - p_e}{1 - p_e}\]

where

$p_o$ is the observed agreement, calculated as the proportion of cases where both raters assign the same category.
$p_e$ is the expected agreement by chance, derived from the marginal probabilities of each category for both raters. For example, if rater 1 assigns category A to 60% of cases and rater 2 to 55%, $p_e$ includes the product of these probabilities for each category.

This adjustment for chance is crucial, as simple percentage agreement can be misleading, especially with imbalanced data.

Interpretation

The Kappa value ranges from -1 to 1:

$\kappa \gt 0.8$: Almost perfect agreement
$0.6 \lt \kappa \le 0.8$: Substantial agreement
$0.4 \lt \kappa \le 0.6$: Moderate agreement
$0.2 \lt \kappa \le 0.4$: Fair agreement
$0 \lt \kappa \le 0.2$: Slight agreement
$\kappa \le 0$ implies disagreement worse than random chance.

Implementing Cohen’s Kappa

We can use cohen_kappa_score function from sklearn.metrics to calculate Cohen’s Kappa in Python. The function supports any hashable labels, not just integers. And as a best practice, we should include the confidence interval as part of the output.

import numpy as np
from sklearn.metrics import cohen_kappa_score

rater1 = [1, 2, 1, 2, 1, 2, 1, 2, 1, 2]
rater2 = [1, 1, 2, 2, 1, 1, 2, 2, 1, 2]

n_bootstraps = 1000

kappa = cohen_kappa_score(rater1, rater2)

kappa_samples = []
for _ in range(n_bootstraps):
    indices = np.random.choice(len(rater1), size=len(rater1), replace=True)
    resampled_rater1 = [rater1[i] for i in indices]
    resampled_rater2 = [rater2[i] for i in indices]
    kappa = cohen_kappa_score(resampled_rater1, resampled_rater2)
    kappa_samples.append(kappa)

lower = np.percentile(kappa_samples, 2.5)
upper = np.percentile(kappa_samples, 97.5)

print(f"Cohen's Kappa: {kappa:.3f}")
print(f"95% CI for Cohen's Kappa: [{lower:.3f}, {upper:.3f}]")

Conclusion

Cohen’s Kappa is a vital statistic for measuring classification agreement between agents. It accounts for chance and provides a nuanced view beyond simple accuracy. It is especially valuable in imbalanced datasets, with applications in healthcare, finance, and content moderation.

Key takeaways:

Use Cohen’s Kappa to measure agreement beyond chance, with values above 0.8 indicating almost perfect agreement.
Implement it using sklearn.metrics.cohen_kappa_score in Python, and consider bootstrap for confidence intervals.

References

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {Statistics for AI/ML, Part 3: Measuring Classification Agreements Between Agents},
    year = {2025},
    month = {03},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/03/03/cohen-kappa/}
}