A common question for clustering is that, once we cluster documents (e.g., articles, images, etc) together, how do we determine how good is the clustering results given ground truth clusters?

We can interpret clustering as a series of pair-wise decisions, one for each pairs of documents in the collection.

Definition

Suppose we are to cluster \(n\) documents, there will be \(\binom{n}{2}\) pairs of decisions (\(n\) choose \(2\)), or \(N(N-1)/2\) pairs of documents as in the collection. Suppose, if we are given a resultant cluster indicator matrix \(C\) of size \(n x k\) where \(n\) is the number of data points and \(k\) is the number of cluster and we are given \(C_true\) for the ground truth cluster, we can evaluate the clustering results using F-measure, where F-measure is defined as

\[F_\beta = (1 + \beta^2) \frac{precision * recall}{(\beta^2*precision) + recall}\]

For clustering decisions, we want to assign two documents to the same cluster if and only if they are similar. Using 1’s methods, we can then define components of the confusion matrix as follows:

  • True Positive (TP): decision assigns two similar documents to the same cluster
  • True Negative (TN): decision assigns two dissimilar documents to different clusters

There are two types of errors we can commit:

  • False Positive (FP): decision assigns two dissimilar documents to the same cluster.
  • False Negative (FN): decision assigns two similar documents to different clusters.

Example

We will use this example from the textbook to illustrate in detail how to manually calcaulate F-measure. In this example, we have three clusters \(C\), where each cluster contains a number of documents. Each document is marked with X, O, and Diamond that represents its groudn truth cluster \(C_true\).

example

There are total of \(n = 17\) documents. Thus there are \(\binom{17}{2} = 136\) pairs of decisions.

Using the definition of TP and FP above, we can calculate All Positives as decision assigning documents to the same cluster. There’s 6 documents in cluster 1, 6 documents in cluster 2, 5 documents in cluster 3. Thus there are \(\binom{6}{2}\) pairs of documents in cluster 1, \(\binom{6}{2}\) in cluster 2, \(\binom{5}{2}\) in cluster 3. Or,

\[All Positive = TP + FP = \binom{6}{2} + \binom{6}{2} + \binom{5}{2} = 40\]

True Positive are pairs of documents of the same type assigned to the same cluster. There are 5 X and 1 O in cluster 1, or \(\binom{5}{2}\) pairs for X and 0 pairs for O. There are 4 O, 1 X, 1 Diamond in cluster 2, or \(\binom{4}{2}\) pairs for O and 0 pairs for X and Diamond. There are 3 Diamond and 2 X in cluster 3, or \(\binom{3}{2}\) pairs for Diamond and $$\binom{2}{2}X and 0 pairs for O. Thus,

\[TP = \binom{5}{2} + \binom{4}{2} + \binom{3}{2} + \binom{2}{2} = 20\]

With All Positive and True Positive, we can derive False Positive as:

\[FP = All Positive - True Positive = 40 - 20 = 20\]

We can also derive All Negative from total number of decision pairs as:

\[All Negative = Total Pairs of Decisions - All Positive = 136 - 40 = 96\]

False Negative are pairs of decisions assigned two similar documents to different clusters. So, we will go through each ground truth label and sum up the number of pairs that’s mis-classified.

For X, there are 5 in cluster 1, 1 in cluster 2, 2 in cluster 3. So the number of mis-classified X is \(5 \times (2 + 1)\) pairs for mis-classification between cluster 1 and cluster 2/cluster 3 plus \(2 \times 1\) pairs for mis-classification between cluster 2 and cluster 3.

For O, there are 4 in cluster 2 and 1 in cluster 1. Thus the total number of mis-classified O is \(4 \times 1\).

For Diamond, there are 3 in cluster 3 and 1 in cluster 2. Thus the total number of mis-classified Diamond is \(3 \times 1\). Therefore,

\[FN = 5 \times 3 + 2 \times 1 + 4 \times 1 + 3 \times 1 = 24\]

We can then derive True Negative as:

\[TN = All Negative - False Negative = 96 - 24 = 72\]

Thus, the confusion table for the clustering results are as follows:

  Same Cluster Different Cluster
Same Class TP = 20 FN = 24
Different Class FP = 20 TN = 72

Then we can calculate Precision and Recall as

\[Precision = \frac{TP}{TP + FP} = \frac{20}{20 + 20} = 0.50\] \[Recall = \frac{TP}{TP + FN} = \frac{20}{20 + 24} = 0.45\]

Using \(\beta = 1\), e.g., equal balance between precision and recall, we can then get the F-measure (F1-Score) as

\[F1 score = 2 * \frac{precision \times recall}{precision + recall} = 2 \times \frac{0.5 \times 0.45}{0.5 + 0.45} = 0.4762\]

Implementation

in process…

References:

1. Introduction to Information Retrieval

2. Characterization and evaluation of similarity measuresfor pairs of clusterings


To cite this content, please use:

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {Feature Selection and Dimensionality Reduction},
    year = {2020},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/2020-11-17-Document-Clustering-Evaluation/}
}