
[In my upcoming book—AI Wisdom: Meta-Principles of Thinking and Learning (March 2025)— I describe the durable AI meta-principles that can be taught in any subject. This is an excerpt from Chapter 9 (“Erring”) describing a broadly useful form of AI (or human) decision error analysis. Previously, I described mathematical bias in AI as distinct from ethical bias.]
Measuring the performance level of AI on various complex tasks is a hard problem, one that increasingly must rely on hand-crafted challenges that are intended to be difficult.
AI performance analysis has dramatically transformed as AI has gotten more sophisticated.[i] What began as relatively straightforward evaluations of accuracy and fluency has evolved into assessment methodologies that must account for nuanced capabilities like reasoning, creativity, and ethical judgment. Traditional metrics have been supplemented by more sophisticated evaluation frameworks that examine models' abilities to maintain consistent reasoning across long contexts, detect and avoid harmful outputs, and demonstrate understanding rather than shallow pattern matching. This evolution reflects the fundamental challenge that as AI systems become more capable, our methods for measuring and understanding their performance must become correspondingly more sophisticated.
Some of the modern AI evaluations are complicated and specialized, but there’s one error analysis for classifications (from AI or people) that’s a great introduction to the subject and super useful. Many tasks that we give AI are classification tasks, where it is trained to decide whether something fits within one or more exclusive categories. A table called a confusion matrix is used to analyze classification errors.
Let’s start with a binary classification. Imagine a medical device is being tested on its ability to correctly diagnose cancer. There are four possibilities: a correct positive diagnosis, a correct negative one, a false positive, or a false negative. The number of diagnoses in each of those categories can be shown in a confusion matrix as in Table 1.
A different confusion matrix exists for each point on the ROC curve described in Chapter 2. [ROC curves show AI detection and false alarm performance for binary classification challenges as a function of the detection threshold.]
Table 1. Confusion matrix for a binary classification challenge. Each quadrant contains a count of the number of test examples fitting that cell.
From this simple table, a variety of metrics can be generated. A partial list is:
The importance of the metrics is application dependent. In medical diagnostics, a false negative (missing a disease) might have very different consequences than a false positive (diagnosing a disease that isn't present). Incorrectly marking important emails as spam (false positive) is generally worse than letting a few spam emails through (false negative). The implications of falsely convicting an innocent person versus letting a guilty person go free are profoundly different.
In many cases, the acceptable error rate or the preferred characteristics of the classification are based on the consequences of the judgment. If there’s an additional human review after the AI makes a classification, then a higher false positive rate might be tolerable, but a low false negative rate is desired. The human will want to know not many cancers were missed if the AI is doing the initial screening. That doesn’t mean the human review makes the judgment better. In a 2024 study of medical diagnosis, AI performed better on its own then either solo doctors or doctors with the AI.[ii] The humans overruled the more accurate AI.
Confusion matrices can also be used for many-class challenges. A hypothetical one is in Table 2. Each row represents the actual category, and each column represents the AI-decided category. The diagonal represents correct categorizations, while off-diagonal elements show various types of misclassifications.
Table 2. Hypothetical confusion matrix for a four-category challenge
For AI classification systems, confusion matrices are invaluable for fine-tuning performance. They help developers understand not just the overall accuracy, but the specific types of errors being made.
Confusion matrices can be used to illustrate several key ideas:
Bias: In Table 2, categories B and C are more often confused with one another than other categories (i.e. the off-diagonal B/C comparisons have larger numbers). When the AI says it’s category B (second column), its errors are skewed toward category C. That might be because those categories are more difficult to distinguish, but if that categorization bias is ethically troublesome, then confusion matrices are the first step in determining whether there’s a problem.
Cost-Sensitive Learning: Not all errors are equal. Just as humans weigh the consequences of different types of mistakes, AI systems can be tuned to prioritize avoiding certain types of errors. If one category is dark-skinned human faces, and another is gorilla faces, then misclassifications of human faces as gorillas should have a much higher cost than other errors. While the cost of errors is a human question, the confusion matrix provides the raw material for such valuations. [Note that a historical example of exactly this misclassification is used earlier in the chapter.]
Prevalence and Base Rates: A high accuracy can be misleading if one class is much more common than others. This relates to the human cognitive bias of neglecting base rates in probability judgments. In the table above, true category D occurs much less often than the other categories, so there should be less trust that the statistics related to that category are accurate.
In education, confusion matrices can be applied to a wide range of subjects. Any time you’re discussing a concept, there is a binary categorization to potentially discuss—what is “in” the concept, and what is “out.” In a literature class, students might use one to analyze their ability to identify different types of figurative language. In a history class, it could be used to evaluate the accuracy of predictions about historical events.
I don’t have to tell educators that assessment can be a complex topic. Confusion matrices are only for the classification decision niche, but that relates to so much of the statistics people get in life.
[i] Pillay, T. "AI Models Are Getting Smarter. New Tests Are Racing to Catch Up." Time, December 24, 2024. https://time.com/7203729/ai-evaluations-safety/.
[ii] Goh, E., R. Gallo, J. Hom, E Strong, Y. Weng, et al. "Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial." JAMA Network Open 7, no. 10 (2024): e2440969. doi.org/10.1001/jamanetworkopen.2024.40969.
This was really helpful, thank you! I think I knew most of this, but the way you brought it all together with concrete examples makes it really easy to share with learners and educators.