Usefulness of Confusion Matrices

A Confusion Matrix is a visual performance assessment of a classification algorithm in the form of a table layout or matrix. Each column of the matrix represents predicted classifications and each row represents actual defined classifications. This representation is a useful way to help evaluate a classifier model. A well behaved model should produce a balanced matrix and have consist percent correctness numbers for accuracy, recall, precision and an F measure. If it does not, there is cause to further evaluate the data used to build the model and the data used to test the model. If you are building your own classification models, this is a helpful way to evaluate them. If you are buying a product for something like Sentiment Analysis which uses a classification model, you should ask for data associated with the Confusion Matrix to help evaluate the tool.

If a classification algorithm distinguishes between positive, negative and neutral text statements, for example, a confusion matrix summarizing the results of a classification algorithm might look like the following.

Actual Pos Neg Neutral
Pos 15 10 100
Neg 10 15 10
Neutral 10 100 1000
  • In this matrix, the total actual number of statements classified in each category is the sum of the row. 125 Positive, 35 Negative and 1110 Neutral.
  • Columns represent predictions made by the algorithm. In the first column, 15 Positive statements were classified correctly as Positive. 10 Negative statements were incorrectly classified as Positive and 10 Neutral statements were incorrectly classified as Positive. The 20 statements incorrectly classified are considered false positives.
  • Reading left to right across the top line, of the total Positive statements, 15 were classified as Positive, 10 were classified as Negative and 100 were classfied as Neutral. 110 Positive statements were missed and considered false negatives.
  • Values in the diagonal are correctly classified and are underlined. All other classifications are incorrect.


The simplest and most intuitve assessment is Accuracy. It is the correct classifications divided by all classifications. In this example, divide the sum of all the underlined values by the sum of values in all cells. 1030/1270 = 0.811

How to Cheat:

If the categories are highly imbalanced, high accuracy can be obtained by always guessing the category with the largest number of elements. Accuracy can be checked by randomly classifying the elements and comparing the percent guessed correctly with the accuracy of the algorithm. Accuracy should also be validated with Precision and Recall to detect how the data may be set up to cheat.

Precision is the correct classifications penalized by the number of incorrect classifications. true positives / ( true positives + false positives). In this example, the precision is:

(reading down each predicted column)

positive statements is 15 / (15 + 20) = .43

negative statements is 15 / (15 + 110) = .12

neutral statements is 1000 / (1000 + 110) = .90

How to Cheat:

You can get high precision by rarely classifying in a given category but this ruins Recall.


Recall is the number of correct classifications penalized by the number of missed items. true positives / (true positives + false negatives). in this example, the recall is:

(reading across each Actual row)

positive statements is 15 / (15 + 110) = .12

negative statements is 15 / (15 + 20) = .43

neutral statements is 1000 / (1000 + 110) = .90

F Measure (F1)

F1 measure is a derived effectiveness measurement. The resultant value is interpreted as a weighted average of the precision and recall. The best value is 1 and the worst is 0.

2((precision*recall) / (precision+recall))

positive statements:  2*((0.43 * 0.12) / (0.43 + 0.12)) = 0.18

negative statements: 2*((0.12 * 0.43) / (0.12 + 0.43)) = 0.18

neutral statements: 2*((0.9 * 0.9) / (0.9 + 0.9)) = 0.9


In this example, although the Accuracy at 81% seems good, the precision and recall indicate that the numbers are skewed by having an imbalance in the number of values tested across the categories. This is clearly visible in the F measure numbers. It could be that the model in this example was built with many more neutral statements than positive and negative and the test data contains mostly neutral statements. Accuracy alone can be misleading.

Ideally you want to see a matrix more like the following:

Actual Pos Neg Neutral
Pos 105 10 10
Neg 6 90 8
Neutral 12 15 150

Accuracy = 0.85


Positive = 0.85

Negative = 0.78

Neutral = 0.89


Positive = 0.84

Negative = 0.87

Neutral = 0.85

F Measure

positive = 0.84

negative = 0.86

neutral = 0.87

All the numbers are in the same ballpark and are reasonably balanced across the matrix with the bulk of the numbers on the diagonal showing a majority of correct classifications. This also shows a greater likelihood of the model being constructed with a balanced data set and that the test data evaluated by the model is also balanced.

User Feedback in Design

The difference between an easy to use and useful application and one that you’re forced to struggle with, usually directly correlates with whether or not  the application was designed and built considering user input during the course of development. This applies both to user interface design and application functions. Original ideas come from project stakeholders and they drive the project. Users evaluate features and ease of use in the context of what they wish to accomplish. Their feedback drives the implementation. With as much as 75% of Facebook users unhappy with the changes introduced 21 Sept 2011, it’s safe to say that their user base did not have much input during the development of the latest release. Facebook now has an earful of feedback. Let this be a prominent example showing the importance of:

  • knowing who your users are,
  • knowing what your users wish to accomplish,
  • understanding what makes your users’ tasks simple and efficient to accomplish,
  • listening to your users for product success!