"To minimize the mistakes your AI will make, you should use the most accurate machine learning model." Sounds straightforward, right? However, making the least mistakes should not always be your goal since different types of mistakes can have varying impacts. ML models will make mistakes and it is therefore crucial to decide which mistakes you can better live with.
To choose the right ML model and make informed decisions based on its predictions, it is important to understand different measures of relevance.
Why you shouldn't blindly use your most accurate ML model
First, let's start by defining accuracy:
The accuracy of an ML model describes how many data points were detected correctly.
To use a practical example, let's look at an image classification problem in which the AI is tasked to label an image dataset containing images of 500 cats and 500 dogs. The model correctly labels 500 dogs and 499 cats. Mistakenly, it labels one cat as "dog". The corresponding accuracy is therefore 99.9%. Here, accuracy is a good assessment of model quality.
For comparison, let's look at a second, less balanced example: A hospital looks for cancer in 1,000 images. In reality, two of those pictures contain evidence for cancer but the model only detects one of them. Since the model only makes one mistake by labeling one cancerous image as "healthy", the accuracy of the model is also at 99.9%.
In this case, the 99.9% accuracy gives a wrong impression as the model actually missed 50% of relevant items. Without doubt, it would be preferable to reduce the accuracy to 99% and mistakenly detect 8 healthy images as "cancerous" if, in return, the second cancerous image could be detected – the trade-off of manually checking 10 images to discover two relevant elements is unworthy of discussion.
But how can you formalize this when choosing your ML model? Let's dive deeper.
Measuring relevance: Dealing with high-priority classes
If your dataset is not well balanced or mistakes have varying impact, your model's accuracy is not a good measure for performance.
Whenever you are looking for specific information, the main task is often to differentiate between the relevant data you are looking for and the irrelevant information that clouds your view. Therefore, it is more important to analyze model performance concerning relevant elements and not the overall dataset.
Let's look at our first example. If the objective is to detect dogs, all dogs are relevant elements whereas cats are irrelevant elements.
In this task, the AI can make two types of mistakes:
- It can miss a detection of a dog (false negative) or
- It can wrongly identify a cat as a dog (false positive).
For a detailed description of the different mistakes, their possible implications, and how you can systematically control them, head over to our article on how to control AI-enabled workflow automation.
Ideally, the AI should detect all dogs without a miss and make no mistake by labeling a cat as a dog. Hence, there are two main dimensions according to which the correctness of machine learning models can be compared.
The precision of a model describes how many detected items are truly relevant. It is calculated by dividing the true positives by overall positives.
In our first example, it compares the number of dogs that were detected to the number of dogs and dressed-up cats that were all detected as dogs. Since missed detections of dogs are not considered in the calculation, it can be increased by setting higher thresholds on when a dog should be detected as such.
In the cat/dog example, the precision is at 99.8% since out of the 501 animals that were detected as dogs, only one was a cat. If we look at the cancer example, we get a perfect score of 100% since the model detected no healthy image as cancerous.
Besides being a measure of model performance, precision can also be seen as the probability that a randomly selected item which is labeled as "relevant" is a true positive. In the cancer example, the precision percentage can be translated as the probability that an image which the model detected as cancerous actually shows cancer.
Recall is a measure of how many relevant elements were detected. Therefore it divides true positives by the number of relevant elements.
In our cat/dog-example, it compares the dogs that were detected to the overall amount of dogs in the dataset (disguised or not). Hence, the recall of the model is at a perfect 100%.
In contrast, the cancer-detection model has a terrible recall. Since only one of two examples of cancer were detected, the recall is at 50%. While accuracy and precision suggested that the model is suitable to detect cancer, calculating recall reveals its weakness.
As with precision, analyzing purely recall can also give a wrong impression on model performance. A model labeling all animals in the dataset as "dog" would have a recall of 100% since it would detect all dogs without a miss. The 500 wrongly labeled cats would not have an impact on recall.
For the individual element, the recall percentage gives the probability that a randomly selected relevant item from the dataset will be detected.
Going back to the question of how to select the right model, there's a trade-off between trying to detect all relevant items and avoiding to make wrong detections. In the end, the decision depends on your use-case.
Put differently, you will need to consider these questions: How crucial is it that you detect every relevant element? Are you willing to manually sort out irrelevant elements in return for optimal recall?
In the cancer diagnosis example, false-negatives should be avoided at all cost since they can have lethal consequences. Here, recall is a better measure than precision.
If you were to optimize recommendations on YouTube, false-negatives are less important since only a small subset of recommendations is shown anyways. Most importantly, false-positives (bad recommendations) should be avoided. Hence, the model should be optimized for precision.
Combining precision and recall: The F-measure
There is also a way to combine the two and it can sometimes make sense to calculate what's called the F-measure. However, unless you are currently preparing for a statistics exam, the above might already be a stretch and to be honest: we struggle with them, too. When working with our software, all you really need to worry about are these two measures.
If you are still hungry for more, here's the Wikipedia article for it 🤓