Understand These 5 Key Deep Learning Classification Metrics for Better Application Success
Product quality is the lifeblood of most companies. Getting it right time and again leads to customer trust, positive word of mouth, fewer costly recalls, and ultimately better business outcomes. In a factory or production line, relying on machine vision systems throughout every step of production is one of the best investments to deliver quality products. Specifically, deep learning tools such as a classifier, help manufacturers identify potential quality control issues on the production line to limit overall flaws in finished products.
The classifier is an important inspection tool because it is not just enough for the production line to identify defects or damaged parts and pull them out of production. Those defects must also be classified so the inspection system can identify patterns to determine whether one defect is a scratch, or another is a dent, for example. Correct classifications of these production flaws keep bad products off the market, while wrong predictions keep good products off the shelves, bogging down production and adding to costs.
In the world of Industry 4.0, where big data is crucial to process and quality control, having the right metrics from this data allows organizations to understand whether their deep learning classification inspections are performing optimally. Classification applications rely on four main outcomes to generate this data:
- True positive: The ground truth is positive and the predicted class is also positive
- False positive: The ground truth is negative and the predicted class is positive
- True negative: The ground truth is negative and the predicted class is negative
- False negative: The ground truth is positive and the predicted class is negative
The ground truth is the actual inspection outcome such as identifying a dent on an automobile bumper. Developers and engineers want to hone their deep learning applications to correctly predict and classify defects, for example, to match the ground truth defect found on the actual part.
There are numerous metrics organizations can use to measure the success of their classification application, but here is a look at five of them.
Accuracy and error rates
The most commonly used metric in manufacturing deep learning applications is Classification accuracy because of its simplicity and its effectiveness in conveying the underlying message in a single number. Error rates is a worthy complement of accuracy.
These are the most fundamental metrics because they identify the essential effectiveness of a deep learning application.
Measuring accuracy is relatively straightforward: divide the number of correct predictions by the total number of predictions made. The error rate is the number of incorrect predictions divided by the number of total predictions.
It is worth noting, for classification applications, that correct predictions include all true positive and true negative results.
A classification application that incorrectly predicts a defective part as good is known as Escape. Allowing damaged or flawed products to “escape” into the marketplace undetected risks a company’s reputation for quality products. Plus, recalls from these escaped products can potentially cost millions of dollars.
The Escape rate is measured by dividing the number of false negatives by the total number of predictions.
A classification application that produces false positive predictions generates overkill, meaning good products or parts without defects are mistakenly removed from the production line. Non-defective parts that are removed from the line can potentially end up as scrap or being manually re-worked. Either outcome costs the manufacturer additional money in parts and labor.
The Overkill rate is measured by dividing the number of false positives by the total number of predictions.
Precision answers the questions of what proportion of positive predictions were correct? In other words, is the classification application predicting the right class without compromising false positives?
A value of 1 indicates the classification model is very good at predicting the right class while also achieving 0% overkill. A value 0 indicates the model is not capable of doing what it should.
The F1-Score is defined as the harmonic mean of Precision and Recall. It is a measure of a test’s accuracy. The highest possible value is 1, indicating perfect precision and recall.
As previously mentioned, precision is the number of correctly identified positive results divided by the number of all positive results, including those not identified correctly. Recall is the number of correctly identified positive results divided by the number of all samples that should have been identified as positive.
The F1-Score, then, is the percentage of correct predictions generated by the classification application.
Measuring what matters
These examples have been kept rudimentary for simplicity’s sake. A real-world deep-learning algorithm might have a half-dozen classifications or more. That would make for a much more sophisticated confusion matrix. There are also more complex formulas for assessing the recall and accuracy of learning algorithms, for instance.
Ultimately, these classification metrics allow companies to create a baseline of success and apply scoring mechanisms, much like teachers grading their students. Over time, deep-learning developers can use these metrics to help fine-tune their applications and produce much more accurate assessments of what works and what does not.
When it comes to industrial automation, manufacturers need a better understanding of what is working and not working with respect to the applications they have deployed. Choosing which metrics to focus on depends on each organization’s unique production line, the problems they are trying to solve for, and the business outcomes that matter most.