The life cycle of a machine learning (ML) model can be as familiar as a software development project or an engineering project. At a macro level, they all start with a measurable objective, some idea of how to achieve it, a prototyping process, development process, an acceptance process to gain permission to roll-out and a deployment process.
Deciding if you accept the results and send the ‘model to market’ to impact things you hold most precious to you is a binary decision, but you are not presented with binary inputs because machine learning is probability-based and every model comes with certain degrees of wrongness.
The Role Of Governance
Having the governance in place to correctly evaluate the benefits of machine learning is an essential part of digital business mastery and accepting the appropriate level of wrongness when applied to automated decisions gives you a competitive advantage. This is especially important in machine-to-machine situations where the model may be the final decider because human review may create an unfavorable delay.
The first, but certainly not the last, information available for making the roll-out decision is produced within the Evaluate Model feature of Azure Machine Learning Studio that compares one dataset with another. The datasets can be your training results compared to the test results, or test results from one model versus another. You can only compare two datasets at a time in one Evaluate Model node.
The evaluation scores both regression and classification models. Classification models are scored with accuracy, precision and AUC (area under the curve), while regression models are scored with mean absolute error, root mean squared error, relative absolute error, and relative squared error metrics.
Our bike buyer classification model has these scores:
The true positive (TP), False negative (FN), False Positive (FP) and True Negative (TN) are often presented in a grid, righteously called a ‘confusion matrix’.
The supplementary scores are calculated from the confusion matrix so:
The results can be interpreted so:
Accuracy = how many were correctly classified as a ratio of all the results
“The test is correct 9 times out of 10”
A high ratio is good and a low ratio is bad. It isn’t affected by how many negatives there are in relation to the positives. If it is acceptable to miss some positives or miss some negatives and the costs are identical then this score is the deciding factor. If there are consequences for missing some positives or some negatives then you must also consider the net benefit.
Recall = how many of the positives did the model classify as a ratio of all the actual positives
“With this test we should find nearly 19% of them, but we need something else to find the other 247 we think are out there”
A high ratio is good, a low ratio is bad. If you want to locate as many positives as possible and don’t care about anything else, then this is a deciding factor. But if there are consequences to false negatives or false positives then you must also consider the accuracy and calculate the net benefit.
Precision = how often a positive test result was actually positive
“You have tested positive, but 8 times out of 25 the test is wrong”
A high ratio is good a low ratio is bad. If there are consequences for a false positive but not for a false negative then this is a deciding factor. If there are also consequences for false negatives then you must consider the accuracy and calculate the net benefit.
F1 Score = gives equal weighting to precision and recall and so ignores true negative performance
A higher ratio is good, a low ratio is bad.
Specificity isn’t calculated by Azure Machine Learning, but it is important if false negatives have more consequences than false positives. A high ratio is good a low ratio is bad.
“You have tested negative, but there is a 1 in a 100 chance you actually have it”
Calculating Net Benefit
When there are consequences for both false positives and false negatives, or the test is only marginally better than your current performance, then the deciding factor is based on the net benefit. Calculating the net benefit when you have non-linear cost and benefit curves may require its own model, but the process uses the extended value of the confusion matrix.
Using the same results but monetizing the benefits and consequence can lead to entirely different conclusions. For example, the accuracy of the model is good enough to make money in the marketing campaign, because avoiding the cost of reaching customers who were never going to buy drives a net benefit. Plus you still benefit from a false negative because customers buy your product even though your campaign deliberately shunned them.
|Marketing Campaign -avoid wasting money connecting with people who are never going to buy|
|$.3 paid to reach a potential customer and a net $2 profit for every sale||counts||per instance benefit/ (consequence)||Extended result|
But a lower accuracy and a smaller benefit can make the model more expensive to implement than the current process.
When the consequences are two-sided and the score is not perfect, then a valid net benefit calculator result is the deciding factor for a classification model. The creation of the net benefit calculator and the review and approval process are a responsibility of a digital-business governance program.
Watch for Part 2 where I’ll discuss interpreting the scores of a regression model and the implication in using them. Feel free to leave a question or comment below!