The previous post concentrated on deciding if a categorical Machine Learning model should be released for production or not. This post concentrates on interpreting the scores of a regression model and the implication in using it in a decision management feature.
Decision management solutions apply business rules written by humans and automatically apply them to the cases they are presented. Digital masters are more likely to include the output from a machine learning prediction into their decision management systems than those who are just starting their digital journey.
For example, before machine learning was included, the rules might be:
- If the last inspection was 30 days ago then put the battery on the “inspection due” list.
- If there are more than 20 items on the “inspection due list” at a site then dispatch a crew to that site.
These rules are often developed from expert opinion not inductive reasoning and are implemented in a rules engine, e.g. SMARTS.
Adding Machine Learning
But when you combine machine learning outputs with expert opinion, the rules can be improved:
- If there is a >70% probability of failure before the next inspection then put the battery on the “inspection due list.”
- If there are more than 20 batteries on the “inspection due list” at a site then dispatch a crew to that site.
- If there is more than 90% probability of rain on the route then add 20% to the travel time.
Use more data than you used before to improve your automatic decision making and increase your performance is a typical digital business tactic.
But machine learning results always have a degree of wrongness, hence you have to choose the level of wrongness you can tolerate, in this example you need to be more than 70% confident before you want to commit to dispatching a crew.
70% need not be an arbitrary choice and can be derived from the score of the model and the level of activity generated for each change in confidence level. Azure Machine Learning (ML) scores regression models with mean absolute error, root mean squared error, relative absolute error, relative squared error and the coefficient of determination (aka R Squared, R2 or r2).
Mean Absolute Error
A high number is bad and a low number is good. It is an absolute value in the units squared of the vertical axis, so only use it to compare models with a common vertical axis unit of measure and similar ranges. As with all these scores, they measure the aggregate vertical differences, aka residuals, of the actual cases to the best fit linear regression line. These scores use absolute or squared values to treat over-predicting and under-predicting equally.
Root Mean Squared Error
This squares the differences rather than use an absolute difference. A high number is bad a low number is good. Only compare models with a common vertical axis unit of measure and similar ranges.
Relative Absolute Error
The ‘relative’ in this context is relative to the mean of all the actual ‘Y’s (the variable you are trying to predict). Because it is a ratio you can compare it to any other model. A high number is bad a low number is good.
Relative Squared Error
The ‘relative’ in this context is relative to the mean of all the actual ‘Y’s. Because the score is a ratio between 0 and 1 you can compare it to any other model. This score squares the residuals instead of taking absolute differences. A high number is bad a low number is good.
Coefficient of Determination
The closer to 1 the better the model is at forecasting, while 0 means it isn’t any good at all and you might as well use the historic mean to predict the next outcome.
If you have a single variable X (Azure ML allows you to have multiple variables, too) that contains information to predict Y, then the coefficient of determination explains how much better using your model with X as an input performs compared to just taking the mean of Y to forecast any other Y. This score is only valid when the differences between the predicted value and the actual value have a normal distribution and the relationship is linear. A coefficient of determination of .7 means 70% of Y can be understood by using X to forecast Y and the other 30% cannot be explained by this model.
The coefficient of determination is sensitive to outliers. Outlier treatment is a subject in itself but as with all machine learning, take a good look at the variables going into the model to see if there are outliers.
Anscombe’s quartet provides a warning to those who want to use a single score to decide if you take a regression model to market. Compare these evaluate model outputs with their respective scatter plots.
Each of these regression lines have identical slopes, intercepts, coefficients of determination and with the exception of the Mean Absolute Error (MAE) and Relative Absolute Error (RAE) similar scores for all the other tests too. Even the variance in MAE and RAE between the charts does not identify the shape of the population.
At the time, Anscombe presented the quartet:
“The user is not showered with graphical displays, he can get them only with trouble, cunning and a fighting spirit.”
Graphs in Statistical Analysis, by F. J. Anscombe The American Statistician, Vol. 27, No. 1. (Feb., 1973), pp. 17-21.
Sampling and Experimentation
42 years later you don’t need ‘cunning and a fighting spirit’ to visualize your data, you just need a sampling technique so the data will fit in Power BI or Excel and the visualization features of Azure ML. But you still need those qualities to protect yourself from a poor model by implementing fail-safes within your decision management system. Returning to the decision automation example earlier, you could add another rule:
- If 80% or more of the items on the “inspection due list” are impending failure recommendations then execute the rogue model protocol
Because when good models go bad you need an ability to fight back. You also have to be patient. As with all machine learning, it is much easier to prove yourself over a series of Ys than predict any given Y, so don’t abandon the model just because of a tragic debut.
Failure is a by-product of experimentation but experimentation is an essential attitude for good digital business. Governance processes that apply fail-safes to your digital processes allow for experimentation without jeopardizing valuable assets and relationships. Decision Engines such as SMARTS simplify experimentation by allowing you to do ‘champion/challenger’ testing.
Want to talk a bit more about Azure Machine Learning in your organization? Let me know…