“…learn statistical hygiene to avoid disasters.”
Attributed to John Platt, Distinguished Scientist, Microsoft Research during Practice of Machine Learning Conference at Microsoft as posted on TechNet
Some would say statistical hygiene includes cleaning and conforming the data you will use before you start learning from it. Failing to do so, will potentially invalidate your machine learning project and damage the benefits of your digital business program.
In no particular order, here are some examples of hygiene issues:
Sampling bias – Non/missing response
You want to learn why a valve keeps cycling unnecessarily but most of the time the readings 30 seconds before and the 60 seconds after the episode are missing.
Sampling bias – convenient population
You want to learn why the software renewal rate is lower than hoped for but only sample those who submitted a help request because you have additional demographic information on those users. However, you ignored those who never used it at all or abandoned it before asking for help.
Sampling bias – self-selection
You solicit responses to learn what is the most popular feature that should be added to your new product release, but a small number of activists use social media to create a vote for <insert pet feature here> campaign and push down the features your full-price paying customers want.
Sampling bias – wrong time
You want to learn how temperature affects the evaporation of an electrolyte in a battery, but you only use this summer’s readings from sites in Arizona, when over half of your business is on the east coast.
Sampling error – wrong population
You want to learn how temperature affects the evaporation of an electrolyte in a battery, but you only use the last two years of readings from sites that have 24*365 operations, which is only 20% of your population.
Measurement errors – leading questions
For example, “On a scale of 1-5 how much do you hate your boss?”
Measurement errors – vanity
For example, “In the past month, how many hours have you spent updating social media and personal shopping during working hours?”
A population can be contaminated if it contains two or more populations but we only want to learn about one of those populations. For example, we might want to learn the mean time between failures of a 30hp 3600 RPM electric motor. However, we did not also track cycle times and environmental variables even though some motors run intermittently and in an exposed environment, while others run continuously in a conditioned environment. We might see our mistake if the differences were significant enough it created a bi-modal distribution, but not if there was a generous overlap.
Good Statistical Hygiene
There is more to good statistical hygiene than good data, it also requires discipline on how to conduct tests, understand confidence measures and how to interpret model evaluation scores. Without this rigor your organization may learn the wrong thing, jeopardizing not only your Machine learning (ML) project and perhaps your digital business program too. In this respect, ML projects are no different from regular IT and regular business intelligence work: garbage in often results in garbage out.