Statistical Hygiene

“…learn statistical hygiene to avoid disasters.”

Attributed to John Platt, Distinguished Scientist, Microsoft Research during Practice of Machine Learning Conference at Microsoft as posted on TechNet

Some would say statistical hygiene includes cleaning and conforming the data you will use before you start learning from it. Failing to do so, will potentially invalidate your machine learning project and damage the benefits of your digital business program.

In no particular order, here are some examples of hygiene issues:

Sampling bias – Non/missing response

You want to learn why a valve keeps cycling unnecessarily but most of the time the readings 30 seconds before and the 60 seconds after the episode are missing.

Sampling bias – convenient population

You want to learn why the software renewal rate is lower than hoped for but only sample those who submitted a help request because you have additional demographic information on those users. However, you ignored those who never used it at all or abandoned it before asking for help.

Sampling bias – self-selection

You solicit responses to learn what is the most popular feature that should be added to your new product release, but a small number of activists use social media to create a vote for <insert pet feature here> campaign and push down the features your full-price paying customers want.

Sampling bias – wrong time

You want to learn how temperature affects the evaporation of an electrolyte in a battery, but you only use this summer’s readings from sites in Arizona, when over half of your business is on the east coast.

Sampling error – wrong population

You want to learn how temperature affects the evaporation of an electrolyte in a battery, but you only use the last two years of readings from sites that have 24*365 operations, which is only 20% of your population.

Measurement errors – leading questions

For example, “On a scale of 1-5 how much do you hate your boss?”

Measurement errors – vanity

For example, “In the past month, how many hours have you spent updating social media and personal shopping during working hours?”

Contaminated Population

A population can be contaminated if it contains two or more populations but we only want to learn about one of those populations. For example, we might want to learn the mean time between failures of a 30hp 3600 RPM electric motor. However, we did not also track cycle times and environmental variables even though some motors run intermittently and in an exposed environment, while others run continuously in a conditioned environment. We might see our mistake if the differences were significant enough it created a bi-modal distribution, but not if there was a generous overlap.

Bimodal Distribution


Generous Overlap


Good Statistical Hygiene

There is more to good statistical hygiene than good data, it also requires discipline on how to conduct tests, understand confidence measures and how to interpret model evaluation scores. Without this rigor your organization may learn the wrong thing, jeopardizing not only your Machine learning (ML) project and perhaps your digital business program too. In this respect, ML projects are no different from regular IT and regular business intelligence work: garbage in often results in garbage out.

Peter Darragh

Peter Darragh

Vice President of Delivery at Mariner
In his business development capacity Peter helps executives evaluate the impact digital investments can have on their business models and operations. In his delivery role, he manages the teams that apply their data integration, analytics, process automation and machine learning expertise to make our customers digital masters.

Please share your thoughts on this post:

Your email address will not be published. Required fields are marked *

About Mariner

Call Us Email Us

How Can We Help You?

Mariner © 2019. All Rights Reserved.