# Machine Learning – How big should your data be?

The above notable quotes are two of the most effective statements that you’ll see defending the power of data.

Accepting the fact that more data is powerful; how do you decide if you need more data for a statistical model that you are currently working on?

Here’s one way to look at it:

Almost every statistical model is prone to some level of error. Some of this error is reducible, and some of it isn’t.

• Irreducible Error – An inherent uncertainty that is associated with a natural variability in a system
• Reducible Error – An error that can and should be minimized to improve accuracy.

Reducible error is further divided into Bias and Variance.

### Bias

Bias is the difference between the expected results (predictions) from a model and the true values of data underlying the model. Although it is a much hazier concept, in general, bias measures how far off the model’s predictions are from the actual values. Average residuals are a useful measure of bias. A residual is the difference between the prediction and true value of a specific data point. If there ever exists a perfect predictive model (completely unbiased), the average of residuals for that model would be zero.

Bias measures inaccuracy of the model – the inability to make correct predictions.

When Bias increases, the expected result (prediction) is far from the true value. This usually happens when the model is oversimplified (less than required parameters) or under-fit.  Here, the number of parameters in the model aren’t enough to accurately predict the outcome. Simply, increasing the number of observations or the training sample size will not help in decreasing the bias. To reduce bias, you have to include more features into the model, therefore increasing the complexity of your data and your model.

### Variance

Variance is the variation of the expected results (predictions) of a given data point from the average prediction for that data point, over multiple training runs of the model. Variance measures the instability of the model – the inability to make consistent predictions.

When Variance increases, the consistency of the model drops. Unlike in the case of Bias, Variance increases when the model is too complex (more than required parameters) or over-fit.  Here, the number of parameters in the model are too many that they distort the model’s predictions. To reduce variance in such situations, you have to reduce the number features that go into the model, therefore reducing the complexity of your data and your model.

In other situations, the variance of a model is also affected by the size of the training sample. The variance of a model is high when there aren’t enough observations for the set of selected parameters in the underlying data.  As seen in the above example, when a test observation in introduced to a model with fewer training observations, the model’s new prediction noticeably varies from the previous prediction in an attempt to accommodate the test observation. However, when the same test observation is introduced to a model with more training observations, the model’s new prediction adjusts to the test observation with much lower variance.

Below are some commonly used graphs to explain the influence of data on a model’s bias and variance. The above graph demonstrates that a model starts out with a very small bias and a very high variance when the training sample is tiny. However, as the sample size increases, the model bias increases but the variance decreases. After a certain point both bias and variance stabilize and do not increase/decrease with the sample size. The second graph demonstrates that a model starts out with a very high bias and a very low variance when the training sample features are too few. As the model complexity increases, the model bias keeps decreasing but the variance keeps increasing. However, there is an optimally complex model that minimizes the total error in prediction.

With the primary goal of minimizing the total model error, we have to find that optimally complex model, and make sure that the complexity matches with the training sample size (to keep bias in check). This is commonly called the bias-variance trade-off.

To conclude,

1. More sample data is always powerful, but is not always the right answer.
2. As you increase your sample data, make sure to adjust your parameter space to minimize error. #### Mariner 