Machine Learning (ML) can often seem like this:
Source: xkcd (link)
We get some data on the left and keep modifying our models until we get reasonable answers on the right. Sounds simple, doesn’t it?
And yet, there are many complications. Crucially, teams working on ML projects are always choosing where to focus their scarce time. They are constantly deciding: Should we get more data? Should we invest on higher-quality data? Train a larger or more complex model? Should we try a different architecture?
The answers really matter here. They entail the difference between making progress and never seeing value from ML projects. Teams can waste months solving for the wrong problem!
Now, the bad news is that every ML project is different and there are no “one-size-fits-all” answers. The good news is that there are principles that can help us decide the best direction to pursue:
1) Give the team a compass, not a map
At the start, it is more helpful to clearly define success, rather than the exact way in which this success will be achieved. More concretely:
• Don’t overplan: Build a first system quickly and iterate. The initial system gives us clues on what are the most promising directions for the team to work on.
• Avoid multiple metrics: Choose a single-number evaluation metric early on that the team can optimize. If we are interested in several metrics, we can still create a composite single metric, or apply the metrics in sequence to get to a definite result. Regardless, there should be no ambiguity on whether the ML model is improving.
2) Split your data carefully or get lost in the woods
ML models overfit the data, this is their nature. It is easy to fool ourselves thinking that we have a great model if we are not rigorous with data. To avoid this, we need to create three distinct datasets:
• Test set: The ultimate arbiter of truth, measuring the final model’s performance. We want to keep this data invisible to the models when training.
• Validation set: Our main guide to measure performance when the model is being fine-tuned.
• Training set: The data we train our model on (this is the bulk of the data).
The purpose of the Test and Validation sets is to direct the team towards the main changes needed. For this reason, they should reflect the "real-world” environment. For instance, if we expect an image classifier to be used in bad lighting conditions, then our Test and Validation data should consist of pictures with bad lighting.
On the other hand, we can be more liberal with the Training set, using data from adjacent data distributions that might contain useful information (e.g., training data with better quality pictures).
3) Big Data fixation can lead us astray
“More data = Better” is a dangerous proxy for ML decisions. Teams might end up spending months collecting data they don’t need. This is what is actually needed:
• Test set: This set just needs to be large enough to give a confident estimate of the overall performance of the system.
• Validation set: The size of this set is a function of accuracy. It should be large enough to detect accuracy changes between candidate models but not larger.
• Training set: Adding more Training data is only useful when we have a large Variance (see last section below for an explanation).
Most teams follow a variant of the “70/30” rule, reserving 30% of data for the Validation & Test sets and using only 70% for Training. This is likely an overkill. We are better off with smaller but better curated Validation and Test sets reflecting the “real-world” distribution we want to predict.
4) Use Prediction Errors as a guide
Progress is error correction. Errors are the breadcrumbs to guide our way into better predictions. There are two important steps to follow:
• Perform Error analysis: Manually examine misclassified Validation set examples and count the major categories of errors. We can also manually examine the examples with the greatest loss (size of error), which give hints as to where the model is getting particularly confused.
• Prioritize the Errors worth fixing: Use the analysis above to prioritize what errors are worth chasing, instead of going down rabbit holes. Many errors only happen sporadically and are not worth the hassle.
5) All Errors are not created equal
Errors can be classified as errors due to Bias or errors due to Variance. This distinction matters more than anything in this blog post.
• Distinguish Bias and Variance: Bias is the error of our model on the Training set. If the error is large, we say the model has high Bias (it is too simple and underfits the data). On the other hand, Variance is the gap between the model's error in the Training set and the error on the Validation set. If this gap is large, we say that the model has high Variance (it is too complex and overfits the data).
• Use this simple Heuristic: If we have high Bias, we need to increase model complexity first (for example, adding more neurons or layers). If instead we have high Variance, we should add training data or make the model less complex. Crucially, adding more data does not help in reducing Bias. This realization alone can save your team months of model development and testing time!
• Choose techniques based on errors: A more nuanced version of this heuristic is that several techniques are helpful in reducing Bias, such as increasing the model complexity, developing new features, and trying different model architectures. Other techniques are helpful in reducing high Variance, such as adding Training data, regularization, dropout, early stopping, and others. Choose wisely.
This is by no means a comprehensive list of ML best practices. However, these five principles point to a high-level practical approach for ML projects: We first define success and split our data carefully to measure progress. We then create a first version of the ML system with the bare minimum data required and test it to find prediction errors. And finally, we analyze those errors and apply different techniques depending on their nature (Bias vs. Variance) to make progress.
Overall, we start from an assumption of ignorance (we don’t know what’s going to work) and optimize for the most precious resource of all: Our time.
Gartner's CDO Research (ML Ignition Guide soon to be published)
Machine Learning Engineering (Andriy Burkov)
Machine Learning Yearning (Andrew Ng)