Avoiding the Blind Spots of Missing Data with Machine Learning
Written by Sergey Obukhov
Categories: For Devs
3 minute read time
Learning by exampleLet’s pull in another example. Here’s a probability tree from a simple GMAT test that assumes a sample size of 100 students in a college class: If you’re male, the probability that you’re single is
50 / 70 = 71%If you’re female, the probability of being single is
20 / 30 = 67%If the gender is unknown, the probability of being single is
71%) + (0.3
67%) = 70%0.7 because 70 students out of 100 are male. 0.3 because 30 students out of 100 are females. By breaking down the overall probability into branch probabilities with weights, we consider all possibilities. In general, I think this is a much better way to overcome missing data and teach our model to generalize future values. Unfortunately, libraries that implement these algorithms rarely support missing values. For example, scikit-learn library – the de facto machine learning library for Python – requires all values to be numeric. But there are still good libraries such as Orange that do support missing values. And as it turns out, the limitation can be overcome.
The power of data imputationAt first, this lack of support for missing values made me feel angry and amused. I mean, seriously, why can’t the very algorithm whose advantage is a built-in support for missing values be used without data imputation?! Come on! But then I realized that data can be imputed in a way equivalent to the algorithm described:
import random def impute_gender(): return random.choice(["Male"] * 70 + ["Female"] * 30)And the beauty of data imputation is that it can be applied to any machine learning algorithm, not just decision trees.
Lessons LearnedNo matter what field you’re working in or how good you are at collecting data, missing values are gonna come up. Maybe you’re working on a credit scoring application. Or maybe you’re trying to predict when email recipients are most likely to open their messages, so you can schedule accordingly. Real tasks tend to have gaps. There are so many different ways to think about a problem like missing values, and depending on your case, the answers can be different. But in the heart of a complex solution often lies a simple idea.
- Wiki article on data imputation
- Some common ways to impute missing values
- Quora article on how decision trees handle missing values
- Orange – commonly used machine learning library in Python that supports missing values
- Testing your model to withstand real-world tasks
Modified on: January 22, 2019