When I started my journey as a data scientist, I always put more focus on learning algorithms, that is the first mistake I made. Focusing too much on algorithms, while it's essential to know algorithms, is not that much important. For a machine learning project, the most important thing is data preparation. One of the vital parts of the data preparation is to treat missing values with the appropriate methods. Here we are trying to mimic the data generation process so, the understanding of the technique and how it works is paramount importance.
I learned this after spending endless hours on projects, but I'm glad that I learned this way because mistakes are a great way to learn. So this post is dedicated to data imputation methods that one can use for different types of variables.
Below we have a quick recap on the types of variables we have to deal with, in our machine learning project. Because the knowledge of these variables type will help us to understand the given method.
Fig.1 Types of Variables |
This technique assumes that data points are missing at random, and it is suitable for both the categorical and numerical variables. In this technique, we drop the missing values while making sure that the distribution of that column remains the same. If the distribution changes after excluding the data, then this is a sign that we shouldn't be doing it.
Advantages :
- This technique is easy to implement and quite fast.
- This technique doesn't require any data manipulation.
- One of the best things about this technique is that it maintains the distribution of the variables, thus, keeping the statistical properties intact to a certain extent.
Dis-Advantage:
- If data contains too many missing instances, then this technique cannot be applied because that will lead to loss of information.
- This analysis can create a biased dataset since dropping values may cause an increase in the percentage of other instances of categorical data.
Similarly, we have two graphs one is before CCA, and the other is after CCA. We see that distribution looks the same. Hence, we can consider this variable for CCA. Here is the jupyter notebook implementation of this technique.
Fig. 3 CCA on a Categorical Variable. |
- This technique is lucid to apply also easy to implement during the model deployment process.
- This technique doesn't require any data manipulation. It doesn't change the data frame.
- This technique changes the distribution, which leads to a change in the statistical property like covariance, standard deviation, etc.
- The higher the NA values higher the distortion, so this technique is not suitable when the percentage of missing values is high.
Fig. 4 3M imputation |
Above is the change in the statistical properties like standard deviation, median, and 75th percentile, etc. Here, is the jupyter notebook implementation of this technique.
Fig. 5 Statistical Properties. |
![]() |
Fig. 6 Normal Distribution. |
Fig. 7 Formula. |
![]() |
Fig. 8 Skewed Distribution. |
- This technique is easy to implement and integrates adeptly into production.
- This technique doesn't introduce any outliers in the dataset.
- This technique is quite robust for tree-based algorithms.
- This technique distorts the original distribution hence changing the statistical relationship between the variables.
- This technique may change the normality of the distribution. Thus, we avoid using it with linear models.
![]() |
Fig. 10 End tail Imputation. |
![]() |
Fig. 11 Statistical Properties |
- It is easy to implement and easily deployable in production.
- It captures the missingness of the data.
- It is quite robust where missing data is quite high.
- This technique fails when the number of missing values is small because then we are adding a rare label, that may affect the algorithm performance.
- This technique increases the cardinality of the variable.
![]() |
Fig. 12 Missing Tag |
- It is easy to implement, and there is no data loss.
- It is deployable in the machine learning pipeline during model deployment.
- It is robust in the case of linear models.
- If there are lots of missing values, this technique may impact the statistical relationship between the target and the variable.
- It takes memory during model deployment since we have to store the training data to extract random values.
![]() |
Fig. 13 Random sample imputation. |
![]() |
Fig. 14 Random Sample Imputation. |