How to Prepare Data | Machine Learning | Data Science
- October 06, 2020
- By Saurav prasad
- 0 Comments
Often you must have heard this saying- "80% of the time goes into data cleaning and 20% in model building". It is quite right because algorithms follow one thing, that is, garbage in garbage out. What does it mean?
It means that if you give a low quality of data to the machine learning algorithm, it will perform poorly. In a nutshell, the performance of your machine learning algorithms depends on the quality of the data you provide. So your path for Data Science should be data-oriented rather than algorithms oriented. So you must be wondering that What are the steps to prepare the data for machine learning algorithms. Don't worry!
This post will help you understand different data processing techniques that you can implement in your data science project. So here are the things we are going to cover in this post.
STEP 1: Missing Values Imputation
As the name suggests, the values which are not present in the data or there is "Nan" in place of numerical or string values. The problem is that our machine learning algorithms are not smart enough to process this information with missing values. So even if you run the algorithm without treating the missing values, you might end up getting errors. So it's crucial to treat them with suitable techniques. Here I have talked in depth about these techniques.
STEP 2: Categorical Encoding
So what is Categorical Encoding?
In simple terms, it is a technique by which we try to replace the string data into a numerical format. There are various ways to do this, and we will be talking about them one by one.
So why Categorical Encoding is necessary?
It is necessary because the machine learning algorithms are not good at handing string data. It needs data in a numerical format so, our aim should always be to convert these data while retaining the information. So let's talk about them.
1) One Hot Encoding
In this technique, we try to convert each categorical observation into a boolean value, 0 or 1. Here we introduce a new column for each unique value in that column. 1 in the newly created column indicates that it was this value in the original column. Below is the example of the one-hot encoding.
So the number of boolean columns is equal to the x-1, where x is the number of values a categorical column can take. Suppose we have a variable that can take three values, so we create two boolean columns. Below is the correct representation of the quoted example.
So why we left one variable?
We do this to make sure that there is no linear combination among the newly created columns because it may lead to multi-colinearity.
Advantages:
- This technique is quite robust for both tree-based models and linear models. In the case of the tree-based method, instead of x-1, we go with x number of columns.
- It keeps all the information intact since there is no data loss.
- It is easy to implement using pandas.
Dis-Advantages:
- The curse of dimensionality, namely, if the cardinality of a column is high. It ends up adding more dimensions to the data.
- This technique doesn't consider the ordinal relationship. For example, if heights are as follows: Tall, Medium, and Short, so this technique considers them equal. Hence, we lose valuable information.
2) Ordinal Encoding
Advantages:
- It is easy to implement, and there is no data loss.
- It does not increase the dimensionality of the data.
- It modifies the data without losing the information. In the above example, we know that tall is greater than medium, and in turn, the medium is greater than the short. So we can assign values in such an order that this information is retained even after the transformation.
Dis-Advantage:
- While it retains the information after the transformation, but still doesn't add any new information in the data.
- We have to be cautious while using this technique. Suppose we use this technique on the gender column. It might end up adding unnecessary information like the male is greater than female or vise-a-versa.
- It can't handle a new column value. Suppose the gender column can take three values, albeit, male, female, and others. If we split the data, and we don't get any row containing "others" as gender value in the train set. We end up encoding based on male and female only. Hence model will not be able to make any prediction if any data point containing "others" as a gender value.
3) Count/Frequency Encoding:
Advantages:
- This technique is easy to implement and quite robust with tree-based machine learning algorithms.
- This technique doesn't add any dimension to the data, so we don't have to worry about the increase in dimensionality.
Dis-Advantages:
- If the two different categories have the same numerical values, our machine learning algorithms might end up interpreting them identical. This way, we will lose some information. For example, If in the height column, The number of instances for tall and short is the same. This technique will assign the same number, losing the information that there is a difference between short and tall.
4) Target Mean Encoding
Advantages:
- This technique is straightforward and easy to implement.
- This technique doesn't increase the dimensionality of the data. So on doesn't have to worry about it.
- This technique creates a monotonic relationship between the target and the categorical values.
Dis-Advantages:
- It introduces the same problem as the count/frequency encoding, that is, losing the information when two categorical values are assigned the same values.
- Sometimes this technique may lead to over-fitting.
Good-customer: If a customer paid his loan.Bad-customer: If a customer didn't pay his loan.
Here :If woe is positive, it means we have more examples of good than bad.If woe is negative, it means we have more examples of bad than good.
Advantages:
- This technique doesn't increase the dimensionality of the data.
- Since the categorical variables are on the same scale, it is easy to identify which one is more predictive.
- This technique introduces a monotonic relationship between the categorical values and target.
- When a categorical column has too many values, we can bin them and use the woe as a whole.
Disadvantages:
- Binning leads to loss of information. So we end up losing some valuable information.
- This technique may lead to overfitting sometimes since one can manipulate the binning.
STEP 3: Variable Transformation
So what happens when we ignore the assumption of normality?
So how we avoid this thing?
1) Log transformation
2) Exponential/ Power transformation:
3) Box-Cox transformation:
STEP 4: Outlier Treatment
So why do we care about them?
How to detect outliers?
1) Missing data
2) Trimming
Here is the python implementation of this technique.
3) Capping
STEP 5: Feature Scaling
What is Feature Scaling?
Why do we scale the features?
You must have heard of the phrase "comparing apple with oranges." That is what our machine learning algorithms have to do. For example, suppose you have a height column and weight column, and we know that the height and the weight have different measures. To discover a precise relation between these two variables, we have to bring them down to the same level.
Another reason, some machine learning algorithms are susceptible to scale of the variable because high magnitude variable dominates the low magnitude variables. It results in machine learning algorithms paying more attention to the higher values. Below are the algorithms that are influenced by the scale of the variables.
So let's talk about each technique.
1) Standardization
- This technique changes the mean from whatever value to zero.
- This technique changes the variance from whatever value to 1.
- This technique preserves the shape and outliers in the data.