Handling missing data with SciKit SimpleImputer

When working on data science projects, it’s very likely that you’ll be encountering missing data in your columns. It’s not ideal to disregard or take out all the rows containing missing data for any project. Other columns for the same row where the data is missing can be critical for the data preparation state, so it’ll be wiser to infer or find a way to fill in the missing values in our dataset for a better outcome.

There are many options with which you can fill in the ‘null’ ‘nan’ or ‘na’ in the dataset. SciKitLearn offers one simple solution with SimpleImputer(formerly Imputer, which was deprecated starting from version 0.20 and will be removed in version 0.22 of SciKitLearn)

Let’s get to the code part: 

let’s consider an array that we named X.

Original array X
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values =np.nan, strategy = 'mean')

imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

Here is the array X after replacing the missing values with the mean of other values in the same column.

New array X with replaced ‘nan’ values

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.