Handling Categorical Features with SciKitLearn

After dealing with missing data in your dataset. You will most likely face Categorical Features in numerous datasets. In the majority of cases, these features tend to be non-numerical and thus need to be converted to be processed in machine learning algorithms.

I’ll cut to the chase here and explain a quick way to do this with the latest version of SciKitLearn.

We have the following array X:

Original Array X

The column Role categorical features that we can convert to floats so we can process it with our model afterwards.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

#We note the column that we want to process by its index 1
preprocessor = make_column_transformer( (OneHotEncoder(),[1]),remainder="passthrough")
X = preprocessor.fit_transform(X)

Here’s the new array X with categorical feature ‘Role’ converted into 4 columns, each column is named after a category of a Role and has 1 where the previous column Role had the same category and 0 elsewhere.

New Array with Categorical Feature Role Converted into multiple binary columns

In the case of a single column array, a ‘yes’ or ‘no’ or in general a binary situation. Using the LabelEncoder is very adequate and sufficient.

This is ‘y’ a single column dataset :

Single column Categorical array ‘y’ containing ‘yes’ or ‘no’ values
from sklearn.preprocessing import LabelEncoder

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

Here’s the result after applying the LabelEncoder :

‘y’ array having 1 for ‘yes’ and 0 for ‘no’ now

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.