After dealing with missing data in your dataset. You will most likely face Categorical Features in numerous datasets. In the majority of cases, these features tend to be non-numerical and thus need to be converted to be processed in machine learning algorithms.
I’ll cut to the chase here and explain a quick way to do this with the latest version of SciKitLearn.
We have the following array X:
The column Role categorical features that we can convert to floats so we can process it with our model afterwards.
from sklearn.preprocessing import OneHotEncoder from sklearn.compose import make_column_transformer #We note the column that we want to process by its index 1 preprocessor = make_column_transformer( (OneHotEncoder(),[1]),remainder="passthrough") X = preprocessor.fit_transform(X)
Here’s the new array X with categorical feature ‘Role’ converted into 4 columns, each column is named after a category of a Role and has 1 where the previous column Role had the same category and 0 elsewhere.
In the case of a single column array, a ‘yes’ or ‘no’ or in general a binary situation. Using the LabelEncoder is very adequate and sufficient.
This is ‘y’ a single column dataset :
from sklearn.preprocessing import LabelEncoder labelencoder_y = LabelEncoder() y = labelencoder_y.fit_transform(y)
Here’s the result after applying the LabelEncoder :