Handling Categorical Variables

How does one train ML or AI algorithms when dataset contains categorical variables? Algorithms only understand numerical values.

That's why there are techniques known as Label Encoding or One-Hot Encoding. I find the concept cool.

One dummy data for the purpose of illustration.

data = ['Red', 'Blue', 'Yellow']

They need to be converted into a format that could be understood by algorithms. Note that there are three categories.

One-Hot Encoding

One-hot encoding is used to represent categorical variables as binary vectors.

It creates a binary column for each category and indicates the presence of that category with a 1 or 0.

import numpy as np
from sklearn.preprocessing import OneHotEncoder

data = np.array(data).reshape(-1, 1) # OneHotEncoder expects 2D array

one_hot = OneHotEncoder(sparse=False)
one_hot.fit(data)

output = one_hot.transform(data) 
print(output) # [[0, 1, 0],
              #  [1, 0, 0],
              #  [0, 0, 1]]

In the above code, it can be seen that there are four columns and three rows.

Four columns represents four categories. Three rows represents the three data elements, i.e. length of the data.

Label Encoding

Label encoding is used to convert categorical labels into numerical format.

In this process, each unique category is assigned an integer value.

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(data)

print(label_encoder.classes_) # ['Blue', 'Red', 'Yellow']

outputs = label_encoder.transform(data) 
print(outputs) # [1, 0, 2]

Now the categorical data has been converted into numerical values which the algorithm can understand.

Question for you. Which loss function would be used for One Hot Encoding and Label Encoding while training a neural network?