Working with real time data is not that easy, we can have diverse type of data with different values and lot of untidy data. There are several algorithms which accepts certain type of data, so if the data is not in some particular form then we have to first convert the data into that form before feeding the data to machine learning algorithms.
Problem becomes more critical, when the target class is not defined in the data and we have to cluster the data based on the distance between the data points. So, it is very essential to encoding categorical features into numerical values.
Methods to encode categorical data :
- Label Encoder : This is simplest technique. In this technique, each categorical value is transformed to numerical value. For example -Gender has two values : male and female, then male is given '0' value and female is given '1' value.
- Another example is we have age in form of range as given below, we can convert this age column into numeric o decimal formal using some mathematical expression. In this case age is replaced with mean of the age.
- One Hot Encoding : In this technique, categorical input variable are transformed into continuous variables. New dummy columns are created based on the levels in the categorical variable. If the level is present, value is 1 otherwise 0. For example:
- Binary Encoding : It is hybrid of one hot encoding and hashing. It also creates dummy columns for storing values for respective levels of categorical variables but has fewer levels as compared to one hot encoding. For example : If we have a column color in the data set with 4 levels, then rather than creating 4 dummy columns, we can do this by only 2 dummy columns as shown below :