Why it is better than Decision Trees and the need to use it
Most times we would want to differentiate between a number of items, like vegetables and fruits, malign or benign cancer, detect fake news, image classification etc. These are all classification problems which random forest algorithm can be used on.
Classification algorithms are simply used to categorize data into different classes based on certain inputs or variables. Classification algorithms are a lot but our focus will be on Random Forest Algorithm today.
The name of the algorithm is metaphorical to a forest . Just as a forest consists of trees, the Random Forest algorithm also contains trees specifically Decision trees.
I think the difference is our forest can’t make decisions it’s just green.
What Really Is RF Algorithm ?
Random Forest is an ensemble learning method that uses the output of multiple decision trees to make a final prediction. The output of multiple decision trees combined gives it robustness to noise, high accuracy and the ability to handle large datasets with high dimensionality ( High dimensionality datasets have a large number of features compared to the number of observations). In other words, they have more columns compared to rows.
What is a Decision Tree ?
It is simply a model for representing series of decisions and their possible consequences. A decision tree takes an input and passes it through a certain condition to evaluate if it passes or fails. Should it pass the condition it is passed down to be evaluated by the next condition if there is any until there is no more condition to pass. Then it is displayed as the output.
You should also note that each tree in a random forest is built independent of the other. Which means each decision tree in a random forest can have different set of conditions for evaluation.
Scenario
Say we want to create a male basketball team with 6’1 Africans or any other person with height >= 6’5 from a group of students. Using the tree, we classify select players who qualify as basketball players using certain conditions.
The decision tree outputs whether a player qualifies to be part of our team or not.
Now let’s build a random forest classifier that uses the above decision tree to select players.
Bobby is a player (data point) in the dataset. This data point is evaluated by a random forest with 3 instances of a decision tree.
Since the majority votes is considered among the decision trees at the end of the final predictions. It implies that Bobby qualifies as a basketball player for our team with 2/3 Qualify and 1/3 Not Qualify.
Why is Random Forest Effective than Decision Trees?
A single decision tree is prone to Bias
A decision tree may not be able to learn the complexities of the data, making it learn in a particular direction.
Robustness to outliers and missing data
Decision trees are very sensitive to outliers and missing data, leading to poor predictions.
Decreased Overfitting
Unlike Random Forest, DTs are prone to overfitting, which means they perform very well on the training data but poorly on the test data. Random Forest takes care of that by aggregating the predictions of multiple decision trees, reducing the variance of the model.
The Variance of a model the change in the model’s prediction if it were to be trained on a different dataset. It is simply a measure of the sensitivity of the model to the randomness in the training data.
A high variance implies that the model is very sensitive to a specific data used for training hence may not generalize well for unseen data. This happens when the model fails to learn the underlying patterns in the data rather learning to fit the noise in the training data.
On the other hand, a low variance means that it is less sensitive to the specific data used for training and is more likely to generalize well to new data. This is because the model has learned the underlying patterns in the data rather than the noise.
Handles categorical features better than DTs
Random forest can handle categorical features directly without the need for one-hot encoding which happens in the case of DTs. One-hot encoding may not be very practical for large datasets as they can lead to an increase in the number of features of the dataset.
What is the Randomness in Random Forest ?
1. Each tree gets a random sample of observations (row values) with replacement
This means each tree gets to be trained on different set of datasets. The number of observations in the random sample is always the same as the number of observations in the original dataset.
What this means is, some observations are likely to be duplicated because of the sampling with replacement. This process is called Bootstrap Aggregation or Bagging.
Because each bootstrap sample is slightly different from the others, each decision tree will be slightly different.
Example
Assuming you have a dataset with 500 observations (data points). This means each tree will have a random sample with 500 observations with replacements (duplicates). And each observation has an equal chance of being selected in each sample.
So, let’s say there is an observation in the dataset called Jacob, the random sampling means that there is a probability that not all the samples will contain the observation, Jacob.
2. Each tree gets all the features (column values), but at each node only a subset of features is available
This was not always the case, as proposed earlier by Tin Kam Ho, each tree gets a small number of dimensions from a given feature space. But this has changed with time, as Leo Breiman’s own is now the industry standard, i.e, each tree gets the full set of features, but at each node, only a random subset of features is considered to prevent overfitting and improve the generalizability of the model.
Bootstrap Aggregation helps reduce overfitting, improve the stability and accuracy of the Random Forest model. By Building each tree on a slightly different subset of the dataset, the RF is able to learn more patterns and relationships in the data, while also reducing the effect of noisy and irrelevant features.
CONCLUSION
Random Forest classifier is a very robust supervised algorithm for classification with a considerably very good accuracy due to its ability to high-dimensional data, noisy datasets, resistance to overfitting and finding complex interactions between features.
It is important to note that Random Forest Classifier is different from Random Forest Regressor. The Classifier is used to categorize discrete values whiles the Regressor is used to categorize continuous values. Additionally, the interpretability of Random Forests has made it a popular choice in many industries where understanding how the model arrives at its predictions is crucial.
On our next algorithm, review we will look at how Random Forest algorithm is implemented using a library like pytorch or keras. Random forest pretty much outperforms most standard ensemble learning algorithms in terms of accuracy. Well, not XGBoost.
There Are Beautiful Things In The Forest !!!
I hope you have become more knowledgeable than you were before reading this.