Decision Tree is a Supervised learning technique that can be used for both Regression and Classification problems. It is a flowchart-like structure in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.
The decision tree usually mimics human thinking ability while making a decision. For predicting the class of the given dataset, the algorithm starts from the root node of the tree. This algorithm compares the values of root attribute with the dataset attribute and, based on the comparison, follows the branch and jumps to the next node.
Advantages
Disadvantages
Attribute Selection Criterion
In order to implement a decision tree, we have to repeatedly select attributes that could act as our decision node. The two popular techniques are:-
Gini index
Information gain
We will explore the IRIS dataset, which is the best known database to be found in the pattern recognition literature. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
In this article, we will implement a decision tree using Gini index as our attribute selection criterion. Here, we won't be using greedy algorithm, rather, we would visualize the density plot and try to find the split points.
Here we will find the gini index for root node using training data. The training data consists of,
The training data has three unique classes i.e 0,1,2 and there counts are 40,41,39.
The gini index for root node is given by,
The gini index for root node is 0.66652.
If we observe the above plot, petal_length and petal_width gives us a good split among classes.
Here we will,
If we observe the gini index values for both the attributes splitting the data into True and False response, both gives us the same values. So, we will go ahead with the split point = 2.45(petal_length).
Observing the findings from petal_length attribute,
So, we will only workout with the data having condition, petal_length > 2.5.
So, we will again perform the same procedure for finding the split point using gini index.
So, the observation obtained are,
For Petal_width <= 1.6, the probability of class 1 is higher, so for this condition we will assume class 1. For Petal_width > 1.6, the probability of class 2 is higher, so for this condition we will assume class 2.
We will evaluate the accuracy of our tree by using our test data.
Our tree is 94% accurate. We can also experiment with more tree structures and find the tree that gives us more accurate result.