Classification and Regression Tree using Python

Decision Tree is a Supervised learning technique that can be used for both Regression and Classification problems. It is a flowchart-like structure in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

Chania

The decision tree usually mimics human thinking ability while making a decision. For predicting the class of the given dataset, the algorithm starts from the root node of the tree. This algorithm compares the values of root attribute with the dataset attribute and, based on the comparison, follows the branch and jumps to the next node.

Advantages

  • Mimics the human thinking ability while making a decision.
  • No need to scale the data.

Disadvantages

  • More the depth of tree, more complex is the structure.
  • They do have overfitting issue, which can be resolved using Random Forests.

Attribute Selection Criterion

In order to implement a decision tree, we have to repeatedly select attributes that could act as our decision node. The two popular techniques are:-

  • Gini index
  • Information gain

Gini index

  • The Gini index or Gini coefficient is a statistical measure used for finding the inequality in the distribution of data.
  • It is a measure of impurity, while creating a decision tree.
  • An attribute with low gini index is preferred over high value.
  • Chania
    , where, P is the probability of occurrence of the class within the node.

Information gain

  • It measures the change in entropy after the split of dataset based on attribute.
  • The entropy of parent and child node is considered for split.
  • The node/attribute having highest information gain is considered for split.
  • Information gain = entropy of parent node - [entropy of child node(left) + entropy of child node(right)]
  • Entropy is given by,
    Chania
    , where, P is the probability of occurrence of the class within the node.

We will explore the IRIS dataset, which is the best known database to be found in the pattern recognition literature. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

In this article, we will implement a decision tree using Gini index as our attribute selection criterion. Here, we won't be using greedy algorithm, rather, we would visualize the density plot and try to find the split points.

Load Libraries and IRIS dataset

Chania
Chania

Train/Test split

Chania

Root Node

Here we will find the gini index for root node using training data. The training data consists of,

Chania
Chania

The training data has three unique classes i.e 0,1,2 and there counts are 40,41,39.

The gini index for root node is given by,

Chania

The gini index for root node is 0.66652.

Visualize Training Data

Chania
Chania
Chania

If we observe the above plot, petal_length and petal_width gives us a good split among classes.

Split Point for Root Node

Here we will,

  • Group the training data and find the minimum and maximum value for each class.
  • Find the best split point from petal_length and petal_width attribute.
  • Compare the gini index for both attributes.

Check split using petal length

Chania
Chania

Check split using petal width

Chania
Chania

If we observe the gini index values for both the attributes splitting the data into True and False response, both gives us the same values. So, we will go ahead with the split point = 2.45(petal_length).

Petal length with split point = 2.45

Observing the findings from petal_length attribute,

  • Petal_length<=2.45, Gini index = 0.0[40,0,0] i.e only one class exists ==> Class 0(setosa)
  • Petal_length>2.5, Gini index = 0.49968[0,41,39] i.e two class exists ==> Class 1(virginica) & Class 2(versicolor)

So, we will only workout with the data having condition, petal_length > 2.5.

Chania

So, we will again perform the same procedure for finding the split point using gini index.

So, the observation obtained are,

  • Petal_width<=1.6, Gini index = 0.16873[0,39,4] ==> Class 1(39 samples) & Class 2(4 samples)
  • Petal_width>1.6, Gini index = 0.10226[0,2,35] ==> Class 1(2 samples) & Class 2(35 samples)

For Petal_width <= 1.6, the probability of class 1 is higher, so for this condition we will assume class 1. For Petal_width > 1.6, the probability of class 2 is higher, so for this condition we will assume class 2.

Decision Tree

Chania

Evaluate the Tree

We will evaluate the accuracy of our tree by using our test data.

  • We would cross-validate by shuffling the dataframe.
  • Compute the accuracy percentage for each shuffled dataframe.
Chania
Chania
Chania

Our tree is 94% accurate. We can also experiment with more tree structures and find the tree that gives us more accurate result.