K-Nearest Neighbors in Python

K-Nearest Neighbor is one of the simplest machine learning algorithm based on supervised learning technique. It is used for regression and classification but mostly preferred for classification problems. The properties of K-NN are:-

  • Lazy learning algorithm:- It does not learn anything from the training dataset, instead it stores the dataset and at the time of classification, action is performed. It is also called as instance based learning.
  • Non-parametric algorithm:- No assumption is made on the training dataset.

The K-NN algorithm assumes the similarity between the new data and available training data and puts the new data into the category that is most similar to the available categories. The predictions are made for new dataset by searching through the entire training data for K most nearest neighbors and categorising the output variable for those K neighbors.

To determine which of the K neighbors in the training dataset are most similar to a new input, a distance measure is used. The distance measure used are:-

the most commonly being "Euclidean distance".

We will explore the IRIS dataset, which is the best known database to be found in the pattern recognition literature. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Working of K-NN

  1. Load libraries and IRIS dataset
  2. Train/Test split with Cross-Validation
  3. Compute Euclidean distance
  4. Compute Accuracy Percentage

Load Libraries and IRIS dataset

Chania
Chania

Train/Test split with Cross-Validation

We will evaluate the algorithm using k-fold cross-validation with 5 folds.

  • We would shuffle the dataset 5 times.
  • For each permutation, train/test split is done.
  • For each permutation, K-NN algorithm is applied.
  • Finally, we would average out the accuracy percentage.

We would create a function for shuffling the dataset once at a time. For this, we have used permutation method within numpy library.

Chania

The function returns the train/test data (i.e X_train, y_train, X_test, y_test) with split ratio of 80/20.

Compute Euclidean distance

Whenever the prediction/classification is made for the new data, the entire training dataset is searched for the K nearest neighbors. For this, distance is measured between the new data & the entire training data and the K nearest neighbors are used for categorising the new data to a class.

Euclidean distance

In Cartesian coordinates, if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in Euclidean n-space, then the distance (d) from p to q, is given by the Pythagorean formula

Chania

With Euclidean distance, the smaller the value, the more similar two records will be.

We would create a function to compute the euclidean distance

Chania

In the above function,

  • A list "distance" is defined, that stores the distance and the corresponding class as a tuple.
  • The list "distance" is sorted.
  • We would only take the K nearest neighbors and store the class values in list "neighbours".
  • We would count the number of occurrences of the class. The class with maximum occurrence is the predicted class for the new data.

Compute Accuracy Percentage

The actual code for the K-NN algorithm starts here,

Chania
  • The functions are called to shuffle and compute the euclidean distance.
  • Accuracy rate(accuracy_rate) for K = 1 - 40 is computed for a single shuffle of the complete dataset. So, we have 40 values.
  • We store the list "accuracy_rate" into another list "total_accuracy_rate". So the length of "total_accuracy_rate" is 5(shuffles).
  • We then average out the accuracy percentage for 5 shuffles i.e mean of "total_accuracy_rate" representing K values.

Accuracy Percentage Plot

Chania
Chania

When we observe the above plot, for K = 23 - 29, the accuracy percentage is stable with the value of 95%. So, K nearest neighbors with the value ranging from 23-29 is a good predictor.