K-Nearest Neighbor is one of the simplest machine learning algorithm based on supervised learning technique. It is used for regression and classification but mostly preferred for classification problems. The properties of K-NN are:-
The K-NN algorithm assumes the similarity between the new data and available training data and puts the new data into the category that is most similar to the available categories. The predictions are made for new dataset by searching through the entire training data for K most nearest neighbors and categorising the output variable for those K neighbors.
To determine which of the K neighbors in the training dataset are most similar to a new input, a distance measure is used. The distance measure used are:-
the most commonly being "Euclidean distance".
We will explore the IRIS dataset, which is the best known database to be found in the pattern recognition literature. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
Working of K-NN
We will evaluate the algorithm using k-fold cross-validation with 5 folds.
We would create a function for shuffling the dataset once at a time. For this, we have used permutation method within numpy library.
The function returns the train/test data (i.e X_train, y_train, X_test, y_test) with split ratio of 80/20.
Whenever the prediction/classification is made for the new data, the entire training dataset is searched for the K nearest neighbors. For this, distance is measured between the new data & the entire training data and the K nearest neighbors are used for categorising the new data to a class.
Euclidean distance
In Cartesian coordinates, if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in Euclidean n-space, then the distance (d) from p to q, is given by the Pythagorean formula
With Euclidean distance, the smaller the value, the more similar two records will be.
We would create a function to compute the euclidean distance
In the above function,
The actual code for the K-NN algorithm starts here,
When we observe the above plot, for K = 23 - 29, the accuracy percentage is stable with the value of 95%. So, K nearest neighbors with the value ranging from 23-29 is a good predictor.