Linear Discriminant Analysis with Python using Bayes' Theorem

Linear discriminant analysis is a linear classification technique used for qualitative analysis. This method is used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. This model is used for dimensionality reduction before the classification is done.

In this tutorial, we won't be dealing with dimensionality reduction, rather, we would use Bayes' Theorem to estimate the probabilities. I have taken some of the ideas from machinelearningmastery, which is a great portal for data science beginner's.

We will explore the IRIS dataset, which is the best known database to be found in the pattern recognition literature. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Attribute information
  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm
  5. class(species name):
    • Iris Setosa
    • Iris Versicolor
    • Iris Virginica

The Linear discriminant analysis makes some assumptions over the data,

  • The attributes have gaussian distribution.
  • The attributes have the same variance about their mean values.
  1. Load Libraries and prepare DataFrame

    Here, we will first load our necessary libraries and then we would prepare our dataframe.

    Chania
    Chania
  2. Visualize the data

    The Linear discriminant analysis assumes that the attributes needs to have gaussian distribution. So we will plot the Kernel density estimation(KDE) plot using matplotlib and seaborn library.

    Chania
    Chania

    The plots are gaussian. The density plot shows that, the petal_length(cm) and petal_width(cm) creates a clear discrimination between the classes.

  3. Train/Test split the data

    We will train/test split the data, with 80% train and 20% test data.

    Chania
  4. LDA model

    We will train our model using X_train and y_train dataset. The steps to be followed are:

    • Compute the total mean for all attributes using train dataset(X_train) and store it in matrix form.
    • Compute the mean vector for each class using groupby method of pandas library.
    • Compute the Covariance matrix for each class using the total mean value.
    • Compute total Covariance matrix
    • Compute discriminant function(F) for each class
    1. Compute the total mean
      Chania
    2. Compute the mean vector for each class using groupby method
      Chania
      Chania
    3. Compute Covariance matrix for each class using total mean value

      In order to calculate Covariance matrix for each class, we need

      1. Compute the difference of class values and the total mean value i.e d = (class value matrix) - (total mean)
      2. Covariance matrix(each class) = (1/number of samples of the respected class) * (d) * (transpose(d))

      First group the values of the classes and store in different variables

      Chania

      Compute the difference, d for each class

      Chania

      Find the number of samples for each class

      Chania
      Chania

      Compute the Covariance matrix

      Chania
    4. Compute total Covariance matrix(C)
      Chania
    5. Compute discriminant function(F) for each class

      The discriminant function for class,i is given by,
      Fi = (Mi)*(C-1)*(xT) - (0.5)*(Mi)*(C-1)*(MiT) + ln(Pi)

      where,
      Mi = mean of class,i
      C-1 = inverse of covariance matrix, C
      xT = transpose of input(row) attributes
      Pi = probability of class,i observed in training data

      Probabilities for each class,

      Chania

      Inverse of covariance matrix,C

      Chania

      The input attributes will be taken from test dataset i.e X_test

      Chania

      Now, we will compute the discriminant function(F) for our 3 classes, i.e, setosa, versicolor and virginica & supply with the test input from X_test. We would obtain 3 set of values of F. The value with higher F represents the class, the current attribute represent. We will store the predictions in a list.

      Chania