Multiple Linear Regression from Scratch in Python

Regression analysis is a statistical method to model the relationship between a dependent (target) and independent (predictor) variables with one or more independent variables. Regression is a supervised learning technique which helps in finding the correlation between variables and enables us to predict the continuous output variable based on the one or more predictor variables.

Multiple Linear Regression is a regression algorithm which models the linear relationship between a single dependent continuous variable and more than one independent variable. The equation for multiple linear regression is given by,

Chania

Ŷ: Output variable
x1,x2,...,xn: Input variable with n features
β01,...,βn: Coefficients

In matrix form,

Chania

where,

Chania

Cost Function

In order to test our model, we have to define a cost function which gives us the error in our model. Ŷ in above equation is our hypothesis function. The cost function is given by,

Chania

where,

  • hβ(x) = Ŷ
  • m = number of samples
  • yi = actual output values

Gradient Descent

Gradient Descent is an optimization algorithm, which is used to minimize the cost function and obtain an optimized value of coefficients i.e β. The algorithm converges the value of cost function to zero.

  • Initialize β with value = 0
  • Learning rate(α):- determines how big the step would be taken on each iteration. Most commonly used rates are, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3.
  • Iteratively update β, the new value of β is given by,
    Chania

Dataset

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.

Load Libraries and Dataset

Chania

We would rename our columns

Chania

Now, we will split the data into input and output attributes i.e X,y

Chania

Standardize/Scale the features

Here, we will scale the input attributes and the resulting data obtained is a numpy array. We would convert it into pandas dataframe. The reason behind this conversion is that, we have to add one more column to the X data for coefficient β0.

Chania
Chania
Chania

Train/Test Split

Chania

Cost Function

Chania

Initial Coefficient and Learning Rate Values

Chania

Iteration

Before iterating through the data, make sure that the matrix compatibility among the data arrays is satisfied, for performing matrix operations.

Chania
Chania

The value of coefficients are,

Chania

Predict and Check Accuracy

We will make the prediction for our test data.

Chania

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared" is used for analysing the performance of a regression model. It is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model's inputs.

Chania

The accuracy of our model is 92%.