## Multiple Linear Regression from Scratch in Python

Regression analysis is a statistical method to model the relationship between a dependent (target) and independent (predictor) variables with one or more independent variables. Regression is a supervised learning technique which helps in finding the correlation between variables and enables us to predict the continuous output variable based on the one or more predictor variables.

Multiple Linear Regression is a regression algorithm which models the linear relationship between a single dependent continuous variable and more than one independent variable. The equation for multiple linear regression is given by, Ŷ: Output variable
x1,x2,...,xn: Input variable with n features
β01,...,βn: Coefficients

In matrix form, where, Cost Function

In order to test our model, we have to define a cost function which gives us the error in our model. Ŷ in above equation is our hypothesis function. The cost function is given by, where,

• hβ(x) = Ŷ
• m = number of samples
• yi = actual output values

Gradient Descent is an optimization algorithm, which is used to minimize the cost function and obtain an optimized value of coefficients i.e β. The algorithm converges the value of cost function to zero.

• Initialize β with value = 0
• Learning rate(α):- determines how big the step would be taken on each iteration. Most commonly used rates are, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3.
• Iteratively update β, the new value of β is given by, Dataset

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant. We would rename our columns Now, we will split the data into input and output attributes i.e X,y #### Standardize/Scale the features

Here, we will scale the input attributes and the resulting data obtained is a numpy array. We would convert it into pandas dataframe. The reason behind this conversion is that, we have to add one more column to the X data for coefficient β0.   #### Train/Test Split #### Cost Function #### Initial Coefficient and Learning Rate Values #### Iteration

Before iterating through the data, make sure that the matrix compatibility among the data arrays is satisfied, for performing matrix operations.  The value of coefficients are, #### Predict and Check Accuracy

We will make the prediction for our test data. In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared" is used for analysing the performance of a regression model. It is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model's inputs. The accuracy of our model is 92%.