Regression analysis is a statistical method to model the relationship between a dependent (target) and independent (predictor) variables with one or more independent variables. Regression is a supervised learning technique which helps in finding the correlation between variables and enables us to predict the continuous output variable based on the one or more predictor variables.
Multiple Linear Regression is a regression algorithm which models the linear relationship between a single dependent continuous variable and more than one independent variable. The equation for multiple linear regression is given by,
Ŷ: Output variable
x1,x2,...,xn: Input variable with n features
In matrix form,
In order to test our model, we have to define a cost function which gives us the error in our model. Ŷ in above equation is our hypothesis function. The cost function is given by,
Gradient Descent is an optimization algorithm, which is used to minimize the cost function and obtain an optimized value of coefficients i.e β. The algorithm converges the value of cost function to zero.
The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.
We would rename our columns
Now, we will split the data into input and output attributes i.e X,y
Here, we will scale the input attributes and the resulting data obtained is a numpy array. We would convert it into pandas dataframe. The reason behind this conversion is that, we have to add one more column to the X data for coefficient β0.
Before iterating through the data, make sure that the matrix compatibility among the data arrays is satisfied, for performing matrix operations.
The value of coefficients are,
We will make the prediction for our test data.
In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared" is used for analysing the performance of a regression model. It is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model's inputs.
The accuracy of our model is 92%.