Feature Importance using Python

Feature importance refers to a technique, that gives us an insight about, which input features are relevant in predicting a target variable. It assigns a score depending upon the model or method used.

Feature Relevance

Feature Importance/Relevance is the process where we automatically or manually select those features which contribute to predicting the output/target variable. Irrelevant features only decreases the accuracy of the models. The benefits of feature selection are:-

  • Reduces over-fitting
  • Improves accuracy

Dataset

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset is about to predict the final price of each home.

Pre-processing

If we load the dataset and perform exploratory analysis, we will find that the attributes have missing values. This needs to be taken care before finding the feature importance. Once the pre-processing of data has been done, we will directly move into finding the importance of features. After Pre-processing, the dataset is assigned to a dataframe namely "numeric_data".

Use Correlation Technique

Correlation is useful in data analysis and modeling to better understand the relationships between variables. A correlation could be positive, meaning both variables move in the same direction, or negative, meaning both variables move in opposite direction. Correlation can also be neutral or zero, meaning that the variables are unrelated.

Chania
Chania
Chania

After computing the correlation matrix, the matrix has been sorted along the rows and columns in descending order. The heatmap show that the "OverallQual"(0.79) is the best predictor in predicting the "SalePrice".

Use Linear Regression Model

Linear Machine Learning algorithms fits a model where the prediction is the weighted sum of the input attributes. We can fit a LinearRegression model on the regression dataset and retrieve the "coeff_ " property that contains the coefficients found for each input variable. These coefficient values represent the feature importance score.

Chania
Chania
Chania
Chania

As you can see, the "PoolQC" attribute is weighted maximum among all the attributes representing the importance in predicting the "SalePrice".

Use CART Model

Classification and Regression Tree (CART) Model offer importance scores based on the reduction in the criterion used to select split points, like Gini index or Entropy. After being fit, the model provides a "feature_importances_" property that can be accessed to retrieve the relative importance scores for each input feature.

Chania
Chania
Chania
Chania

Here, "OverallQual" predictor has the maximum importance in predicting the "SalePrice".

Use Random Forest Model

Here, we will use Random Forest model, which provides a "feature_importances_" property that can be accessed to retrieve the relative importance scores for each input feature.

Chania
Chania
Chania
Chania

Here also, "OverallQual" predictor has the maximum importance in predicting the "SalePrice".