MLmuse: Correlation and Collinearity — How they can make or break a model

Image for post
Image for post

We know that the purpose of any supervised machine learning model is to establish a function of the predictors that best explains the response variable. For this function to be stable and to be a good and reliable estimate of the target variable, it is very important that these predictors are not correlated with each other. The first step to ensure this is Correlation Analysis.

Correlation analysis is one of those important checks that need to be performed during various stages of a project like data analysis, before and after feature transformations, feature engineering, and feature selection. Before we understand correlation in a more detailed manner, you can take a look at our machine learning related case studies here.

Understanding Correlation

Correlation is a statistical measure that indicates the extent to which two or more variables move together¹. A positive correlation indicates that the variables increase or decrease together. A negative correlation indicates that if one variable increases, the other decreases, and vice versa².

Covariance is another measure that describes the degree to which two variables tend to deviate from their means in similar ways. But covariance is not unit-less which makes it difficult to interpret anything about the relation between the variables. Hence, it is normalized by standard deviations of the variables to make it a dimensionless and unit-less measure called the correlation coefficient.

The correlation coefficient indicates the strength of the linear relationship that might be existing between two variables.

Image for post
Image for post

“Correlation doesn’t imply causation”

For example, the sales of ice creams and sunglasses increase in summer. They tend to have a high positive correlation, but this doesn’t mean that buying ice cream makes people want to wear sunglasses. We have to read between the lines to understand that there might be another variable like ‘temperature’ that might be influencing both the variables similarly.

Also, when the correlation coefficient of the two variables is zero, it only indicates the absence of a ‘linear’ relationship between them and doesn’t imply that the variables are independent.

How are correlation and collinearity different?

Collinearity is a linear association between two predictors. Multicollinearity is a situation where two or more predictors are highly linearly related. In general, an absolute correlation coefficient of >0.7 among two or more predictors indicates the presence of multicollinearity.

‘Predictors’ is the point of focus here. Correlation between a ‘predictor and response’ is a good indication of better predictability. But, correlation ‘among the predictors’ is a problem to be rectified to be able to come up with a reliable model.

Diagnostics of multicollinearity

1. Prominent changes in the estimated regression coefficients by adding or deleting a predictor

2. Variance inflation factor (VIF) helps a formal detection-tolerance for multicollinearity. VIF of 5 or 10 and above (depends on the business problem) indicates a multicollinearity problem.

If there is no collinearity, the VIF should be:

A VIF of 10 means that the variance of the coefficient of the predictor is 10 times more than what it should be if there’s no collinearity.

VIF = 1/(1-R2) where R2 is the coefficient of Determination which indicates the extent to which a predictor can explain the change in the response variable

3. The correlation matrix of predictors, as mentioned above, may indicate the presence of multicollinearity. Though correlation talks about bivariate linear relationship whereas multicollinearity are multivariate, if not always, correlation matrix can be a good indicator of multicollinearity and indicate the need for further investigation

4. If a multivariate regression finds an insignificant coefficient of a particular predictor but a simple linear regression of the response variable using that predictor shows a coefficient significantly different from zero, it indicates the presence of multicollinearity

Problems due to multicollinearity

1. Redundancy: two predictors might be providing the same information about the response variable thereby leading to unreliable coefficients of the predictors (especially for linear models)

2. The estimate of a predictor on the response variable will tend to be less precise and less reliable

3. An important predictor can become unimportant as that feature has a collinear relationship with other predictors

4. The standard errors of the coefficients of the affected predictors tend to be large. In that case, we fail to reject the null hypothesis of linear regression that the coefficient is equal to zero. This leads to a “Type II error” as we are forced to believe that there isn’t a significant impact of a predictor on the response variable when essentially there is a significant impact

5. Overfitting — The best models are those in which each predictor variable has a unique impact on the response variable. When there are redundant or correlated predictors in the model that explains the response variable, the model tends to overfit. That means, the models do well on the train data but do a poor job on the test data thus defeating the whole purpose of model building

What can be done?

1. Drop redundant variables or the one with high VIF — this may again lead to loss of information

2. Come up with interaction terms or polynomial terms and drop the redundant features

3. If the correlated predictors are different lagged values of the same underlying explanator, then a distributed lag technique can be used to impose a general structure on the relative values of the coefficients to be estimated³

4. Use Principal component analysis (also a dimensionality reduction technique) which is a statistical procedure to convert a set of possibly correlated predictors into a set of linearly uncorrelated variables.

Dealing with multicollinearity using a use case:

This is a case of a Malaysian telecom operator that was interested in customer churn analysis. The table below shows a glimpse of a few of the important features of the dataset:

Image for post
Image for post

The correlation matrix below for the numeric features indicates a high correlation of 0.82 and 0.65 between (TotalCharges, contract_age) and (TotalCharges, MonthlyCharges) respectively. This indicates a possible problem of multicollinearity and the need for further investigation.

Image for post
Image for post

Computing VIF

Image for post
Image for post

Clearly, the VIF of TotalCharges is higher than the other 2 features. If 5 is the cut-off point to identify the variables causing multicollinearity, do we remove contract_age and TotalCharges features both at a time? No. We first remove the one with the highest VIF and calculate the VIFs again.

Image for post
Image for post

Here we go! Just removing TotalCharges from the data frame brought down the VIF of other variables to the desired value ~ 1.

Also, it is easy to guess that as the contract_age is in months, we also expect the charges to be monthly, and as expected, VIF did the job!

Effect on coefficients of predictors

Image for post
Image for post

When TotalCharges is included, the absolute coefficients of contract_age is inflated and that of MonthlyCharges is deflated. Though the coefficient of TotalCharges is close to 0, its influence on other features is significant. So, it might be a good idea to drop TotalCharges before building a model.

Effect on model performance

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

We see that the AUC (Area Under the Curve) on validation data is higher when TotalCharges is included. But this performance cannot be guaranteed on a new test data unless we are sure that TotalCharges will continue to have a similar correlation with other predictors.

Also, we find the AUC on validation is the same when the model was fitted after removing the TotalCharges variable, and when the model was trained on PCA transformed train data. This makes it evident that, by identifying and removing a feature causing correlation has given us similar results that could have been achieved by using principal component analysis.

When the number of features is very high, we can first find and eliminate the predictors with a very high absolute correlation like >0.8, then calculate the VIFs for further elimination and then if needed, apply PCA to make the remaining features linearly uncorrelated before heading to model training stage.

Conclusion

Though simple, correlation analysis can go a long way in developing a good model. Correlation (between predictors and response) is good! We don’t want to do much about it. What needs to be fixed is ‘collinearity’ (among the predictors).

References

¹Ivy Wigmore., (November, 2016): Correlation. Retrieved from-https://www.mckinsey.com/industries/high-tech/our-insights/an-executives-guide-to-machine-learning

²Elon Correa and Royston Goodacre., (Janury, 2011): A genetic algorithm-Bayesian network approach for the analysis of metabolomics and spectroscopic data: application to the rapid identification of Bacillus spores and classification of Bacillus species. Retrieved from- https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-33

³Multicollinearity. Retrieved from- https://en.wikipedia.org/wiki/Multicollinearity

Clairvoyant Blog

Clairvoyant is a data and decision engineering company.

RekhaMolala

Written by

Data Scientist/ http://linkedin.com/in/rekhas31

Clairvoyant Blog

Clairvoyant is a data and decision engineering company. We design, implement and operate data management platforms with the aim to deliver transformative business value to our customers.

RekhaMolala

Written by

Data Scientist/ http://linkedin.com/in/rekhas31

Clairvoyant Blog

Clairvoyant is a data and decision engineering company. We design, implement and operate data management platforms with the aim to deliver transformative business value to our customers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store