Linear Regression in Python (Univariate)— diagnostic plots and much more

10 min readJun 26, 2021

(ISLR Series in python)

This article is part of our ISLR series wherein we will try to cover few examples from exercises in the “Introduction to Statistical Learning” book and discuss them in-depth. The ideal audience for these articles will be someone starting with regression or who wants a refresher.

This article mainly covers the usage of diagnostic plots in analyzing and improving linear regression models.

Assumption: Reader has a basic understanding of python and statistics.

Suggestion: Though the tutorial is complete in itself but could be best understood when read in conjunction with Chapter3 of ISLR → http://faculty.marshall.usc.edu/gareth-james

Link to jupyter notebook is provided at the end of the article.

Auto dataset contains information about 392 vehicles. Our goal will be to predict mpg based on horsepower using linear regression and its variants.

Data files could be download from Git here.

Description: Gas mileage, horsepower, and other information for 392 vehicles.

Usage: Auto — Auto Data Set

Format: A data frame with 392 observations on the following 9 variables.

mpg: miles per gallon
cylinders: Number of cylinders between 4 and 8
displacement: Engine displacement (cu. inches)
horsepower: Engine horsepower
weight: Vehicle weight (lbs.)
acceleration: Time to accelerate from 0 to 60 mph (sec.)
year: Model year (modulo 100)
origin: Origin of the car (1. American, 2. European, 3. Japanese)
name: Vehicle name

The original data contained 408 observations, but 16 observations with missing values were removed.

Source- This dataset was taken from the StatLib library, which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.

Loading and Cleaning data

We start by loading the data to pandas DataFrame and analysing it using .head() and .info() functionality.

Using the head to view the first 5 rows of the dataset.

auto_dataset = pd.read_csv(Auto.csv')
auto_dataset.head()

auto_dataset.info()

Observe that though horsepower values appear integer in first five rows but we see data type as object.

Though horsepower values in the first five rows of the data set are integer, but data type for the column is an object. Hence, we will force convert this column to numeric values ignoring any values that cannot be converted.

auto_dataset.horsepower = pd.to_numeric(auto_dataset.horsepower, errors='coerce')
auto_dataset.info()

We can now observe that the horsepower column is a float with five missing values that pandas could not convert.

We will set the name column as an index and slice data_Set to contain just two columns of interest, mpg, and horsepower for ease of usage.

required_df = auto_dataset[['mpg', 'horsepower']]
required_df.describe()

Exploratory Analysis

Before directly jumping into the linear regression model, we should first plot and observe the relationship between two variables. We will use regplot from seaborn library, which allows us to plot the best fit line over the scatter plot.

sns.regplot(x="horsepower", y="mpg", data=required_df, line_kws={'color':'red'})
plt.show()

We can observe from the above plot that linear regression captures the essence of the negative relationship between mpg and horsepower. However, there is some convexity in the relationship that linear regression fails to capture. We will discuss this in detail while discussing diagnostics on linear regression assumptions.

Basics of Linear Regression

This section will briefly intuitively discuss Ordinary Least Square Regression without going deep into mathematics.

OLS aims at finding the best fit line for which the sum of the square of verticle distance of all the points from the line is minimum.

The below equation shows how linear relations look like and essential terminologies.

Linear regression tries to predict the above relation as below. There is no error term in our linear regression equation as this term is non-modellable. Linear regression aims to find unbiased estimates of parameters.

Note: OLS line always passes through point denoted by mean of X and Y.

Linear Regression in Python

To run linear regression in python, we have used statsmodel package. Once we have our data in DataFrame, it takes only two lines of code to run and get the summary of the model.

import statsmodels.formula.api as smf
lin_model = smf.ols("mpg ~ horsepower", data=required_df).fit()
lin_model.summary()

The summary provides comprehensive information about linear regression. The starting point is to always look at parameters predicted and their p-value.

P-values allow us to test the significance of individual parameters predicted. P-values are almost zero for both the parameters suggests that we can reject the null hypothesis that either intercept or slope parameter is zero with the high confidence level. At the same time, Prob (F-statistic) tests the significance of the complete model as a whole. Again, this statistic’s close to zero value suggests that we can reject the null hypothesis that the model captures no relationship between dependent and independent variables, with high confidence interval.

R-squared statistic tells how much variance our model can capture out of total variance in data. The current value suggests we can capture 60.5% of the total variance. The unexplained variance could be two reasons: 1) error term — non-modellable part 2) Biasness in the model — the current model is too simple to capture the actual relationship. In this example, we have a significant effect from the second case as we are trying to model convex relationship with the linear model.

The last portion of the summary, below the red highlighted box, gives information about the error terms. The Durbin-Watson test checks the presence of auto-correlation and the Jarque-Bera test to check if error terms follow a normal distribution. We will discuss the importance of these in a later section.

Diagnostic Plots

Using diagnostic plots, we test if our model holds various assumptions of linear regression or not. These tests are to check the correctness of the model and provide information on how to improve the model to capture the relationship between dependent and independent variables accurately.

Below is the list of assumptions that we will be testing:

Linearity: Residuals should not have any modellable information in them.
Endogeneity: Residual should not be related to the independent variable.
Normality: Residuals should be Normally distributed.
Homoskedasticity: Variance of error terms should be constant
Outliers and High-Leverage: There shouldn’t be a high influence point in the dataset to distort our model.

For a detailed understanding of the above assumptions please refer to the Gauss-Markov theorem.

Linearity: Residuals should be independent of predicted values. Residual vs. Fitted Values plot should not show any trend, and values should be randomly distributed about the x-axis.

Any pattern in the plot implies some amount of modellable information not captured by the model.

The linear model already has both residual and fitted values calculated in it.

temp_data = pd.DataFrame(dict(fitted_values=lin_model.fittedvalues, residual=lin_model.resid))graph = sns.lmplot(x='fitted_values', y='residual', data=temp_data, lowess=True, line_kws={'color':'red'})plt.show()

Linearity: Fitted values vs. Residual plot

We can observe a convex pattern in residual, which our linear model does not capture. It is in line with our initial observation that the relationship between mpg and horsepower has some convexity.

Endogeneity: This check is similar to the linearity check. It suggests that we should not have any relationship between independent variables (instead of fitted values) and residual terms.

temp_data = pd.DataFrame(dict(horsepower=required_df.horsepower, residual=lin_model.resid))graph = sns.lmplot(x='horsepower', y='residual', data=temp_data, lowess=True, line_kws={'color':'red'})plt.show()

We can observe a similar violation of assumption as linearity. This indicates that estimated parameters might not be an unbiased estimate of actual parameters; hence our model is biased.

Normality: Normality assumption has a lot of room to go wrong for model accuracy, but it highly impacts the hypothesis testing values. P-values could be under or over-estimated if residuals deviate too much from a normal distribution. We can use QQ-Plot to check the normality of residuals.

sm.qqplot(lin_model.resid, dist=stats.t, fit=True, line='45', ax=ax[0])
sns.distplot(lin_model.get_influence().resid_studentized_internal)
plt.show()

Residuals are very close to normally distributed with minor deviations near the tail. The left tail is thinner than the normal distribution, while the right tail is slightly thicker than the normal distribution.

Homoskedasticity: In case residual values reflect heteroskedastic behavior, it implies that the linear model might give more weights to few data points than other data points while computing the best fit line.

Heteroskedasticity is using a Scale-location plot. For homoskedasticity, the trend line of this plot should be horizontal.

We can also observe heteroscedasticity by looking at residual graphs above. Again, funnel-like distribution shows much more variance in residual towards the right than the left side of the plot.

qrt_std_residual = np.sqrt(np.abs(lin_model.get_influence().resid_studentized_internal))temp_data = pd.DataFrame(dict(fitted_values=lin_model.fittedvalues,
                              sqrt_std_residual=sqrt_std_residual))graph = sns.lmplot(x='fitted_values', y='sqrt_std_residual', data=temp_data, lowess=True, line_kws={'color':'red'})plt.show()

The trendline is not flat; hence residuals are heteroskedastic in nature.

Outliers and High Leverage: In Linear regression, often, our focus is too much on outliers. In reality, high leverage points, which are also outliers, are much more dangerous than just outliers. High leverage points are data points that are at the extreme value of independent variables (X).

We aim to make sure that no single point in the dataset significantly impacts our model. Such points will be a problem if they are an exceptional case or not correct, as it will alter the model away from the original relationship.

Influence plot from statsmodel.api module or directly looking at measuring Cook’s distance can find high influence points.

sm.graphics.influence_plot(lin_model, alpha  = 0.05, ax=ax, criterion="cooks")
plt.show()

Influence PLot: Outliers and H. Leverage

In the above plot, Y-axis represents outliers in terms of standard deviation. The general convention is to consider points below or above three standard deviations as outliers. The X-axis defines the H. Leverage of each point, and the size of the bubble is the influence of each point on the model based on Cook’s measure.

The rule of thumb is that data points with Cook’s distance of greater than 4/(no. of data points) are the data points of concern.

cooks_threshold = 4/len(X.index)df = influence_summary[influence_summary.cooks_d>cooks_threshold][['cooks_d']]df.style.bar(subset=['cooks_d'], color='Red')

Cook’s distance for data points breaching threshold

We should look at all these data points to decide whether they are correct and should be part of our model or not. Another commonly used matrics for finding influence points is DFFITS.

With this, we have covered all the diagnostic plots and observed how our current model is failing on assumptions of linear regression. In the next section, we will briefly cover how we can fix these issues intuitively to capture the relationship between mpg and horsepower better.

Resolving issues from Diagnostic Plots

To capture the convex relationship that we observed in residuals vs. fitted plot, we can add a polynomial of degree two and check if it captures the convex relationship.

quad_model = smf.ols("mpg ~ horsepower + np.square(horsepower)", data=auto_dataset[['mpg','horsepower']].reset_index()).fit()quad_model.summary()

It can be observed from summary statistics that the p-value for the square term is close to zero, and hence its significance. We can also see that accuracy of our model has increased as the R-squared term is now ~68%, 8% higher than before.

We can observe that trendline has become closer to horizontal. However, we still see a funnel-like structure in residuals with high dispersion towards the right end, indicating the presence of heteroskedasticity.

To resolve the scenario when the funnel opens at the right side, we can transform the dependent variable with the concave function. If the funnel opens at the left side, we try to transform the dependent variable using the convex function.

In the current scenario, we need concave transformation, and the best starting point is to try log transformation.

log_quad_model = smf.ols("np.log(mpg) ~ horsepower + np.square(horsepower)", data=auto_dataset[['mpg','horsepower']].reset_index()).fit()log_quad_model.summary()

The above regression summary statistics indicate that the transformation of the dependent variable increases the model accuracy by ~4.5%.

We have ignored all the influence points breaching the threshold for Cook’s distance for simplicity of our discussion.

cooks_distance = influence_summary[influence_summary.cooks_d>4/len(X.index)][['cooks_d']]

auto_dataset_removed_outliers = auto_dataset.drop(cooks_distance.index)log_quad_model = smf.ols("np.log(mpg) ~ horsepower + np.square(horsepower)", data=auto_dataset_removed_outliers[['mpg','horsepower']].reset_index()).fit()log_quad_model.summary()

Post removing influential points, accuracy of the model further increases by ~4%, as R-squared indicates.

Diagnostic plots on final model

We can observe from the above diagnostic plots that our final model resolves all the issues with assumptions of linear regression. In the process of resolving these issues, we have also redefined the relationship between mpg and horsepower.

INITIAL PROPOSED MODEL

FINAL DERIVED MODEL

Final Note

Through this article, I have tried to scratch the surface of linear regression. However, linear regression is still the most powerful tool used in analytics. So, before jumping into a fancier model, make sure to deep dive into linear regression. It is observed that following a structured approach while analyzing the result of linear regression could help to capture much more complex relationships, as shown above.

Please comment or reach out to me, in case of feedback or any particular topics you would like me to cover in the following articles.

Thanks for taking out time to read this article, and happy learning.

*Special thanks to Nimeshana for helping me with various parts of this article.**

Jupyter notebook link: https://github.com/quantsbin/ISLRpybin/tree/main/islrpybin/notebooks/chapter3