Multiple Linear Regression with Backward Elimination Method

Yash gupta
7 min readJan 23, 2021

In this article we are gonna talk about the multiple linear regression and backward elimination method.

Basically Backward elimination is a technique which helps us to improve our multiple linear regression model.

As we all know about the simple linear regression which helps us to predict the relationship between one independent variable vs one dependent variable.

The simple linear regression equation is:

Multiple Linear Regression is a type of regression where the model depends on several independent variables(instead of only on one independent variable as seen in the case of Simple Linear Regression).

The multiple linear regression equation is:

Multiple Linear Regression has several techniques to build an effective model namely:

  • All-in
  • Backward Elimination
  • Forward Selection
  • Bidirectional Elimination

Firstly, we will implement multiple linear regression without Backward elimination method.

Let’s take an example, considered 4 independent variables (R&D spend, Administration spend, Marketing spend, and state (categorical variable)and one dependent variable (Profit)

First step is to convert the ‘State’ columns into the dummy variables. We have 3 categories in ‘State’ column, convert it into 2 columns of binary numbers instead of 3 to avoid dummy variable trap.

Encoding categorical data:

As we have one categorical variable (State), which cannot be directly applied to the model, so we will encode it. To encode the categorical variable into numbers, we will use the LabelEncoder class. But it is not sufficient because it still has some relational order, which may create a wrong model. So in order to remove this problem, we will use OneHotEncoder, which will create the dummy variables.

Below is the code-

Now, splitting our dataset into training and testing set:-

Fitting our Multiple linear regression:

Calculating training and testing score:

Above are the train and test score for multiple linear regression.

The above score tells that our model is 95% accurate with the training dataset and 93% accurate with the test dataset.

The difference between both scores is 0.0154.

Backward Elimination:

Now, we will implement multiple linear regression using the backward elimination technique.
Below are some main steps which are used to apply backward elimination process:

Step-1: Firstly, We need to select a significance level to stay in the model. (SL=0.05)

Step-2: Fit the complete model with all possible predictors/independent variables.

Step-3: Choose the predictor which has the highest P-value, such that.

  1. If P-value >SL, go to step 4.
  2. Else Finish, and Our model is ready.

Step-4: Remove that predictor.

Step-5: Rebuild and fit the model with the remaining variables.

Taking above dataset of 4 independent variables (R&D spend, Administration spend, Marketing spend, and state (categorical variable)and one dependent variable (Profit).

But that model is not optimal, as we have included all the independent variables and do not know which independent model is most affecting and which one is the least affecting for the prediction.

Unnecessary features increase the complexity of the model. Hence it is good to have only the most significant features and keep our model simple to get the better result.

Step: 1- Preparation of Backward Elimination:

Importing the library: Firstly, we need to import the “statsmodels.formula.api” library, which is used for the estimation of various statistical models such as OLS(Ordinary Least Square).

import statsmodels.api as smf

  • Adding a column in matrix of features: As we can check in our MLR equation (a), there is one constant term b0, but this term is not present in our matrix of features, so we need to add it manually. We will add a column having values x0 = 1 associated with the constant term b0.

X_constant = sm.add_constant(X)

As we can see in the above output image, the first column is added successfully, which corresponds to the constant term of the MLR equation.

Step: 2:

  • Now, we are actually going to apply a backward elimination process. Firstly we will create a new feature vector x_opt, which will only contain a set of independent features that are significantly affecting the dependent variable.
  • Next, as per the Backward Elimination process, we need to choose a significant level(0.5), and then need to fit the model with all possible predictors. So for fitting the model, we will create a regressor_OLS object of new class OLS of statsmodels library. Then we will fit it by using the fit() method.
  • Next we need p-value to compare with SL value, so for this we will use summary() method to get the summary table of all the values.

Below is the code:

By executing the above lines of code, we will get a summary table. Consider the below image:

In the above image, we can clearly see the p-values of all the variables. Here x1, x2 are dummy variables, x3 is R&D spend, x4 is Administration spend, and x5 is Marketing spend.

From the table, we will choose the highest p-value, which is for x1=0.953 Now, we have the highest p-value which is greater than the SL value, so will remove the x1 variable (dummy variable) from the table and will refit the model.

Below is the code for it:

As we can see in the output image, now five variables remain. In these variables, the highest p-value is 0.961. So we will remove it in the next iteration.

  • Now the next highest value is 0.961 for x1 variable, which is another dummy variable. So we will remove it and refit the model.

Below is the code for it:

In the above output image, we can see the dummy variable(x2) has been removed. And the next highest value is .602, which is still greater than .5, so we need to remove it.

  • Now we will remove the Admin spend which is having .602 p-value and again refit the model.

As we can see in the above output image, the variable (Admin spend) has been removed. But still, there is one variable left, which is marketing spend as it has a high p-value (0.60). So we need to remove it.

  • Finally, we will remove one more variable, which has .60 p-value for marketing spend, which is more than a significant level.

Below is the code for it:

As we can see in the above output image, only two variables are left. So only the R&D independent variable is a significant variable for the prediction. So we can now predict efficiently using this variable.

Now, we use only 1independent variables and fit our multiple regression model.

After applying multiple linear regression model with R&D independent variable.

As we can see, the training score is 94% accurate, and the test score is also 94% accurate. The difference between both scores is .00149. This score is very much close to the previous score, i.e., 0.0154, where we have included all the variables.

We got this result by using one independent variable (R&D spend) only instead of four variables. Hence, now, our model is simple and accurate.

That’s all on Multiple regression and Backward elimination technique. I hope you like it :)

--

--

Yash gupta

A Machine Learning Enthusiast, who loves to works with data and the have a great interest in statistics.