Multiple Linear Regression In Machine Learning

In this article, we will learn about Multiple Linear Regression in Machine Learning.

If you are new to Machine Learning then you can follow this blog for a basic understanding of machine learning.

What is Multiple Linear Regression?

->  A multiple linear regression model is able to analyze the relationship between several independent variables and a single dependent variable.

->  Multiple Linear Regression attempts to model the relationship between two or more features and a response by fitting a linear equation to observed data. The steps to perform multiple linear Regression are almost similar to that of simple linear Regression.

->  Multiple regression has numerous real-world applications in three problem domains: examining relationships between variables, making numerical predictions, and time series forecasting.

Equation : Y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + …… bn * xn 

Here,  Y = Dependent variable and x1, x2, x3, …… xn = multiple independent variables, bo = Constant, b1 = Coefficient

Assumptions for Multiple Linear Regression

  • Linearity: The relationship between dependent and independent variables should be linear.
  • Homoscedasticity: Constant variance of the errors should be maintained.
  • Multivariate normality: Multiple Regression assumes that the residuals are normally distributed.
  • The regression residuals must be normally distributed

Dummy Variable:

In statistics and econometrics, particularly in regression analysis, a dummy variable is one that takes only the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.

Let’s start to implement one example for Multiple Linear Regression.

We have a dataset of 50 start-up companies. This dataset contains five main information: R&D Spend, Administration Spend, Marketing Spend, State, and Profit for a financial year.

Our goal is to create a model that can easily determine which company has a maximum profit, and which is the most affecting factor for the profit of a company.

Step 1)  Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Step 2 ) Importing the dataset

dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# OUTPUT
[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California'] ... ... ... ]

Step 3 ) Encoding categorical data

  • As we have one categorical variable (State), which cannot be directly applied to the model, so we will encode it. To encode the categorical variable into numbers, we will use the OneHotEncoder class.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

# OUTPUT
[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53] ... ... ... ]

Step 4 ) Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Step 5 ) Training the Multiple Linear Regression model on the Training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Step 6 ) Predicting the Test set results

y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

# OUTPUT
[[103015.2  103282.38]
 [132582.28 144259.4 ] ... ... ... ]
  • So we have some amazing predictions, very close to the real profits and some. Okay, predictions.
  • So here, from what we see, well, we could say that the multiple in our regression is well adapted to this data set. The data set does not necessarily have some perfect linear correlations. However, you can be assured that this linear regression class, was able to select the right features with the right parameters to make these predictions.
  • And even if you tune your linear regression model by, for example, applying backward elimination to select, you know, a team of more statistically significant features, you will actually get similar results.

I hope you’ll get impressed by this multiple linear regression model prediction. So let’s start coding within the next blog. If you’ve got any questions regarding this blog, please let me know in the comments.

Submit a Comment

Your email address will not be published. Required fields are marked *

Subscribe

Select Categories