Linear Regression with Multiple Variables in Python - Machine Learning Tutorial

Startertutorials Blog

Tutorials and articles related to programming, computer science, technology and others.

Subscribe to Startertutorials.com's YouTube channel for different tutorial and lecture videos.

Suryateja Pericherla Categories: Machine Learning. 2 Comments

In this article we will look at linear regression with multiple variables also called as multiple regression along with Python code.

Let’s use another dataset which is a variation of the dataset used in our simple linear regression article. Download the modified house price dataset. The data is as shown below.

area	bedrooms	age	price
2600	3	20	550000
3000	4	15	565000
3200		18	610000
3600	3	30	595000
4000	5	8	760000

Our new dataset includes two new columns: bedrooms (number of bedrooms) and age (age of the house). You may already know that the price of a house is not simply dependent only on area of the house.

So, we will try to predict the price of a house (dependent variable) based on number of bedrooms, age of house and area of the house (independent variables). Since there are multiple independent variables, this type of linear regression is called multiple regression.

We will try to predict the price of houses with following properties:

3000 sq. ft. area, 3 bedrooms, 40 years old
2500 sq. ft. area, 4 bedrooms, 5 years old

Before going into the implementation of multiple regression, let’s have a look at the dataset. We can see that the number of bedrooms in row 3 is empty. We have to do something about that. Such kind of empty data cells are called as missing values. Also, we can see that there is a linear relationship between the dependent variable (price) and independent variables or features (bedrooms, age and area).

So, our linear equation with the three independent variables becomes:

price = m1*area + m2*bedrooms + m3*age + b

In the above equation price is the dependent variable; area, bedrooms and age are independent variables or features; m1, m2 and m3 are coefficients; b is the intercept.

Contents

1 Handling Missing Values
2 Linear Regression with Multiple Variables in Python
3 Exercise on Multiple Regression

Handling Missing Values

Let’s first import some useful libraries and load the dataset.

import pandas as pd
import numpy as np
from sklearn import linear_model

df = pd.read_csv("house_prices_mv.csv")

When we see the data in the dataframe df, you will see that the third row is having a value as NaN (Not a Number). That means we have a missing value. The general way to fill in the missing values is to calculate the mean or median of the rest of the values in that column. So we will write:

import math
median_bedrooms = math.floor(df.bedrooms.median())

Don’t worry about the math.floor method. I just wanted to have a whole number without the fractional part. So we will have 3 in median_bedrooms variable. Now we will store this value in our dataframe by writing:

df.bedrooms = df.bedrooms.fillna(median_bedrooms)

The fillna method is used to fill up any missing values with the provided value. The process which we have done is called data cleaning which is a part of data preprocessing. Now our dataset is ready for predicting the house prices.

Linear Regression with Multiple Variables in Python

Now we will create an object for LinearRegression class and call the fit method on that object by passing the three independent variables and the dependent variable price.

reg = linear_model.LinearRegression()
reg.fit(df[['area', 'bedrooms', 'age']], df.price)

We are done with training our multiple regression model. We can see the values of coefficients (m1, m2 and m3) by writing reg.coef_ and the value of intercept (b) by writing reg.intercept_. The values are 137.25, -26025. , -6825., and 383724.9999999998 respectively.

Now let’s try to predict the price of a house with 3000 sq. ft. area, 3 bedrooms, and 40 years of age by writing:

reg.predict([[3000, 3, 40]])

The price will be given as 444400. Similarly let’s try to predict the price of house with 2500 sq. ft. area, 4 bedrooms and 5 years of age by writing:

reg.predict([[2500, 4, 5]])

The price will be given as 588625. Now let’s do an exercise on multiple regression.

Exercise on Multiple Regression

Let’s try a new dataset which contains the details of employees in a company. Download the hiring dataset. The dataset contains four columns. The original columns in the dataset have been renamed as given below.

experience of an employee – experience
test score (out of 10) – score
interview score (out of 10) – int_score
employee salary ($) – salary

The employee hiring dataset is as shown below.

experience	score	int_score	salary
	8	9	50000
	8	6	45000
five	6	7	60000
two	10	10	65000
seven	9	6	70000
three	7	10	62000
ten		7	72000
eleven	7	8	80000

The HR of the company wants to predict how much salary (dependent variable) to offer for a person attending for an interview based on the years of experience, test score and interview score (independent variables).

As we can see from the above table, we have to do a bit of data preprocessing here. As linear regression works only with numbers, we need to convert the data in experience column to numbers. For this we will use a Python library called word2number. Downloading and installing this library is left as an exercise to you.

We will treat the missing values in experience column as zeros and then convert the words into numbers by writing the following code.

from word2number import w2n

df1 = pd.read_csv("hiring.csv")
df1.experience = df1.experience.fillna("zero")

Now let’s convert the strings in experience column to numbers using the following code.

exp = []
for i in range(0,len(df1.experience)):
    exp.append(w2n.word_to_num(df1.experience[i]))
df1.experience = exp

Now we will fill the missing value in score column with the median of the values in that column as we already did in our multiple regression tutorial above. The code for that is:

df1.score = df1.score.fillna(df1.score.median())

The median for the data in score column is 8. Now our dataset is ready. To train our model the code is as given below:

reg1 = linear_model.LinearRegression()
reg1.fit(df1[['experience', 'score', 'int_score']], df1.salary)

Now, let’s test our model with the following data:

2 yrs experience, 9 test score, 6 interview score
12 yrs experience, 10 test score, 10 interview score

The code for predicting the above two cases is as given below.

reg1.predict([[2,9,6]])

53205.96797671

reg1.predict([[12,10,10]])

92002.18340611

That’s it for this tutorial on linear regression with multiple variables (mutiple regression). If you any questions regarding the tutorial please comment below.

For more information, visit the following links:

Machine Learning Tutorial Python – 3: Linear Regression Multiple Variables

Suryateja Pericherla

Suryateja Pericherla, at present is a Research Scholar (full-time Ph.D.) in the Dept. of Computer Science & Systems Engineering at Andhra University, Visakhapatnam. Previously worked as an Associate Professor in the Dept. of CSE at Vishnu Institute of Technology, India.

He has 11+ years of teaching experience and is an individual researcher whose research interests are Cloud Computing, Internet of Things, Computer Security, Network Security and Blockchain.

He is a member of professional societies like IEEE, ACM, CSI and ISCA. He published several research papers which are indexed by SCIE, WoS, Scopus, Springer and others.

2 Comments

You can follow any responses to this entry through the RSS 2.0 feed.

Emmanuel

from word2number import w2n

exp = [ ]
for i in range(0, len(df.experience)):
exp.append(w2n.word_to_num(df.experience[i]))
df.experience = exp

ValueError: Type of input is not string! Please enter a valid number word (eg. ‘two million twenty three thousand and forty nine’)

please help me on how to resolve this error.

Reply

Suryateja Pericherla

From what I can understand from your code, I think the problem is with the values in “experience” that you are passing to “word_to_num”. Check that.

Leave a Reply Cancel reply