Startertutorials Blog
Tutorials and articles related to programming, computer science, technology and others.
Subscribe to Startertutorials.com's YouTube channel for different tutorial and lecture videos.

Categories: Machine Learning. No Comments on Linear Regression Single Variable in Python

In this article we will look at what is simple linear regression, the math behind it and then see hands-on implementation of simple linear regression in Python language.

 

Let’s see how to predict home prices using a machine learning technique called simple linear regression. By simple we mean single variable regression. Regression is a supervised learning technique which is used for making predictions.

 

Here, single variable means there is only one independent variable (you will see what is independent variable later). For example predicting the price of a house based on the area of house. Given below is a sample of data showing area of the house and corresponding price for houses in Monroe, New Jersey, USA.

 

AreaPrice
2600550000
3000565000
3200610000
3600680000
4000725000

 

Introduction to Linear Regression

Using this data (in the above table) we will build a model/function that can tell us the prices of the homes whose are is 3300 sq.ft. and 5000 sq.ft. We can plot the above data as a scatter plot (we will see the python code later) as shown below. Download the dataset.

 

Area vs Prices of Houses Linear Regression

 

The blue points in the above figure represent the price of the house for a given area. The black line which is going through the data points (some of them) is the one which best fits the given data. It is also called the regression line. We will see how to get that line using Python code later.

 

Once we get the line, we can predict the price of a new house with a given area. For example, the price of a house whose area is 3300 sq.ft., its price will be 628715$.

 

Prediction using Regression

 

Now you might be thinking how did we get the black line (regression line)? The line is not clearly going through (touching) all the data points and there are number of ways we can draw the lines, green and violet, as shown below.

 

Different Regression Lines

 

How to find the best fit (line) is, we calculate delta, which is the an error between the actual data point and the line (as shown in below figure)  which is predicted by the linear equation. We square the individual errors and sum them up and we try to minimize those. So we repeat this process for different lines and come up with the best line that has the least value for the sum of squared errors.

 

Squared Errors Regression

 

From your elementary school or high school math you might have remembered the equation of a line which is shown below.

 

Line Equation

 

In our house prices example, the variable y in the line equation is the price of the house and the variable x is area of the house. Area is called an independent variable (generally the variable on x-axis) and price is called a dependent variable (generally the variable on y-axis) because we are calculating the price based on area.

 

Linear Regression Single Variable in Python

Now we will see linear regression in action using Python. The hands-on will be done using Jupyter Notebook. I hope you already have some experience with it and Python programming language. First we will import some important libraries.

 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

 

NumPy and Pandas libraries are used to load and perform operations on the data. Matplotlib library is used to plot graphs. Sklearn or also called scikit-learn is used here to build our model. First, we will load our dataset using Pandas.

 

df = pd.read_csv("house_prices.csv")

 

Here, read_csv is a method used to load our dataset in the file house_prices.csv. It returns back a dataframe which is stored in df variable. Now, we will plot a scatter plot to see the distribution of our house prices data.

 

plt.xlabel("Area(sq.ft.)")
plt.ylabel("Price($)")
plt.scatter(df.Area, df.Price)

 

If you run the above line you should see a scatter plot like below.

 

House Prices Scatter Plot

 

From the above figure we can see that the distribution is approximately linear and we can use linear regression. First we need to create an object of LinearRegression. Now we will fit our data using fit method. Fitting the data means we are training the linear regression model using the available data in our dataset.

 

reg = linear_model.LinearRegression()
reg.fit(df[['Area']], df.Price)

 

The first argument for the fit method is the independent variable and the second argument will be the dependent variable. Once we execute the above code, linear regression model will fit our data and will be ready to predict the price of houses for given data.

 

Now, let’s try to predict the price of house with area as 3300 sq.ft. The code for that is:

 

reg.predict([[3300]])

 

The predicted price is:

 

628715.75342466

 

Now let’s look at the internal details how linear regression was able to predict the above value. Linear regression while fitting the data will calculate the coefficient or slope and intercept i.e., m and b.

 

We can see the value of m or coefficient by writing reg.coef_ and we get the value 135.78767123. Similarly we can see the value of b or intercept by writing reg.intercept_ and we get the value 180616.43835616432. You can verify by calculating 135.78767123 * 3300 + 180616.43835616432 gives 628715.75342466.

 

Let’s try to predict the house price for 5000 sq.ft. by writing reg.predict([[5000]). We will get the predicted price as 859554.79452055.

 

Now let’s see the linear regression line on our scatter plot. For that we need to add another line in Python as shown below.

 

plt.xlabel("Area(sq.ft.)")
plt.ylabel("Price($)")
plt.scatter(df.Area, df.Price)
plt.plot(df.Area, reg.predict(df[['Area']]), color='blue')

 

When we execute the above piece of code, we can see the regression line as shown below.

 

Linear Regression Single Variable in Python

 

Exercise on Linear Regression with Single Variable

So, let’s try out an exercise. First download Canada’s per capita income dataset. The data in the CSV file is as shown below.

 

yearpci
19703399.299037
19713768.297935
19724251.175484
19734804.463248
19745576.514583
19755998.144346
19767062.131392
19777100.12617
19787247.967035
19797602.912681
19808355.96812
19819434.390652
19829619.438377
198310416.53659
198410790.32872
198511018.95585
198611482.89153
198712974.80662
198815080.28345
198916426.72548
199016838.6732
199117266.09769
199216412.08309
199315875.58673
199415755.82027
199516369.31725
199616699.82668
199717310.75775
199816622.67187
199917581.02414
200018987.38241
200118601.39724
200219232.17556
200322739.42628
200425719.14715
200529198.05569
200632738.2629
200736144.48122
200837446.48609
200932755.17682
201038420.52289
201142334.71121
201242665.25597
201342676.46837
201441039.8936
201535175.18898
201634229.19363

 

In the above dataset we have data up to the year 2016. So, let’s use our dataset and try to predict the per capita income for the year 2021. The code and scatter plot along with the linear regression’s line is shown below.

 

df1 = pd.read_csv("canada_pci.csv")

plt.xlabel("Year")
plt.ylabel("Per Capita Income")
plt.scatter(df1.year, df1.pci)

reg1 = linear_model.LinearRegression()
reg1.fit(df1[['year']], df1.pci)

reg1.predict([[2021]])

plt.xlabel("Year")
plt.ylabel("Per Capita Income")
plt.scatter(df1.year, df1.pci)
plt.plot(df1.year, reg1.predict(df1[['year']]), color='red')

 

Canada's Per Capita Income Regression

 

The result of reg1.predict([[2021]]) is 42117.15916964i.

 

That’s it for this tutorial on linear regression single variable in Python language. If you have any questions or if there are any corrections, please comment below.

 

For more information, visit the following links:

How useful was this post?

Click on a star to rate it!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?

Suryateja Pericherla

Suryateja Pericherla, at present is a Research Scholar (full-time Ph.D.) in the Dept. of Computer Science & Systems Engineering at Andhra University, Visakhapatnam. Previously worked as an Associate Professor in the Dept. of CSE at Vishnu Institute of Technology, India.

He has 11+ years of teaching experience and is an individual researcher whose research interests are Cloud Computing, Internet of Things, Computer Security, Network Security and Blockchain.

He is a member of professional societies like IEEE, ACM, CSI and ISCA. He published several research papers which are indexed by SCIE, WoS, Scopus, Springer and others.

Leave a Reply

Your email address will not be published. Required fields are marked *