Simple Linear Regression in Python

Machine Learning Regression Algorithm in Python

Simple Linear Regression in Python

What is Simple Linear Regression?

In statistics, simple linear regression is a linear regression model with a single explanatory variable. In simple linear regression, we predict scores on one variable based on results on another. The criteria variable Y is the variable we are predicting. Predictor variable X is the variable using which we are making our predictions. The prediction approach is known as simple regression as there is only one predictor variable,

As a result, a linear function that predicts the values of the dependent variable as a function of the independent variable is discovered for two-dimensional sample points with one independent variable and one dependent variable.

The below graph explains the Linear regression visually.

Equation : y = mx + c

This is the simple linear regression equation where c is the constant and m is the slope and describes the relationship between (independent variable) and y (dependent variable). The coefficient can be positive or negative and is the degree of change in the dependent variable for every 1 unit of change in the independent variable.

β0 (y-intercept) and β1 (slope) are the coefficients whose values represent the accuracy of predicted values with the actual values.

Implement Simple Linear Regression in Python

In this example, we will use the salary data concerning the experience of employees. In this dataset, we have two columns YearsExperience and Salary

  • Step 1: Import the required python packages

We need Pandas for data manipulation, NumPy for mathematical calculations, and MatplotLib, and Seaborn for visualizations. Sklearn libraries are used for machine learning operations

# Import libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns from sklearn.model_selection

import train_test_split from pandas.core.common

import random_state from sklearn.linear_model import LinearRegression

  • Step 2: Load the dataset

Download the dataset from here upload it to your notebook and read it into the pandas DataFrame.

# Get dataset

df_sal = pd.read_csv('/content/Salary_Data.csv')

df_sal.head()

  • Step 3: Data analysis

Now that our data is ready, let’s analyze and understand its trend in detail. To do that we can first describe the data below –

# Describe data
df_sal.describe()

Here, we can see Salary ranges from 37731 to 122391 and a median of 65237.

We can also find how the data is distributed visually using the Seaborn distplot

# Data distribution

plt.title('Salary Distribution Plot')

sns.distplot(df_sal['Salary'])

plt.show()

A distplot or distribution plot shows the variation in the data distribution. It represents the data by combining a line with a histogram.

Then we check the relationship between Salary and Experience –

# Relationship between Salary and Experience

plt.scatter(df_sal['YearsExperience'], df_sal['Salary'], color = 'lightcoral') plt.title('Salary vs Experience')

plt.xlabel('Years of Experience')

plt.ylabel('Salary')

plt.box(False)

plt.show()

It is visible now, that our data varies linearly. That means, that an individual receives more Salary as they gain Experience.

  • Step 4: Split the dataset into dependent/independent variables

Experience (X) is the independent variable Salary (y) is dependent on experience

# Splitting variables

X = df_sal.iloc[:, :1] # independent

y = df_sal.iloc[:, 1:] # dependent

  • Step 5: Split data into Train/Test sets

Further, split your data into training (80%) and test (20%) sets using train_test_split

# Splitting dataset into test/train

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

  • Step 6: Train the regression model

Pass the X_train and y_train data into the regressor model by regressor.fit to train the model with our training data.

# Regressor model

regressor = LinearRegression()

regressor.fit(X_train, y_train)

  • Step 7: Predict the result

Here comes the interesting part, when we are all set and ready to predict any value of y (Salary) dependent on X (Experience) with the trained model using a regressor. predict

# Prediction result

y_pred_test = regressor.predict(X_test) # predicted value of y_test

y_pred_train = regressor.predict(X_train) # predicted value of y_train

  • Step 8: Plot the training and test results

It is time to test our predicted results by plotting graphs

  • Plot training set data vs predictions First we plot the result of training sets (X_train, y_train) with X_train and predicted value of y_train (regressor.predict(X_train))
# Prediction on training set plt.scatter(X_train, y_train, color = 'lightcoral') plt.plot(X_train, y_pred_train, color = 'firebrick') plt.title('Salary vs Experience (Training Set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title = 'Sal/Exp', loc='best', facecolor='white') plt.box(False) plt.show()
  • Plot test set data vs predictions Secondly, we plot the result of test sets (X_test, y_test) with X_train and predicted value of y_train (regressor.predict(X_train))
# Prediction on test set plt.scatter(X_test, y_test, color = 'lightcoral') plt.plot(X_train, y_pred_train, color = 'firebrick') plt.title('Salary vs Experience (Test Set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title = 'Sal/Exp', loc='best', facecolor='white') plt.box(False) plt.show()

We can see, in both plots, that the regressor line covers train and test data.

Also, you can plot results with the predicted value of y_test (regressor. predict(X_test)) but the regression line would remain the same as it is generated from the unique equation of linear regression with the same training data.

If you remember from the beginning of this article, we discussed the linear equation y = mx + c, we can also get the c (y-intercept) and m (slope/coefficient) from the regressor model.

# Regressor coefficients and intercept

print(f'Coefficient: {regressor.coef_}')

print(f'Intercept: {regressor.intercept_}')

Conclusion

I hope this post served as a good introduction to Simple Linear Regression and why it is important to create and assess linear models. In simple terms, linear regression is a potent supervised machine learning approach that enables us to predict linear correlations between two variables.

Author