In the previous blog, I covered the fundamentals of regressions. You can check it out here.
Introduction
Simple linear regression is the Supervised learning model. Used to find the relationship between one independent variable and one dependent variable. Such as the relationship between experience and salary, rainfall and grain yield, etc.
It is used to find the best fit line between independent and dependent variables with minimum residual error.
The equation should look like this: y = mx + c
y – dependent variable
x – independent variable
m – slope
c – coefficient
Solve By Hand
For more mathematical understanding let us solve the following problem. Download the dataset from here: https://www.kaggle.com/datasets/shsrivas/salary-data This dataset has experience and salary data for employees. Here YearsExperience is independent column (X), and Salary is dependent column (Y)
The equation will be: y = mx + c -> Salary = m(YearsExperience) + c
We need to calculate m(slope) and c(coefficient) The equation to calculate m and c is:
Do the following steps:
- Calculate x average x̄
- Calculate y average ȳ
- Calculate x-x̄ and y-ȳ
- Calculate (x-x̄)*(y-ȳ)
- Finally calculate (x-x̄)2 and (y-ȳ)2
6. Find value of r
standard deviation Sy = 27,414.429 and Sx= 2.837
c = 76003 - 9,450.58×5.313
c = 25,792.07
9. Now we have values for both slope and coefficient. Predict X value using y = mx + c
Let us take X = 9
Y = 9,450.58(9) + 25,792.07
The predicted outcome of Y = 1,10,847 Residual errors are the error between a predicted value and the observed actual value. Draw a slope line with low residual errors. To get better accuracy.
Solve using Python sci-kit learn regression model
- Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
2. Read the csv file using the pandas data frame. Here I am using head() function to display only the first five rows.
df = pd.read_csv('Salary Data.csv')
df.head()
3. visualize the data set using matplotlib scatter chart to understand the relationship between experience and salary. The chart below clearly shows that as experience increases, so does salary. Therefore, salary clearly depends on experience.
plt.scatter(df['vearsexperience'], df['salary'])
plt.xlabel("YearsExperience")
plt.ylabel("salary")
4. Create two variables X and Y. Store YearsExperience in X and Salary in Y. Change format to two-dimensional array using NumPy reshape function
np.array(df[ 'Yearsixperience']).reshape(-1, 1)
np.array(df[ 'salary’]).reshape(-1, 1)
5. Now we have both independent and dependent variables. Create a linear regression model and train that model using X and Y. model.fit() function is used to train a model.
model = LinearRegression()
model. fit(X, y)
print('Interception c¢ = ',model.intercept_)
print('Coefficient(slope) m = :',model.coef_)
print('Model socre: ',model.score(X, y))
Interception c = [25792.20019867]
Coefficient(slope) m = : [[9449.96232146]]
Model socre: 0.9569566641435086
Here you can see the results of interception(c) and slope(m) matches with previous Solve by Hand. result.
6. Here I created a 2-dimension array variable called test. This variable contains the experience of three employees and let’s us predict their salary using our trained model using model.predict() function. Then print the output.
test = [[2],[7],[9]]
prediction = model.predict(test)
for data in zip(test, prediction):
print(f'Experience: {data[0][1]} Years and Predicted Salary: {data[1][e]}"')
Experience: 2 Years and Predicted Salary: 44692.12484157886
Experience: 7 Years and Predicted Salary: 91941.93644885423
Experience: 9 Years and Predicted Salary: 110841.86109176437
7. Now let us draw a slope using predicted outcomes. The dots have minimum distance from slope. It proves that residual errors are exceptionally low.
plt.scatter(df[ 'Yearsexperience'], df['salary'])
plt.plot(test, prediction, color =k")
plt.xlabel("YearsExperience™)
plt.ylabel("salary")
Conclusion
Simple linear regression is one of the easiest supervised learning algorithms in machine learning. The main goal of this blog is to predict the employee salary using a simple linear regression model. In this blog we covered both mathematical hands-on and python approaches.
Will cover multiple linear regression in the next blog. This model has multiple independent variables and one dependent variable.
Previous blog: Fundamentals of Regression – Athen
No Comment! Be the first one.