Fundamentals of Regression

What Is Regression in Machine Learning?

Regression analysis is a statistical analysis model used in machine learning. It is used to model the relationship between dependent variables and one or more independent variables. The goal of the regression analysis is to find the best fit line or curve with minimum residual errors, and it is used to predict the dependent variable based on the values of the independent variables.

Regression analysis is widely used in various fields such as forecasting, finance, economics, engineering, time series modeling and many others. It can be used to make predictions, identify trends and patterns, and used to understand the relationships between variables.

Terminologies related to regression model

Independent Variable

In a regression model, independent variables are the input features or predictor variables used to predict the outcome or dependent variable. The values of the independent variables will directly impact the predicted outcomes.

For example, if we are building a model to predict the price of a house, the features of the house like the size of the house, the number of bedrooms and the location are independent variables.

Dependent Variable

In the regression model, the dependent variable is the outcome variable or the one we are trying to predict. It depends on the values of other independent variables.

For example, if we are building a model to predict the price of a house, the price field is the outcome or dependent variable. Other fields such as number of rooms, location, and land value are the independent variables used to predict house prices.

Outliers

Outliers are data that significantly differs from other data in a dataset. Outliers can be caused by errors while recording data. In machine learning, outliers can significantly impact the model’s performance, especially in regression analysis.

There are several methods for detecting outliers in regression models. such as Residual analysis, Z-score, Boxplot,…etc.

Residuals

In regression analysis, residuals are used to check the accuracy of the regression model. residuals represent the difference between the values of the dependent variable (y) and the predicted values (ŷ).

Residual = y - ŷ

Under-fitting and Over-fitting

Under-fitting and Over-fitting are two common problems in regression model. That occurs when models do not generalize well on new and unseen data.

Under-fitting occurs when the model is too simple and not able to find the pattern or relationship between given data. This results in a high bias; this can lead to worse performance on both the training and test data.

Over-fitting occurs when the model is too complex and overly sensitive to training data. but does not generalize well to new data. This results in a high variance.

Regularization, cross-validation, and early stopping are some of the ways to avoid underfitting and overfitting.

What is a Linear vs Non-Linear relationship?

Linear relationship refers to a relationship between two variables that can be represented by a straight line on a graph. In other words, if one variable increases the other variable will increase or decrease. For example, when CPU process increases it will emit more heat.

Non-linear relationship refers to a relationship between two variables that cannot be represented by a straight line on a graph. In other words, when one variable increases, the other increases or decreases in a non-linear way, forming a curved line. For example, if we plot the temperature for each day of the year based on time. we will see a non-linear relationship.

Categorical vs Continuous Variables

A categorical variable can take limited number of predefined values such as color of car, gender, or blood group Categorical variables are also known as nominal variables.

A continuous variable can take any number of values within a certain range. for example, a person’s height or weight can be any value within a certain range.

Multi-collinearity

Multi-collinearity is a common issue in regression analysis. This issue will occur when two or more independent variables are highly correlated with each other’s. It can lead to unreliable and unstable regression models

To detect multicollinearity, we can use correlation matrices or variance inflation factors (VIF). If multicollinearity is detected, we can remove one or more correlated variables from the dataset. But it required domain knowledge to determine which variable to drop.

Auto-Correlation

Autocorrelation is important in time series analysis, as it can affect the accuracy of statistical models and predictions. If autocorrelation is present in a dataset, it may indicate that the values are dependent on previous observations. This can lead to biased or inefficient parameter estimates and inaccurate predictions.

Conclusion

Regression is a fundamental machine learning technique that allows us to predict a continuous output variable based on one or more input variables. When we are using regression models there are several challenges we need to face such as outliers, issues related to under-fitting and over-fitting, dealing with categorical and continuous variables, autocorrelation and detecting multicollinearity.

Overall a deep understanding of these issues is essential for building robust and accurate regression models. The next blog will start with the Simple Linear Regression.

Tagged Artificial Intelligence

One Response

Pingback: Simple Linear Regression – Athen

Services

Resources

Solutions