Regression – Relating input variables and outcome


The term “regression” was coined by Francis Galton in the nineteenth century to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean). Specifically, regression analysis helps one understand how the value of the dependent variable (also referred to as outcome) changes when any one of the independent variables changes (also referred as drivers), while the other independent variables are held fixed. Regression analysis estimates the conditional expectation of the dependent variable given the independent variables — that is, the mean value of the dependent variable when the independent variables are held fixed. Some example questions are : I want to predict the lifetime value of this customer and understand what drives LTV. What drives the LTV higher or lower? I want to predict the probability that this loan will default and understand what drives default Regression focuses on the relationship between the outputs and the inputs. It also provides a model that has some explanatory value, in addition to predicting outcomes. Social scientists used regression mainly for its explanatory value and it can be a fairly good predictor for which method is popular among Data Scientists. The outcome can be continuous or discrete and when it is discrete we are predicting the probability that the outcome will occur. Two types of regression methods will be discussed


We use Linear regression to predict a continuous value as a linear or additive function of other variables. Some examples are • Predicting income as a function of number of years of education, age and gender (drivers). • House price (outcome) as a function of median home price in the neighborhood, square footage, number of rooms. • Neighborhood house sales in the past year based on economic indicators. The input variables can be continuous or discrete and the outputs are:

1) A set of coefficients that indicate the relative impact of each driver (possibly and how strongly the variables are correlated)

2) A linear expression predicting the outcome as a function of drivers.


Linear Regression is the most frequently used technique for predicting a continuous outcome. It is simple and works well in most instances. It is recommended that Linear regression should be tried and if it is determined that the results are not reliable other complicated models should be used. Models such as kernelized ridge regression, local linear regression, regression trees, neural nets can be attempted. (all these models are out of scope for this course). Some of the use cases are listed on the slide, others examples also are:

  • Look at past years’ sales orders and advertising campaigns to decide where and how you will spend this year’s advertising budget
  • Identify the relationship between important variables that affect your business or organization


Linear regressions are of the form y is equal to a constant term + a linear combination of all the variables. The linear combinations are made up of a coefficient term “bi ” multiplied by the value of the corresponding variable. The problem itself is solving for the bi It is a matrix inversion problem and the method is referred to as Ordinary Least Squares (OLS). The solution requires storage as the square of the number of variables and we need to invert a matrix. The complexity of the solution (both in storage and computation) increases as the number of variables increase. When you have categorical variables, they are expanded as a set of indicator variables one for each possible value. We will explain this in the next slide in more detail with an example. What we highlight here is that if we expand on categorical variables (ZIP codes as a categorical value) we will end up with lot of variables and the complexity of the solution becomes extremely high.