Linear regressions are the workhorse tool in econometrics
Simplicity: straightforward to understand, implement, and visualize.
Interpretability: coefficients have clear interpretations.
represents the change in the Y for a one-unit change in the X, holding all other variables constant.
Versatility: simple linear regression or multiple linear regression.
Assumptions: linearity, independence of errors, homoscedasticity, and normality of errors.
Baseline Model: You can compare the performance of more advanced models to linear regression.
Estimation: provides clear estimates of the coefficients’ confidence intervals and hypothesis testing.
Regression
In this setting, the variables \(y\) and \(x\) can have several names.
Y
X
Dependent variable
Independent variable
Explained variable
Explanatory variable
Response variable
Control variable
Predicted variable
Predictor variable
Regressand
Regressor
Regression
Broadly, we are interested in how y is explained by x?
\(y_i = \alpha + \beta_1 x_i + \epsilon\)
Perhaps \(\epsilon\) is the most important part of a regression.
The interpretation is “everything that is not explained by X and that explains Y”.
A comment
Usually, the literature uses \(\epsilon\) for the “estimated” residual.
And \(\mu\) for the “true” residual, which necessarily implies that the assumptions hold.
At the end od the day, you don’t need to worry to much with the notation of this term because we are always in the “estimated world”, and almost never in the “true world”.
The “true world” implies that you are studying the population or that you have a true random sample of the population
\(y_i = \alpha + \beta_1 x_i + \mu\)
Regression
Remember
\(y, x\), and \(\mu\) are random variables
\(y and x\) are observable
\(\mu\) and \(\beta\) are unobservable
\(\mu\) captures everything that determines y after accounting for x
Our goal is to estimate β
Regression
There are some assumptions/requirements about \(\mu\) in a OLS
First assumption
E(\(\mu\)) = 0
This is a simple assumption, not strong at all.
It simply assumes that the average of \(\mu\) is zero in the population.
Basically, any non-zero mean is absorbed by the intercept
Say that E(\(\mu\)) = k
We could rewrite \(\mu = k + w\), where E(w)=0
Then, model becomes \(y=(\alpha +𝑘) + \beta𝑥+𝑤\)
Intercept is now just (α + k), and error, w, is mean zero
Regression
Second assumption
E(\(\mu\)|x) = E(\(\mu\)) for all values of x
It says that the average value of \(\mu\) does not depend on the value of x (i.e., the slice of the population you are looking at).
We say that \(\mu\) is mean-independent of x.
This is true if the X and the \(\mu\) are independent to each other.
Implies that x and \(\mu\) are uncorrelated.
Conditional Mean Independence (CMI).
This is one of the keys assumption to causal inference.
library(tidyverse)set.seed(1)tb <-tibble(x =rnorm(10000),u =rnorm(10000),y =5.5*x +12*u # notice that I am defining the beta1 here. The 5.5 is the "true" beta we want to estimate.) reg_tb <-lm(y ~ x, data=tb) summary(reg_tb)
Call:
lm(formula = y ~ x, data = tb)
Residuals:
Min 1Q Median 3Q Max
-51.528 -8.152 -0.173 7.978 44.718
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04991 0.11890 -0.42 0.675
x 5.55690 0.11745 47.31 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.89 on 9998 degrees of freedom
Multiple R-squared: 0.1829, Adjusted R-squared: 0.1828
F-statistic: 2238 on 1 and 9998 DF, p-value: < 2.2e-16
Python
import pandas as pdimport numpy as npimport statsmodels.api as smimport matplotlib.pyplot as pltnp.random.seed(1)obs =10000data = pd.DataFrame({'x': np.random.normal(size=obs),'u': np.random.normal(size=obs),})data['y'] =5.5* data['x'] +12* data['u']X = sm.add_constant(data['x'])model = sm.OLS(data['y'], X).fit()print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.183
Model: OLS Adj. R-squared: 0.183
Method: Least Squares F-statistic: 2237.
Date: qua, 20 set 2023 Prob (F-statistic): 0.00
Time: 10:39:25 Log-Likelihood: -39049.
No. Observations: 10000 AIC: 7.810e+04
Df Residuals: 9998 BIC: 7.812e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.1114 0.120 0.927 0.354 -0.124 0.347
x 5.6887 0.120 47.293 0.000 5.453 5.924
==============================================================================
Omnibus: 0.640 Durbin-Watson: 2.050
Prob(Omnibus): 0.726 Jarque-Bera (JB): 0.672
Skew: -0.012 Prob(JB): 0.715
Kurtosis: 2.968 Cond. No. 1.01
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Stata
setseed 1 setobs 10000 gen x = rnormal() gen u = rnormal() geny = 5.5*x + 12*u regy x
Number of observations (_N) was 0, now 10,000.
Source | SS df MS Number of obs = 10,000
-------------+---------------------------------- F(1, 9998) = 2206.80
Model | 327141.413 1 327141.413 Prob > F = 0.0000
Residual | 1482125.8 9,998 148.242229 R-squared = 0.1808
-------------+---------------------------------- Adj R-squared = 0.1807
Total | 1809267.22 9,999 180.944816 Root MSE = 12.175
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
x | 5.598296 .119172 46.98 0.000 5.364695 5.831897
_cons | -.0750109 .1217548 -0.62 0.538 -.3136748 .163653
------------------------------------------------------------------------------
library(tidyverse)set.seed(1)tb <-tibble(x =rnorm(10000),u =rnorm(10000),y =5.5*x +12*u # notice that I am defining the beta1 here. The 5.5 is the "true" beta we want to estimate.) reg_tb <-lm(y ~ x, data=tb) tb %>%lm(y ~ x, .) %>%ggplot(aes(x=x, y=y)) +ggtitle("OLS Regression Line") +geom_point(size =0.05, color ="black", alpha =0.5) +geom_smooth(method = lm, color ="black") +annotate("text", x =-1.5, y =30, color ="red", label =paste("Intercept = ", reg_tb$coefficients[1])) +annotate("text", x =1.5, y =-30, color ="blue", label =paste("Slope =", reg_tb$coefficients[2]))
Python
import pandas as pdimport numpy as npimport statsmodels.api as smimport seaborn as snsimport matplotlib.pyplot as plt# Create a scatterplot with the OLS regression line using Seabornsns.set(style='whitegrid')plt.figure(figsize=(7, 5))sns.scatterplot(x='x', y='y', data=data, color='black', alpha=0.5, s=5)sns.regplot(x='x', y='y', data=data, color='black', scatter=False, line_kws={'color':'black'})plt.title('OLS Regression Line')plt.annotate(f'Intercept = {model.params[0]:.2f}', xy=(-1.5, 30), color='red')plt.annotate(f'Slope = {model.params[1]:.2f}', xy=(1.5, -30), color='blue')plt.show()
Stata
setseed 1 setobs 10000 gen x = rnormal() gen u = rnormal() geny = 5.5*x + 12*u regy x predict yhat1 gen yhat2 = -0.0750109 + 5.598296*x predict uhat1, residual gen uhat2=y-yhat2 quisum uhat* twoway (lfity x, lcolor(black) lwidth(medium)) (scattery x, mcolor(black) msize(tiny) msymbol(point)), title(OLS Regression Line) quigraphexport"files/graph3.svg", replace
setseed 1 clearquisetobs 10000 gen x = rnormal() gen u = rnormal() geny = 5.5*x + 12*u quiregy x predict uhat1, residual predict yhat1 gen yhat2 = -0.0750109 + 5.598296*x gen uhat2=y-yhat2 sum yhat* uhat*
library(tidyverse)set.seed(1)tb <-tibble(x =rnorm(10000),u =rnorm(10000),y =5.5*x +12*u # notice that I am defining the beta1 here. The 5.5 is the "true" beta we want to estimate.) reg_tb <-lm(y ~ x, data=tb) tb <- tb %>%mutate(yhat1 =predict(lm(y ~ x, .)),yhat2 = reg_tb$coefficients[1] + reg_tb$coefficients[2]*x, uhat1 =residuals(lm(y ~ x, .)),uhat2 = y - yhat2 )tb %>%lm(uhat1 ~ x , .) %>%ggplot(aes(x=x, y=uhat1)) +geom_point(size =0.1, color ="black") +geom_smooth(method = lm, color ="black")
Python
import pandas as pdimport numpy as npimport statsmodels.api as smnp.random.seed(1)obs =10000x = np.random.normal(size=obs)u = np.random.normal(size=obs)y =5.5* x +12* uX = sm.add_constant(x)model = sm.OLS(y, X).fit()tb = pd.DataFrame({'x': x, 'u': u, 'y': y})tb['yhat1'] = model.predict(X)tb['uhat1'] = y - tb['yhat1']tb['yhat2'] = model.params[0] + model.params[1] * xtb['uhat2'] = y - tb['yhat2']model = sm.OLS(tb['uhat1'], sm.add_constant(tb['yhat1'])).fit()# Create a scatter plot with a regression linesns.set(style="whitegrid")plt.figure(figsize=(7, 4.5))sns.scatterplot(x='x', y='uhat1', data=tb, size=0.05, color='black', alpha=0.5)sns.regplot(x='x', y='uhat1', data=tb, scatter=False, color='black')plt.xlabel('x')plt.ylabel('uhat1')plt.title('Scatter Plot of uhat1 vs. x')plt.show()
Stata
setseed 1 setobs 10000 gen x = rnormal() gen u = rnormal() geny = 5.5*x + 12*u quiregy x predict yhat1 predict uhat1, residual twoway (lfit uhat1 x , lcolor(black) lwidth(large)) (scatter uhat1 x , mcolor(black) msymbol(point))quigraphexport"files/graph5.svg", replace
Properties of OLS
Let’s say you estimate a model and find the \(\hat{\mu}\).
If you calculate the correlation between the X and \(\hat{\mu}\), you will find zero.
This is by construction! It is not an evidence that CMI is nos violated.
In fact, the OLS “assumes” and “forces” zero correlation.
It is intuitive: if you are “forcing” zero correlation when the correlation is not in fact zero, your coefficients will be biased.
The previous graphs actually show zero correlation. But that is expected and does not suggest the model is not violating CMI.
At the end of the day, CMI is untestable and unverifiable.
Goodness-of-fit
Goodness-of-fit
Understanding what SSR, SSE and SST mean
SSE = Sum of Squares Explained = \(\sum_i^n(\hat{y_i}-\bar{y})^2\)
SSR = Sum of Squares Residuals = \(\sum_i^n\hat{\mu}^2\)
SST = Sum of Squares Total = SSE + SSR = \(\sum_i^n(y_i-\hat{y_i})^2\)
R-squared is simply the ratio of portion explained over the total that could be explained.
\[R^2 = \frac{SSE}{SST} = 1-\frac{SSR}{SST}\]
Goodness-of-fit
Goodness-of-fit
You can think this way:
If X does not explain Y, then the best predictor of Y is \(\bar{y}\). In that case, your model does not explain anything of Y, thus \(R^2\) is zero, and \(\hat{y_i}=\bar{y}\)
If X partially explains Y, then \(\hat{y_i} \neq \bar{y}\), meaning that \(\hat{y_i}\) has some inclination (like the figure next slide). This means that \(SSE>0\) and your \(R^2>0\) but \(R^2<0\)
Whatever is not explained by \(\hat{y_i}\) is left to \(\sum_i^2\hat{\mu}^2\), meaning that SSR will be non-zero.
The ratio of the portion that you can explain by \(\hat{y_i}\) over the total that is to be explained \(y_i-\hat{y_i}\) if the \(R^2\).
library(foreign) # importing dataset from a stata dta filedata <-read.dta("files/CEOSAL1.dta")attach(data)# Statistics of salary mean(salary)
[1] 1281.12
R
# OLS modelmodel <-lm(salary ~ roe)salaryhat <-fitted(model) # Predict values for dependent variableuhat <-resid(model) # Predict regression residualssalarymean <-rep(mean(salary),length(salary)) # Generating the mean of salary summary(model)
Call:
lm(formula = salary ~ roe)
Residuals:
Min 1Q Median 3Q Max
-1160.2 -526.0 -254.0 138.8 13499.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 963.19 213.24 4.517 1.05e-05 ***
roe 18.50 11.12 1.663 0.0978 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1367 on 207 degrees of freedom
Multiple R-squared: 0.01319, Adjusted R-squared: 0.008421
F-statistic: 2.767 on 1 and 207 DF, p-value: 0.09777
Python
import pandas as pdimport statsmodels.api as smdata = pd.read_stata("files/CEOSAL1.dta")print(data['salary'].mean())# OLS model
1281.1196172248804
Python
X = data['roe']X = sm.add_constant(X)y = data['salary']model = sm.OLS(y, X).fit() # Fit the linear regression modelsalaryhat = model.fittedvalues # Predicted values for the dependent variableuhat = model.resid # Predict regression residualssalarymean = pd.Series([y.mean()] *len(y)) # Generating the mean of salaryprint(model.summary())
library(foreign) # importing dataset from a stata dta filemydata <-read.dta("files/CEOSAL1.dta")attach(mydata)model <-lm(salary ~ roe)salaryhat <-fitted(model) # Predict values for dependent variableuhat <-resid(model) # Predict regression residualssalarymean <-rep(mean(salary),length(salary)) # Generating the mean of salary # r-squared is simply the ratio of portion explained over total that could be explained - Understanding what SSR, SSE and SST mean plot(salary ~ roe)abline(lm(salary ~ roe), col ="blue")abline(lm(salarymean ~ roe), col ="red")
Python
import pandas as pdimport matplotlib.pyplot as pltimport numpy as npdata = pd.read_stata("files/CEOSAL1.dta")X = data[['roe']]y = data['salary']salarymean = np.repeat(y.mean(), len(y))X_mean = X.mean()y_mean = y.mean()slope = np.sum((X - X_mean) * (y - y_mean)) / np.sum((X - X_mean) **2)intercept = y_mean - slope * X_meansalaryhat = slope * X + intercept# Plotting the data and regression linesplt.scatter(X, y, alpha=0.7)plt.plot(X, salaryhat, color='blue', linewidth=2)plt.plot(X, salarymean, color='red', linewidth=2)plt.xlabel('roe')plt.ylabel('salary')plt.show()
library(foreign) # importing dataset from a stata dta filemydata <-read.dta("files/CEOSAL1.dta")attach(mydata)model <-lm(salary ~ roe)salaryhat <-fitted(model) # Predict values for dependent variableuhat <-resid(model) # Predict regression residualssalarymean <-rep(mean(salary),length(salary)) # Generating the mean of salary # r-squared is simply the ratio of portion explained over total that could be explainedssr <-sum(uhat^2)ssrB <-sum((salary - salaryhat)^2)sst <-sum((salary - salarymean)^2)sse <-sum((salaryhat - salarymean)^2)sse / sst