Econometria Aplicada a Finanças

Part 4

Henrique C. Martins

More about Regressions

Regression

Linear regressions are the workhorse tool in econometrics

Simplicity: straightforward to understand, implement, and visualize.

Interpretability: coefficients have clear interpretations.
- represents the change in the Y for a one-unit change in the X, holding all other variables constant.

Versatility: simple linear regression or multiple linear regression.

Assumptions: linearity, independence of errors, homoscedasticity, and normality of errors.

Baseline Model: You can compare the performance of more advanced models to linear regression.

Estimation: provides clear estimates of the coefficients’ confidence intervals and hypothesis testing.

Regression

In this setting, the variables \(y\) and \(x\) can have several names.

Y	X
Dependent variable	Independent variable
Explained variable	Explanatory variable
Response variable	Control variable
Predicted variable	Predictor variable
Regressand	Regressor

Regression

Broadly, we are interested in how y is explained by x?

\(y_i = \alpha + \beta_1 x_i + \epsilon\)

Perhaps \(\epsilon\) is the most important part of a regression.

The interpretation is “everything that is not explained by X and that explains Y”.

A comment

Usually, the literature uses \(\epsilon\) for the “estimated” residual.
And \(\mu\) for the “true” residual, which necessarily implies that the assumptions hold.
At the end od the day, you don’t need to worry to much with the notation of this term because we are always in the “estimated world”, and almost never in the “true world”.
The “true world” implies that you are studying the population or that you have a true random sample of the population
- \(y_i = \alpha + \beta_1 x_i + \mu\)

Regression

Remember

\(y, x\), and \(\mu\) are random variables
\(y and x\) are observable
\(\mu\) and \(\beta\) are unobservable
\(\mu\) captures everything that determines y after accounting for x

Our goal is to estimate β

Regression

There are some assumptions/requirements about \(\mu\) in a OLS

First assumption

E(\(\mu\)) = 0
- This is a simple assumption, not strong at all.
- It simply assumes that the average of \(\mu\) is zero in the population.
- Basically, any non-zero mean is absorbed by the intercept
  - Say that E(\(\mu\)) = k
  - We could rewrite \(\mu = k + w\), where E(w)=0
  - Then, model becomes \(y=(\alpha +𝑘) + \beta𝑥+𝑤\)
  - Intercept is now just (α + k), and error, w, is mean zero

Regression

Second assumption

E(\(\mu\)|x) = E(\(\mu\)) for all values of x
- It says that the average value of \(\mu\) does not depend on the value of x (i.e., the slice of the population you are looking at).
- We say that \(\mu\) is mean-independent of x.
- This is true if the X and the \(\mu\) are independent to each other.
- Implies that x and \(\mu\) are uncorrelated.
- Conditional Mean Independence (CMI).
- This is one of the keys assumption to causal inference.

Regression

Second assumption

Example

Let’s say the model is:

\[wage = \alpha + \beta Schooling_{years} + \epsilon\]

where \(\epsilon\) represents unobserved ability.

Does CMI hold?

That is E(ability|x=8)=E(ability|x=10)=E(ability|x=12)?

Probably not, because the unobserved ability should depend on the years of schooling.

The solution (not trivial) would be to include ability as a new X.

Regression

Another example

Consider the following model (with only one X)

\[Leverage_i = \alpha + \beta_1 Profitability_i + \mu_i\]

CMI says that, to every firm i, \(\mu\) is the same, even when firms have different profitability.
Can you think on examples when this assumption may not hold in this model?

unprofitable firms have higher bankruptcy risk, which should make them to have lower leverage (tradeoff theory).
unprofitable firms have low cash, which should make them to have more leverage (pecking order theory).

Regression

The discussion of whether the CMI holds is the origin of the “endogeneity” problem.

You will face reviewers arguing reasons of why the CMI might not hold in your case.

Many will criticize a model by saying it has an “endogeneity problem”, whithout explaining more.
- This is very generic. Don’t do that!

They should explain what is the source of the problem that is making the model violate CMI.

OVB, selection bias, reverse causality, simultaneity, etc?

Generally speaking, endogeneity refers to a violation of CMI, meaning that \(x\) and \(\mu\) are correlated.

This is always a plausible possibility, since \(\mu\) carries a lot of stuff (something must be correlated with X).

Regression

Third assumption

Combining 1 and 2 leads to

E(\(\mu\)|x) = 0
- This is a very important assumption called zero conditional mean assumption.

Ordinary Least Squares

Our focus is to find estimates for \(\alpha\) and \(\beta\). Should we have access to the population, it would be easy. We could write:

\[y_i= \alpha + \beta_1x_i + \mu\]

But remember that,

\[E(u) = 0\] \[E(u|x) = 0\]

The second bullet implies that the correlation between x and \(\mu\) is zero (we can write that as E(x,u) = 0).

Important

Remember: \(\alpha\) and \(\beta\) are parameters to be estimated (i.e., constants), while \(X\) and \(Y\) are variables

Regression

So we can write that (in the population)

\(E(y - \alpha - \beta_1x ) = 0\)

\((x[y - \alpha - \beta_1x ]) = 0\)

So we can write that (in the sample)

\(\frac{1}{n} \sum_{i=1}^n (y_i - \hat{\alpha} - \hat{\beta_1} x_i ) = 0\) , We will use this to find \(\alpha\):

\(\frac{1}{n} \sum_{i=1}^n (x_i [y_i - \hat{\alpha} - \hat{\beta_1} x_i ]) = 0\) , We will use this to find \(\beta\):

Finding \(\alpha\)

From before:

\[\frac{1}{n} \sum_{i=1}^n (y_i - \hat{\alpha} - \hat{\beta_1} x_i ) =0\]

Passing the sum operator through

\[\frac{1}{n}\sum_{i=1}^n(y_i) - \frac{1}{n}\sum_{i=1}^n(\hat{\alpha}) - \frac{1}{n} \sum_{i=1}^n(\hat{\beta_1} x_i )=0\]

Coefficients are constants, so we can get rid of the sum operator.

\[\frac{1}{n}\sum_{i=1}^n(y_i) - \hat{\alpha} - \hat{\beta_1} \frac{1}{n} \sum_{i=1}^n(\ x_i )=0\]

Finding \(\alpha\)

We know that \(\frac{1}{n}\sum_{i=1}^n(y_i)\) is \(\bar{y_i}\) (the mean)

\[\bar{y_i} - \hat{\alpha} - \hat{\beta_1} \bar{x_i}=0\]

So we write:

\[\hat{\alpha} = \bar{y_i} - \hat{\beta_1} \bar{x_i}\]

Easy Peasy 😀

Finding \(\beta\)

From before:

\[\frac{1}{n} \sum_{i=1}^n (x_i [y_i - \hat{\alpha} - \hat{\beta_1} x_i ]) = 0\]

But now we have:

\[\hat{\alpha} = \bar{y_i} - \hat{\beta_1} \bar{x_i}\]

Thus,

\[\frac{1}{n} \sum_{i=1}^n (x_i [y_i - (\bar{y_i} - \hat{\beta_1} \bar{x_i}) - \hat{\beta_1} x_i ]) = 0\]

Finding \(\beta\)

Turning into

\[\frac{1}{n} \sum_{i=1}^n (x_i [y_i - \bar{y_i} ]) - \frac{1}{n} \sum_{i=1}^n (x_i [\hat{\beta_1} x_i - \hat{\beta_1} \bar{x_i} ]) = 0\]

\[\frac{1}{n} \sum_{i=1}^n (x_i [y_i - \bar{y_i} ]) = \hat{\beta_1} \frac{1}{n} \sum_{i=1}^n (x_i [ x_i - \bar{x_i} ]) \]

Last step (I am skipping some steps here)

\[\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x} )(y_i - \bar{y_i} ) = \hat{\beta_1} \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x_i} )^2 \]

Finding \(\beta\)

If the variance of x is not zero, we can write \(\beta\) as

\[\hat{\beta_1} = \frac{\sum_i^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_i^n (x_i - \bar{x})^2} = \frac{Covariance(x_i,y_i)}{Variance(x_i)}\]

Finding \(\mu\)

Now that we have \(\hat{Y_i}=\hat{\alpha} + \hat{\beta_1} X_i\), we can estimate the residual \(\mu\)

\[\hat{\mu} = Y_i - \hat{Y_i}\]

Which is the same as:

\[\hat{\mu} = Y_i - \hat{\alpha} - \hat{\beta_1}x_i\]

Most residuals will not be 0, meaning they do not lie on the best fitting line.

Finding \(\mu\)

The job of an OLS model is find the parameters to minimize the squared error (i.e., to find the best fitting line).

\[\sum_{i=1}^n \hat{\mu}^2 = \sum_{i=1}^n(Y_i - \hat{Y_i})^2\]

Regression (Source)

Another example of regression. The differences in the coefficients are due to the differences in the seed of the random variables generator.

R
Python
Stata

library(tidyverse)
set.seed(1)
tb <- tibble(
  x = rnorm(10000),
  u = rnorm(10000),
  y = 5.5*x + 12*u # notice that I am defining the beta1 here. The 5.5 is the "true" beta we want to estimate.
) 
reg_tb <- lm(y ~ x, data=tb) 
summary(reg_tb)


Call:
lm(formula = y ~ x, data = tb)

Residuals:
    Min      1Q  Median      3Q     Max 
-51.528  -8.152  -0.173   7.978  44.718 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.04991    0.11890   -0.42    0.675    
x            5.55690    0.11745   47.31   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.89 on 9998 degrees of freedom
Multiple R-squared:  0.1829,    Adjusted R-squared:  0.1828 
F-statistic:  2238 on 1 and 9998 DF,  p-value: < 2.2e-16

Python

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
np.random.seed(1)

obs = 10000
data = pd.DataFrame({
    'x': np.random.normal(size=obs),
    'u': np.random.normal(size=obs),
})
data['y'] = 5.5 * data['x'] + 12 * data['u']

X = sm.add_constant(data['x'])
model = sm.OLS(data['y'], X).fit()

print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.183
Model:                            OLS   Adj. R-squared:                  0.183
Method:                 Least Squares   F-statistic:                     2237.
Date:                qua, 20 set 2023   Prob (F-statistic):               0.00
Time:                        10:39:25   Log-Likelihood:                -39049.
No. Observations:               10000   AIC:                         7.810e+04
Df Residuals:                    9998   BIC:                         7.812e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1114      0.120      0.927      0.354      -0.124       0.347
x              5.6887      0.120     47.293      0.000       5.453       5.924
==============================================================================
Omnibus:                        0.640   Durbin-Watson:                   2.050
Prob(Omnibus):                  0.726   Jarque-Bera (JB):                0.672
Skew:                          -0.012   Prob(JB):                        0.715
Kurtosis:                       2.968   Cond. No.                         1.01
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Stata

set seed 1 
set obs 10000 
gen x = rnormal() 
gen u  = rnormal() 
gen y  = 5.5*x + 12*u 
reg y x

Number of observations (_N) was 0, now 10,000.

      Source |       SS           df       MS      Number of obs   =    10,000
-------------+----------------------------------   F(1, 9998)      =   2206.80
       Model |  327141.413         1  327141.413   Prob > F        =    0.0000
    Residual |   1482125.8     9,998  148.242229   R-squared       =    0.1808
-------------+----------------------------------   Adj R-squared   =    0.1807
       Total |  1809267.22     9,999  180.944816   Root MSE        =    12.175

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
           x |   5.598296    .119172    46.98   0.000     5.364695    5.831897
       _cons |  -.0750109   .1217548    -0.62   0.538    -.3136748     .163653
------------------------------------------------------------------------------

Regression (Source)

Another example of regression. The differences in the coefficients are due to the differences in the seed of the random variables generator.

R
Python
Stata

library(tidyverse)
set.seed(1)
tb <- tibble(
  x = rnorm(10000),
  u = rnorm(10000),
  y = 5.5*x + 12*u # notice that I am defining the beta1 here. The 5.5 is the "true" beta we want to estimate.
) 
reg_tb <- lm(y ~ x, data=tb) 
tb %>% 
  lm(y ~ x, .) %>% 
  ggplot(aes(x=x, y=y)) + 
  ggtitle("OLS Regression Line") +
  geom_point(size = 0.05, color = "black", alpha = 0.5) +
  geom_smooth(method = lm, color = "black") +
  annotate("text", x = -1.5, y = 30, color = "red", 
           label = paste("Intercept = ", reg_tb$coefficients[1])) +
  annotate("text", x = 1.5, y = -30, color = "blue", 
           label = paste("Slope =", reg_tb$coefficients[2]))

Python

import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt

# Create a scatterplot with the OLS regression line using Seaborn
sns.set(style='whitegrid')
plt.figure(figsize=(7, 5))
sns.scatterplot(x='x', y='y', data=data, color='black', alpha=0.5, s=5)
sns.regplot(x='x', y='y', data=data, color='black', scatter=False, line_kws={'color':'black'})
plt.title('OLS Regression Line')
plt.annotate(f'Intercept = {model.params[0]:.2f}', xy=(-1.5, 30), color='red')
plt.annotate(f'Slope = {model.params[1]:.2f}', xy=(1.5, -30), color='blue')
plt.show()

Stata

set seed 1 
set obs 10000 
gen x = rnormal() 
gen u  = rnormal() 
gen y  = 5.5*x + 12*u 
reg y x 
predict yhat1 
gen yhat2 = -0.0750109  + 5.598296*x 
predict uhat1, residual 
gen uhat2=y-yhat2 
qui sum uhat* 
twoway (lfit y x, lcolor(black) lwidth(medium)) (scatter y x, mcolor(black) msize(tiny) msymbol(point)), title(OLS Regression Line) 
qui graph export "files/graph3.svg", replace

Regression (Source)

Using the previous regressions, we can show that:

\(\hat{y_i} = \hat{\alpha} + \hat{\beta_1} x_i\)
\(\hat{\mu_i} = y_i - \hat{y_i}\)

R
Python
Stata

library(tidyverse)
set.seed(1)
tb <- tibble(
  x = rnorm(10000),
  u = rnorm(10000),
  y = 5.5*x + 12*u # notice that I am defining the beta1 here. The 5.5 is the "true" beta we want to estimate.
) 
reg_tb <- lm(y ~ x, data=tb) 

tb <- tb %>% 
  mutate(
    yhat1 = predict(lm(y ~ x, .)),
    yhat2 = reg_tb$coefficients[1] + reg_tb$coefficients[2]*x, 
    uhat1 = residuals(lm(y ~ x, .)),
    uhat2 = y - yhat2
  )
summary(tb[-1:-3])

     yhat1               yhat2               uhat1              uhat2         
 Min.   :-20.45096   Min.   :-20.45096   Min.   :-51.5275   Min.   :-51.5275  
 1st Qu.: -3.79189   1st Qu.: -3.79189   1st Qu.: -8.1520   1st Qu.: -8.1520  
 Median : -0.13842   Median : -0.13842   Median : -0.1727   Median : -0.1727  
 Mean   : -0.08624   Mean   : -0.08624   Mean   :  0.0000   Mean   :  0.0000  
 3rd Qu.:  3.71578   3rd Qu.:  3.71578   3rd Qu.:  7.9778   3rd Qu.:  7.9778  
 Max.   : 21.12342   Max.   : 21.12342   Max.   : 44.7176   Max.   : 44.7176

Python

import pandas as pd
import numpy as np
import statsmodels.api as sm
np.random.seed(1)
obs = 10000
x = np.random.normal(size=obs)
u = np.random.normal(size=obs)
y = 5.5 * x + 12 * u

X = sm.add_constant(x)
model = sm.OLS(y, X).fit()

tb = pd.DataFrame({'x': x, 'u': u, 'y': y})
tb['yhat1'] = model.predict(X)
tb['uhat1'] = y - tb['yhat1']
tb['yhat2'] = model.params[0] + model.params[1] * x
tb['uhat2'] = y - tb['yhat2']

print(tb[['yhat1','yhat2', 'uhat1','uhat2']].describe())

              yhat1         yhat2         uhat1         uhat2
count  10000.000000  10000.000000  1.000000e+04  1.000000e+04
mean       0.166975      0.166975 -1.268319e-16 -1.268319e-16
std        5.682040      5.682040  1.201339e+01  1.201339e+01
min      -20.688875    -20.688875 -4.142425e+01 -4.142425e+01
25%       -3.659775     -3.659775 -8.199882e+00 -8.199882e+00
50%        0.159473      0.159473  4.497835e-02  4.497835e-02
75%        3.933075      3.933075  8.147307e+00  8.147307e+00
max       23.018769     23.018769  5.000751e+01  5.000751e+01

Stata

set seed 1 
clear 
qui set obs 10000 
gen x = rnormal() 
gen u  = rnormal() 
gen y  = 5.5*x + 12*u 
qui reg y x 
predict uhat1, residual 
predict yhat1 
gen yhat2 = -0.0750109  + 5.598296*x 
gen uhat2=y-yhat2 
sum yhat*  uhat*

(option xb assumed; fitted values)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       yhat1 |     10,000   -.0797351    5.719914  -23.06927   22.47639
       yhat2 |     10,000   -.0797351    5.719913  -23.06927   22.47639
       uhat1 |     10,000   -3.53e-09    12.17487   -49.8996   42.15174
       uhat2 |     10,000   -1.72e-08    12.17487   -49.8996   42.15174

Properties of OLS

We can easily show that (remember from before)

\[\sum_i^n(y_i - \hat{\alpha} - \hat{\beta_1} x_i) = 0\]

And that

\[\sum_i^n \hat{\mu} = 0\]

The graphs next slide shows a spherical figure, suggesting that the residual is not correlated with the the fitted values (\(\hat{y_i}\))

library(tidyverse)
set.seed(1)
tb <- tibble(
  x = rnorm(10000),
  u = rnorm(10000),
  y = 5.5*x + 12*u # notice that I am defining the beta1 here. The 5.5 is the "true" beta we want to estimate.
) 
reg_tb <- lm(y ~ x, data=tb) 
tb <- tb %>% 
  mutate(
    yhat1 = predict(lm(y ~ x, .)),
    yhat2 = reg_tb$coefficients[1] + reg_tb$coefficients[2]*x, 
    uhat1 = residuals(lm(y ~ x, .)),
    uhat2 = y - yhat2
  )
tb %>% 
  lm(uhat1 ~ yhat1 , .) %>% 
  ggplot(aes(x=yhat1, y=uhat1)) + 
  geom_point(size = 0.1, color = "black") +
  geom_smooth(method = lm, color = "black")

Python

import pandas as pd
import numpy as np
import statsmodels.api as sm
np.random.seed(1)
obs = 10000
x = np.random.normal(size=obs)
u = np.random.normal(size=obs)
y = 5.5 * x + 12 * u

X = sm.add_constant(x)
model = sm.OLS(y, X).fit()

tb = pd.DataFrame({'x': x, 'u': u, 'y': y})
tb['yhat1'] = model.predict(X)
tb['uhat1'] = y - tb['yhat1']
tb['yhat2'] = model.params[0] + model.params[1] * x
tb['uhat2'] = y - tb['yhat2']
model = sm.OLS(tb['uhat1'], sm.add_constant(tb['yhat1'])).fit()
# Create a scatter plot with a regression line
sns.set(style="whitegrid")
plt.figure(figsize=(7, 4.5))
sns.scatterplot(x='yhat1', y='uhat1', data=tb, size=0.05, color='black', alpha=0.5)
sns.regplot(x='yhat1', y='uhat1', data=tb, scatter=False, color='black')
plt.xlabel('yhat1')
plt.ylabel('uhat1')
plt.title('Scatter Plot of uhat1 vs. yhat1')
plt.show()

Stata

set seed 1 
set obs 10000 
gen x = rnormal() 
gen u  = rnormal() 
gen y  = 5.5*x + 12*u 
qui reg y x 
predict yhat1 
predict uhat1, residual 
twoway (lfit uhat1 yhat1 , lcolor(black) lwidth(large)) (scatter uhat1 yhat1 , mcolor(black)  msymbol(point))
qui graph export "files/graph4.svg", replace

Properties of OLS

Similarly, we can easily show that

\[\sum_i^nx_i(y_i - \hat{\alpha} - \hat{\beta_1} x_i) = 0\]

And that

\[\sum_i^nx_i\hat{\mu} = 0\]

Meaning that the sample covariance between the X and the residual will be always zero.

Properties of OLS

R
Python
Stata

library(tidyverse)
set.seed(1)
tb <- tibble(
  x = rnorm(10000),
  u = rnorm(10000),
  y = 5.5*x + 12*u # notice that I am defining the beta1 here. The 5.5 is the "true" beta we want to estimate.
) 
reg_tb <- lm(y ~ x, data=tb) 
tb <- tb %>% 
  mutate(
    yhat1 = predict(lm(y ~ x, .)),
    yhat2 = reg_tb$coefficients[1] + reg_tb$coefficients[2]*x, 
    uhat1 = residuals(lm(y ~ x, .)),
    uhat2 = y - yhat2
  )
tb %>% 
  lm(uhat1 ~ x , .) %>% 
  ggplot(aes(x=x, y=uhat1)) + 
  geom_point(size = 0.1, color = "black") +
  geom_smooth(method = lm, color = "black")

Python

import pandas as pd
import numpy as np
import statsmodels.api as sm
np.random.seed(1)
obs = 10000
x = np.random.normal(size=obs)
u = np.random.normal(size=obs)
y = 5.5 * x + 12 * u

X = sm.add_constant(x)
model = sm.OLS(y, X).fit()

tb = pd.DataFrame({'x': x, 'u': u, 'y': y})
tb['yhat1'] = model.predict(X)
tb['uhat1'] = y - tb['yhat1']
tb['yhat2'] = model.params[0] + model.params[1] * x
tb['uhat2'] = y - tb['yhat2']
model = sm.OLS(tb['uhat1'], sm.add_constant(tb['yhat1'])).fit()
# Create a scatter plot with a regression line
sns.set(style="whitegrid")
plt.figure(figsize=(7, 4.5))
sns.scatterplot(x='x', y='uhat1', data=tb, size=0.05, color='black', alpha=0.5)
sns.regplot(x='x', y='uhat1', data=tb, scatter=False, color='black')
plt.xlabel('x')
plt.ylabel('uhat1')
plt.title('Scatter Plot of uhat1 vs. x')
plt.show()

Stata

set seed 1 
set obs 10000 
gen x = rnormal() 
gen u  = rnormal() 
gen y  = 5.5*x + 12*u 
qui reg y x 
predict yhat1 
predict uhat1, residual 
twoway (lfit uhat1 x , lcolor(black) lwidth(large)) (scatter uhat1 x , mcolor(black) msymbol(point))
qui graph export "files/graph5.svg", replace

Properties of OLS

Let’s say you estimate a model and find the \(\hat{\mu}\).

If you calculate the correlation between the X and \(\hat{\mu}\), you will find zero.

This is by construction! It is not an evidence that CMI is nos violated.

In fact, the OLS “assumes” and “forces” zero correlation.

It is intuitive: if you are “forcing” zero correlation when the correlation is not in fact zero, your coefficients will be biased.

The previous graphs actually show zero correlation. But that is expected and does not suggest the model is not violating CMI.

At the end of the day, CMI is untestable and unverifiable.

Goodness-of-fit

Understanding what SSR, SSE and SST mean

SSE = Sum of Squares Explained = \(\sum_i^n(\hat{y_i}-\bar{y})^2\)
SSR = Sum of Squares Residuals = \(\sum_i^n\hat{\mu}^2\)
SST = Sum of Squares Total = SSE + SSR = \(\sum_i^n(y_i-\hat{y_i})^2\)

R-squared is simply the ratio of portion explained over the total that could be explained.

\[R^2 = \frac{SSE}{SST} = 1-\frac{SSR}{SST}\]

Goodness-of-fit

You can think this way:

If X does not explain Y, then the best predictor of Y is \(\bar{y}\). In that case, your model does not explain anything of Y, thus \(R^2\) is zero, and \(\hat{y_i}=\bar{y}\)

If X partially explains Y, then \(\hat{y_i} \neq \bar{y}\), meaning that \(\hat{y_i}\) has some inclination (like the figure next slide). This means that \(SSE>0\) and your \(R^2>0\) but \(R^2<0\)

Whatever is not explained by \(\hat{y_i}\) is left to \(\sum_i^2\hat{\mu}^2\), meaning that SSR will be non-zero.

The ratio of the portion that you can explain by \(\hat{y_i}\) over the total that is to be explained \(y_i-\hat{y_i}\) if the \(R^2\).

Goodness-of-fit

R
Python
Stata

library(foreign) # importing dataset from a stata dta file
data <- read.dta("files/CEOSAL1.dta")
attach(data)
# Statistics of salary 
mean(salary)

[1] 1281.12

# OLS model
model <- lm(salary ~ roe)
salaryhat <- fitted(model)                      # Predict values for dependent variable
uhat <- resid(model)                            # Predict regression residuals
salarymean <- rep(mean(salary),length(salary))  # Generating the mean of salary 
summary(model)


Call:
lm(formula = salary ~ roe)

Residuals:
    Min      1Q  Median      3Q     Max 
-1160.2  -526.0  -254.0   138.8 13499.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   963.19     213.24   4.517 1.05e-05 ***
roe            18.50      11.12   1.663   0.0978 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1367 on 207 degrees of freedom
Multiple R-squared:  0.01319,   Adjusted R-squared:  0.008421 
F-statistic: 2.767 on 1 and 207 DF,  p-value: 0.09777

Python

import pandas as pd
import statsmodels.api as sm
data = pd.read_stata("files/CEOSAL1.dta")
print(data['salary'].mean())
# OLS model

1281.1196172248804

Python

X = data['roe']
X = sm.add_constant(X)
y = data['salary']

model = sm.OLS(y, X).fit()  # Fit the linear regression model
salaryhat = model.fittedvalues  # Predicted values for the dependent variable
uhat = model.resid  # Predict regression residuals
salarymean = pd.Series([y.mean()] * len(y))  # Generating the mean of salary
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                qua, 20 set 2023   Prob (F-statistic):             0.0978
Time:                        10:39:50   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        963.1913    213.240      4.517      0.000     542.790    1383.592
roe           18.5012     11.123      1.663      0.098      -3.428      40.431
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Stata

use "files/CEOSAL1.DTA" , replace
sum salary 
reg salary roe 
predict salaryhat , xb              
predict uhat, resid                 
egen salarymean = mean(salary)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      salary |        209     1281.12    1372.345        223      14822

      Source |       SS           df       MS      Number of obs   =       209
-------------+----------------------------------   F(1, 207)       =      2.77
       Model |  5166419.04         1  5166419.04   Prob > F        =    0.0978
    Residual |   386566563       207  1867471.32   R-squared       =    0.0132
-------------+----------------------------------   Adj R-squared   =    0.0084
       Total |   391732982       208  1883331.64   Root MSE        =    1366.6

------------------------------------------------------------------------------
      salary | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         roe |   18.50119   11.12325     1.66   0.098    -3.428196    40.43057
       _cons |   963.1913   213.2403     4.52   0.000     542.7902    1383.592
------------------------------------------------------------------------------

Goodness-of-fit

R
Python
Stata

library(foreign) # importing dataset from a stata dta file
mydata <- read.dta("files/CEOSAL1.dta")
attach(mydata)
model <- lm(salary ~ roe)
salaryhat <- fitted(model)                      # Predict values for dependent variable
uhat <- resid(model)                            # Predict regression residuals
salarymean <- rep(mean(salary),length(salary))  # Generating the mean of salary 
# r-squared is simply the ratio of portion explained over total that could be explained - Understanding what SSR, SSE and SST mean 
plot(salary ~ roe)
abline(lm(salary ~ roe), col = "blue")
abline(lm(salarymean ~ roe), col = "red")

Python

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_stata("files/CEOSAL1.dta")
X = data[['roe']]
y = data['salary']
salarymean = np.repeat(y.mean(), len(y))
X_mean = X.mean()
y_mean = y.mean()
slope = np.sum((X - X_mean) * (y - y_mean)) / np.sum((X - X_mean) ** 2)
intercept = y_mean - slope * X_mean
salaryhat = slope * X + intercept
# Plotting the data and regression lines
plt.scatter(X, y,  alpha=0.7)
plt.plot(X, salaryhat,  color='blue', linewidth=2)
plt.plot(X, salarymean, color='red',  linewidth=2)
plt.xlabel('roe')
plt.ylabel('salary')
plt.show()

Stata

use "files/CEOSAL1.DTA" , replace
sum salary , d
reg salary roe 
predict salaryhat , xb              
predict uhat, resid                 
egen salarymean = mean(salary)      

twoway (scatter salary roe) (lfit salary roe) (lfit salarymean roe) 
qui graph export "files/graph4_6.svg", replace

Goodness-of-fit

Manually calculating \(R^2\)

R
Python
Stata

library(foreign) # importing dataset from a stata dta file
mydata <- read.dta("files/CEOSAL1.dta")
attach(mydata)
model <- lm(salary ~ roe)
salaryhat <- fitted(model)                      # Predict values for dependent variable
uhat <- resid(model)                            # Predict regression residuals
salarymean <- rep(mean(salary),length(salary))  # Generating the mean of salary 

# r-squared is simply the ratio of portion explained over total that could be explained
ssr  <- sum(uhat^2)
ssrB <- sum((salary    - salaryhat)^2)
sst  <- sum((salary    - salarymean)^2)
sse  <- sum((salaryhat - salarymean)^2)
sse / sst

[1] 0.01318862

Python

import pandas as pd
import numpy as np
import statsmodels.api as sm
data = pd.read_stata("files/CEOSAL1.dta")
X = data['roe']
y = data['salary']
X = sm.add_constant(X)  # Add a constant term (intercept)
model = sm.OLS(y, X).fit()
salaryhat = model.fittedvalues
uhat = model.resid
salarymean = np.repeat(y.mean(), len(y))
# Calculate R-squared
ssr = np.sum(uhat**2)
ssrB = np.sum((y - salaryhat)**2)
sst = np.sum((y - salarymean)**2)
sse = np.sum((salaryhat - salarymean)**2)
rsquared = sse / sst
print(rsquared)

0.013188624081034108

Stata

use "files/CEOSAL1.DTA" , replace
qui reg salary roe 
predict salaryhat , xb              
predict uhat, resid                 
egen salarymean = mean(salary)      
egen sst  = total((salary    - salarymean)^2)  
egen ssr  = total((salary    - salaryhat)^2)
egen ssrB = total(uhat^2)                   
egen sse  = total((salaryhat - salarymean)^2)   
di sse / sst

.01318862

THANK YOU!

Henrique Castro Martins