Econometria Aplicada a Finanças

Part 5

Henrique C. Martins

Variance of coefficients

When we estimate coefficients we have some “error of estimation”.

Basically, you are searching the “true” coefficient using a sample, which should be representative of the population but it is not the population itself.
This means that the coefficient estimated is estimated with error.
We would like (e.g., we will need) to impose some “structure” to that error.

Variance of coefficients

Standard error and T-stat

To assess if the variables are significantly related, you need to assess the significance of \(\beta\) coefficients.

Using the example from Wooldridge, we know that the Beta of ROE is 18.591, while the standard error of ROE is 11.123.

The standard error is a measure of the accuracy of your estimate. If you find a large standard error, your estimate does not have good accuracy.
Ideally, you would find small standard errors, meaning that your coefficient is accurately estimated.
However, you do not have good control over the magnitude of the standard errors.

Variance of coefficients

Standard error and T-stat

If you have a large standard error, probably you coefficient will not be significantly different from zero. You can test whether your coefficient is significantly different from zero computing the t-statistics as follows:

\[t_{\beta} = \frac{\hat{\beta}}{se(\hat{\beta})}\]

If \(t_{\beta}\) is large enough, you can say that \(\beta\) is significantly different from zero. Usually, \(t_{\beta}\) larger than 2 is enough to be significant.

Variance of coefficients

In the previous example, you can find the t-stat manually as follows (\(t_{\beta} =\frac{\hat{\beta}}{se(\hat{\beta})}\)):

R
Python
Stata

library(foreign) # importing dataset from a stata dta file
data <- read.dta("files/CEOSAL1.dta")
attach(data)
# OLS model
model <- lm(salary ~ roe)
summary(model)$coefficients[2,1] / summary(model)$coefficients[2,2]

[1] 1.663289

summary(model)$coefficients[2,3]

[1] 1.663289

Python

import pandas as pd
import statsmodels.api as sm
data = pd.read_stata("files/CEOSAL1.dta")
# OLS model
X = data['roe']
X = sm.add_constant(X)
y = data['salary']
model = sm.OLS(y, X).fit()  
# Extract and calculate specific coefficients
coef_beta = model.params['roe']
coef_std_error = model.bse['roe']
# Calculate t-value
t_value = coef_beta / coef_std_error
# Print the coefficient and t-value
print("Coefficient (beta):", coef_beta)

Coefficient (beta): 18.501186345214926

Python

print("Standard Error:", coef_std_error)

Standard Error: 11.123250903287634

Python

print("t-value:", t_value)

t-value: 1.6632894920806505

Stata

use "files/CEOSAL1.DTA" , replace
qui reg salary roe 
local beta = _b[roe]
local std_error = _se[roe]
local t_value = `beta' / `std_error'
display "Coefficient (beta): " `beta'
display "Standard Error: " `std_error'
display "t-value: " `t_value'

Coefficient (beta): 18.501186

Standard Error: 11.123251

t-value: 1.6632895

Variance of coefficients

Naturally, the previous analysis requires an estimate of \(\beta\) and an estimate of the \(\beta\)’s standard error.

The standard error can be defined as:

\[se(\hat{\beta_1})=\frac{\hat{\sigma}}{\sqrt{SST_x}}\]

Where \(\hat{\sigma}\) is the standard deviation of the error term in the regression, which can be calculated as:

\[\hat{\sigma} = \sqrt{\frac{SSR}{n-2}}\]

- The $n-2$ here is an adjustment for the degrees of freedom in the regression.

SST is defined as before \(\sum_i^n(y_i-\hat{y_i})^2\)

library(foreign) # importing dataset from a stata dta file
data <- read.dta("files/CEOSAL1.dta")
attach(data)
# OLS model
model <- lm(salary ~ roe)
# Extract the standard error of the coefficient for 'roe'
summary(model)$coefficients["roe", "Std. Error"]

[1] 11.12325

#calculating manually
# Extract the residuals
residuals <- resid(model)
# Number of observations (n)
n <- length(residuals)
# Calculate the mean of the independent variable (roe)
roe_mean <- mean(roe)
# Calculate the sum of squared deviations of roe from its mean (SXX)
SST <- sum((roe - roe_mean)^2)
# Calculate the sum of squared errors (SSE)
SSR <- sum(residuals^2)
# Calculate the standard error of beta
Sd_beta <- sqrt(SSR / ((n - 2)))
# Calculate S.E
Se_beta <- Sd_beta / sqrt(SST)
# Print the standard error of beta
print(Se_beta)

[1] 11.12325

Python

import pandas as pd
import numpy as np
import statsmodels.api as sm
data = pd.read_stata("files/CEOSAL1.dta")
X = data['roe']
y = data['salary']
X = sm.add_constant(X)  
model = sm.OLS(y, X).fit()
# Extract the standard error of the coefficient for 'roe'
beta_se_summary = model.bse['roe']
print("Standard Error (from summary):", beta_se_summary)
# Calculate it manually
# Extract the residuals

Standard Error (from summary): 11.123250903287634

Python

residuals = model.resid
# Number of observations (n)
n = len(residuals)
# Calculate the mean of the independent variable (roe)
roe_mean = X['roe'].mean()
# Calculate the sum of squared deviations of roe from its mean (SST)
SST = np.sum((X['roe'] - roe_mean) ** 2)
# Calculate the sum of squared errors (SSE)
SSE = np.sum(residuals ** 2)
# Calculate the standard error of beta (Sd_beta)
Sd_beta = np.sqrt(SSE / (n - 2))
# Calculate SE_beta
SE_beta = Sd_beta / np.sqrt(SST)
print("Standard Error (manually calculated):", SE_beta)

Standard Error (manually calculated): 11.123250601798315

Stata

use "files/CEOSAL1.DTA" , replace
qui reg salary roe 
gen beta_se_summary = _se[roe]
gen n = _N
predict residuals, residuals
sum roe, meanonly
gen roe_diff = roe - r(mean)
egen roe_diff_sq = total(roe_diff^2)
gen residuals_sq = residuals^2
egen residuals_sq_sum = total(residuals_sq)
gen Sd_beta = sqrt(residuals_sq_sum / (n - 2))
gen SE_beta = Sd_beta / sqrt(roe_diff_sq)
display "Standard Error (from summary): " sum(beta_se_summary)
display "Standard Error (manually calculated): " sum(SE_beta)

Standard Error (from summary): 11.123251

Standard Error (manually calculated): 11.123251

Variance of coefficients

Another comment:

\[se(\hat{\beta_1})=\frac{\hat{\sigma}}{\sqrt{SST_x}}\]

The larger \(\hat{\sigma}\) is, the larger the variance of \(\beta\). That is, the more “noise” in the association between x and Y, the harder it is to learn something about \(\beta\).
However, more variation in x, the larger the SST, so the smaller is the variance of \(\beta\).

Robust standard errors

Looking at both equations below:

\[t_{\beta} = \frac{\hat{\beta}}{se(\hat{\beta})}\]

\[se(\hat{\beta_1})=\frac{\hat{\sigma}}{\sqrt{SST_x}}\]

What happens if \(\hat{\sigma}\) is not constant (for the values of x)?

In other words, how realistic is to assume that the variance in the errors is the same for all slices of x?

Can you think of an example where that may happen?

Robust standard errors

Earnings = f(education)

PhD have a higher variance of earnings than non-educated people.

Leveragge=f(Size)

It is quite possible that small firms will have less options of leverage than large companies.

This means that a sub-sample of large companies will have higher variance in the leverage decisions (and thus the error terms) than the sub-sample of small firms

Robust standard errors

One of the key assumptions in OLS estimators is homoscedasticity

That is, the assumption is that the variance of the errors is homoscedastic (constant variance in all slices of X).

It means that throughout all observations, the error term shows the same variance.

If errors are not homoscedastic, we have the heteroscedasticity problem.

Heteroskedasticity does not cause bias or inconsistency in the OLS estimators of the \(\beta\) like the OVB would.

It also does not affect the \(R^2\).

What Heteroscedasticity does is to bias the standard errors of the estimates.

Robust standard errors

Homoskedascticity = Constant \(\hat{\sigma}\) to all slices of X.

Heteroskedascticity = Non-constant \(\hat{\sigma}\) to all slices of X.

Without homoskedascticity, OLS no longer has the minimum mean squared errors, which means that the estimated standard errors are biased, which in turn creates bias in the t-stat and the inference you’ll make with your model.

Fortunately, we have an easy solution for that.

\[Var(\hat{\beta_1}) = \frac{\sum_i^n(x_i-\bar{x})^2\hat{\mu}^2}{SST^2_x}\]

This formula simply “includes” the heteroskedascticity in the calculation of \(Var(\hat{\beta_1})\), meaning this correct the estimated standard deviation to heteroskedascticity.

We call this correction as Robust Standard Errors (White Robust).

In other words, you should always use Robust Standard Errors. It is easy to use it with R.

Robust standard errors

Using Robust Standard-errors.

R
Python
Stata

library(sandwich)
library(foreign) 
library(lmtest)

data <- read.dta("files/CEOSAL1.DTA")
model <- lm(salary ~ roe, data = data)
robust_model <- coeftest(model, vcov = vcovHC(model, type = "HC3"))
SE_beta_robust <- robust_model["roe", "Std. Error"]
cat("Robust Standard Error :", SE_beta_robust, "\n")

Robust Standard Error : 6.899434

Python

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
data = pd.read_stata("files/CEOSAL1.DTA")
model = smf.ols("salary ~ roe", data=data)
results = model.fit(cov_type='HC3')  
SE_beta_robust = results.bse['roe']
print("Robust Standard Error :", SE_beta_robust)

Robust Standard Error : 6.899433834797476

Stata

use "files/CEOSAL1.DTA" , replace

qui reg salary roe 
gen beta_se_non = _se[roe]

qui reg salary roe , robust
gen beta_se_summary = _se[roe]

di "Standard Error (non-robust): " sum(beta_se_non)
di "Standard Error (robust): " sum(beta_se_summary)

Standard Error (non-robust): 11.123251

Standard Error (robust): 6.8294482

Robust standard errors

Notice that the standard errors have changed quite significantly in this example.

Usually, the robust standard errors are larger than the traditional ones in empirical works.

But, in this example, they are smaller.

Perhaps more importantly:

Once the S.e. change, you should expect that the t-stat of the estimates also change.

Final comment: robust standard errors are robust in the case of homoskedasticity.

Warning

Thus, you should always use robust S.E.

Clustered standard errors

Almost always, someone will ask you whether you clustered your standard errors.

The intuition is the following:

When you do not cluster, you are assuming that all observations are independently and identically distributed (i.i.d.), which may or may not be true.
Imagine you are studying the effect of class size on students achievement.
How much of a effect would have the teacher of a class?

In this design, the teacher influences the achievement of all the students in the same class, and one teacher cannot be at two classes at the same time.
Thus, it would be wise to cluster the errors at the class-level. This assumes that the residual of each individual is clustered with the other individuals in the same class.

In principle, clustering solves any form of dependence of the residuals in your data.

Clustered standard errors

In corporate finance/accounting research panel data research, the tradition is to cluster at the firm-level.

The reason is that the observations of the same firm are not independent trough time, thus are correlated.

But, there is a lot of debate about this decision.

The tip is to cluster where the randomness exist. That is quite subjective. In the class size example, the randomness comes out of the teacher, since each teacher has their own ways of teaching (materials, resources, etc.).

But, it is a good practice to stress this decision a bit in your own research by also showing results with clustered s.e. at the industry-level.

Final tip: usually the minimum number of cluster is about 30. Less than that might be insufficient (but, again, the guidance in this topic is very subjective).

Clustered standard errors

The clustered standard errors are different because I am fabricating the clusters here for the sake of the coding.

In your real research, you would have the cluster at hands.

R
Python
Stata

library(sandwich)
library(foreign) 
library(lmtest)
library(plm)

data <- read.dta("files/CEOSAL1.DTA")
model <- lm(salary ~ roe, data = data)
robust_model <- coeftest(model, vcov = vcovHC(model, type = "HC3"))
#clustered
data$cluster <- rep(1:35, length.out = nrow(data))
model <- plm(salary ~ roe, data = data, index = c("cluster"))

clustered_se <- vcovHC(model, type = "HC3", cluster = "group")
SE_beta_clustered <- sqrt(clustered_se["roe", "roe"])

cat("Standard Error (robust):", SE_beta_robust, "\n")

Standard Error (robust): 6.899434

cat("Standard Error (clustered)::", SE_beta_clustered, "\n")

Standard Error (clustered):: 12.98492

Python

import pandas as pd
import statsmodels.api as sm

# Read the dataset
data = pd.read_stata("files/CEOSAL1.DTA")

# Create a new variable 'cluster' with cluster numbers ranging from 1 to 35
data['cluster'] = list(range(1, 36)) * (len(data) // 35)

# Fit the linear regression model

Error: ValueError: Length of values (175) does not match length of index (209)

Python

model = sm.OLS(data['salary'], sm.add_constant(data['roe'])).fit()

# Compute robust standard errors
robust_model = model.get_robustcov_results(cov_type='HC3')
SE_beta_robust = robust_model.cov_params().loc['roe', 'roe'] ** 0.5

# Fit the linear regression model with clustered standard errors

Error: AttributeError: 'numpy.ndarray' object has no attribute 'loc'

Python

model_clustered = sm.OLS(data['salary'], sm.add_constant(data['roe'])).fit(cov_type='cluster', cov_kwds={'groups': data['cluster']})

# Extract the clustered standard errors for 'roe'

Error: KeyError: 'cluster'

Python

clustered_se = model_clustered.HC3_se.loc['roe']

Error: NameError: name 'model_clustered' is not defined

Python

print("Robust Standard Error (HC3):", SE_beta_robust)

Robust Standard Error (HC3): 6.899433834797476

Python

print("Clustered Standard Error (HC3):", clustered_se)

Error: NameError: name 'clustered_se' is not defined

Stata

use "files/CEOSAL1.DTA" , replace

qui reg salary roe 
gen beta_se_non = _se[roe]

qui reg salary roe , robust
gen beta_se_summary = _se[roe]

egen cluster = seq(), block(6)
qui regress salary roe , vce(cluster cluster)
gen SE_beta_clustered = _se[roe]

di "Standard Error (non-robust): " sum(beta_se_non)
di "Standard Error (robust): " sum(beta_se_summary)
di "Standard Error (clustered): " sum(SE_beta_clustered)

Standard Error (non-robust): 11.123251

Standard Error (robust): 6.8294482

Standard Error (clustered): 6.4122176

THANK YOU!

Henrique Castro Martins