R
[1] 1.663289
[1] 1.663289
Part 5
When we estimate coefficients we have some “error of estimation”.
Basically, you are searching the “true” coefficient using a sample, which should be representative of the population but it is not the population itself.
This means that the coefficient estimated is estimated with error.
We would like (e.g., we will need) to impose some “structure” to that error.
Standard error and T-stat
To assess if the variables are significantly related, you need to assess the significance of \(\beta\) coefficients.
Using the example from Wooldridge, we know that the Beta of ROE is 18.591
, while the standard error of ROE is 11.123
.
The standard error is a measure of the accuracy of your estimate. If you find a large standard error, your estimate does not have good accuracy.
Ideally, you would find small standard errors, meaning that your coefficient is accurately estimated.
However, you do not have good control over the magnitude of the standard errors.
Standard error and T-stat
If you have a large standard error, probably you coefficient will not be significantly different from zero. You can test whether your coefficient is significantly different from zero computing the t-statistics as follows:
\[t_{\beta} = \frac{\hat{\beta}}{se(\hat{\beta})}\]
If \(t_{\beta}\) is large enough, you can say that \(\beta\) is significantly different from zero. Usually, \(t_{\beta}\) larger than 2 is enough to be significant.
In the previous example, you can find the t-stat manually as follows (\(t_{\beta} =\frac{\hat{\beta}}{se(\hat{\beta})}\)):
import pandas as pd
import statsmodels.api as sm
data = pd.read_stata("files/CEOSAL1.dta")
# OLS model
X = data['roe']
X = sm.add_constant(X)
y = data['salary']
model = sm.OLS(y, X).fit()
# Extract and calculate specific coefficients
coef_beta = model.params['roe']
coef_std_error = model.bse['roe']
# Calculate t-value
t_value = coef_beta / coef_std_error
# Print the coefficient and t-value
print("Coefficient (beta):", coef_beta)
Coefficient (beta): 18.501186345214926
Standard Error: 11.123250903287634
t-value: 1.6632894920806505
Coefficient (beta): 18.501186
Standard Error: 11.123251
t-value: 1.6632895
Naturally, the previous analysis requires an estimate of \(\beta\) and an estimate of the \(\beta\)’s standard error.
The standard error can be defined as:
\[se(\hat{\beta_1})=\frac{\hat{\sigma}}{\sqrt{SST_x}}\]
\[\hat{\sigma} = \sqrt{\frac{SSR}{n-2}}\]
- The $n-2$ here is an adjustment for the degrees of freedom in the regression.
[1] 11.12325
#calculating manually
# Extract the residuals
residuals <- resid(model)
# Number of observations (n)
n <- length(residuals)
# Calculate the mean of the independent variable (roe)
roe_mean <- mean(roe)
# Calculate the sum of squared deviations of roe from its mean (SXX)
SST <- sum((roe - roe_mean)^2)
# Calculate the sum of squared errors (SSE)
SSR <- sum(residuals^2)
# Calculate the standard error of beta
Sd_beta <- sqrt(SSR / ((n - 2)))
# Calculate S.E
Se_beta <- Sd_beta / sqrt(SST)
# Print the standard error of beta
print(Se_beta)
[1] 11.12325
import pandas as pd
import numpy as np
import statsmodels.api as sm
data = pd.read_stata("files/CEOSAL1.dta")
X = data['roe']
y = data['salary']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
# Extract the standard error of the coefficient for 'roe'
beta_se_summary = model.bse['roe']
print("Standard Error (from summary):", beta_se_summary)
# Calculate it manually
# Extract the residuals
Standard Error (from summary): 11.123250903287634
residuals = model.resid
# Number of observations (n)
n = len(residuals)
# Calculate the mean of the independent variable (roe)
roe_mean = X['roe'].mean()
# Calculate the sum of squared deviations of roe from its mean (SST)
SST = np.sum((X['roe'] - roe_mean) ** 2)
# Calculate the sum of squared errors (SSE)
SSE = np.sum(residuals ** 2)
# Calculate the standard error of beta (Sd_beta)
Sd_beta = np.sqrt(SSE / (n - 2))
# Calculate SE_beta
SE_beta = Sd_beta / np.sqrt(SST)
print("Standard Error (manually calculated):", SE_beta)
Standard Error (manually calculated): 11.123250601798315
use "files/CEOSAL1.DTA" , replace
qui reg salary roe
gen beta_se_summary = _se[roe]
gen n = _N
predict residuals, residuals
sum roe, meanonly
gen roe_diff = roe - r(mean)
egen roe_diff_sq = total(roe_diff^2)
gen residuals_sq = residuals^2
egen residuals_sq_sum = total(residuals_sq)
gen Sd_beta = sqrt(residuals_sq_sum / (n - 2))
gen SE_beta = Sd_beta / sqrt(roe_diff_sq)
display "Standard Error (from summary): " sum(beta_se_summary)
display "Standard Error (manually calculated): " sum(SE_beta)
Standard Error (from summary): 11.123251
Standard Error (manually calculated): 11.123251
Another comment:
\[se(\hat{\beta_1})=\frac{\hat{\sigma}}{\sqrt{SST_x}}\]
The larger \(\hat{\sigma}\) is, the larger the variance of \(\beta\). That is, the more “noise” in the association between x and Y, the harder it is to learn something about \(\beta\).
However, more variation in x, the larger the SST, so the smaller is the variance of \(\beta\).
Looking at both equations below:
\[t_{\beta} = \frac{\hat{\beta}}{se(\hat{\beta})}\]
\[se(\hat{\beta_1})=\frac{\hat{\sigma}}{\sqrt{SST_x}}\]
What happens if \(\hat{\sigma}\) is not constant (for the values of x)?
In other words, how realistic is to assume that the variance in the errors is the same for all slices of x?
Can you think of an example where that may happen?
Earnings = f(education)
PhD have a higher variance of earnings than non-educated people.
Leveragge=f(Size)
It is quite possible that small firms will have less options of leverage than large companies.
This means that a sub-sample of large companies will have higher variance in the leverage decisions (and thus the error terms) than the sub-sample of small firms
One of the key assumptions in OLS estimators is homoscedasticity
That is, the assumption is that the variance of the errors is homoscedastic (constant variance in all slices of X).
It means that throughout all observations, the error term shows the same variance.
If errors are not homoscedastic, we have the heteroscedasticity problem.
Heteroskedasticity does not cause bias or inconsistency in the OLS estimators of the \(\beta\) like the OVB would.
It also does not affect the \(R^2\).
What Heteroscedasticity does is to bias the standard errors of the estimates.
Homoskedascticity = Constant \(\hat{\sigma}\) to all slices of X.
Heteroskedascticity = Non-constant \(\hat{\sigma}\) to all slices of X.
Without homoskedascticity, OLS no longer has the minimum mean squared errors, which means that the estimated standard errors are biased, which in turn creates bias in the t-stat and the inference you’ll make with your model.
Fortunately, we have an easy solution for that.
\[Var(\hat{\beta_1}) = \frac{\sum_i^n(x_i-\bar{x})^2\hat{\mu}^2}{SST^2_x}\]
This formula simply “includes” the heteroskedascticity in the calculation of \(Var(\hat{\beta_1})\), meaning this correct the estimated standard deviation to heteroskedascticity.
We call this correction as Robust Standard Errors (White Robust).
In other words, you should always use Robust Standard Errors. It is easy to use it with R.
Using Robust Standard-errors.
library(sandwich)
library(foreign)
library(lmtest)
data <- read.dta("files/CEOSAL1.DTA")
model <- lm(salary ~ roe, data = data)
robust_model <- coeftest(model, vcov = vcovHC(model, type = "HC3"))
SE_beta_robust <- robust_model["roe", "Std. Error"]
cat("Robust Standard Error :", SE_beta_robust, "\n")
Robust Standard Error : 6.899434
Robust Standard Error : 6.899433834797476
Standard Error (non-robust): 11.123251
Standard Error (robust): 6.8294482
Notice that the standard errors have changed quite significantly in this example.
Usually, the robust standard errors are larger than the traditional ones in empirical works.
But, in this example, they are smaller.
Perhaps more importantly:
Once the S.e. change, you should expect that the t-stat of the estimates also change.
Final comment: robust standard errors are robust in the case of homoskedasticity.
Almost always, someone will ask you whether you clustered your standard errors.
The intuition is the following:
When you do not cluster, you are assuming that all observations are independently and identically distributed (i.i.d.), which may or may not be true.
Imagine you are studying the effect of class size on students achievement.
How much of a effect would have the teacher of a class?
In this design, the teacher influences the achievement of all the students in the same class, and one teacher cannot be at two classes at the same time.
Thus, it would be wise to cluster the errors at the class-level. This assumes that the residual of each individual is clustered with the other individuals in the same class.
In principle, clustering solves any form of dependence of the residuals in your data.
In corporate finance/accounting research panel data research, the tradition is to cluster at the firm-level.
But, there is a lot of debate about this decision.
The tip is to cluster where the randomness exist. That is quite subjective. In the class size example, the randomness comes out of the teacher, since each teacher has their own ways of teaching (materials, resources, etc.).
But, it is a good practice to stress this decision a bit in your own research by also showing results with clustered s.e. at the industry-level.
Final tip: usually the minimum number of cluster is about 30. Less than that might be insufficient (but, again, the guidance in this topic is very subjective).
The clustered standard errors are different because I am fabricating the clusters here for the sake of the coding.
In your real research, you would have the cluster at hands.
library(sandwich)
library(foreign)
library(lmtest)
library(plm)
data <- read.dta("files/CEOSAL1.DTA")
model <- lm(salary ~ roe, data = data)
robust_model <- coeftest(model, vcov = vcovHC(model, type = "HC3"))
#clustered
data$cluster <- rep(1:35, length.out = nrow(data))
model <- plm(salary ~ roe, data = data, index = c("cluster"))
clustered_se <- vcovHC(model, type = "HC3", cluster = "group")
SE_beta_clustered <- sqrt(clustered_se["roe", "roe"])
cat("Standard Error (robust):", SE_beta_robust, "\n")
Standard Error (robust): 6.899434
Standard Error (clustered):: 12.98492
Error: ValueError: Length of values (175) does not match length of index (209)
Error: AttributeError: 'numpy.ndarray' object has no attribute 'loc'
Error: KeyError: 'cluster'
Error: NameError: name 'model_clustered' is not defined
Robust Standard Error (HC3): 6.899433834797476
Error: NameError: name 'clustered_se' is not defined
use "files/CEOSAL1.DTA" , replace
qui reg salary roe
gen beta_se_non = _se[roe]
qui reg salary roe , robust
gen beta_se_summary = _se[roe]
egen cluster = seq(), block(6)
qui regress salary roe , vce(cluster cluster)
gen SE_beta_clustered = _se[roe]
di "Standard Error (non-robust): " sum(beta_se_non)
di "Standard Error (robust): " sum(beta_se_summary)
di "Standard Error (clustered): " sum(SE_beta_clustered)
Standard Error (non-robust): 11.123251
Standard Error (robust): 6.8294482
Standard Error (clustered): 6.4122176