7  Real research

7.1 OLS, Panel, Fixed effects, and Random Effects

library(ggplot2)
library(ggthemes)
library(readxl) 
library(jtools) # for nice tables of models - https://cran.r-project.org/web/packages/jtools/vignettes/summ.html#summ
data <- read_excel("files/data.xls")

After uploading the data, tell R the data is panel. You need two identifiers: one for the cross-sectional part (usually, firms in accounting & finance research), and other for the time-series part (usually, year but sometimes can be quarter). The following example uses a panel wit quarterly data.

I am also creating an identifier for industries, which might be useful.

library(dplyr)
data <- data %>% group_by(id )                %>% dplyr::mutate(id_firm = cur_group_id())
data <- data %>% group_by(setor_economatica)  %>% dplyr::mutate(id_ind = cur_group_id())

Now, I am stating to r that I have a dataset in a panel structure.

library(plm)
data <- pdata.frame(data, index=c("id_firm","year"))

Let’s attach the dataset to make things easier.

attach(data)

Let’s inspect the dataset. First, let’s see how many observations we have by country:

library('plyr')
count(country)
   x freq
1 AR 1497
2 BR 6804
3 CL 3050
4 CO  483
5 MX 2315
6 PE 1928

Now, by year.

count(year)
      x freq
1  1995  624
2  1996  676
3  1997  743
4  1998  749
5  1999  738
6  2000  739
7  2001  681
8  2002  671
9  2003  666
10 2004  687
11 2005  679
12 2006  685
13 2007  739
14 2008  722
15 2009  727
16 2010  717
17 2011  719
18 2012  706
19 2013  701
20 2014  690
21 2015  669
22 2016  663
23 2017  686

Now, by industry.

count(setor_economatica)
                   x freq
1       Agro e Pesca  832
2    Alimentos e Beb 1537
3           Comércio 1117
4         Construção  806
5  Eletroeletrônicos  332
6   Energia Elétrica 1539
7    Máquinas Indust  259
8          Mineração  693
9   Minerais não Met  587
10            Outros 2860
11  Papel e Celulose  299
12    Petróleo e Gas  522
13           Química  859
14 Siderur & Metalur 1104
15  Software e Dados   75
16  Telecomunicações  770
17            Textil  845
18 Transporte Serviç  517
19  Veiculos e peças  524

7.2 Generating variables

Let’s generate some variables to use later.

data$lev1 <- Debt / (Debt + Equity.market.value)
data$lev2 <- Debt / Total.Assets
data$wc_ta <- wc  / Total.Assets 
data$cash_ta <- cash / Total.Assets
data$div_ta <- Dividends / Total.Assets
data$fcf_ta <- Free.cash.flow / Total.Assets
data$tang_ta <- tangible / Total.Assets
data$roa2 <- roa / 100
library(SciViews)
data$size1  <- ln(Total.Assets)

Notice that most variables are ratios and thus are expected to fall in the 0-100% range.

There is one important exception: Size. Size in this example is the natural logarithm of Total Assets (sometimes, studies use the natural logarithm of Total Sales). Taking the natural logarithm is a traditional decision in this type of research.

So, why did I do that in Size but not in the others?

In many cases, you want to use relative differences or percentages change. For instance, GDP growth is usually measured as the growth of each year over the previous year. The same is true for stock prices.

\[GDP\; growth = \frac{GDP_t - GDP_{t-1}}{GDP`{t-1}}\]

This is possible because you always have a previous value to use as base for comparison.

In many cases, you do not have such base for comparison. This is the case of a firm’s Total Assets. What is the natural base for Total Assets? No answer.

Therefore, you take the natural logarithm so that you minimize this concern.

Additionally, keep in mind that the empirical values of Total Assets are often very dissimilar and with huge magnitude. Consider two firms, one with 1 million of total assets, the other with 1 billion. Assume that both firms increased throughout one year their total assets by 1 million. The first firm is twice the size of before, but 1 million make little difference for a billion dollar firm. Taking the logarithm helps you to have a similar base and mitigate the magnitude effect of these changes.


7.3 Summary statistics

library(vtable)
sumtable(data, vars = c('lev1' , 'lev2' , 'wc_ta', 'cash_ta',  'size1'  , 'fcf_ta' , 'div_ta', 'roa' , 'tang_ta'))
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
lev1 12708 0.35 0.28 0 0.11 0.55 1
lev2 12706 0.41 3.7 0 0.095 0.37 244
wc_ta 15584 -0.025 2.5 -139 -0.015 0.22 1
cash_ta 15056 0.073 0.1 0 0.01 0.096 1
size1 15686 20 2 7.3 18 21 27
fcf_ta 12426 0.091 7.4 -35 0.019 0.066 818
div_ta 10295 0.023 1.4 -0.073 0 0.0043 136
roa 13689 1.6 378 -3656 -0.47 2.4 43650
tang_ta 15600 0.4 0.25 0 0.19 0.59 1.1

7.4 Winsorization

Some of the variables have weird values. For instance, the variable \(\frac{Free cash flow}{Total Assets}\) has a maximum value of 817. It means that there is one firm-year observation in which the cash flow of that firm in that year is 817 times the firm’s total assets. This is very odd to say the least.

To avoid this type of problem, we can winsorize our variables.

library(DescTools)
data$w_lev1     <- Winsorize(data$lev1   , probs = c(0.01, 0.99) , na.rm = TRUE) 
data$w_lev2     <- Winsorize(data$lev2   , probs = c(0.01, 0.99) , na.rm = TRUE) 
data$w_wc_ta    <- Winsorize(data$wc_ta  , probs = c(0.01, 0.99) , na.rm = TRUE) 
data$w_cash_ta  <- Winsorize(data$cash_ta, probs = c(0.01, 0.99) , na.rm = TRUE) 
data$w_size1    <- Winsorize(data$size1  , probs = c(0.01, 0.99) , na.rm = TRUE) 
data$w_fcf_ta   <- Winsorize(data$fcf_ta , probs = c(0.01, 0.99) , na.rm = TRUE) 
data$w_div_ta   <- Winsorize(data$div_ta , probs = c(0.01, 0.99) , na.rm = TRUE) 
data$w_roa      <- Winsorize(data$roa    , probs = c(0.01, 0.99) , na.rm = TRUE) 
data$w_tang_ta  <- Winsorize(data$tang_ta, probs = c(0.01, 0.99) , na.rm = TRUE) 

7.5 Panel data

Multiple times are always helpful. That is, in a panel dataset, you have many observations of the same unit in different time periods. This is helpful because you observe the evolution of that unit. Also, this allows to control for time-invariant unobserved heterogeneity at the unit level.

A general representation of a regression using panel data is:

\[Y_{t,i} = \alpha + \beta \times X_{t,i} + \epsilon_{t,i}\]

Notice that each variable has two subscripts, t represents the year = t, and i represents the unit = i. This type of notation is common in empirical research.


7.6 Functional form of relationships

As mentioned before, in many cases, you want to use the logarithm of a variable. This changes the interpretation of the coefficients you estimate.

We have three possible scenarios for regression’s functional forms.

  • log-log regression: both Y and X are in log values \(ln(Y) = \alpha + \beta \times ln(X) + \epsilon\). The interpretation of \(\beta\) in this case is: \(y\) is \(\beta\) percent higher on average for observations with one percent higher \(x\).

  • log-level regression: both Y and X are in log values \(ln(Y) = \alpha + \beta \times X + \epsilon\). The interpretation of \(\beta\) in this case is: \(y\) is \(\beta\) percent higher on average for observations with one unit higher \(x\).

  • level-log regression: both Y and X are in log values \(Y = \alpha + \beta \times ln(X) + \epsilon\). The interpretation of \(\beta\) in this case is: \(y\) is \(\frac{\beta}{100}\) units higher on average for observations with one percent higher \(x\).

Important note: the misspecification of x’s is similar to the omitted variable bias (OVB).


7.7 OLS

This is a plain OLS model using the original data (i.e., not winsorized yet).

ols <- lm(lev1 ~ size1 + fcf_ta + roa + tang_ta + cash_ta + div_ta, data = data)
summary(ols)

Call:
lm(formula = lev1 ~ size1 + fcf_ta + roa + tang_ta + cash_ta + 
    div_ta, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.44535 -0.19888 -0.04651  0.16032  1.40432 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.024252   0.031814  -0.762 0.445911    
size1        0.018368   0.001564  11.741  < 2e-16 ***
fcf_ta       0.160553   0.105327   1.524 0.127473    
roa         -0.001686   0.001054  -1.600 0.109642    
tang_ta      0.049239   0.012935   3.807 0.000142 ***
cash_ta     -0.476372   0.031985 -14.893  < 2e-16 ***
div_ta      -1.112794   0.106868 -10.413  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2462 on 6878 degrees of freedom
  (9192 observations deleted due to missingness)
Multiple R-squared:  0.07463,   Adjusted R-squared:  0.07382 
F-statistic: 92.45 on 6 and 6878 DF,  p-value: < 2.2e-16

A nicer looking table.

library(jtools)
export_summs(ols, digits = 3 , model.names = c("OLS") )
OLS
(Intercept)-0.024    
(0.032)   
size10.018 ***
(0.002)   
fcf_ta0.161    
(0.105)   
roa-0.002    
(0.001)   
tang_ta0.049 ***
(0.013)   
cash_ta-0.476 ***
(0.032)   
div_ta-1.113 ***
(0.107)   
N6885        
R20.075    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Now, we are using the winsorized data.

w_ols <- lm(w_lev1 ~ w_size1 + w_fcf_ta + w_roa + w_tang_ta + w_cash_ta + w_div_ta, data = data)
#summary(w_ols)
export_summs(ols,w_ols ,model.names = c("OLS","OLS (winsorized)"))
OLSOLS (winsorized)
(Intercept)-0.02    -0.17 ***
(0.03)   (0.03)   
size10.02 ***       
(0.00)          
fcf_ta0.16           
(0.11)          
roa-0.00           
(0.00)          
tang_ta0.05 ***       
(0.01)          
cash_ta-0.48 ***       
(0.03)          
div_ta-1.11 ***       
(0.11)          
w_size1       0.03 ***
       (0.00)   
w_fcf_ta       -0.03    
       (0.11)   
w_roa       -0.01 ***
       (0.00)   
w_tang_ta       0.06 ***
       (0.01)   
w_cash_ta       -0.50 ***
       (0.03)   
w_div_ta       -3.03 ***
       (0.19)   
N6885       6885       
R20.07    0.15    
*** p < 0.001; ** p < 0.01; * p < 0.05.

7.8 Heterogeneity across countries

There is a huge literature on capital structure decisions. Let’s see the heterogeneity of leverage across countries. Thus, it makes sense to control for this type of heterogeneity somehow.

library(gplots)
plotmeans(lev1 ~ country, data = data)


7.9 FE Country

The following example uses fixed effects for countries. Notice that one country is missing.

fe_c <- lm(lev1 ~ size1 + fcf_ta + roa + tang_ta + cash_ta + div_ta  + factor(country) , data = data)
export_summs(fe_c)
Model 1
(Intercept)-0.02    
(0.03)   
size10.01 ***
(0.00)   
fcf_ta0.11    
(0.10)   
roa-0.00    
(0.00)   
tang_ta0.09 ***
(0.01)   
cash_ta-0.48 ***
(0.03)   
div_ta-1.14 ***
(0.11)   
factor(country)BR0.10 ***
(0.01)   
factor(country)CL0.02    
(0.01)   
factor(country)CO0.11 ***
(0.03)   
factor(country)MX0.03 ** 
(0.01)   
factor(country)PE0.09 ***
(0.01)   
N6885       
R20.10    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Now, we are using fixed effects for countries and years.

fe_cy <- lm(lev1 ~ size1 + fcf_ta + roa + tang_ta + cash_ta + div_ta  + factor(country) + factor(year), data = data)
export_summs(fe_c, fe_cy)
Model 1Model 2
(Intercept)-0.02    -0.01    
(0.03)   (0.04)   
size10.01 ***0.02 ***
(0.00)   (0.00)   
fcf_ta0.11    0.08    
(0.10)   (0.10)   
roa-0.00    -0.00    
(0.00)   (0.00)   
tang_ta0.09 ***0.09 ***
(0.01)   (0.01)   
cash_ta-0.48 ***-0.44 ***
(0.03)   (0.03)   
div_ta-1.14 ***-1.12 ***
(0.11)   (0.10)   
factor(country)BR0.10 ***0.10 ***
(0.01)   (0.01)   
factor(country)CL0.02    0.01    
(0.01)   (0.01)   
factor(country)CO0.11 ***0.08 *  
(0.03)   (0.03)   
factor(country)MX0.03 ** 0.01    
(0.01)   (0.01)   
factor(country)PE0.09 ***0.05 ** 
(0.01)   (0.02)   
factor(year)1996       -0.03    
       (0.03)   
factor(year)1997       -0.08 *  
       (0.03)   
factor(year)1998       0.02    
       (0.03)   
factor(year)1999       -0.03    
       (0.03)   
factor(year)2000       0.03    
       (0.03)   
factor(year)2001       0.04    
       (0.03)   
factor(year)2002       0.04    
       (0.03)   
factor(year)2003       -0.01    
       (0.03)   
factor(year)2004       -0.06    
       (0.03)   
factor(year)2005       -0.08 *  
       (0.03)   
factor(year)2006       -0.10 ***
       (0.03)   
factor(year)2007       -0.12 ***
       (0.03)   
factor(year)2008       -0.03    
       (0.03)   
factor(year)2009       -0.12 ***
       (0.03)   
factor(year)2010       -0.13 ***
       (0.03)   
factor(year)2011       -0.08 ** 
       (0.03)   
factor(year)2012       -0.07 ** 
       (0.03)   
factor(year)2013       -0.06 *  
       (0.03)   
factor(year)2014       -0.05    
       (0.03)   
factor(year)2015       0.00    
       (0.03)   
factor(year)2016       -0.04    
       (0.03)   
factor(year)2017       -0.07 ** 
       (0.03)   
N6885       6885       
R20.10    0.13    
*** p < 0.001; ** p < 0.01; * p < 0.05.

7.10 FE Firms

The most usual regression we estimate in Accounting & Finance is a panel using firms as units. Imagine that you are including one dummy for each firm, just as we included a dummy for each country in the previous example. This is the way to do it.

The strength of a firm fixed effect model is that it controls for time-invariant unobserved heterogeneity at the firm level.

That is, if there is an omitted variable in your model (something that is constant over time that you cannot measure because you do not observe, for instance, the quality of the managerial team), you should use firm fixed effects.

FE is a very traditional and accepted solution in this type of research because our models are always a representation of reality, we can never be sure we are controlling for everything that is important. By including a firm FE you are, at least, safe that you are controlling of omitted variables that do not change over time at the firm level. That is, Firm FE mitigates concerns of omitted variable bias of time-invariant at the firm level.

On top of the firm FE, it is always a good idea to include year FE at the same time, as follows. You do not need to present them in your final paper, but you need to mention them.

fe <- plm(w_lev1 ~ w_size1 + w_fcf_ta + w_roa + w_tang_ta + w_cash_ta + w_div_ta + factor(year) , data = data, model="within")
export_summs(fe, coefs = c("w_size1","w_div_ta","w_fcf_ta","w_roa","w_tang_ta","w_cash_ta"),  digits = 3 , model.names = c("FE") )
FE
w_size10.076 ***
(0.004)   
w_div_ta-0.894 ***
(0.152)   
w_fcf_ta0.453 ***
(0.117)   
w_roa-0.009 ***
(0.001)   
w_tang_ta0.176 ***
(0.019)   
w_cash_ta-0.304 ***
(0.029)   
nobs6885        
r.squared0.215    
adj.r.squared0.086    
statistic57.872    
p.value0.000    
deviance108.418    
df.residual5914.000    
nobs.16885.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.

More technically (but really not that much), a firm FE model can be represented as follows. That is, you include one fixed term for each firm (again, except for the first firm). Also, I am adding the year FE in the equation below.

\[Y_{t,i} = \beta \times X_{t-1,i} + \alpha_i + \alpha_t+ \epsilon_{t,i}\]

Technically, the inclusion of the FE acts as a transformation of Y and X into differences such as \(y_{t,i} - \bar{y_i}\) and \(x_{t,i} - \bar{x_i}\), where \(\bar{y_i}\) is the mean value of y across all time periods for unit i. As a result, in FE regressions, \(\beta\) shows how much larger y is compared to its mean within the cross-sectional unit. That is, compare two observations of x that are different in terms of their value x compared to the i-specific mean. On average, x is larger that average x by \(\beta\).

Thus, a FE model is a within-unit comparison model and the model is not affected by whether the unit has larger average of X or Y. That is, the estimates are not affected by unobserved confounders that affect either x or y in the same way in all time periods.

Important question: Can you include firm FE and country FE at the same time in a regression? The answer is no. If you try, all the country FE will be dropped due to colinearity with firm FE.

Final notes on FE models:

  1. Notice there is no intercept in the table above. The reason is that the model includes \(i\) intercepts, one for each firm. In some cases, some softwares report an intercept value that is the average of all intercepts. But that is of little value to us.

  2. There are two possible r-squares in FE models: within R-squared and the R-squared (the traditional one). The first is more adequate, because the second uses all intercepts as explanatory variables of y and thus overstates the magnitude of the model’s explanatory power.


7.11 OLS vs FE

At the end of the day, you never can be sure that the FE model is better. So, we can test if it is an improvement over OLS, as follows.

If the p-value low enough then the fixed effects model is the path to go.

pFtest(fe, ols)

    F test for individual effects

data:  w_lev1 ~ w_size1 + w_fcf_ta + w_roa + w_tang_ta + w_cash_ta +  ...
F = 17.45, df1 = 964, df2 = 5914, p-value < 2.2e-16
alternative hypothesis: significant effects

7.12 Random Effects

An alternative to FE models is the RE (random effects) models. In FE models, we assume \(\alpha_i\) is correlated with the other X variables. This makes sense since the unobserved time-invariant characteristics of a firm is expected to be correlated with the observed characteristics of that firm.

If we believe that \(\alpha_i\) is not correlated with X, we should use a RE model. You can estimate as follows.

re <- plm(w_lev1 ~ w_size1 + w_fcf_ta + w_roa + w_tang_ta + w_cash_ta + w_div_ta  , data = data, model="random") 
export_summs(fe, re, coefs = c("w_size1","w_div_ta","w_fcf_ta","w_roa","w_tang_ta","w_cash_ta"),  digits = 3 , model.names = c("FE", "RE") )
FERE
w_size10.076 ***0.033 ***
(0.004)   (0.003)   
w_div_ta-0.894 ***-1.336 ***
(0.152)   (0.157)   
w_fcf_ta0.453 ***0.307 ** 
(0.117)   (0.112)   
w_roa-0.009 ***-0.009 ***
(0.001)   (0.001)   
w_tang_ta0.176 ***0.149 ***
(0.019)   (0.016)   
w_cash_ta-0.304 ***-0.410 ***
(0.029)   (0.029)   
nobs6885        6885        
r.squared0.215    0.144    
adj.r.squared0.086    0.144    
statistic57.872    856.420    
p.value0.000    0.000    
deviance108.418    146.431    
df.residual5914.000    6878.000    
nobs.16885.000    6885.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.

7.13 RE vs FE

If you want to test which model is best suited to your dataset, you can test as follows. If the p-value is significant then use FE, if not use RE.

phtest(fe, re)

    Hausman Test

data:  w_lev1 ~ w_size1 + w_fcf_ta + w_roa + w_tang_ta + w_cash_ta +  ...
chisq = 9718.9, df = 6, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent