Econometria Aplicada a Finanças

Part 10

Henrique C. Martins

Regression Discontinuity Design (RDD)

Regression Discontinuity Design (RDD)

The regression-discontinuity (RD) research design is a quasi experimental method.

Here the treatment is not a binary as before.

Treated units are assigned based on a cutoff score of a continuous variable.

Example Mastering Metrics:

  • Age for drinking is 18 in Brazil.

Why prohibiting to younger than 18?

The theory behind this effort is that legal drinking at age 18 discourages binge drinking and promotes a culture of mature alcohol consumption.

It splits who can drink and who can’t.

Regression Discontinuity Design (RDD)

The `name´ RDD comes from a jump, a discontinuity that occurs in a continuous variable.

In its simplest form, the design has a:

  • The assignment variable (e.g., age),

  • Two groups (above and below the cutoff),

  • The outcome variable.

  • You may include nonlinearities and control variables.

Regression Discontinuity Design (RDD)

The main assumption that allows using RDD as a causal method is that

Next to the cut, the participants are similar. The only difference is that one individual is in each of the “sides”. Source

Regression Discontinuity Design (RDD)

  • The cutoff value occurs at 50

  • What are the differences between someone that scores 49.99 and 50.01 in the X variable?

  • The intuition is that these individuals are similar and comparable.

In the absence of treatment, the assumption is that the solid line would “continue” with the same inclination and values.

There is a discontinuity, however. This implies that the pretreatment in the absence of the treatment should be the dashed line.

The discontinuity is the causal effect of X (at the cutoff) to Y.

Unlike the matching and regression strategies based on treatment-control comparisons conditional on covariates, the validity of RDD is based on our willingness to extrapolate across values of the running variable, at least for values in the neighborhood of the cutoff at which treatment switches on. MM

Regression Discontinuity Design (RDD)

In the US it is 21 years (age & alcohol). Example Mastering Metrics:

Notice the x-axis.

Regression Discontinuity Design (RDD)

In the US it is 21 years. Example Mastering Metrics:

Notice the x-axis.

Regression Discontinuity Design (RDD)

This is (legally, it sould be) an example of a Sharp RDD

A Sharp RDD occurs when the cutoff is mandatory. There are no exceptions. In this case, there are no 17 years old drinking and driving.

The treatment is

\[D_a= 1, \;if \;a \;>=\; 18, \;0 \;if \;a\; <\; 18\]

The alternative is a fuzzy RDD, which occurs when there is some misassignment.

  • People from under the cut also receiving the treatment.
  • Ex. students that receive 5,96 usually are approved in a course (this is a misassignment).
  • To estimate a Fuzzy RDD, you can use the treatment as an instrumental variable (Angrist & Pischke, 2009).

Regression Discontinuity Design (RDD)

Things that are “good” to a RDD.

  • Age
  • Dates (you need 6 years to start school, 5,99 years is not allowed)
  • Ranking systems
  • Location (when people cannot “move” easily)

Regression Discontinuity Design (RDD)

A Sharp RDD can take the form of:

\[Y_i = \alpha + \beta_1 D_a + \beta_2 x_1 + \epsilon\]

Where \(D_a\) is the treatment based on the cutoff.

\(x_1\) is the running variable (the difference between the actual X and the cutoff in X).

  • For instance, months until the 18 years birthday.

  • We can also use the notation: \((x-x_0)\) regarding the difference.

RDD Nonlinearities

RDD Nonlinearities

Example Mastering Metrics:

This is a linear relationship.

RDD Nonlinearities

Example Mastering Metrics:

This is a nonlinear relationship.

RDD Nonlinearities

Example Mastering Metrics:

Is the relationship linear or nonlinear here? If you misjudge the relationship, it will be hard to tell a story credibly.

RDD Nonlinearities

The takeaway: RDD is a graphical method. You need to show the graphs.

Nobody will believe your story without the correct specification of the model.

In the first case:

\[Y_i = \alpha + \beta_1 D_a + \beta_2 x_1 + \epsilon\]

In the second case:

\[Y_i = \alpha + \beta_1 D_a + \beta_2 x_1 + \beta_3 x_1^2 + \epsilon\]

RDD Nonlinearities

We can also add an interaction term (notice that I am changing the notation now to make it similar to MM)

\[Y_i = \alpha + \beta_1 D_a + \beta_2 (x - x_0) + \beta_3 (x - x_0) D_a + \epsilon\]

This allows for different inclinations before and after the cut.

RDD Nonlinearities

Or even different nonlinearities before and after the cut:

\[Y_i = \alpha + \beta_1 D_a + \beta_2 (x - x_0) + \beta_3 (x - x_0)^2 + \beta_4 (x - x_0) D_a + \beta_5 (x - x_0)^2 D_a + \epsilon\]

Example RDD

Example RDD Clearly, this is not linear.

R
library(readxl)
library(ggplot2)
data  <- read_excel("files/RDD.xlsx")
ggplot(data, aes(x, y))  + 
  geom_point(size=1.2) + 
  labs(y = "", x="", title = "Evolution of Y") +
  theme(plot.title = element_text(color="black", size=20, face="bold"),
        panel.background = element_rect(fill = "grey95", colour = "grey95"),
        axis.text.y = element_text(face="bold", color="black", size = 12),
        axis.text.x = element_text(face="bold", color="black", size = 12),
        legend.title = element_blank(),
        legend.key.size = unit(1, "cm")) +
    geom_smooth(method = "lm", fill = NA)

Python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Read Excel file
data = pd.read_excel("files/RDD.xlsx")
# Generate line graph - Including all observations together
sns.set(style="whitegrid")
plt.figure(figsize=(10, 5))
scatter_plot = sns.scatterplot(x='x', y='y', data=data, s=50)
scatter_plot.set_title("Evolution of Y", fontsize=20, fontweight='bold')
scatter_plot.set_xlabel("", fontsize=12, fontweight='bold')
scatter_plot.set_ylabel("", fontsize=12, fontweight='bold')
# Add regression line
sns.regplot(x='x', y='y', data=data, scatter=False, line_kws={'color': 'blue'})
plt.show()

Example RDD

R
library(readxl)
library(ggplot2)
data  <- read_excel("files/RDD.xlsx")
# Creating  groups
data$treated <- 0
data$treated[data$x >= 101] <- 1  
# Generate a line graph - two groups
ggplot(data, aes(x, y, group=treated, color = factor(treated)))  + 
    geom_point( size=1.25) + 
    labs(y = "", x="", title = "RDD exemplo")+
    theme(plot.title = element_text(color="black", size=25, face="bold"),
          panel.background = element_rect(fill = "grey95", colour = "grey95"),
          axis.text.y = element_text(face="bold", color="black", size = 16),
          axis.text.x = element_text(face="bold", color="black", size = 16),
          legend.title = element_blank(),
          legend.key.size = unit(2, "cm")) +
    geom_smooth(method = "lm", fill = NA)

Python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Read Excel file
data = pd.read_excel("files/RDD.xlsx")
# Create treated variable
data['treated'] = 0
data.loc[data['x'] >= 101, 'treated'] = 1
# Generate line graph with two groups
sns.set(style="whitegrid")
plt.figure(figsize=(10, 5))
scatter_plot = sns.scatterplot(x='x', y='y', hue='treated', style='treated', data=data, s=50)
scatter_plot.set_title("RDD exemplo", fontsize=25, fontweight='bold')
scatter_plot.set_xlabel("", fontsize=16, fontweight='bold')
scatter_plot.set_ylabel("", fontsize=16, fontweight='bold')
scatter_plot.legend().set_title('')
scatter_plot.legend(title='', loc='upper left', fontsize='small')
# Add regression lines
sns.regplot(x='x', y='y', data=data[data['treated'] == 0], scatter=False, line_kws={'color': 'blue'})
sns.regplot(x='x', y='y', data=data[data['treated'] == 1], scatter=False, line_kws={'color': 'orange'})
plt.show()

Example RDD

R
library(readxl)
library(ggplot2)
data  <- read_excel("files/RDD.xlsx")
# Creating  groups
data$treated <- 0
data$treated[data$x >= 101] <- 1  
# define cut
cut <- 100
band <- 50
xlow = cut - band
xhigh = cut + band
# subset the data for the bandwidth
data <- subset(data, x > xlow & x <= xhigh, select=c(x, y,  treated))
# Generate a line graph - two groups
ggplot(data, aes(x, y, group=treated, color = factor(treated)))  + 
  geom_point( size=1.25) + 
  labs(y = "", x="", title = "RDD example")+
  theme(plot.title = element_text(color="black", size=25, face="bold"),
        panel.background = element_rect(fill = "grey95", colour = "grey95"),
        axis.text.y = element_text(face="bold", color="black", size = 16),
        axis.text.x = element_text(face="bold", color="black", size = 16),
        legend.title = element_blank(),
        legend.key.size = unit(2, "cm")) +
  geom_smooth(method = "lm", fill = NA)

Python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Read Excel file
data = pd.read_excel("files/RDD.xlsx")
# Create treated variable
data['treated'] = 0
data.loc[data['x'] >= 101, 'treated'] = 1
# Define cut, band, and subset data
cut = 100
band = 50
xlow = cut - band
xhigh = cut + band
data = data[(data['x'] > xlow) & (data['x'] <= xhigh)][['x', 'y', 'treated']]
# Generate line graph with two groups
sns.set(style="whitegrid")
plt.figure(figsize=(10, 5))
scatter_plot = sns.scatterplot(x='x', y='y', hue='treated', style='treated', data=data, s=50)
scatter_plot.set_title("RDD example", fontsize=25, fontweight='bold')
scatter_plot.set_xlabel("", fontsize=16, fontweight='bold')
scatter_plot.set_ylabel("", fontsize=16, fontweight='bold')
scatter_plot.legend().set_title('')
scatter_plot.legend(title='', loc='upper left', fontsize='small')
# Add regression lines
sns.regplot(x='x', y='y', data=data[data['treated'] == 0], scatter=False, line_kws={'color': 'blue'})
sns.regplot(x='x', y='y', data=data[data['treated'] == 1], scatter=False, line_kws={'color': 'orange'})
plt.show()

Example RDD

R
library(readxl)
library(ggplot2)
data  <- read_excel("files/RDD.xlsx")
data$treated <- 0
data$treated[data$x >= 101] <- 1  
cut <- 100
band <- 50
xlow = cut - band
xhigh = cut + band
data <- subset(data, x > xlow & x <= xhigh, select=c(x, y,  treated))
# Generating xhat - Now we are going to the RDD
data$xhat <- data$x - cut
# Generating xhat * treated to allow different inclinations (we will use the findings of the last graph, i.e. that each group has a different trend.)
data$xhat_treated <- data$xhat * data$treated
# RDD Assuming different trends
rdd <- lm(y  ~ xhat + treated  + xhat_treated, data = data)
summary(rdd)

Call:
lm(formula = y ~ xhat + treated + xhat_treated, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.9477  -3.2607   0.6875   3.2227  12.2004 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  55.83059    1.53681  36.329  < 2e-16 ***
xhat          0.29431    0.05405   5.445 3.97e-07 ***
treated      28.93921    2.20672  13.114  < 2e-16 ***
xhat_treated -0.51587    0.07644  -6.749 1.13e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.515 on 96 degrees of freedom
Multiple R-squared:  0.8942,    Adjusted R-squared:  0.8909 
F-statistic: 270.3 on 3 and 96 DF,  p-value: < 2.2e-16
Python
import pandas as pd
import statsmodels.api as sm
data = pd.read_excel("files/RDD.xlsx")
# Generate treated variable
data['treated'] = 0
data.loc[data['x'] >= 101, 'treated'] = 1
# Define cut and band
cut = 100
band = 50
xlow = cut - band
xhigh = cut + band
# Subset data
data = data[(data['x'] > xlow) & (data['x'] <= xhigh)]
# Generate xhat and xhat_treated
data['xhat'] = data['x'] - cut
data['xhat_treated'] = data['xhat'] * data['treated']
# Regression
X = data[['xhat', 'treated', 'xhat_treated']]
X = sm.add_constant(X)  # Add a constant term
y = data['y']
model = sm.OLS(y, X).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.891
Method:                 Least Squares   F-statistic:                     270.3
Date:                qua, 08 nov 2023   Prob (F-statistic):           1.14e-46
Time:                        17:22:11   Log-Likelihood:                -310.60
No. Observations:                 100   AIC:                             629.2
Df Residuals:                      96   BIC:                             639.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           55.8306      1.537     36.329      0.000      52.780      58.881
xhat             0.2943      0.054      5.445      0.000       0.187       0.402
treated         28.9392      2.207     13.114      0.000      24.559      33.320
xhat_treated    -0.5159      0.076     -6.749      0.000      -0.668      -0.364
==============================================================================
Omnibus:                        0.995   Durbin-Watson:                   2.114
Prob(Omnibus):                  0.608   Jarque-Bera (JB):                1.079
Skew:                          -0.221   Prob(JB):                        0.583
Kurtosis:                       2.747   Cond. No.                         151.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Stata
import excel "files/RDD.xlsx", firstrow clear
gen treated = 0
replace treated = 1 if x >= 101
gen cut = 100
gen band = 50
gen xlow = cut - band
gen xhigh = cut + band
keep if x > xlow & x <= xhigh
gen xhat = x - cut
gen xhat_treated = xhat * treated
regress y xhat treated xhat_treated
(2 vars, 200 obs)


(100 real changes made)





(100 observations deleted)

      Source |       SS           df       MS      Number of obs   =       100
-------------+----------------------------------   F(3, 96)        =    270.35
       Model |  24669.3025         3  8223.10084   Prob > F        =    0.0000
    Residual |  2920.00749        96  30.4167447   R-squared       =    0.8942
-------------+----------------------------------   Adj R-squared   =    0.8909
       Total |    27589.31        99  278.679899   Root MSE        =    5.5151

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        xhat |   .2943097   .0540479     5.45   0.000     .1870255     .401594
     treated |   28.93921   2.206717    13.11   0.000     24.55891    33.31951
xhat_treated |  -.5158703   .0764353    -6.75   0.000    -.6675932   -.3641475
       _cons |   55.83059   1.536805    36.33   0.000     52.78005    58.88112
------------------------------------------------------------------------------

Example RDD

The coefficient of x before the cut is 0.29 (t-stat 5.45), and after the cut, it is -0.51 (t-stat -6.75).

We also have the coefficient of the treatment, which is measured by the “jump” that occurs near the cut: an estimated coefficient of 28.9 (t-stat 13.11).

If this were a real example, this would be the causal effect of receiving the treatment (i.e., being beyond the cut).

THANK YOU!