5 Selection problem

Imagine that the population contains 5000 units, from which you can observe only 50.

You want to run a linear model to understand the relationship between x and Y.

The “true” beta of this relationship is as follows. By “true” I mean the beta you would get should you observe the population (remember though that you don’t).

summary(lm(df$y ~ df$x))
## 
## Call:
## lm(formula = df$y ~ df$x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2527.44 -1230.21     4.28  1246.20  2510.94 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.549e+03  4.103e+01   62.13   <2e-16 ***
## df$x        1.871e-01  1.407e-02   13.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1435 on 4998 degrees of freedom
## Multiple R-squared:  0.03416,    Adjusted R-squared:  0.03397 
## F-statistic: 176.8 on 1 and 4998 DF,  p-value: < 2.2e-16

So the “true” beta is 0.187. And the t-stat is 13.296

Plotting this relationship in a graph, you get:

If you run a linear model using the sample you can observe, you might get this.

Or maybe this:

Or maybe this:

Or maybe several other estimates.

So, the takeaway is: always remember that you can only observe a sample of the population. If the sample you observe is biased, you will get biased estimates.