Instrumental variables I

Materials for class on Monday, November 11, 2019

Contents

Slides

Download the slides from today’s class.

First slide

Running R from your computer

Download and unzip this folder to get started:

IV/2SLS examples

library(tidyverse)   # For ggplot, %>%, and friends
library(broom)       # For converting models to data frames
library(huxtable)    # For side-by-side regression tables
library(AER)         # For some econometrics datasets
library(wooldridge)  # More econometrics datasets
library(estimatr)    # For iv_robust

Education, wages, and parent education (fake data)

Download this data and put it in a folder named “data” in your project root:

ed_fake <- read_csv("data/father_education.csv")

We’re interested in the perennial econometrics question of whether an extra year of education causes increased wages. In this example we use simulated/fake data that includes the following variables:

Variable name Description
wage Weekly wage
educ Years of education
ability Magical column that measures your ability to work and go to school (omitted variable)
fathereduc Years of education for father

If we could actually measure ability, we could estimate this model, which closes the confounding backdoor posed by ability and isolates just the effect of education on wages:

model_perfect <- lm(wage ~ educ + ability, data = ed_fake)
tidy(model_perfect)
term estimate std.error statistic p.value
(Intercept) -80.2626923 5.6585424 -14.18434 0
educ 9.2424696 0.3426683 26.97206 0
ability 0.2580724 0.0071809 35.93874 0

However, in real life we don’t have ability, so we’re stuck with a naive model:

model_naive <- lm(wage ~ educ, data = ed_fake)
tidy(model_naive)
term estimate std.error statistic p.value
(Intercept) -53.08545 8.4920070 -6.251225 0
educ 12.24036 0.5033011 24.320144 0

The naive model overestimates the effect of education on wages (12.2 vs. 9.24) because of omitted variable bias. Education suffers from endogeneity—there are things in the model (like ability, hidden in the error term) that are correlated with it. Any estimate we calculate will be wrong and biased because of selection effects or omitted variable bias (all different names for endogeneity).

To fix the endogeneity problem, we can use an instrument to remove the endogeneity from education and instead use a special exogeneity-only version of education. Perhaps someone’s father’s education can be an instrument for education.

To be a valid instrument, it must meet three criteria:

  1. Relevance: Instrument is correlated with policy variable
  2. Exclusion: Instrument is correlated with outcome only through the policy variable
  3. Exogeneity: Instrument isn’t correlated with anything else in the model (i.e. omitted variables)

We can first test relevance by making a scatterplot and running a model of policy ~ instrument:

ggplot(ed_fake, aes(x = fathereduc, y = educ)) +
  geom_point() +
  geom_smooth(method = "lm")

check_relevance <- lm(educ ~ fathereduc, data = ed_fake)
tidy(check_relevance)
term estimate std.error statistic p.value
(Intercept) 4.3955891 0.3986274 11.02681 0
fathereduc 0.7569158 0.0242786 31.17622 0
glance(check_relevance)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.4933898 0.4928822 1.595708 971.9564 0 2 -1885.255 3776.51 3791.233 2541.192 998

This looks pretty good! The F-statistic is definitely above 10 (it’s 972!), and there’s a significant relationship between the instrument and policy. I’d say that this is relevant.

To check for exclusion, we need to see if there’s a relationship between father’s education and wages that occurs only because of education. If we plot it, we’ll see a relationship:

ggplot(ed_fake, aes(x = fathereduc, y = wage)) +
  geom_point() +
  geom_smooth(method = "lm")

That’s to be expected, since in our model, father’s education causes education which causes wages—they should be correlated. We have to use theory to justify the idea that a father’s education increases the hourly wage only because it increases one’s education, and there’s no real statistical test for that.

There’s not really a test for exogeneity either, since there’s no way to measure other endogenous variables in the model (that’s the whole reason we’re using IVs in the first place!). Becuase we have the magical ability column in this fake data, we can test it. Father’s education shouldn’t be related to ability:

ggplot(ed_fake, aes(x = ability, y = fathereduc)) +
  geom_point() +
  geom_smooth(method = "lm")

And it’s not! We can safely say that it meets the exclusion assumption.

For the last part—the exogeneity assumption—there’s no statistical test. We just have to tell a theory-based story that the number of years of education one’s father has is not correlated with anything else in the model (including any omitted variables). Good luck with that—it’s probably not a good instrument. This relates to Scott Cunningham’s argument that instruments have to be weird. According to Scott:

The reason I think this is because an instrument doesn’t belong in the structural error term and the structural error term is all the intuitive things that determine your outcome. So it must be weird, otherwise it’s probably in the error term.

Let’s just pretend that father’s education is a valid instrument and move on :)

Now we can do two-stage least squares (2SLS) regressin and use the instrument to filter out the endogenous part of education. The first stage predicts education based on the instrument (we already ran this model earlier when checking for relevance, but we’ll do it again just for fun):

first_stage <- lm(educ ~ fathereduc, data = ed_fake)

Now we want to add a column of predicted education to our original dataset. The easiest way to do that is with the augment_columns() function from the broom library:

ed_fake_with_prediction <- augment_columns(first_stage, ed_fake)
head(ed_fake_with_prediction)
wage educ ability fathereduc .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
146.3478 18.05319 348.2400 17.15832 17.38299 0.0547177 0.6701995 0.0011758 1.596367 0.0001040 0.4202485
147.5996 15.84549 181.1607 13.98853 14.98373 0.0752312 0.8617548 0.0022227 1.596274 0.0003256 0.5406466
161.8202 15.10521 337.3677 15.99431 16.50194 0.0509579 -1.3967293 0.0010198 1.595895 0.0003915 -0.8757505
105.0829 16.45813 106.4580 21.41317 20.60356 0.1343020 -4.1454269 0.0070837 1.591062 0.0242457 -2.6071108
167.5622 18.79382 301.5100 16.45763 16.85263 0.0506309 1.9411855 0.0010068 1.595323 0.0007464 1.2171170
172.9273 16.01597 283.6524 15.42531 16.07125 0.0546236 -0.0552788 0.0011718 1.596507 0.0000007 -0.0346625

Note a couple of these new columns. .fitted is the fitted/predicted value of education, and it’s the version of education with endogeneity arguably removed. .resid shows how far off the prediction is from educ. The other columns don’t matter so much.

Instead of dealing with weird names like .fitted, I like to rename the fitted variable to something more understandable after I use augment_columns:

ed_fake_with_prediction <- augment_columns(first_stage, ed_fake) %>% 
  rename(educ_hat = .fitted)

head(ed_fake_with_prediction)
wage educ ability fathereduc educ_hat .se.fit .resid .hat .sigma .cooksd .std.resid
146.3478 18.05319 348.2400 17.15832 17.38299 0.0547177 0.6701995 0.0011758 1.596367 0.0001040 0.4202485
147.5996 15.84549 181.1607 13.98853 14.98373 0.0752312 0.8617548 0.0022227 1.596274 0.0003256 0.5406466
161.8202 15.10521 337.3677 15.99431 16.50194 0.0509579 -1.3967293 0.0010198 1.595895 0.0003915 -0.8757505
105.0829 16.45813 106.4580 21.41317 20.60356 0.1343020 -4.1454269 0.0070837 1.591062 0.0242457 -2.6071108
167.5622 18.79382 301.5100 16.45763 16.85263 0.0506309 1.9411855 0.0010068 1.595323 0.0007464 1.2171170
172.9273 16.01597 283.6524 15.42531 16.07125 0.0546236 -0.0552788 0.0011718 1.596507 0.0000007 -0.0346625

We can now use the new educ_hat variable in our second stage model:

second_stage <- lm(wage ~ educ_hat, data = ed_fake_with_prediction)
tidy(second_stage)
term estimate std.error statistic p.value
(Intercept) -3.107953 14.3703343 -0.2162757 0.8288171
educ_hat 9.251863 0.8555225 10.8142831 0.0000000

The estimate for educ_hat is arguably more accurate now because we’ve used the instrument to remove the endogenous part of education and should only have the exogenous part.

We can put all the models side-by-side to compare them:

huxreg(list("Perfect" = model_perfect, "OLS" = model_naive, "2SLS" = second_stage))
Perfect OLS 2SLS
(Intercept) -80.263 *** -53.085 *** -3.108    
(5.659)    (8.492)    (14.370)   
educ 9.242 *** 12.240 ***         
(0.343)    (0.503)            
ability 0.258 ***                  
(0.007)                     
educ_hat                   9.252 ***
                  (0.856)   
N 1000         1000         1000        
R2 0.726     0.372     0.105    
logLik -4576.101     -4991.572     -5168.868    
AIC 9160.202     9989.144     10343.735    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Note how the coefficient for educ_hat in the 2SLS model is basically the same as the coefficient for educ in the perfect model that accounts for ability. That’s the magic of instrumental variables!

Education, wages, and parent education (real data)

This data comes from the wage2 dataset in the wooldridge R package (and it’s real!). Wages are measured in monthly earnings in 1980 dollars.

ed_real <- wage2 %>% 
  rename(education = educ, education_dad = feduc, education_mom = meduc) %>%
  na.omit()  # Get rid of rows with missing values

We want to again estimate the effect of education on wages, but this time we’ll use both one’s father’s education and one’s mother’s education as instruments. Here’s the naive estimation of the relationship, which suffers from endogeneity:

model_naive <- lm(wage ~ education, data = ed_real)
tidy(model_naive)
term estimate std.error statistic p.value
(Intercept) 175.15999 92.838561 1.886716 0.0596366
education 59.45181 6.697942 8.876131 0.0000000

This is wrong though! Education is endogenous to unmeasured things in the model (like ability, which lives in the error term). We can isolate the exogenous part of education with an instrument.

Before doing any 2SLS models, we want to check the validity of the instruments. Remember, for an instrument to be valid, it should meet these criteria:

  1. Relevance: Instrument is correlated with policy variable
  2. Exclusion: Instrument is correlated with outcome only through the policy variable
  3. Exogeneity: Instrument isn’t correlated with anything else in the model (i.e. omitted variables)

We can check for relevance by looking at the relationship between the instruments and education:

# Combine father's and mother's education into one column so we can plot both at the same time
ed_real_long <- ed_real %>% 
  pivot_longer(cols = c(education_dad, education_mom), 
               names_to = "instrument", values_to = "instrument_value")

ggplot(ed_real_long, aes(x = instrument_value, y = education)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap(~ instrument)

model_check_instruments <- lm(education ~ education_dad + education_mom, 
                              data = ed_real)
tidy(model_check_instruments)
term estimate std.error statistic p.value
(Intercept) 9.9132852 0.3195911 31.018649 0.00e+00
education_dad 0.2189424 0.0288925 7.577836 0.00e+00
education_mom 0.1401693 0.0336511 4.165370 3.52e-05
glance(model_check_instruments)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.2015354 0.1991158 1.996932 83.29321 0 3 -1397.792 2803.584 2821.571 2631.908 660

There’s a clear relationship between both of the instruments and education, and the coefficients for each are signficiant. The F-statistic for the model is 83, which is higher than 10, which is a good sign of a strong instrument.

We can check for exclusion in part by looking at the relationship between the instruments and the outcome, or wages. We should see some relationship:

ggplot(ed_real_long, aes(x = instrument_value, y = wage)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap(~ instrument)

And we do! Now we just have to make the case that the only reason there’s a relationship is that parental education only influences wages through education. Good luck with that.

The last step is to prove exogeneity—that parental education is not correlated with education or wages. Good luck with that too.

Assuming that parental education is a good instrument, we can use it to remove the endogenous part of education using 2SLS. In the first stage, we predict education using our instruments:

first_stage <- lm(education ~ education_dad + education_mom, data = ed_real)

We can then extract the predicted education and add it to our main dataset, renaming the .fitted variable to something more useful along the way:

ed_real_with_predicted <- augment_columns(first_stage, ed_real) %>% 
  rename(education_hat = .fitted)

Finally, we can use predicted education to estimate the exogenous effect of education on wages:

second_stage <- lm(wage ~ education_hat, 
                   data = ed_real_with_predicted)
tidy(second_stage)
term estimate std.error statistic p.value
(Intercept) -537.7122 208.16441 -2.583113 0.010005
education_hat 111.5614 15.17586 7.351244 0.000000

That should arguably be our actual effect! Let’s compare it to the naive model:

huxreg(list("OLS" = model_naive, "2SLS" = second_stage))
OLS 2SLS
(Intercept) 175.160     -537.712 *  
(92.839)    (208.164)   
education 59.452 ***         
(6.698)            
education_hat          111.561 ***
         (15.176)   
N 663         663        
R2 0.106     0.076    
logLik -4885.974     -4897.252    
AIC 9777.947     9800.503    
*** p < 0.001; ** p < 0.01; * p < 0.05.

The 2SLS effect is roughly twice as large and is arguably more accurate, since it has removed the endogeneity from education. An extra year of school leads to an extra $111.56 dollars a month in income (in 1980 dollars).

If you don’t want to go through the hassle of doing the two stages by hand, you can use the iv_robust() function from the estimatr package to do both stages at the same time. The second stage goes on the left side of the |, just like a normal regression. The first stage goes on the right side of the |:

model_same_time <- iv_robust(wage ~ education | education_dad + education_mom,
                             data = ed_real)
tidy(model_same_time)
term estimate std.error statistic p.value conf.low conf.high df outcome
(Intercept) -537.7122 214.43142 -2.507618 0.0123934 -958.76099 -116.6633 661 wage
education 111.5614 15.90091 7.016040 0.0000000 80.33905 142.7838 661 wage

We should get the same coefficient as the second stage, but the standard errors with iv_robust are more accurate. The only problem with iv_robust is that there’s no way to see the first stage, so if you want to check for relevancy or show the F-test or show the coefficients for the instruments, you’ll have to run a first_stage model on your own.

Models from iv_robust() also work with huxreg(). Note how the education variable isn’t renamed educ_hat in the iv_robust() version—it’s still using predicted education even if it’s not obvious

huxreg(list("OLS" = model_naive, "2SLS" = second_stage, 
            "2SLS automatic" = model_same_time))
OLS 2SLS 2SLS automatic
(Intercept) 175.160     -537.712 *   -537.712 *  
(92.839)    (208.164)    (214.431)   
education 59.452 ***          111.561 ***
(6.698)             (15.901)   
education_hat          111.561 ***         
         (15.176)            
N 663         663         663        
R2 0.106     0.076     0.025    
logLik -4885.974     -4897.252             
AIC 9777.947     9800.503             
*** p < 0.001; ** p < 0.01; * p < 0.05.

Education, wages, and distance to college (real data)

For this last example we’ll estimate the effect of education on wages using a different instrument—geographic proximity to colleges. This data comes from David Card’s 1995 study where he did the same thing, and it’s available in the wooldridge library as card. You can find a description of all variables here; we’ll use these:

Variable name Description
lwage Annual wage (log form)
educ Years of education
nearc4 Living close to college (=1) or far from college (=0)
smsa Living in metropolitan area (=1) or not (=0)
exper Years of experience
expersq Years of experience (squared term)
black Black (=1), not black (=0)
south Living in the south (=1) or not (=0)

Once again, Card wants to estimate the impact of education on wage. But to solve the ability bias, he utilizes a different instrumental variable: proximity to college. He provides arguments to support each of three main characteristics of a good instrumental variable:

  1. Relevancy: People who live close to a 4-year college have easier access to education at a lower costs (no commuting costs and time nor accomodation costs), so they have greater incentives to pursue education.
  2. Exclusion: Proximity to a college has no effect on your annual income, unless you decide to pursue further education because of the nearby college.
  3. Exogeneity: Individual ability does not depend on proximity to a college.

Therefore, he estimates a model where:

First stage:

\[ \widehat{\text{Educ}} = \beta_0 + \beta_1\text{nearc4} + \beta_{2-6}\text{Control variables} \]

Second stage:

\[ \text{lwage} = \beta_0 + \beta_1 \widehat{\text{Educ}} + \beta_{2-6}\text{Control variables} \]

He controls for five things: smsa66 + exper + expersq + black + south66.

We can do the same thing. IMPORTANT NOTE: When you include controls, every control variable needs to go in both stages. The only things from the first stage that don’t carry over to the second stage are the instruments—notice how nearc4 is only in the first stage, since it’s the instrument, but it’s not in the second stage. The other controls are all in both stages.

# Load data
data("card")

# First we'll build a naive model without any instruments so we can see the bias
# in the educ coefficient
naive_model <- lm(lwage ~ educ + smsa66 + exper + expersq + black + south66, 
                  data = card)

# Then we'll run the first stage, predicting educ with nearc4 + all the controls
first_stage <- lm(educ ~ nearc4 + smsa66 + exper + expersq + black + south66,
                  data = card)

# Then we'll add the fitted education values into the original dataset and
# rename the .fitted column so it's easier to work with
card <- augment_columns(first_stage, card) %>% 
  rename(educ_hat = .fitted)

# Finally we can run the second stage model using the predicted education from
# the first stage
second_stage <- lm(lwage ~ educ_hat + smsa66 + exper + expersq + black + south66, 
                  data = card)

# Just for fun, we can do all of this at the same time with iv_robsust
model_2sls <- iv_robust(lwage ~ educ + smsa66 + exper + expersq + black + south66 | 
                          nearc4 + smsa66 + exper + expersq + black + south66,
                        data = card)
huxreg(list("OLS" = naive_model, "2SLS" = second_stage, "2SLS (robust)" = model_2sls))
OLS 2SLS 2SLS (robust)
(Intercept) 4.731 *** 3.357 *** 3.357 ***
(0.069)    (0.926)    (0.930)   
educ 0.076 ***          0.157 ** 
(0.004)             (0.055)   
smsa66 0.113 *** 0.081 **  0.081 ** 
(0.015)    (0.027)    (0.027)   
exper 0.085 *** 0.118 *** 0.118 ***
(0.007)    (0.023)    (0.024)   
expersq -0.002 *** -0.002 *** -0.002 ***
(0.000)    (0.000)    (0.000)   
black -0.177 *** -0.104     -0.104 *  
(0.018)    (0.053)    (0.052)   
south66 -0.096 *** -0.064 *   -0.064 *  
(0.016)    (0.028)    (0.028)   
educ_hat          0.157 **          
         (0.055)            
N 3010         3010         3010        
R2 0.269     0.160     0.143    
logLik -1352.703     -1563.264             
AIC 2721.406     3142.528             
*** p < 0.001; ** p < 0.01; * p < 0.05.

Notice how educ_hat and educ are the same in each of the 2SLS models, and they’re higher than the naive uninstrumented model. Because the outcome is log wages, we can say that an extra year of education causes a 15.7% increase in wages.

Clearest and muddiest things

Go to this form and answer these three questions:

  1. What was the muddiest thing from class today? What are you still wondering about?
  2. What was the clearest thing from class today?
  3. What was the most exciting thing you learned?

I’ll compile the questions and send out answers after class.