Instrumental variables I

Materials for class on Monday, November 11, 2019

Slides
Running R from your computer
IV/2SLS examples
Clearest and muddiest things

Slides

Download the slides from today’s class.

Running R from your computer

Download and unzip this folder to get started:

iv-practice.zip

IV/2SLS examples

library(tidyverse)   # For ggplot, %>%, and friends
library(broom)       # For converting models to data frames
library(huxtable)    # For side-by-side regression tables
library(AER)         # For some econometrics datasets
library(wooldridge)  # More econometrics datasets
library(estimatr)    # For iv_robust

Education, wages, and parent education (fake data)

Download this data and put it in a folder named “data” in your project root:

father_education.csv

ed_fake <- read_csv("data/father_education.csv")

We’re interested in the perennial econometrics question of whether an extra year of education causes increased wages. In this example we use simulated/fake data that includes the following variables:

Variable name	Description
`wage`	Weekly wage
`educ`	Years of education
`ability`	Magical column that measures your ability to work and go to school (omitted variable)
`fathereduc`	Years of education for father

If we could actually measure ability, we could estimate this model, which closes the confounding backdoor posed by ability and isolates just the effect of education on wages:

model_perfect <- lm(wage ~ educ + ability, data = ed_fake)
tidy(model_perfect)

term	estimate	std.error	statistic
(Intercept)	-80.2626923	5.6585424	-14.18434
educ	9.2424696	0.3426683	26.97206
ability	0.2580724	0.0071809	35.93874

However, in real life we don’t have ability, so we’re stuck with a naive model:

model_naive <- lm(wage ~ educ, data = ed_fake)
tidy(model_naive)

term	estimate	std.error	statistic	p.value
(Intercept)	-53.08545	8.4920070	-6.251225	0
educ	12.24036	0.5033011	24.320144	0

The naive model overestimates the effect of education on wages (12.2 vs. 9.24) because of omitted variable bias. Education suffers from endogeneity—there are things in the model (like ability, hidden in the error term) that are correlated with it. Any estimate we calculate will be wrong and biased because of selection effects or omitted variable bias (all different names for endogeneity).

To fix the endogeneity problem, we can use an instrument to remove the endogeneity from education and instead use a special exogeneity-only version of education. Perhaps someone’s father’s education can be an instrument for education.

To be a valid instrument, it must meet three criteria:

Relevance: Instrument is correlated with policy variable
Exclusion: Instrument is correlated with outcome only through the policy variable
Exogeneity: Instrument isn’t correlated with anything else in the model (i.e. omitted variables)

We can first test relevance by making a scatterplot and running a model of policy ~ instrument:

ggplot(ed_fake, aes(x = fathereduc, y = educ)) +
  geom_point() +
  geom_smooth(method = "lm")

check_relevance <- lm(educ ~ fathereduc, data = ed_fake)
tidy(check_relevance)

term	estimate	std.error	statistic	p.value
(Intercept)	4.3955891	0.3986274	11.02681	0
fathereduc	0.7569158	0.0242786	31.17622	0

glance(check_relevance)

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.4933898	0.4928822	1.595708	971.9564	0	2	-1885.255	3776.51	3791.233	2541.192	998

This looks pretty good! The F-statistic is definitely above 10 (it’s 972!), and there’s a significant relationship between the instrument and policy. I’d say that this is relevant.

To check for exclusion, we need to see if there’s a relationship between father’s education and wages that occurs only because of education. If we plot it, we’ll see a relationship:

ggplot(ed_fake, aes(x = fathereduc, y = wage)) +
  geom_point() +
  geom_smooth(method = "lm")

That’s to be expected, since in our model, father’s education causes education which causes wages—they should be correlated. We have to use theory to justify the idea that a father’s education increases the hourly wage only because it increases one’s education, and there’s no real statistical test for that.

There’s not really a test for exogeneity either, since there’s no way to measure other endogenous variables in the model (that’s the whole reason we’re using IVs in the first place!). Becuase we have the magical ability column in this fake data, we can test it. Father’s education shouldn’t be related to ability:

ggplot(ed_fake, aes(x = ability, y = fathereduc)) +
  geom_point() +
  geom_smooth(method = "lm")

And it’s not! We can safely say that it meets the exclusion assumption.

For the last part—the exogeneity assumption—there’s no statistical test. We just have to tell a theory-based story that the number of years of education one’s father has is not correlated with anything else in the model (including any omitted variables). Good luck with that—it’s probably not a good instrument. This relates to Scott Cunningham’s argument that instruments have to be weird. According to Scott:

The reason I think this is because an instrument doesn’t belong in the structural error term and the structural error term is all the intuitive things that determine your outcome. So it must be weird, otherwise it’s probably in the error term.

Let’s just pretend that father’s education is a valid instrument and move on :)

Now we can do two-stage least squares (2SLS) regressin and use the instrument to filter out the endogenous part of education. The first stage predicts education based on the instrument (we already ran this model earlier when checking for relevance, but we’ll do it again just for fun):

first_stage <- lm(educ ~ fathereduc, data = ed_fake)

Now we want to add a column of predicted education to our original dataset. The easiest way to do that is with the augment_columns() function from the broom library:

ed_fake_with_prediction <- augment_columns(first_stage, ed_fake)
head(ed_fake_with_prediction)

wage	educ	ability	fathereduc	.fitted	.se.fit	.resid	.hat	.sigma	.cooksd	.std.resid
146.3478	18.05319	348.2400	17.15832	17.38299	0.0547177	0.6701995	0.0011758	1.596367	0.0001040	0.4202485
147.5996	15.84549	181.1607	13.98853	14.98373	0.0752312	0.8617548	0.0022227	1.596274	0.0003256	0.5406466
161.8202	15.10521	337.3677	15.99431	16.50194	0.0509579	-1.3967293	0.0010198	1.595895	0.0003915	-0.8757505
105.0829	16.45813	106.4580	21.41317	20.60356	0.1343020	-4.1454269	0.0070837	1.591062	0.0242457	-2.6071108
167.5622	18.79382	301.5100	16.45763	16.85263	0.0506309	1.9411855	0.0010068	1.595323	0.0007464	1.2171170
172.9273	16.01597	283.6524	15.42531	16.07125	0.0546236	-0.0552788	0.0011718	1.596507	0.0000007	-0.0346625

Note a couple of these new columns. .fitted is the fitted/predicted value of education, and it’s the version of education with endogeneity arguably removed. .resid shows how far off the prediction is from educ. The other columns don’t matter so much.

Instead of dealing with weird names like .fitted, I like to rename the fitted variable to something more understandable after I use augment_columns:

ed_fake_with_prediction <- augment_columns(first_stage, ed_fake) %>% 
  rename(educ_hat = .fitted)

head(ed_fake_with_prediction)

wage	educ	ability	fathereduc	educ_hat	.se.fit	.resid	.hat	.sigma	.cooksd	.std.resid
146.3478	18.05319	348.2400	17.15832	17.38299	0.0547177	0.6701995	0.0011758	1.596367	0.0001040	0.4202485
147.5996	15.84549	181.1607	13.98853	14.98373	0.0752312	0.8617548	0.0022227	1.596274	0.0003256	0.5406466
161.8202	15.10521	337.3677	15.99431	16.50194	0.0509579	-1.3967293	0.0010198	1.595895	0.0003915	-0.8757505
105.0829	16.45813	106.4580	21.41317	20.60356	0.1343020	-4.1454269	0.0070837	1.591062	0.0242457	-2.6071108
167.5622	18.79382	301.5100	16.45763	16.85263	0.0506309	1.9411855	0.0010068	1.595323	0.0007464	1.2171170
172.9273	16.01597	283.6524	15.42531	16.07125	0.0546236	-0.0552788	0.0011718	1.596507	0.0000007	-0.0346625

We can now use the new educ_hat variable in our second stage model:

second_stage <- lm(wage ~ educ_hat, data = ed_fake_with_prediction)
tidy(second_stage)

term	estimate	std.error	statistic	p.value
(Intercept)	-3.107953	14.3703343	-0.2162757	0.8288171
educ_hat	9.251863	0.8555225	10.8142831	0.0000000

The estimate for educ_hat is arguably more accurate now because we’ve used the instrument to remove the endogenous part of education and should only have the exogenous part.

We can put all the models side-by-side to compare them:

huxreg(list("Perfect" = model_perfect, "OLS" = model_naive, "2SLS" = second_stage))

	Perfect	OLS	2SLS
(Intercept)	-80.263 ***	-53.085 ***	-3.108
	(5.659)	(8.492)	(14.370)
educ	9.242 ***	12.240 ***
	(0.343)	(0.503)
ability	0.258 ***
	(0.007)
educ_hat			9.252 ***
			(0.856)
N	1000	1000	1000
R2	0.726	0.372	0.105
logLik	-4576.101	-4991.572	-5168.868
AIC	9160.202	9989.144	10343.735
* p < 0.001; p < 0.01; * p < 0.05.

Note how the coefficient for educ_hat in the 2SLS model is basically the same as the coefficient for educ in the perfect model that accounts for ability. That’s the magic of instrumental variables!

Education, wages, and parent education (real data)

This data comes from the wage2 dataset in the wooldridge R package (and it’s real!). Wages are measured in monthly earnings in 1980 dollars.

ed_real <- wage2 %>% 
  rename(education = educ, education_dad = feduc, education_mom = meduc) %>%
  na.omit()  # Get rid of rows with missing values

We want to again estimate the effect of education on wages, but this time we’ll use both one’s father’s education and one’s mother’s education as instruments. Here’s the naive estimation of the relationship, which suffers from endogeneity:

model_naive <- lm(wage ~ education, data = ed_real)
tidy(model_naive)

term	estimate	std.error	statistic	p.value
(Intercept)	175.15999	92.838561	1.886716	0.0596366
education	59.45181	6.697942	8.876131	0.0000000

This is wrong though! Education is endogenous to unmeasured things in the model (like ability, which lives in the error term). We can isolate the exogenous part of education with an instrument.

Before doing any 2SLS models, we want to check the validity of the instruments. Remember, for an instrument to be valid, it should meet these criteria:

Relevance: Instrument is correlated with policy variable
Exclusion: Instrument is correlated with outcome only through the policy variable
Exogeneity: Instrument isn’t correlated with anything else in the model (i.e. omitted variables)

We can check for relevance by looking at the relationship between the instruments and education:

# Combine father's and mother's education into one column so we can plot both at the same time
ed_real_long <- ed_real %>% 
  pivot_longer(cols = c(education_dad, education_mom), 
               names_to = "instrument", values_to = "instrument_value")

ggplot(ed_real_long, aes(x = instrument_value, y = education)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap(~ instrument)

model_check_instruments <- lm(education ~ education_dad + education_mom, 
                              data = ed_real)
tidy(model_check_instruments)

term	estimate	std.error	statistic	p.value
(Intercept)	9.9132852	0.3195911	31.018649	0.00e+00
education_dad	0.2189424	0.0288925	7.577836	0.00e+00
education_mom	0.1401693	0.0336511	4.165370	3.52e-05

glance(model_check_instruments)

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.2015354	0.1991158	1.996932	83.29321	0	3	-1397.792	2803.584	2821.571	2631.908	660

There’s a clear relationship between both of the instruments and education, and the coefficients for each are signficiant. The F-statistic for the model is 83, which is higher than 10, which is a good sign of a strong instrument.

We can check for exclusion in part by looking at the relationship between the instruments and the outcome, or wages. We should see some relationship:

ggplot(ed_real_long, aes(x = instrument_value, y = wage)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap(~ instrument)

And we do! Now we just have to make the case that the only reason there’s a relationship is that parental education only influences wages through education. Good luck with that.

The last step is to prove exogeneity—that parental education is not correlated with education or wages. Good luck with that too.

Assuming that parental education is a good instrument, we can use it to remove the endogenous part of education using 2SLS. In the first stage, we predict education using our instruments:

first_stage <- lm(education ~ education_dad + education_mom, data = ed_real)

We can then extract the predicted education and add it to our main dataset, renaming the .fitted variable to something more useful along the way:

ed_real_with_predicted <- augment_columns(first_stage, ed_real) %>% 
  rename(education_hat = .fitted)

Finally, we can use predicted education to estimate the exogenous effect of education on wages:

second_stage <- lm(wage ~ education_hat, 
                   data = ed_real_with_predicted)
tidy(second_stage)

term	estimate	std.error	statistic	p.value
(Intercept)	-537.7122	208.16441	-2.583113	0.010005
education_hat	111.5614	15.17586	7.351244	0.000000

That should arguably be our actual effect! Let’s compare it to the naive model:

huxreg(list("OLS" = model_naive, "2SLS" = second_stage))

	OLS	2SLS
(Intercept)	175.160	-537.712 *
	(92.839)	(208.164)
education	59.452 ***
	(6.698)
education_hat		111.561 ***
		(15.176)
N	663	663
R2	0.106	0.076
logLik	-4885.974	-4897.252
AIC	9777.947	9800.503
* p < 0.001; p < 0.01; * p < 0.05.

The 2SLS effect is roughly twice as large and is arguably more accurate, since it has removed the endogeneity from education. An extra year of school leads to an extra $111.56 dollars a month in income (in 1980 dollars).

If you don’t want to go through the hassle of doing the two stages by hand, you can use the iv_robust() function from the estimatr package to do both stages at the same time. The second stage goes on the left side of the |, just like a normal regression. The first stage goes on the right side of the |:

model_same_time <- iv_robust(wage ~ education | education_dad + education_mom,
                             data = ed_real)
tidy(model_same_time)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
(Intercept)	-537.7122	214.43142	-2.507618	0.0123934	-958.76099	-116.6633	661	wage
education	111.5614	15.90091	7.016040	0.0000000	80.33905	142.7838	661	wage

We should get the same coefficient as the second stage, but the standard errors with iv_robust are more accurate. The only problem with iv_robust is that there’s no way to see the first stage, so if you want to check for relevancy or show the F-test or show the coefficients for the instruments, you’ll have to run a first_stage model on your own.

Models from iv_robust() also work with huxreg(). Note how the education variable isn’t renamed educ_hat in the iv_robust() version—it’s still using predicted education even if it’s not obvious

huxreg(list("OLS" = model_naive, "2SLS" = second_stage, 
            "2SLS automatic" = model_same_time))

	OLS	2SLS	2SLS automatic
(Intercept)	175.160	-537.712 *	-537.712 *
	(92.839)	(208.164)	(214.431)
education	59.452 ***		111.561 ***
	(6.698)		(15.901)
education_hat		111.561 ***
		(15.176)
N	663	663	663
R2	0.106	0.076	0.025
logLik	-4885.974	-4897.252
AIC	9777.947	9800.503
* p < 0.001; p < 0.01; * p < 0.05.

Education, wages, and distance to college (real data)

For this last example we’ll estimate the effect of education on wages using a different instrument—geographic proximity to colleges. This data comes from David Card’s 1995 study where he did the same thing, and it’s available in the wooldridge library as card. You can find a description of all variables here; we’ll use these:

Variable name	Description
`lwage`	Annual wage (log form)
`educ`	Years of education
`nearc4`	Living close to college (=1) or far from college (=0)
`smsa`	Living in metropolitan area (=1) or not (=0)
`exper`	Years of experience
`expersq`	Years of experience (squared term)
`black`	Black (=1), not black (=0)
`south`	Living in the south (=1) or not (=0)

Once again, Card wants to estimate the impact of education on wage. But to solve the ability bias, he utilizes a different instrumental variable: proximity to college. He provides arguments to support each of three main characteristics of a good instrumental variable:

Relevancy: People who live close to a 4-year college have easier access to education at a lower costs (no commuting costs and time nor accomodation costs), so they have greater incentives to pursue education.
Exclusion: Proximity to a college has no effect on your annual income, unless you decide to pursue further education because of the nearby college.
Exogeneity: Individual ability does not depend on proximity to a college.

Therefore, he estimates a model where:

First stage:

\[ \widehat{\text{Educ}} = \beta_0 + \beta_1\text{nearc4} + \beta_{2-6}\text{Control variables} \]

Second stage:

\[ \text{lwage} = \beta_0 + \beta_1 \widehat{\text{Educ}} + \beta_{2-6}\text{Control variables} \]

He controls for five things: smsa66 + exper + expersq + black + south66.

We can do the same thing. IMPORTANT NOTE: When you include controls, every control variable needs to go in both stages. The only things from the first stage that don’t carry over to the second stage are the instruments—notice how nearc4 is only in the first stage, since it’s the instrument, but it’s not in the second stage. The other controls are all in both stages.

# Load data
data("card")

# First we'll build a naive model without any instruments so we can see the bias
# in the educ coefficient
naive_model <- lm(lwage ~ educ + smsa66 + exper + expersq + black + south66, 
                  data = card)

# Then we'll run the first stage, predicting educ with nearc4 + all the controls
first_stage <- lm(educ ~ nearc4 + smsa66 + exper + expersq + black + south66,
                  data = card)

# Then we'll add the fitted education values into the original dataset and
# rename the .fitted column so it's easier to work with
card <- augment_columns(first_stage, card) %>% 
  rename(educ_hat = .fitted)

# Finally we can run the second stage model using the predicted education from
# the first stage
second_stage <- lm(lwage ~ educ_hat + smsa66 + exper + expersq + black + south66, 
                  data = card)

# Just for fun, we can do all of this at the same time with iv_robsust
model_2sls <- iv_robust(lwage ~ educ + smsa66 + exper + expersq + black + south66 | 
                          nearc4 + smsa66 + exper + expersq + black + south66,
                        data = card)

huxreg(list("OLS" = naive_model, "2SLS" = second_stage, "2SLS (robust)" = model_2sls))

	OLS	2SLS	2SLS (robust)
(Intercept)	4.731 ***	3.357 ***	3.357 ***
	(0.069)	(0.926)	(0.930)
educ	0.076 ***		0.157 **
	(0.004)		(0.055)
smsa66	0.113 ***	0.081 **	0.081 **
	(0.015)	(0.027)	(0.027)
exper	0.085 ***	0.118 ***	0.118 ***
	(0.007)	(0.023)	(0.024)
expersq	-0.002 ***	-0.002 ***	-0.002 ***
	(0.000)	(0.000)	(0.000)
black	-0.177 ***	-0.104	-0.104 *
	(0.018)	(0.053)	(0.052)
south66	-0.096 ***	-0.064 *	-0.064 *
	(0.016)	(0.028)	(0.028)
educ_hat		0.157 **
		(0.055)
N	3010	3010	3010
R2	0.269	0.160	0.143
logLik	-1352.703	-1563.264
AIC	2721.406	3142.528
* p < 0.001; p < 0.01; * p < 0.05.

Notice how educ_hat and educ are the same in each of the 2SLS models, and they’re higher than the naive uninstrumented model. Because the outcome is log wages, we can say that an extra year of education causes a 15.7% increase in wages.

Clearest and muddiest things

Go to this form and answer these three questions:

What was the muddiest thing from class today? What are you still wondering about?
What was the clearest thing from class today?
What was the most exciting thing you learned?

I’ll compile the questions and send out answers after class.