I've spent time over the last few months attempting to enhance my skills in the statistical sub-field of causal inference.

Overly simplified, causal inference comprises a series of methodologies and techniques to assist analysts in making the jump from association or correlation to cause and effect. How can one progress from noting a correlation between factors X and Y to making a sound conclusion that X causes Y? For example, the association between X and Y could arise from X causing Y, Y causing X, or both X and Y being influenced by a third factor Z. The field of causal inference purports to sort out these possibilities.

The platinum design for causal inference is the experiment where subjects are randomly assigned to the different treatment groups. With randomization, the effects of uncontrolled or confounding factors (the Z's) should, within sampling limitations, be "equal" or "balanced" across treatments or values of X. In such settings of "controlled" Z's, the analyst is much more confident that a correlation between X and Y actually indicates causality.

But what about the in-situ data gathering schemes generally seen in the data science world, where data are observational, and confounders are free to roam? What is one to do? The answer: consider causal inference techniques that attempt to statistically mimic the randomized experiment.

Years ago, I routinely deployed analysis of covariance methods that included confounders directly in the linear models. If the X's had better coefficients than the confounders, then X caused Y. Not a very compelling solution. Over the years, I've tried propensity scores that attempt to summarize the confounders into a single score. Then there're matching techniques like those used below. The idea behind matching is to comprise a control group that looks similar to treatment on the distribution of confounders. If successful, the confounders are balanced and hence other things are essentially equal -- so a strong residual relationship between X and Y is evidence of cause and effect. That is, provided the critical ignorability assumption that all important confounders have been identified a priori holds.

While searching the web for CM didactics, I came across a well-assembled, intermediate-level, Coursera curriculum,A Crash Course in Causality: Inferring Causal Effects from Observational Data. Turns out to be just what the statistical doctor ordered. I invested a day absorbing many of Professor Jason Roy's accessible lectures, including a full analysis example worked through in R. I then took what I'd learned and applied it to my own data with my own R code.

What follows is an application of matching techniques to data from the American Community Survey census data. The question I'm asking is if sample respondents with a terminal masters degree earn more than those with a terminal bachelors. The treatment is thus MA, the control BA. The confounders considered are age (MA respondents are older), sex (more female MA respondents), and marital status (MA respondents are more likely to be married), among others.

I first subset all terminal bachelors and masters respondents to a data set. I then take a random sample of 250,000 of these cases. Finally, I match treatment (has a terminal MA) to control (has a terminal BA). I then invoke the R MatchIt package for dense, greedy, nearest neighbor matching on the identified confounders. This process ultimately produces an R data.table with a single matched control for each treatment record. It turns out that MA respondents do indeed have considerably higher average incomes than BA's. But when matched on the confounders, the mean average annual difference is reduced by over $2,500. This makes sense, since, among other things, the MA sample is several years older on average than the BA's.

The technology used is JupyterLab with Microsoft Open R, 3.4.4. For the matching work, the MatchIt and tableone packages are deployed.

Load libraries and set the sampling seed and working directory.

In [27]:

```
options(warn=-1)
options(scipen=10)
options(datatable.print.topn=100)
options(datatable.showProgress=FALSE)
suppressMessages(library(data.table))
suppressMessages(library(pryr))
suppressMessages(library(plyr))
suppressMessages(library(dplyr))
suppressMessages(library(dtplyr))
suppressMessages(library(fst))
suppressMessages(library(purrr))
suppressMessages(library(RColorBrewer))
suppressMessages(library(stringr))
suppressMessages(library(lubridate))
suppressMessages(library(ggplot2))
suppressMessages(library(knitr))
suppressMessages(library(kableExtra))
suppressMessages(library(IRdisplay))
suppressMessages(library(MatchIt))
suppressMessages(library(tableone))
suppressMessages(library(rcbalance))
wdir <- "C:/bigdata/pums/2016/population"
setwd(wdir)
set.seed(321)
cat("\n\n")
```

Define a few useful functions

In [2]:

```
meta <- function(dt,dict=FALSE)
{
print(dim(dt))
print(object_size(dt))
if (dict==TRUE) print(str(dt))
cat("\n\n")
}
cat("\n\n")
```

In [3]:

```
frequenciesdyn <- function(DTstr, xstr)
{
return(eval(parse(text=sprintf('%s[,.(count=.N),.(%s)]', DTstr, xstr))) %>%
arrange(desc(count)) %>% mutate(percent=100*count/sum(count)))
}
cat("\n\n")
```

In [4]:

```
allfreqs <- function(dtn)
{
dt <- eval(parse(text=paste("data.table(",dtn,")",sep="")))
nmes <- names(dt)
lst <- map(nmes,function(nm) frequenciesdyn(dtn,nm))
names(lst) <- nmes
return(lst)
}
cat("\n\n")
```

Read several attributes from my fst-saved file of the 2016 American Community Survey (ACS), Public Use MicroData Sample (PUMS) data. The data set consists of over 3.1M records.

In [5]:

```
cols <- c("serialno","agep","schl","sex","mar", "hisp","rac1p","pincp")
pumspop2016 <- read_fst("pop2016.fst",columns=cols,as.data.table=TRUE)
meta(pumspop2016,dict=TRUE)
cat("\n\n")
```

Subset the data to include respondents between 20 and 70 years old whose educational attainment includes terminal BA/BS or MA/MS degrees.

In [6]:

```
pumspop2016 <- pumspop2016[agep >= 20 & agep <= 70 & pincp > 100 & (schl == 21 | schl == 22)]
meta(pumspop2016)
cat("\n\n")
```

Of the 547965 qualifiers, 381,225 earned a BA/BS and 166,740 a MA/MS.

In [7]:

```
frequenciesdyn("pumspop2016","schl")
cat("\n\n")
```

Create additional attributes, compute a coarsened age attribute, and rename the variables.

In [8]:

```
invisible(pumspop2016[,`:=` (sex = sex-1,
education = schl,
race = ifelse(hisp>1, 3,car::recode(rac1p,"1=1;2=2;6=4;else=5")),
married = as.integer(mar==1),
single = as.integer(mar==5),
agecat=as.integer(round(agep-min(agep))/2),
treat = schl - 21,
mar = NULL,schl = NULL, hisp = NULL,rac1p = NULL,division = NULL
)
][,
`:=` (white = as.integer(race==1),
black = as.integer(race==2),
asian = as.integer(race==4),
hispanic = as.integer(race==3))
])
setnames(pumspop2016,c("agep","pincp"), c("age", "income"))
meta(pumspop2016,dict=TRUE)
tail(pumspop2016,10)
cat("\n\n")
```

Tabulate frequencies of the final attributes.

In [9]:

```
allfreqs('pumspop2016[,!c("serialno","age","income"),with=FALSE]')
cat("\n\n")
```

Take a random sample of 250,000 of the cases to be used for subsequent matching.

In [10]:

```
nrr <- nrow(pumspop2016)
howmany <- 250000
samppumspop2016 <- pumspop2016[sample(nrr,howmany)]
meta(samppumspop2016,dict=TRUE)
cat("\n\n")
```

The "treated" group, those with an MA/MS degree, match up well, population to sample.

In [11]:

```
frequenciesdyn("pumspop2016","treat")
frequenciesdyn("samppumspop2016","treat")
cat("\n\n")
```

Consider matching on agecat, sex, married, single, white, black, asian, and hispanic. Take a look at how these compare treatment/control in the sample. SMD's > 0.1 are cause for concern. agecat, married, single, sex, asian, and hispanic might all benefit from matching.

In [12]:

```
matchvars <- c("agecat","sex","married","single", c("white","black","asian","hispanic"))
tab1 <- CreateTableOne(vars=matchvars,strata="treat",data=samppumspop2016,test=FALSE)
print(tab1,smd=TRUE)
cat("\n\n")
```

Create the matched data.table on the sampled data using the matchit function. This "dense" matching is computationally expensive, consuming over 70 minutes for the 76,000 treatment records. Performance is O^2 with the N of treated.

In [13]:

```
ptmpre <- proc.time()
form <- as.formula('treat ~ agecat + sex + married + single + white + black + hispanic + asian')
matchstr <- matchit(form,samppumspop2016,method="nearest")
ptmpost <- proc.time()
print(ptmpost-ptmpre)
cat("\n\n")
```

Define a function to convert matchit's final output structure to a data.table.

In [14]:

```
mkmatch <- function(matchstr)
{
wcontrol <- as.integer(as.vector(matchstr$match.matrix[,1]))
wtreat <- as.integer(as.vector(unlist(dimnames(matchstr$match.matrix)[1])))
dt <- data.table(matchstr$model$data)
return(rbind(dt[wtreat][,match:=1:length(wtreat)],dt[wcontrol][,match:=1:length(wcontrol)]))
}
cat("\n\n")
```

Build the matched data.table with a total of 151,810 records -- 75,905 treated and 75,905 control.

In [15]:

```
vars <- c("agecat","sex","white","black","asian","hispanic","married","single","income","match","treat")
matchsamp <- mkmatch(matchstr)[,vars,with=FALSE]
meta(matchsamp,dict=TRUE)
cat("\n\n")
```

Examine the SMD's post match. They look quite good -- the confounders are well-balanced between treatment and control.

In [16]:

```
matchvars <- c("agecat","sex","married","single",c("white","black","asian","hispanic"))
tab2 <- CreateTableOne(vars=matchvars,strata="treat",data=matchsamp,test=FALSE)
print(tab2,smd=TRUE)
cat("\n\n")
```

Split matchsamp into treatment and control

In [17]:

```
splitmatch <- split(matchsamp,matchsamp[,treat])
head(splitmatch[[1]])
head(splitmatch[[2]])
cat("\n\n")
```

In [18]:

```
head(cbind(splitmatch[[1]],splitmatch[[2]]),10)
cat("\n\n")
```

Further evidence that treatment and control match well. 99.9% of treatment records had an identical control match on confounders.

In [19]:

```
nrow(splitmatch[[1]])
cat("\n")
sum(splitmatch[[1]]$agecat==splitmatch[[2]]$agecat)
sum(splitmatch[[1]]$sex==splitmatch[[2]]$sex)
sum(splitmatch[[1]]$married==splitmatch[[2]]$married)
sum(splitmatch[[1]]$single==splitmatch[[2]]$single)
sum(splitmatch[[1]]$white==splitmatch[[2]]$white)
sum(splitmatch[[1]]$black==splitmatch[[2]]$black)
sum(splitmatch[[1]]$hispanic==splitmatch[[2]]$hispanic)
sum(splitmatch[[1]]$asian==splitmatch[[2]]$asian)
cat("\n\n")
```

In [20]:

```
sum(splitmatch[[1]]$agecat==splitmatch[[2]]$agecat &
splitmatch[[1]]$sex==splitmatch[[2]]$sex &
splitmatch[[1]]$married==splitmatch[[2]]$married &
splitmatch[[1]]$single==splitmatch[[2]]$single &
splitmatch[[1]]$white==splitmatch[[2]]$white &
splitmatch[[1]]$black==splitmatch[[2]]$black &
splitmatch[[1]]$hispanic==splitmatch[[2]]$hispanic &
splitmatch[[1]]$asian==splitmatch[[2]]$asian
)
cat("\n\n")
```

Notice that the mean age of the matched controls is equal to the mean age of matched treated.

In [21]:

```
samppumspop2016[,.(.N, mean_agecat=round(mean(agecat))),.(treat)][order(treat)]
matchsamp[,.(.N, mean_agecat=round(mean(agecat))),.(treat)][order(treat)]
cat("\n\n")
```

The mean income of the matched controls is over $2,500 more than that of the sample. This difference can be interpreted as the bias caused by the imbalance in age and other confounders of the unmatched sample. That said, it certainly pays to get an MA/MS degree.

In [22]:

```
samppumspop2016[,.(.N, mean_income=round(mean(income))),.(treat)][order(treat)]
matchsamp[,.(.N, mean_income=round(mean(income))),.(treat)][order(treat)]
cat("\n\n")
```

Ditto for the 5% trimmed means.

In [23]:

```
samppumspop2016[,.(.N, mean_income=round(mean(income,trim=.05))),.(treat)][order(treat)]
matchsamp[,.(.N, mean_income=round(mean(income,trim=.05))),.(treat)][order(treat)]
cat("\n\n")
```

Perform a permutation assessment of the difference between mean income of the matched treatment and control. The actual difference of more than $12,000 blows away what would be expected by chance. So, even "controlling" for confounders age, sex, marital status and race, there's a significant difference in incomes between those with a terminal masters degree and those with a terminal bachelors.

In [24]:

```
ptmpre <- proc.time()
nr <- nrow(matchsamp)
ntreat <- nr/2
nctl <- nr-ntreat
nreps <- 10000
hold <- list()
for (i in 1:nreps)
{
income <- matchsamp[sample(nr),income]
hold[[i]] <- c(round(mean(income[1:ntreat])), round(mean(income[(nctl+1):nr])))
}
head(hold)
ptmpost <- proc.time()
print(ptmpost-ptmpre)
cat("\n\n")
```

In [25]:

```
f <- function(h) h[1]-h[2]
diff <- data.table(diff=unlist(map(hold,f)))
pal <- brewer.pal(9,"Blues")
titstr <- "Density Plot of the Mean Income Permutation Differences"
xstr <- "Income Difference in $"
ystr <- "Density\n"
ggplot(diff, aes(diff)) +
geom_density() +
theme(legend.position = "none", plot.background = element_rect(fill = pal[2]),
panel.background = element_rect(fill = pal[2])) +
labs(title=titstr,x=xstr,y=ystr)
cat("\n\n")
```

Save the data to R fst files.

In [26]:

```
write_fst(samppumspop2016,"samppumspop2016.fst")
write_fst(matchsamp,"matchsamp.fst")
cat("\n\n")
```

This blog barely skims the surface of causal inference. With typical data set sizes large enough to support matching on a host of confounders, I'd love to see causal models become more pervasive in today's analytics world. Embellished by these techniques, analysis of large observational, database-driven, or otherwise non-randomized data sets that data scientists use to make inferences can be substantially enhanced methodologically.

Causal modeling algorithms and computational performance will continue to improve. In subsequent blogs, I'll take a look at more efficient sparse matching as well as the propensity model and other causal algorithms.