Mixing & Matching in R for Data Science

I've spent time over the last few months attempting to enhance my skills in the statistical sub-field of causal inference.

Overly simplified, causal inference comprises a series of methodologies and techniques to assist analysts in making the jump from association or correlation to cause and effect. How can one progress from noting a correlation between factors X and Y to making a sound conclusion that X causes Y? For example, the association between X and Y could arise from X causing Y, Y causing X, or both X and Y being influenced by a third factor Z. The field of causal inference purports to sort out these possibilities.

The platinum design for causal inference is the experiment where subjects are randomly assigned to the different treatment groups. With randomization, the effects of uncontrolled or confounding factors (the Z's) should, within sampling limitations, be "equal" or "balanced" across treatments or values of X. In such settings of "controlled" Z's, the analyst is much more confident that a correlation between X and Y actually indicates causality.

But what about the in-situ data gathering schemes generally seen in the data science world, where data are observational, and confounders are free to roam? What is one to do? The answer: consider causal inference techniques that attempt to statistically mimic the randomized experiment.

Years ago, I routinely deployed analysis of covariance methods that included confounders directly in the linear models. If the X's had better coefficients than the confounders, then X caused Y. Not a very compelling solution. Over the years, I've tried propensity scores that attempt to summarize the confounders into a single score. Then there're matching techniques like those used below. The idea behind matching is to comprise a control group that looks similar to treatment on the distribution of confounders. If successful, the confounders are balanced and hence other things are essentially equal -- so a strong residual relationship between X and Y is evidence of cause and effect. That is, provided the critical ignorability assumption that all important confounders have been identified a priori holds.

While searching the web for CM didactics, I came across a well-assembled, intermediate-level, Coursera curriculum,A Crash Course in Causality: Inferring Causal Effects from Observational Data. Turns out to be just what the statistical doctor ordered. I invested a day absorbing many of Professor Jason Roy's accessible lectures, including a full analysis example worked through in R. I then took what I'd learned and applied it to my own data with my own R code.

What follows is an application of matching techniques to data from the American Community Survey census data. The question I'm asking is if sample respondents with a terminal masters degree earn more than those with a terminal bachelors. The treatment is thus MA, the control BA. The confounders considered are age (MA respondents are older), sex (more female MA respondents), and marital status (MA respondents are more likely to be married), among others.

I first subset all terminal bachelors and masters respondents to a data set. I then take a random sample of 250,000 of these cases. Finally, I match treatment (has a terminal MA) to control (has a terminal BA). I then invoke the R MatchIt package for dense, greedy, nearest neighbor matching on the identified confounders. This process ultimately produces an R data.table with a single matched control for each treatment record. It turns out that MA respondents do indeed have considerably higher average incomes than BA's. But when matched on the confounders, the mean average annual difference is reduced by over $2,500. This makes sense, since, among other things, the MA sample is several years older on average than the BA's.

The technology used is JupyterLab with Microsoft Open R, 3.4.4. For the matching work, the MatchIt and tableone packages are deployed.

Load libraries and set the sampling seed and working directory.

In [27]:
options(warn=-1)
options(scipen=10)
options(datatable.print.topn=100)
options(datatable.showProgress=FALSE)

suppressMessages(library(data.table))
suppressMessages(library(pryr))
suppressMessages(library(plyr))
suppressMessages(library(dplyr))
suppressMessages(library(dtplyr))
suppressMessages(library(fst))
suppressMessages(library(purrr))
suppressMessages(library(RColorBrewer))
suppressMessages(library(stringr))
suppressMessages(library(lubridate))
suppressMessages(library(ggplot2))
suppressMessages(library(knitr))
suppressMessages(library(kableExtra))
suppressMessages(library(IRdisplay))
suppressMessages(library(MatchIt))
suppressMessages(library(tableone))
suppressMessages(library(rcbalance))

wdir <- "C:/bigdata/pums/2016/population"
setwd(wdir)

set.seed(321)

cat("\n\n")

Define a few useful functions

In [2]:
meta <- function(dt,dict=FALSE)
{
    print(dim(dt))
    print(object_size(dt))
    if (dict==TRUE) print(str(dt))
    cat("\n\n")
}

cat("\n\n")

In [3]:
frequenciesdyn <- function(DTstr, xstr)
{    
        return(eval(parse(text=sprintf('%s[,.(count=.N),.(%s)]', DTstr, xstr))) %>% 
               arrange(desc(count)) %>% mutate(percent=100*count/sum(count)))
} 

cat("\n\n")

In [4]:
allfreqs <- function(dtn)
{
    dt <- eval(parse(text=paste("data.table(",dtn,")",sep="")))
    nmes <-  names(dt)
    lst <- map(nmes,function(nm) frequenciesdyn(dtn,nm))
    names(lst) <- nmes
    return(lst)
}
               
cat("\n\n")

Read several attributes from my fst-saved file of the 2016 American Community Survey (ACS), Public Use MicroData Sample (PUMS) data. The data set consists of over 3.1M records.

In [5]:
cols <- c("serialno","agep","schl","sex","mar", "hisp","rac1p","pincp")
pumspop2016 <- read_fst("pop2016.fst",columns=cols,as.data.table=TRUE)

meta(pumspop2016,dict=TRUE)

cat("\n\n")
[1] 3156487       8
101 MB
Classes 'data.table' and 'data.frame':	3156487 obs. of  8 variables:
 $ serialno: int  64 64 80 107 107 107 107 134 134 180 ...
 $ agep    : int  84 84 78 46 52 20 16 66 68 64 ...
 $ schl    : int  19 21 19 20 18 16 13 16 21 18 ...
 $ sex     : int  2 1 2 1 2 1 2 2 1 1 ...
 $ mar     : int  1 1 3 1 1 5 5 1 1 1 ...
 $ hisp    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ rac1p   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pincp   : int  22510 14100 45800 65000 0 0 0 3900 57000 82000 ...
 - attr(*, ".internal.selfref")=<externalptr> 
NULL




Subset the data to include respondents between 20 and 70 years old whose educational attainment includes terminal BA/BS or MA/MS degrees.

In [6]:
pumspop2016 <- pumspop2016[agep >= 20 & agep <= 70 & pincp > 100 & (schl == 21 | schl == 22)]

meta(pumspop2016)

cat("\n\n")
[1] 547965      8
17.5 MB




Of the 547965 qualifiers, 381,225 earned a BA/BS and 166,740 a MA/MS.

In [7]:
frequenciesdyn("pumspop2016","schl")

cat("\n\n")
schlcountpercent
21 381225 69.57105
22 166740 30.42895

Create additional attributes, compute a coarsened age attribute, and rename the variables.

In [8]:
invisible(pumspop2016[,`:=` (sex = sex-1,
education = schl,
race = ifelse(hisp>1, 3,car::recode(rac1p,"1=1;2=2;6=4;else=5")),
married = as.integer(mar==1),
single = as.integer(mar==5),
agecat=as.integer(round(agep-min(agep))/2),
treat = schl - 21,
mar = NULL,schl = NULL, hisp = NULL,rac1p = NULL,division = NULL
)
][,
`:=` (white = as.integer(race==1),
black = as.integer(race==2),
asian = as.integer(race==4),
hispanic = as.integer(race==3))
])

setnames(pumspop2016,c("agep","pincp"), c("age", "income"))
meta(pumspop2016,dict=TRUE)

tail(pumspop2016,10)

cat("\n\n")
[1] 547965     14
37.3 MB
Classes 'data.table' and 'data.frame':	547965 obs. of  14 variables:
 $ serialno : int  134 313 445 589 1543 1938 2420 2444 2811 2811 ...
 $ age      : int  68 35 36 35 66 38 54 28 42 41 ...
 $ sex      : num  0 1 0 0 1 0 0 0 0 1 ...
 $ income   : int  57000 35000 351200 2400 17100 90000 60000 7040 104400 42000 ...
 $ education: int  21 21 21 21 21 21 21 21 21 21 ...
 $ race     : num  1 1 1 1 1 1 1 1 4 4 ...
 $ married  : int  1 0 1 0 0 1 0 0 1 1 ...
 $ single   : int  0 0 0 0 1 0 0 1 0 0 ...
 $ agecat   : int  24 7 8 7 23 9 17 4 11 10 ...
 $ treat    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ white    : int  1 1 1 1 1 1 1 1 0 0 ...
 $ black    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ asian    : int  0 0 0 0 0 0 0 0 1 1 ...
 $ hispanic : int  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 
NULL


serialnoagesexincomeeducationracemarriedsingleagecattreatwhiteblackasianhispanic
150346739 0 20000 22 1 1 0 9 1 1 0 0 0
150790126 0 14000 21 1 0 1 3 0 1 0 0 0
151116055 0 383000 21 1 1 0 17 0 1 0 0 0
151149557 0 38000 22 1 1 0 18 1 1 0 0 0
151310350 0 75000 22 1 1 0 15 1 1 0 0 0
151331567 0 24000 22 1 1 0 23 1 1 0 0 0
151357025 0 7550 21 1 0 1 2 0 1 0 0 0
151357023 1 7500 21 1 0 1 1 0 1 0 0 0
151357023 1 3000 21 1 0 1 1 0 1 0 0 0
151596538 1 83500 22 1 1 0 9 1 1 0 0 0

Tabulate frequencies of the final attributes.

In [9]:
allfreqs('pumspop2016[,!c("serialno","age","income"),with=FALSE]')

cat("\n\n")
$sex
sexcountpercent
1 289836 52.89316
0 258129 47.10684
$education
educationcountpercent
21 381225 69.57105
22 166740 30.42895
$race
racecountpercent
1 416385 75.987517
4 45308 8.268411
3 37703 6.880549
2 36342 6.632175
5 12227 2.231347
$married
marriedcountpercent
1 351954 64.22928
0 196011 35.77072
$single
singlecountpercent
0 419525 76.56055
1 128440 23.43945
$agecat
agecatcountpercent
5 24327 4.4395171
7 24222 4.4203553
4 24050 4.3889664
6 23999 4.3796593
3 23962 4.3729070
12 23888 4.3594025
13 23646 4.3152391
17 23263 4.2453441
16 23134 4.2218025
8 23016 4.2002683
18 22637 4.1311033
15 22504 4.1068316
21 22370 4.0823775
19 22310 4.0714279
22 22301 4.0697855
14 22218 4.0546385
20 22034 4.0210597
10 21808 3.9798162
9 21805 3.9792687
2 21765 3.9719690
23 21710 3.9619319
24 21464 3.9170385
11 21380 3.9017091
1 14528 2.6512642
25 8129 1.4834889
0 1495 0.2728276
$treat
treatcountpercent
0 381225 69.57105
1 166740 30.42895
$white
whitecountpercent
1 416385 75.98752
0 131580 24.01248
$black
blackcountpercent
0 511623 93.367825
1 36342 6.632175
$asian
asiancountpercent
0 502657 91.731589
1 45308 8.268411
$hispanic
hispaniccountpercent
0 510262 93.119451
1 37703 6.880549

Take a random sample of 250,000 of the cases to be used for subsequent matching.

In [10]:
nrr <- nrow(pumspop2016)

howmany <- 250000
samppumspop2016 <- pumspop2016[sample(nrr,howmany)]
meta(samppumspop2016,dict=TRUE)

cat("\n\n")
[1] 250000     14
17 MB
Classes 'data.table' and 'data.frame':	250000 obs. of  14 variables:
 $ serialno : int  201772 839857 781065 1213674 1056696 1248394 961245 1172636 882951 1096151 ...
 $ age      : int  43 42 52 40 28 63 28 25 45 42 ...
 $ sex      : num  1 1 1 0 1 1 1 0 1 1 ...
 $ income   : int  74000 160000 24000 13000 39000 32560 62800 4000 78000 6000 ...
 $ education: int  22 22 21 21 22 22 21 21 21 22 ...
 $ race     : num  1 1 3 1 1 1 1 2 4 1 ...
 $ married  : int  0 0 1 0 1 1 0 0 1 1 ...
 $ single   : int  1 1 0 1 0 0 1 1 0 0 ...
 $ agecat   : int  11 11 16 10 4 21 4 2 12 11 ...
 $ treat    : num  1 1 0 0 1 1 0 0 0 1 ...
 $ white    : int  1 1 0 1 1 1 1 0 0 1 ...
 $ black    : int  0 0 0 0 0 0 0 1 0 0 ...
 $ asian    : int  0 0 0 0 0 0 0 0 1 0 ...
 $ hispanic : int  0 0 1 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 
NULL




The "treated" group, those with an MA/MS degree, match up well, population to sample.

In [11]:
frequenciesdyn("pumspop2016","treat")
frequenciesdyn("samppumspop2016","treat")

cat("\n\n")
treatcountpercent
0 381225 69.57105
1 166740 30.42895
treatcountpercent
0 17409569.638
1 7590530.362

Consider matching on agecat, sex, married, single, white, black, asian, and hispanic. Take a look at how these compare treatment/control in the sample. SMD's > 0.1 are cause for concern. agecat, married, single, sex, asian, and hispanic might all benefit from matching.

In [12]:
matchvars <- c("agecat","sex","married","single", c("white","black","asian","hispanic"))

tab1 <- CreateTableOne(vars=matchvars,strata="treat",data=samppumspop2016,test=FALSE)
print(tab1,smd=TRUE)

cat("\n\n")
                      Stratified by treat
                       0             1            SMD   
  n                    174095        75905              
  agecat (mean (sd))    12.22 (7.06) 13.76 (6.55)  0.226
  sex (mean (sd))        0.51 (0.50)  0.56 (0.50)  0.094
  married (mean (sd))    0.62 (0.49)  0.70 (0.46)  0.183
  single (mean (sd))     0.26 (0.44)  0.17 (0.38)  0.214
  white (mean (sd))      0.76 (0.43)  0.76 (0.43)  0.003
  black (mean (sd))      0.06 (0.25)  0.07 (0.25)  0.018
  asian (mean (sd))      0.08 (0.27)  0.09 (0.29)  0.049
  hispanic (mean (sd))   0.07 (0.26)  0.06 (0.23)  0.068


Create the matched data.table on the sampled data using the matchit function. This "dense" matching is computationally expensive, consuming over 70 minutes for the 76,000 treatment records. Performance is O^2 with the N of treated.

In [13]:
ptmpre <- proc.time()

form <- as.formula('treat ~ agecat + sex + married + single + white + black + hispanic + asian')

matchstr <- matchit(form,samppumspop2016,method="nearest")

ptmpost <- proc.time()
print(ptmpost-ptmpre)

cat("\n\n")
   user  system elapsed 
3876.14  461.93 4337.33 


Define a function to convert matchit's final output structure to a data.table.

In [14]:
mkmatch <- function(matchstr)
{
    wcontrol <- as.integer(as.vector(matchstr$match.matrix[,1]))
    wtreat <- as.integer(as.vector(unlist(dimnames(matchstr$match.matrix)[1])))
    dt <- data.table(matchstr$model$data)
    return(rbind(dt[wtreat][,match:=1:length(wtreat)],dt[wcontrol][,match:=1:length(wcontrol)]))
}

cat("\n\n")

Build the matched data.table with a total of 151,810 records -- 75,905 treated and 75,905 control.

In [15]:
vars <- c("agecat","sex","white","black","asian","hispanic","married","single","income","match","treat")
matchsamp <- mkmatch(matchstr)[,vars,with=FALSE]

meta(matchsamp,dict=TRUE)

cat("\n\n")
[1] 151810     11
7.9 MB
Classes 'data.table' and 'data.frame':	151810 obs. of  11 variables:
 $ agecat  : int  11 11 4 21 11 6 19 25 6 15 ...
 $ sex     : num  1 1 1 1 1 0 1 0 1 1 ...
 $ white   : int  1 1 1 1 1 1 1 1 1 0 ...
 $ black   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ asian   : int  0 0 0 0 0 0 0 0 0 1 ...
 $ hispanic: int  0 0 0 0 0 0 0 0 0 0 ...
 $ married : int  0 0 1 1 1 0 0 0 1 1 ...
 $ single  : int  1 1 0 0 0 1 1 0 0 0 ...
 $ income  : int  74000 160000 39000 32560 6000 93000 80300 50000 55001 60000 ...
 $ match   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ treat   : num  1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr> 
NULL




Examine the SMD's post match. They look quite good -- the confounders are well-balanced between treatment and control.

In [16]:
matchvars <- c("agecat","sex","married","single",c("white","black","asian","hispanic"))

tab2 <- CreateTableOne(vars=matchvars,strata="treat",data=matchsamp,test=FALSE)
print(tab2,smd=TRUE)

cat("\n\n")
                      Stratified by treat
                       0            1            SMD   
  n                    75905        75905              
  agecat (mean (sd))   13.76 (6.55) 13.76 (6.55) <0.001
  sex (mean (sd))       0.56 (0.50)  0.56 (0.50) <0.001
  married (mean (sd))   0.70 (0.46)  0.70 (0.46) <0.001
  single (mean (sd))    0.17 (0.38)  0.17 (0.38) <0.001
  white (mean (sd))     0.76 (0.43)  0.76 (0.43) <0.001
  black (mean (sd))     0.07 (0.25)  0.07 (0.25) <0.001
  asian (mean (sd))     0.09 (0.29)  0.09 (0.29) <0.001
  hispanic (mean (sd))  0.06 (0.23)  0.06 (0.23) <0.001


Split matchsamp into treatment and control

In [17]:
splitmatch <- split(matchsamp,matchsamp[,treat])

head(splitmatch[[1]])
head(splitmatch[[2]])

cat("\n\n")
agecatsexwhiteblackasianhispanicmarriedsingleincomematchtreat
11 1 1 0 0 0 0 1 720001 0
11 1 1 0 0 0 0 1 800002 0
4 1 1 0 0 0 1 0 475003 0
21 1 1 0 0 0 1 0 450004 0
11 1 1 0 0 0 1 0 1000005 0
6 0 1 0 0 0 0 1 1090006 0
agecatsexwhiteblackasianhispanicmarriedsingleincomematchtreat
11 1 1 0 0 0 0 1 740001 1
11 1 1 0 0 0 0 1 1600002 1
4 1 1 0 0 0 1 0 390003 1
21 1 1 0 0 0 1 0 325604 1
11 1 1 0 0 0 1 0 60005 1
6 0 1 0 0 0 0 1 930006 1

In [18]:
head(cbind(splitmatch[[1]],splitmatch[[2]]),10)

cat("\n\n")
agecatsexwhiteblackasianhispanicmarriedsingleincomematch...sexwhiteblackasianhispanicmarriedsingleincomematchtreat
11 1 1 0 0 0 0 1 72000 1 ... 1 1 0 0 0 0 1 74000 1 1
11 1 1 0 0 0 0 1 80000 2 ... 1 1 0 0 0 0 1 160000 2 1
4 1 1 0 0 0 1 0 47500 3 ... 1 1 0 0 0 1 0 39000 3 1
21 1 1 0 0 0 1 0 45000 4 ... 1 1 0 0 0 1 0 32560 4 1
11 1 1 0 0 0 1 0 100000 5 ... 1 1 0 0 0 1 0 6000 5 1
6 0 1 0 0 0 0 1 109000 6 ... 0 1 0 0 0 0 1 93000 6 1
19 1 1 0 0 0 0 1 6500 7 ... 1 1 0 0 0 0 1 80300 7 1
25 0 1 0 0 0 0 0 2500 8 ... 0 1 0 0 0 0 0 50000 8 1
6 1 1 0 0 0 1 0 35050 9 ... 1 1 0 0 0 1 0 55001 9 1
15 1 0 0 1 0 1 0 10500010 ... 1 0 0 1 0 1 0 6000010 1

Further evidence that treatment and control match well. 99.9% of treatment records had an identical control match on confounders.

In [19]:
nrow(splitmatch[[1]])
cat("\n")

sum(splitmatch[[1]]$agecat==splitmatch[[2]]$agecat)
sum(splitmatch[[1]]$sex==splitmatch[[2]]$sex)
sum(splitmatch[[1]]$married==splitmatch[[2]]$married)
sum(splitmatch[[1]]$single==splitmatch[[2]]$single)
sum(splitmatch[[1]]$white==splitmatch[[2]]$white)
sum(splitmatch[[1]]$black==splitmatch[[2]]$black)
sum(splitmatch[[1]]$hispanic==splitmatch[[2]]$hispanic)
sum(splitmatch[[1]]$asian==splitmatch[[2]]$asian)

cat("\n\n")
75905

75871
75889
75904
75890
75895
75882
75900
75897

In [20]:
sum(splitmatch[[1]]$agecat==splitmatch[[2]]$agecat &
    splitmatch[[1]]$sex==splitmatch[[2]]$sex &
    splitmatch[[1]]$married==splitmatch[[2]]$married &
    splitmatch[[1]]$single==splitmatch[[2]]$single &
    splitmatch[[1]]$white==splitmatch[[2]]$white &
    splitmatch[[1]]$black==splitmatch[[2]]$black &
    splitmatch[[1]]$hispanic==splitmatch[[2]]$hispanic &
    splitmatch[[1]]$asian==splitmatch[[2]]$asian
   )

cat("\n\n")
75871

Notice that the mean age of the matched controls is equal to the mean age of matched treated.

In [21]:
samppumspop2016[,.(.N,  mean_agecat=round(mean(agecat))),.(treat)][order(treat)]
matchsamp[,.(.N, mean_agecat=round(mean(agecat))),.(treat)][order(treat)]

cat("\n\n")
treatNmean_agecat
0 17409512
1 7590514
treatNmean_agecat
0 7590514
1 7590514

The mean income of the matched controls is over $2,500 more than that of the sample. This difference can be interpreted as the bias caused by the imbalance in age and other confounders of the unmatched sample. That said, it certainly pays to get an MA/MS degree.

In [22]:
samppumspop2016[,.(.N, mean_income=round(mean(income))),.(treat)][order(treat)]
matchsamp[,.(.N, mean_income=round(mean(income))),.(treat)][order(treat)]

cat("\n\n")
treatNmean_income
0 17409567047
1 7590584030
treatNmean_income
0 7590570015
1 7590584030

Ditto for the 5% trimmed means.

In [23]:
samppumspop2016[,.(.N, mean_income=round(mean(income,trim=.05))),.(treat)][order(treat)]
matchsamp[,.(.N, mean_income=round(mean(income,trim=.05))),.(treat)][order(treat)]

cat("\n\n")
treatNmean_income
0 17409556834
1 7590572328
treatNmean_income
0 7590559518
1 7590572328

Perform a permutation assessment of the difference between mean income of the matched treatment and control. The actual difference of more than $12,000 blows away what would be expected by chance. So, even "controlling" for confounders age, sex, marital status and race, there's a significant difference in incomes between those with a terminal masters degree and those with a terminal bachelors.

In [24]:
ptmpre <- proc.time()

nr <- nrow(matchsamp)
ntreat <- nr/2
nctl <- nr-ntreat

nreps <- 10000
hold <- list()

for (i in 1:nreps)
{
    income <- matchsamp[sample(nr),income]
    hold[[i]] <- c(round(mean(income[1:ntreat])), round(mean(income[(nctl+1):nr])))
}

head(hold)

ptmpost <- proc.time()
print(ptmpost-ptmpre)

cat("\n\n")
    1. 76700
    2. 77345
    1. 77121
    2. 76923
    1. 76930
    2. 77114
    1. 76888
    2. 77156
    1. 76954
    2. 77090
    1. 77269
    2. 76776
   user  system elapsed 
  55.70   10.93   66.69 


In [25]:
f <- function(h) h[1]-h[2]
diff <- data.table(diff=unlist(map(hold,f)))

pal <- brewer.pal(9,"Blues")

titstr <- "Density Plot of the Mean Income Permutation Differences"
xstr <- "Income Difference in $"
ystr <- "Density\n"

ggplot(diff, aes(diff)) +
  geom_density() +
  theme(legend.position = "none", plot.background = element_rect(fill = pal[2]), 
  panel.background = element_rect(fill = pal[2])) +
  labs(title=titstr,x=xstr,y=ystr)

cat("\n\n")

Save the data to R fst files.

In [26]:
write_fst(samppumspop2016,"samppumspop2016.fst")
write_fst(matchsamp,"matchsamp.fst")

cat("\n\n")

This blog barely skims the surface of causal inference. With typical data set sizes large enough to support matching on a host of confounders, I'd love to see causal models become more pervasive in today's analytics world. Embellished by these techniques, analysis of large observational, database-driven, or otherwise non-randomized data sets that data scientists use to make inferences can be substantially enhanced methodologically.

Causal modeling algorithms and computational performance will continue to improve. In subsequent blogs, I'll take a look at more efficient sparse matching as well as the propensity model and other causal algorithms.