8 Matching Methods for Causal Inference Using R. Nearest Neighbor Matching, Optimal Matching, Full Matching, Genetic Matching, Exact Matching, Coarsened Exact Matching, Subclassification, Cardinality Matching

8 Matching Methods for Causal Inference Using R

Matching for causal inference is based on the idea that two groups of subjects can be matched on some or all characteristics to see if certain interventions affect outcomes in one group more than in another.

In this tutorial, we will explore 8 different matching methods for causal inference using R.

  1. Nearest Neighbor Matching
  2. Optimal Matching
  3. Full Matching
  4. Genetic Matching
  5. Exact Matching
  6. Coarsened Exact Matching
  7. Subclassification Matching
  8. Cardinality Matching

Causal inference has well-established packages in R, but not in Python. This tutorial provides an example of using R packages for causal analysis in a Python notebook.

  • If you are a Python user, please follow the steps in the tutorial to set up the environment to run the R packages.
  • If you are an R user, please ignore the python code and use the R code directly. R code are the ones starting with %R or %%R.

Resources for this post:

8 Matching Methods for Causal Inference Using R – GrabNGoInfo.com

Let’s get started!

Step 1: Install and Import Libraries

In step 1, we will install and import libraries.

Firstly, let’s install dowhy for dataset creation.

# Install dowhy
!pip install dowhy

Then, we activate the R magic command in the Google Colab notebook using rpy2 and install the R packages MatchingrgenoudMatchItoptmatch, and Rglpk. This tutorial uses both Python and R in the same Colab notebook. For details on using R with Colab, please check out my tutorial How to Use R with Google Colab Notebook.

# Need to run this before installing the Rglpk package
!sudo apt-get install libglpk-dev 

# activate R magic
%load_ext rpy2.ipython

# Install R package Matching
%R install.packages('Matching') # For generic matching

# Install R package rgenoud
%R install.packages('rgenoud') # For generic matching

# Install R package MatchIt
%R install.packages('MatchIt')

# Install R package optmatch
%R install.packages('optmatch') # for optimal matching

# Install R package Rglpk
%R install.packages('Rglpk', dependencies=TRUE) # for cardinality matching

After the installation is completed, we can import the libraries.

  • The datasets is imported from dowhy for dataset creation.
  • pandas and numpy are imported for data processing.
  • Packages from rpy2 are imported for Python to R conversion.
  • Then we imported the R libraries for different matching methods.
# Package to create synthetic data for causal inference
from dowhy import datasets

# Data processing
import pandas as pd
import numpy as np

# Python to R conversion
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

# Import R libraries
Matching = importr('Matching')
rgenoud = importr('rgenoud')
MatchIt = importr('MatchIt')
optmatch = importr('optmatch')
Rglpk = importr('Rglpk')

# Read R libraries
%R library(Matching) # For genetic matching
%R library(rgenoud) # For genetic matching
%R library(MatchIt)
%R library(optmatch) # For optimal matching
%R library(Rglpk) # For cardinality matching

Step 2:Create Dataset

In step 2, we will create a synthetic dataset for the causal inference.

  • Firstly, we set a random seed using np.random.seed to make the dataset reproducible.
  • Then a dataset with the true causal impact of 10, four confounders, 10,000 samples, a binary treatment variable, and a continuous outcome variable is created.
  • After that, we created a dataframe for the data. In the dataframe, the columns W0, W1, W2, and W3 are the four confounders, v0 is the treatment indicator, and y is the outcome.
# Set random seed
np.random.seed(42)

# Create a synthetic dataset
data = datasets.linear_dataset(
    beta=10,
    num_common_causes=4,
    num_samples=10000,
    treatment_is_binary=True,
    outcome_is_binary=False)

# Create Dataframe
df = data['df']

# Take a look at the data
df.head()
Causal Inference Dataset – GrabNGoInfo.com

Next, let’s rename v0 to treatment, rename y to outcome, and convert the boolean values to 0s and 1s.

# Rename columns
df = df.rename({'v0': 'treatment', 'y': 'outcome'}, axis=1)

# Create the treatment variable, and change boolean values to 1 and 0
df['treatment'] = df['treatment'].apply(lambda x: 1 if x == True else 0)

# Take a look at the data
df.head()
Causal Inference Dataset – GrabNGoInfo.com

Finally, the Python dataframe is converted to R dataframe in order to use R packages on the dataset.

# Convert Python dataframe to R dataframe
with localconverter(ro.default_converter + pandas2ri.converter):
  dfr = ro.conversion.py2rpy(df)

# Create a variable name in R's global environment
from rpy2.robjects import globalenv
globalenv['dfr'] = dfr

Step 3: Pre-Matching Standardized Mean Difference of Confounders

In step 3, we will check the standardized mean difference (SMD) between the treatment group and the non-treatment group.

method=NULL means that no matching is conducted. We can see that the output shows the same counts for the Matched as All under Sample Sizes.

%%R
# No Matching
no_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method=NULL, distance = 'glm')

# Show summary of balance
summary(no_match)

Output

Call:
matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, 
    method = NULL, distance = "glm")

Summary of Balance for All Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.9283        0.2442          4.2211     0.3304    0.4709
W0             -0.0880       -0.7794          0.7118     1.0705    0.1930
W1              1.2066       -0.1808          1.6741     1.4654    0.4058
W2              0.5548        0.1221          0.4368     0.9879    0.1186
W3              0.2208        0.1267          0.0936     1.0771    0.0273
         eCDF Max
distance   0.8150
W0         0.2776
W1         0.6456
W2         0.1737
W3         0.0462


Sample Sizes:
          Control Treated
All          2269    7731
Matched      2269    7731
Unmatched       0       0
Discarded       0       0

Step 4: Nearest Neighbor Matching – Method 1

In step 4, we will talk about the nearest neighbor matching.

Nearest neighbor matching is a type of greedy matching. It matches the nearest control at the moment, and remove the matched control from the rest of the matching.

Nearest neighbor matching is fast but sensitive to the order of samples. It is not optimal for minimizing the total distance because the samples that are matched later in the process can only choose from the contorl units that have not been matched.

We can use method="nearest" in the matchit R package to implement the nearest neighbor matching.

The summary function prints out the matching results. un = FALSE means that we will print out the matched results only. The unmatched results will not be printed out because we already have it from the previous step.

%%R
# Nearest neighbot matching
nearest_neighbor_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="nearest", ratio=1, replace=F, distance = 'glm', caliper=0.2)

# Summary of nearest neighbor matching results
summary(nearest_neighbor_match, un = FALSE)

Output

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Warning:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:  Fewer control units than treated units; not all treated units will get a match.

Call:
matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, 
    method = "nearest", distance = "glm", replace = F, caliper = 0.2, 
    ratio = 1)

Summary of Balance for Matched Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.5881        0.5403          0.2948     1.1726    0.0492
W0             -0.4646       -0.5606          0.0988     1.1593    0.0218
W1              0.4439        0.2544          0.2286     2.7730    0.0476
W2              0.3118        0.2554          0.0569     1.0740    0.0148
W3              0.1524        0.1507          0.0017     1.0204    0.0077
         eCDF Max Std. Pair Dist.
distance   0.0915          0.2951
W0         0.0446          1.1005
W1         0.0857          0.7930
W2         0.0516          1.0996
W3         0.0258          1.0875

Sample Sizes:
          Control Treated
All          2269    7731
Matched       852     852
Unmatched    1417    6879
Discarded       0       0

Step 5: Optimal Matching – Method 2

In step 5, we will talk about the optimal matching.

Optimal matching is also called optimal pair matching. Different from greedy matching, optimal matching minimizes the total distance across all pairs.

When there are not many close matches for the treatment group, optimal matching can be helpful for finding the best matches.

We can use method="optimal" in the matchit R package to implement the nearest neighbor matching.

%%R
# Optimal matching
optimal_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="optimal", ratio=1, replace=F, distance = 'glm')

# Summary of optimal matching results
summary(optimal_match, un = FALSE)

Output

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Warning:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:  Fewer control units than treated units; not all treated units will get a match.


Call:
matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, 
    method = "optimal", distance = "glm", replace = F, ratio = 1)

Summary of Balance for Matched Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.7657        0.2442          3.2175     0.6536    0.1834
W0             -0.4465       -0.7794          0.3427     0.9543    0.0894
W1              0.4642       -0.1808          0.7784     0.6324    0.1669
W2              0.3393        0.1221          0.2192     0.9503    0.0568
W3              0.1724        0.1267          0.0455     1.0725    0.0133
         eCDF Max Std. Pair Dist.
distance   0.6580          3.2176
W0         0.1401          1.0983
W1         0.4121          1.0245
W2         0.0886          1.1258
W3         0.0300          1.1027

Sample Sizes:
          Control Treated
All          2269    7731
Matched      2269    2269
Unmatched       0    5462
Discarded       0       0

Step 6: Full Matching – Method 3

In step 6, we will talk about the full matching.

Full matching is also called optimal full matching.

  • It is called full matching because all the test and control units are assigned to a subclass and are utilized in the matching.
  • It is optimal because the weighted average distances between the treated and control units in each subclass are minimized.

Full matching outputs weights that are computed based on subclasses. The weights can work similar to propensity score weights and be used to estimate a weighted treatment effect.

In the R code, the matchit package method="full" argument implements the full matching.

%%R
# Full matching
full_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="full", ratio=1, replace=F, distance = 'glm', caliper=0.2)

# Summary of full matching results
summary(full_match, un = FALSE)

Output

Call:
matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, 
    method = "full", distance = "glm", replace = F, caliper = 0.2, 
    ratio = 1)

Summary of Balance for Matched Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.9283        0.9274          0.0060     0.8897    0.0862
W0             -0.0880       -0.1904          0.1054     1.4125    0.0704
W1              1.2066        0.9552          0.3033     2.1055    0.0703
W2              0.5548        0.2860          0.2714     1.1665    0.1229
W3              0.2208        0.5464         -0.3238     1.0508    0.1160
         eCDF Max Std. Pair Dist.
distance   0.4580          0.0175
W0         0.2174          0.9880
W1         0.3642          0.7212
W2         0.2609          1.1358
W3         0.2905          1.0876

Sample Sizes:
              Control Treated
All           2269.      7731
Matched (ESS)    8.73    7731
Matched       2269.      7731
Unmatched        0.         0
Discarded        0.         0

Step 7: Genetic Matching – Method 4

In step 7, we will talk about the genetic matching.

Genetic matching uses a generic search algorithm to find weights for each covariate to achieve optimal balance. The current matching is with replacement and the balance is evaluated using t-tests and Kolmogorov-Smirnov tests.

In the R code, the matchit package method="genetic" argument implements the genetic matching. pop.size controls the run time through the population size for the optimization algorithm.

%%R
# Genetic matching
generic_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="genetic", pop.size=20)

# Summary of genetic matching
summary(generic_match, un = FALSE)

Output

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Warning:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:  Fewer control units than treated units; not all treated units will get a match.


Call:
matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, 
    method = "genetic", pop.size = 20)

Summary of Balance for Matched Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.9999        0.2442          4.6628     0.0000    0.7506
W0              0.3246       -0.7794          1.1367     1.0358    0.3077
W1              2.0227       -0.1808          2.6590     0.8712    0.6452
W2              0.8072        0.1221          0.6914     0.9482    0.1908
W3              0.2993        0.1267          0.1717     1.0941    0.0497
         eCDF Max Std. Pair Dist.
distance   1.0000          4.6628
W0         0.4328          1.1388
W1         0.9246          2.6590
W2         0.2847          0.7268
W3         0.0771          1.1187

Sample Sizes:
          Control Treated
All          2269    7731
Matched      2269    2269
Unmatched       0    5462
Discarded       0       0

Step 8: Exact Matching – Method 5

In step 8, we will talk about the exact matching.

Exact matching mathces each unit in the treatment unit with all the control units with the same values for all the covariates.

In the R code, the matchit package method="exact" argument implements the exact matching. The exact matching matches on all the four covariates in the equation. There are no other options for exact matching.

%%R

# Exact matching
exact_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="exact")

# Summary of exact matching
summary(exact_match, un = FALSE)

We got the RInterpreterError of No exact matches were found because there are no exactly same covariates between any treatment units and control units.

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Error: No exact matches were found.


Error: No exact matches were found.
---------------------------------------------------------------------------
RRuntimeError                             Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/rpy2/ipython/rmagic.py in eval(self, code)
    267                 # Need the newline in case the last line in code is a comment.
--> 268                 value, visible = ro.r("withVisible({%s\n})" % code)
    269             except (ri.embedded.RRuntimeError, ValueError) as exception:

10 frames
RRuntimeError: Error: No exact matches were found.


During handling of the above exception, another exception occurred:

RInterpreterError                         Traceback (most recent call last)
<decorator-gen-119> in R(self, line, cell, local_ns)

/usr/local/lib/python3.7/dist-packages/rpy2/ipython/rmagic.py in eval(self, code)
    271                 warning_or_other_msg = self.flush()
    272                 raise RInterpreterError(code, str(exception),
--> 273                                         warning_or_other_msg)
    274             text_output = self.flush()
    275             return text_output, value, visible[0]

RInterpreterError: Failed to parse and evaluate line '\n# Exact matching\nexact_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="exact")\n\n# Summary of exact matching\nsummary(exact_match, un = FALSE)'.
R error message: 'Error: No exact matches were found.'

Step 9: Coarsened Exact Matching – Method 6

In step 9, we will talk about the coarsened exact matching.

Coarsened Exact Matching (CEM) performs exact matching on the binned covariates. The number of bins impact the matching results.

  • Fewer bins allow the units with very different covariate values to be placed into the same subclass.
  • More bins may cause units to be dropped because of the difficulty in finding matches. A general rule of thumb is that when we have more sample units and not many covariates, the number of bins can be relatively high. If we have a small number of sample units or many covariates, the number of bins should be relatively low.

In the R code, the matchit package method="cem" argument implements the exact matching. The cutpoints option allows the user to specify the number of bins for each covariate.

%%R
# Coarsened exact matching
coarsened_exact_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="cem", cutpoints = list(W0 = 3, W1 = 3, W2 = 2, W3 = 4))

# Summary of coarsened exact matching
summary(coarsened_exact_match, un = FALSE)

Output

Call:
matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, 
    method = "cem", cutpoints = list(W0 = 3, W1 = 3, W2 = 2, 
        W3 = 4))

Summary of Balance for Matched Data:
   Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
W0       -0.0622       -0.4240          0.3726     1.0282    0.1166   0.2208
W1        1.0008        0.0154          1.1890     1.2883    0.3212   0.6188
W2        0.5251        0.2760          0.2515     0.9595    0.0557   0.1021
W3        0.1943        0.0879          0.1057     0.9368    0.0284   0.0614
   Std. Pair Dist.
W0          0.7960
W1          1.3109
W2          0.6804
W3          0.5993

Sample Sizes:
              Control Treated
All           2269.      7731
Matched (ESS)  257.62    6570
Matched       2041.      6570
Unmatched      228.      1161
Discarded        0.         0

Step 10: Subclassification – Method 7

In step 10, we will talk about the subclassification matching.

Subclassification matching form subclasses with similar distribution between the treatment and the control group.Different subclassification schemes exist, including the matching based on the propensity score. When using propensity score for the subclassification matching, it can be thought of as a coarsened exact matching with the propensity score as the only covariate.

In the R code, the matchit package method="subclass" argument implements the exact matching. The default number of subclasses is 6. The default distance measure is the propensity score from a logistic regression model. method = "subclass" does not work with argument caliper.

%%R
# Subclassification matching
subclassification_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="subclass", ratio=1, replace=F, distance = 'glm')

# Summary of subclassification matching
summary(subclassification_match, un = FALSE)

Output

Call:
matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, 
    method = "subclass", distance = "glm", replace = F, ratio = 1)

Summary of Balance Across Subclasses
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.9283        0.8554          0.4500     0.2598    0.1143
W0             -0.0880       -0.3030          0.2214     1.3454    0.0653
W1              1.2066        0.8618          0.4161     1.4052    0.0881
W2              0.5548        0.2872          0.2702     1.4618    0.1071
W3              0.2208        0.6779         -0.4546     0.7733    0.1130
         eCDF Max
distance   0.4580
W0         0.2347
W1         0.3651
W2         0.2624
W3         0.2653

Sample Sizes:
              Control Treated
All           2269.      7731
Matched (ESS)   11.59    7731
Matched       2269.      7731
Unmatched        0.         0
Discarded        0.         0

Step 11: Cardinality Matching – Method 8

In step 11, we will talk about cardinality matching.

Cardinality matching selects the largest subset of units that satisfy user-supplied balance constraints (e.g., mean difference) and the ratio between the treated and the control units.

It does not consider the distance between units. It also does not assign units to pairs, or assign units to subclasses.

Subset selection is conducted by solving a mixed integer programming optimization problem with linear constraints. It maximizes the size of the matched sample subject to the constraints on balance and sample size.

In the R code, the matchit package method="cardinality" argument implements the exact matching. The option tols controls the balance tolerances.

%%R
# Cardinality matching
cardinality_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="cardinality", tols = 1)

# Summary of cardinality matching
summary(cardinality_match, un = FALSE)

Output

Call:
matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, 
    method = "cardinality", tols = 1)

Summary of Balance for Matched Data:
   Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
W0        0.1915       -0.7794          0.9996     1.0252    0.2714   0.3848
W1        0.6478       -0.1808          0.9998     1.1129    0.2302   0.4279
W2        1.0695        0.1221          0.9562     0.8501    0.2648   0.3834
W3        0.9724        0.1267          0.8412     0.8401    0.2427   0.3640

Sample Sizes:
          Control Treated
All          2269    7731
Matched      2269    2269
Unmatched       0    5462
Discarded       0       0

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Recommended Tutorials

References

[1] DoWhy Documentation

[2] rpy2 R and pandas Dataframe Documentation

[3] Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010 Feb 1;25(1):1-21. doi: 10.1214/09-STS313. PMID: 20871802; PMCID: PMC2943670.

[4] R MatchIt Documentation

[5] Matching Methods

[6] Optimal Full Matching Documentation

[7] Daniel Ho; Kosuke Imai; Gary King; and Elizabeth Stuart (2007), “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference,” Political Analysis 15(3): 199-236, http://gking.harvard.edu/files/abs/matchp-abs.shtml.

[8] Daniel Ho; Kosuke Imai; Gary King; and Elizabeth Stuart (2007b) “Matchit: Nonparametric Preprocessing for Parametric Causal Inference,” Journal of Statistical Software, http://gking.harvard.edu/matchit/. Website: https://r.iq.harvard.edu/docs/matchit/2.4-15/

Leave a Comment

Your email address will not be published. Required fields are marked *