Matching for causal inference is based on the idea that two groups of subjects can be matched on some or all characteristics to see if certain interventions affect outcomes in one group more than in another.

In this tutorial, we will explore 8 different matching methods for causal inference using R.

- Nearest Neighbor Matching
- Optimal Matching
- Full Matching
- Genetic Matching
- Exact Matching
- Coarsened Exact Matching
- Subclassification Matching
- Cardinality Matching

Causal inference has well-established packages in R, but not in Python. This tutorial provides an example of using R packages for causal analysis in a Python notebook.

- If you are a Python user, please follow the steps in the tutorial to set up the environment to run the R packages.

- If you are an R user, please ignore the python code and use the R code directly. R code are the ones starting with %R or %%R.

**Resources for this post:**

- Click here for the Colab notebook with code.
- More video tutorials on Causal Inference
- More blog posts on Causal Inference
- Video version of the tutorial on YouTube

Let’s get started!

### Step 1: Install and Import Libraries

In step 1, we will install and import libraries.

Firstly, let’s install dowhy for dataset creation.

# Install dowhy !pip install dowhy

Then, we activate the R magic command in the Google Colab notebook using `rpy2`

and install the R packages `Matching`

, `rgenoud`

, `MatchIt`

, `optmatch`

, and `Rglpk`

. This tutorial uses both Python and R in the same Colab notebook. For details on using R with Colab, please check out my tutorial How to Use R with Google Colab Notebook.

# Need to run this before installing the Rglpk package !sudo apt-get install libglpk-dev # activate R magic %load_ext rpy2.ipython # Install R package Matching %R install.packages('Matching') # For generic matching # Install R package rgenoud %R install.packages('rgenoud') # For generic matching # Install R package MatchIt %R install.packages('MatchIt') # Install R package optmatch %R install.packages('optmatch') # for optimal matching # Install R package Rglpk %R install.packages('Rglpk', dependencies=TRUE) # for cardinality matching

After the installation is completed, we can import the libraries.

- The
`datasets`

is imported from`dowhy`

for dataset creation. `pandas`

and`numpy`

are imported for data processing.- Packages from
`rpy2`

are imported for Python to R conversion. - Then we imported the R libraries for different matching methods.

# Package to create synthetic data for causal inference from dowhy import datasets # Data processing import pandas as pd import numpy as np # Python to R conversion import rpy2.robjects as ro from rpy2.robjects.packages import importr from rpy2.robjects import pandas2ri from rpy2.robjects.conversion import localconverter # Import R libraries Matching = importr('Matching') rgenoud = importr('rgenoud') MatchIt = importr('MatchIt') optmatch = importr('optmatch') Rglpk = importr('Rglpk') # Read R libraries %R library(Matching) # For genetic matching %R library(rgenoud) # For genetic matching %R library(MatchIt) %R library(optmatch) # For optimal matching %R library(Rglpk) # For cardinality matching

### Step 2ï¼šCreate Dataset

In step 2, we will create a synthetic dataset for the causal inference.

- Firstly, we set a random seed using
`np.random.seed`

to make the dataset reproducible. - Then a dataset with the true causal impact of 10, four confounders, 10,000 samples, a binary treatment variable, and a continuous outcome variable is created.
- After that, we created a dataframe for the data. In the dataframe, the columns W0, W1, W2, and W3 are the four confounders, v0 is the treatment indicator, and y is the outcome.

# Set random seed np.random.seed(42) # Create a synthetic dataset data = datasets.linear_dataset( beta=10, num_common_causes=4, num_samples=10000, treatment_is_binary=True, outcome_is_binary=False) # Create Dataframe df = data['df'] # Take a look at the data df.head()

Next, let’s rename `v0`

to `treatment`

, rename `y`

to `outcome`

, and convert the boolean values to 0s and 1s.

# Rename columns df = df.rename({'v0': 'treatment', 'y': 'outcome'}, axis=1) # Create the treatment variable, and change boolean values to 1 and 0 df['treatment'] = df['treatment'].apply(lambda x: 1 if x == True else 0) # Take a look at the data df.head()

Finally, the Python dataframe is converted to R dataframe in order to use R packages on the dataset.

# Convert Python dataframe to R dataframe with localconverter(ro.default_converter + pandas2ri.converter): dfr = ro.conversion.py2rpy(df) # Create a variable name in R's global environment from rpy2.robjects import globalenv globalenv['dfr'] = dfr

### Step 3: Pre-Matching Standardized Mean Difference of Confounders

In step 3, we will check the standardized mean difference (SMD) between the treatment group and the non-treatment group.

`method=NULL`

means that no matching is conducted. We can see that the output shows the same counts for the `Matched`

as `All`

under Sample Sizes.

%%R # No Matching no_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method=NULL, distance = 'glm') # Show summary of balance summary(no_match)

Output

Call: matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, method = NULL, distance = "glm") Summary of Balance for All Data: Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean distance 0.9283 0.2442 4.2211 0.3304 0.4709 W0 -0.0880 -0.7794 0.7118 1.0705 0.1930 W1 1.2066 -0.1808 1.6741 1.4654 0.4058 W2 0.5548 0.1221 0.4368 0.9879 0.1186 W3 0.2208 0.1267 0.0936 1.0771 0.0273 eCDF Max distance 0.8150 W0 0.2776 W1 0.6456 W2 0.1737 W3 0.0462 Sample Sizes: Control Treated All 2269 7731 Matched 2269 7731 Unmatched 0 0 Discarded 0 0

### Step 4: Nearest Neighbor Matching – Method 1

In step 4, we will talk about the nearest neighbor matching.

Nearest neighbor matching is a type of greedy matching. It matches the nearest control at the moment, and remove the matched control from the rest of the matching.

Nearest neighbor matching is fast but sensitive to the order of samples. It is not optimal for minimizing the total distance because the samples that are matched later in the process can only choose from the contorl units that have not been matched.

We can use `method="nearest"`

in the `matchit`

R package to implement the nearest neighbor matching.

The summary function prints out the matching results. `un = FALSE`

means that we will print out the matched results only. The unmatched results will not be printed out because we already have it from the previous step.

%%R # Nearest neighbot matching nearest_neighbor_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="nearest", ratio=1, replace=F, distance = 'glm', caliper=0.2) # Summary of nearest neighbor matching results summary(nearest_neighbor_match, un = FALSE)

Output

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Warning: WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Fewer control units than treated units; not all treated units will get a match. Call: matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, method = "nearest", distance = "glm", replace = F, caliper = 0.2, ratio = 1) Summary of Balance for Matched Data: Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean distance 0.5881 0.5403 0.2948 1.1726 0.0492 W0 -0.4646 -0.5606 0.0988 1.1593 0.0218 W1 0.4439 0.2544 0.2286 2.7730 0.0476 W2 0.3118 0.2554 0.0569 1.0740 0.0148 W3 0.1524 0.1507 0.0017 1.0204 0.0077 eCDF Max Std. Pair Dist. distance 0.0915 0.2951 W0 0.0446 1.1005 W1 0.0857 0.7930 W2 0.0516 1.0996 W3 0.0258 1.0875 Sample Sizes: Control Treated All 2269 7731 Matched 852 852 Unmatched 1417 6879 Discarded 0 0

### Step 5: Optimal Matching – Method 2

In step 5, we will talk about the optimal matching.

Optimal matching is also called optimal pair matching. Different from greedy matching, optimal matching minimizes the total distance across all pairs.

When there are not many close matches for the treatment group, optimal matching can be helpful for finding the best matches.

We can use `method="optimal"`

in the `matchit`

R package to implement the nearest neighbor matching.

%%R # Optimal matching optimal_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="optimal", ratio=1, replace=F, distance = 'glm') # Summary of optimal matching results summary(optimal_match, un = FALSE)

Output

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Warning: WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Fewer control units than treated units; not all treated units will get a match. Call: matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, method = "optimal", distance = "glm", replace = F, ratio = 1) Summary of Balance for Matched Data: Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean distance 0.7657 0.2442 3.2175 0.6536 0.1834 W0 -0.4465 -0.7794 0.3427 0.9543 0.0894 W1 0.4642 -0.1808 0.7784 0.6324 0.1669 W2 0.3393 0.1221 0.2192 0.9503 0.0568 W3 0.1724 0.1267 0.0455 1.0725 0.0133 eCDF Max Std. Pair Dist. distance 0.6580 3.2176 W0 0.1401 1.0983 W1 0.4121 1.0245 W2 0.0886 1.1258 W3 0.0300 1.1027 Sample Sizes: Control Treated All 2269 7731 Matched 2269 2269 Unmatched 0 5462 Discarded 0 0

### Step 6: Full Matching – Method 3

In step 6, we will talk about the full matching.

Full matching is also called optimal full matching.

- It is called full matching because all the test and control units are assigned to a subclass and are utilized in the matching.
- It is optimal because the weighted average distances between the treated and control units in each subclass are minimized.

Full matching outputs weights that are computed based on subclasses. The weights can work similar to propensity score weights and be used to estimate a weighted treatment effect.

In the R code, the `matchit`

package `method="full"`

argument implements the full matching.

%%R # Full matching full_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="full", ratio=1, replace=F, distance = 'glm', caliper=0.2) # Summary of full matching results summary(full_match, un = FALSE)

Output

Call: matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, method = "full", distance = "glm", replace = F, caliper = 0.2, ratio = 1) Summary of Balance for Matched Data: Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean distance 0.9283 0.9274 0.0060 0.8897 0.0862 W0 -0.0880 -0.1904 0.1054 1.4125 0.0704 W1 1.2066 0.9552 0.3033 2.1055 0.0703 W2 0.5548 0.2860 0.2714 1.1665 0.1229 W3 0.2208 0.5464 -0.3238 1.0508 0.1160 eCDF Max Std. Pair Dist. distance 0.4580 0.0175 W0 0.2174 0.9880 W1 0.3642 0.7212 W2 0.2609 1.1358 W3 0.2905 1.0876 Sample Sizes: Control Treated All 2269. 7731 Matched (ESS) 8.73 7731 Matched 2269. 7731 Unmatched 0. 0 Discarded 0. 0

### Step 7: Genetic Matching – Method 4

In step 7, we will talk about the genetic matching.

Genetic matching uses a generic search algorithm to find weights for each covariate to achieve optimal balance. The current matching is with replacement and the balance is evaluated using t-tests and Kolmogorov-Smirnov tests.

In the R code, the `matchit`

package `method="genetic"`

argument implements the genetic matching. `pop.size`

controls the run time through the population size for the optimization algorithm.

%%R # Genetic matching generic_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="genetic", pop.size=20) # Summary of genetic matching summary(generic_match, un = FALSE)

Output

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Warning: WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Fewer control units than treated units; not all treated units will get a match. Call: matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, method = "genetic", pop.size = 20) Summary of Balance for Matched Data: Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean distance 0.9999 0.2442 4.6628 0.0000 0.7506 W0 0.3246 -0.7794 1.1367 1.0358 0.3077 W1 2.0227 -0.1808 2.6590 0.8712 0.6452 W2 0.8072 0.1221 0.6914 0.9482 0.1908 W3 0.2993 0.1267 0.1717 1.0941 0.0497 eCDF Max Std. Pair Dist. distance 1.0000 4.6628 W0 0.4328 1.1388 W1 0.9246 2.6590 W2 0.2847 0.7268 W3 0.0771 1.1187 Sample Sizes: Control Treated All 2269 7731 Matched 2269 2269 Unmatched 0 5462 Discarded 0 0

### Step 8: Exact Matching – Method 5

In step 8, we will talk about the exact matching.

Exact matching mathces each unit in the treatment unit with all the control units with the same values for all the covariates.

In the R code, the `matchit`

package `method="exact"`

argument implements the exact matching. The exact matching matches on all the four covariates in the equation. There are no other options for exact matching.

%%R # Exact matching exact_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="exact") # Summary of exact matching summary(exact_match, un = FALSE)

We got the `RInterpreterError`

of `No exact matches were found`

because there are no exactly same covariates between any treatment units and control units.

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Error: No exact matches were found. Error: No exact matches were found. --------------------------------------------------------------------------- RRuntimeError Traceback (most recent call last) /usr/local/lib/python3.7/dist-packages/rpy2/ipython/rmagic.py in eval(self, code) 267 # Need the newline in case the last line in code is a comment. --> 268 value, visible = ro.r("withVisible({%s\n})" % code) 269 except (ri.embedded.RRuntimeError, ValueError) as exception: 10 frames RRuntimeError: Error: No exact matches were found. During handling of the above exception, another exception occurred: RInterpreterError Traceback (most recent call last) <decorator-gen-119> in R(self, line, cell, local_ns) /usr/local/lib/python3.7/dist-packages/rpy2/ipython/rmagic.py in eval(self, code) 271 warning_or_other_msg = self.flush() 272 raise RInterpreterError(code, str(exception), --> 273 warning_or_other_msg) 274 text_output = self.flush() 275 return text_output, value, visible[0] RInterpreterError: Failed to parse and evaluate line '\n# Exact matching\nexact_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="exact")\n\n# Summary of exact matching\nsummary(exact_match, un = FALSE)'. R error message: 'Error: No exact matches were found.'

### Step 9: Coarsened Exact Matching – Method 6

In step 9, we will talk about the coarsened exact matching.

Coarsened Exact Matching (CEM) performs exact matching on the binned covariates. The number of bins impact the matching results.

- Fewer bins allow the units with very different covariate values to be placed into the same subclass.
- More bins may cause units to be dropped because of the difficulty in finding matches. A general rule of thumb is that when we have more sample units and not many covariates, the number of bins can be relatively high. If we have a small number of sample units or many covariates, the number of bins should be relatively low.

In the R code, the `matchit`

package `method="cem"`

argument implements the exact matching. The `cutpoints`

option allows the user to specify the number of bins for each covariate.

%%R # Coarsened exact matching coarsened_exact_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="cem", cutpoints = list(W0 = 3, W1 = 3, W2 = 2, W3 = 4)) # Summary of coarsened exact matching summary(coarsened_exact_match, un = FALSE)

Output

Call: matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, method = "cem", cutpoints = list(W0 = 3, W1 = 3, W2 = 2, W3 = 4)) Summary of Balance for Matched Data: Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max W0 -0.0622 -0.4240 0.3726 1.0282 0.1166 0.2208 W1 1.0008 0.0154 1.1890 1.2883 0.3212 0.6188 W2 0.5251 0.2760 0.2515 0.9595 0.0557 0.1021 W3 0.1943 0.0879 0.1057 0.9368 0.0284 0.0614 Std. Pair Dist. W0 0.7960 W1 1.3109 W2 0.6804 W3 0.5993 Sample Sizes: Control Treated All 2269. 7731 Matched (ESS) 257.62 6570 Matched 2041. 6570 Unmatched 228. 1161 Discarded 0. 0

### Step 10: Subclassification – Method 7

In step 10, we will talk about the subclassification matching.

Subclassification matching form subclasses with similar distribution between the treatment and the control group.Different subclassification schemes exist, including the matching based on the propensity score. When using propensity score for the subclassification matching, it can be thought of as a coarsened exact matching with the propensity score as the only covariate.

In the R code, the `matchit`

package `method="subclass"`

argument implements the exact matching. The default number of subclasses is 6. The default distance measure is the propensity score from a logistic regression model. `method = "subclass"`

does not work with argument `caliper`

.

%%R # Subclassification matching subclassification_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="subclass", ratio=1, replace=F, distance = 'glm') # Summary of subclassification matching summary(subclassification_match, un = FALSE)

Output

Call: matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, method = "subclass", distance = "glm", replace = F, ratio = 1) Summary of Balance Across Subclasses Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean distance 0.9283 0.8554 0.4500 0.2598 0.1143 W0 -0.0880 -0.3030 0.2214 1.3454 0.0653 W1 1.2066 0.8618 0.4161 1.4052 0.0881 W2 0.5548 0.2872 0.2702 1.4618 0.1071 W3 0.2208 0.6779 -0.4546 0.7733 0.1130 eCDF Max distance 0.4580 W0 0.2347 W1 0.3651 W2 0.2624 W3 0.2653 Sample Sizes: Control Treated All 2269. 7731 Matched (ESS) 11.59 7731 Matched 2269. 7731 Unmatched 0. 0 Discarded 0. 0

### Step 11: Cardinality Matching – Method 8

In step 11, we will talk about cardinality matching.

Cardinality matching selects the largest subset of units that satisfy user-supplied balance constraints (e.g., mean difference) and the ratio between the treated and the control units.

It does not consider the distance between units. It also does not assign units to pairs, or assign units to subclasses.

Subset selection is conducted by solving a mixed integer programming optimization problem with linear constraints. It maximizes the size of the matched sample subject to the constraints on balance and sample size.

In the R code, the `matchit`

package `method="cardinality"`

argument implements the exact matching. The option `tols`

controls the balance tolerances.

%%R # Cardinality matching cardinality_match <- matchit(treatment ~ W0 + W1 + W2 + W3, data=dfr, method="cardinality", tols = 1) # Summary of cardinality matching summary(cardinality_match, un = FALSE)

Output

Call: matchit(formula = treatment ~ W0 + W1 + W2 + W3, data = dfr, method = "cardinality", tols = 1) Summary of Balance for Matched Data: Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max W0 0.1915 -0.7794 0.9996 1.0252 0.2714 0.3848 W1 0.6478 -0.1808 0.9998 1.1129 0.2302 0.4279 W2 1.0695 0.1221 0.9562 0.8501 0.2648 0.3834 W3 0.9724 0.1267 0.8412 0.8401 0.2427 0.3640 Sample Sizes: Control Treated All 2269 7731 Matched 2269 2269 Unmatched 0 5462 Discarded 0 0

For more information about data science and machine learning, please check out myÂ YouTube channelÂ andÂ Medium PageÂ or follow me onÂ LinkedIn.

### Recommended Tutorials

- GrabNGoInfo Machine Learning Tutorials Inventory
- One-Class SVM For Anomaly Detection
- Multivariate Time Series Forecasting with Seasonality and Holiday Effect Using Prophet in Python
- Hyperparameter Tuning For XGBoost
- Recommendation System: User-Based Collaborative Filtering
- Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python
- How to detect outliers | Data Science Interview Questions and Answers

### References

[2] rpy2 R and pandas Dataframe Documentation

[3] Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010 Feb 1;25(1):1-21. doi: 10.1214/09-STS313. PMID: 20871802; PMCID: PMC2943670.

[5] Matching Methods

[6] Optimal Full Matching Documentation

[7] Daniel Ho; Kosuke Imai; Gary King; and Elizabeth Stuart (2007), “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference,” Political Analysis 15(3): 199-236, http://gking.harvard.edu/files/abs/matchp-abs.shtml.

[8] Daniel Ho; Kosuke Imai; Gary King; and Elizabeth Stuart (2007b) “Matchit: Nonparametric Preprocessing for Parametric Causal Inference,” Journal of Statistical Software, http://gking.harvard.edu/matchit/. Website: https://r.iq.harvard.edu/docs/matchit/2.4-15/