Causal Inference One-to-one Matching on Confounders Using Python and R. Using R Matching package for causal inference with Mahalanobis Distance Matching (MDM) in Google Colab notebook.

Causal Inference One-to-one Matching on Confounders Using Python and R

Causal inference is the process of determining the effect of a treatment. The causal impact can be evaluated by randomized experiments or observational studies.

  • A randomized experiment randomly separates the samples into the treatment group and the control group. The causal impact can be calculated by getting the difference between the treatment group and the control group.
  • When only the observational data is available, we can use causal inference algorithms to calculate the causal impact. Such algorithms include but are not limited to difference-in-difference, matching, Inverse Probability Treatment Weighting (IPTW), and counterfactual modeling.

In this tutorial, we will talk about how to use Mahalanobis Distance Matching (MDM) for causal inference using the R package Matching. You will learn:

  • How do the matching algorithms work?
  • How to do Mahalanobis Distance Matching (MDM) using R?
  • How to do causal inference using matched data?
  • How much improvement in accuracy is made by causal inference?

Resources for this post:

  • If you prefer the video version of the tutorial, watch the video below on YouTube.
Causal Inference One-to-one Matching on Confounders – GrabNGoInfo.com

Let’s get started!

Step 0: Matching Algorithms

Matching takes a sample in the treatment group and finds a similar sample in the non-treatment group. The goal of matching is to create the treatment group and the control group that are comparable.

The matching process involves the following steps:

  1. Compute a distance between the subject in the treatment group and the samples in the control group. The distance can be calculated using the confounders or using the propensity scores. In this tutorial, we will do the matching on confounders.
  2. Match subjects in the treatment and control group using the shortest distance. There are different ways of calculating distances, we will use the Mahalanobis Distance Matching (MDM) in this tutorial. It is similar to the Euclidean distance. The difference is that Mahalanobis Distance Matching (MDM) uses the standardized data while Euclidean distance uses the original data.
  3. Define a caliper as the maximum distance threshold we are willing to accept.
    • A small caliper means a small distance threshold, a better balance between treatment and control group, a smaller number of matched pairs, and we get results with less bias and more variance.
    • A large caliper means a large distance threshold, a worse balance between treatment and control group, a larger number of matched pairs and we get results with more bias and less variance.

Step 1: Install and Import Libraries

In step 1, we will install and import libraries.

Firstly, let’s install dowhy for dataset creation.

# Install dowhy
!pip install dowhy

Then, we activate the R magic command in the Google Colab notebook using rpy2 and install the R packages tableone and Matching. This tutorial uses both Python and R in the same Colab notebook. To get details on using R with Colab, please check out my tutorial How to Use R with Google Colab Notebook.

# activate R magic
%load_ext rpy2.ipython

# Install R package tableone
%R install.packages('tableone')

# Install R package Matching
%R install.packages('Matching')

After the installation is completed, we can import the libraries.

  • The datasets is imported from dowhy for dataset creation.
  • pandas and numpy are imported for data processing.
  • Packages from rpy2 are imported for Python to R conversion.
  • Then we imported the two R libraries, tableone and Matching.
# Package to create synthetic data for causal inference
from dowhy import datasets

# Data processing
import pandas as pd
import numpy as np

# Python to R conversion
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

# Import R libraries
tableone = importr('tableone')
Matching = importr('Matching')

# Read R libraries
%R library(tableone)
%R library(Matching)

Step 2:Create Dataset

In step 2, we will create a synthetic dataset for the causal inference.

  • Firstly, we set a random seed using np.random.seed to make the dataset reproducible.
  • Then a dataset with the true causal impact of 10, four confounders, 10,000 samples, a binary treatment variable, and a continuous outcome variable is created.
  • After that, we created a dataframe for the data. In the dataframe, the columns W0, W1, W2, and W3 are the four confounders, v0 is the treatment indicator, and y is the outcome.
# Set random seed
np.random.seed(42)

# Create a synthetic dataset
data = datasets.linear_dataset(
    beta=10,
    num_common_causes=4,
    num_samples=10000,
    treatment_is_binary=True,
    outcome_is_binary=False)

# Create Dataframe
df = data['df']

# Take a look at the data
df.head()
Causal Inference Dataset – GrabNGoInfo.com

Next, let’s rename v0 to treatment, rename y to outcome, and convert the boolean values to 0s and 1s.

# Rename columns
df = df.rename({'v0': 'treatment', 'y': 'outcome'}, axis=1)

# Create the treatment variable, and change boolean values to 1 and 0
df['treatment'] = df['treatment'].apply(lambda x: 1 if x == True else 0)

# Take a look at the data
df.head()

Finally, the Python dataframe is converted to R dataframe in order to use R packages on the dataset.

# Convert Python dataframe to R dataframe
with localconverter(ro.default_converter + pandas2ri.converter):
  dfr = ro.conversion.py2rpy(df)

# Create a variable name in R's global environment
from rpy2.robjects import globalenv
globalenv['dfr'] = dfr

Step 3: Direct Differences

In step 3, we will get the direct difference between the treatment group and the non-treatment group.

We can see that for the group that received the treatment, the average outcome is about 14, while for the group that did not receive the treatment, the average outcome is about -2.

# Get the average outcome for treatment group and non-treatment group
df.groupby('treatment')['outcome'].mean()

Output

treatment
0    -2.191159
1    13.940382
Name: outcome, dtype: float64

If we do not use causal inference and take the direct difference between the two groups, we will draw the conclusion that the treatment increased the outcome by 16, which is much higher than the true causal impact of 10.

# Print the direct difference
print('The direct differenc is', df.groupby('treatment')['outcome'].mean()[1] -df.groupby('treatment')['outcome'].mean()[0])
The direct differenc is 16.131541137888554

Step 4: Standardized Mean Difference of Confounders

In step 4, we will check the standardized mean difference (SMD) between the treatment group and the non-treatment group.

  • var=confounders means that we will summarize the data for the four confounders.
  • strata='treatment' means that the variables will be grouped by the treatment type.
  • test=TRUE means the p-value for the significance testing will be reported.
  • smd=TRUE means that the standardized mean difference will be reported.
%%R 
# Confounders
confounders <- c('W0', 'W1', 'W2', 'W3')

# Create table one, test=TRUE means with significance testing
table1 <- CreateTableOne(var=confounders, strata='treatment', data=dfr, test=TRUE)

# Print the results
print(table1, smd=TRUE)

Standardized Mean Differences(SMD) greater than 0.1 means that the data is imbalanced between the treatment and the control group. We can see that most of the confounders have SMD greater than 0.1 and a significant p-value, meaning that the confounders that different between the treatment group and the non-treatment group.

                Stratified by treatment
                 0            1            p      test SMD   
  n               2269         7731                          
  W0 (mean (SD)) -0.78 (0.94) -0.09 (0.97) <0.001       0.724
  W1 (mean (SD)) -0.18 (0.68)  1.21 (0.83) <0.001       1.825
  W2 (mean (SD))  0.12 (1.00)  0.55 (0.99) <0.001       0.435
  W3 (mean (SD))  0.13 (0.97)  0.22 (1.01) <0.001       0.095

Step 5: Greedy Matching on Confounders Using Mahalanobis Distance

In step 5, we will perform Mahalanobis Distance Matching (MDM) on the dataset using Match in the R Matching package.

  • Tr takes in the treatment indicator column.
  • M=1 means one record in the treatment group is matched with one record in the control group.
  • X takes the list of confounders to match on.
  • Match returns objects index.treated and index.control to reconstruct the matched dataset.
%%R 

# Greedy match on Mahalanobis distance
greedymatch_mdm <- Match(Tr=dfr[['treatment']], M=1, X=dfr[confounders])
matched_mdm <- dfr[unlist(greedymatch_mdm[c('index.treated', 'index.control')]),]

# Matched data
matchedtab1_mdm <- CreateTableOne(vars=confounders, strata='treatment', data=matched_mdm, test = TRUE)
print(matchedtab1_mdm, smd=TRUE)
                Stratified by treatment
                 0            1            p      test SMD   
  n               7731         7731                          
  W0 (mean (SD)) -0.33 (0.89) -0.09 (0.97) <0.001       0.260
  W1 (mean (SD))  0.70 (0.57)  1.21 (0.83) <0.001       0.709
  W2 (mean (SD))  0.40 (0.93)  0.55 (0.99) <0.001       0.159
  W3 (mean (SD))  0.20 (0.95)  0.22 (1.01)  0.152       0.023

TableOne results show that after matching, the test and control are still significantly different for some confounders, so we need to add a caliper.

A caliper is the distance threshold for any match. The matched pairs with a distance greater than the caliper distance will be dropped.

caliper=.2 means that matches greater than 20% of the standard deviations of each confounder will be dropped.

%%R 

# Greedy match
greedymatch_mdm <- Match(Tr=dfr[['treatment']], M=1, X=dfr[confounders], caliper=0.2)
matched_mdm <- dfr[unlist(greedymatch_mdm[c('index.treated', 'index.control')]),]

# Matched data
matchedtab1_mdm <- CreateTableOne(vars=confounders, strata='treatment', data=matched_mdm, test = TRUE)
print(matchedtab1_mdm, smd=TRUE)

We can see that after adding the caliper, the Standardized Mean Distance (SMD) decreased and the p-values are not significant, meaning that the matched treatment and control group are not significantly different from each other for all the confounders.

                Stratified by treatment
                 0            1            p      test SMD   
  n                659          659                          
  W0 (mean (SD)) -0.43 (0.72) -0.41 (0.72)  0.685       0.022
  W1 (mean (SD))  0.33 (0.45)  0.37 (0.45)  0.171       0.075
  W2 (mean (SD))  0.26 (0.73)  0.27 (0.73)  0.783       0.015
  W3 (mean (SD))  0.07 (0.71)  0.08 (0.72)  0.849       0.010

Step 6: Causal Inference Using Mahalanobis Distance Matching

In step 6, we will calculate the causal impact on the matched dataset. Firstly, we get the outcomes for the matched treatment samples and the matched control samples. Then a paired t-test is performed to check if the mean difference is equal to 0.

%%R
# Get the matched treatement outcomes
y_treatment_mdm <- matched_mdm$outcome[matched_mdm$treatment==1]

# Get the matched control outcomes
y_control_mdm <- matched_mdm$outcome[matched_mdm$treatment==0]

# t-test
t.test(y_treatment_mdm,
       y_control_mdm,
       paired=TRUE,
       conf.level=0.95)

From the output of the paired t-test, we can see that:

  • The p-value is close to 0, meaning that the outcome mean of the treatment group is different from the outcome mean of the control group.
  • The 95% confidence interval is between 10.12 and 10.21.
  • The outcome mean difference is 10.1649, which is very close to the true causal impact of 10. Compared with the direct difference without matching of 16.13, the Mahalanobis Distance Matching (MDM) increased the accuracy by about 60%.
	Paired t-test

data:  y_treatment_mdm and y_control_mdm
t = 445.89, df = 658, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 10.12014 10.20966
sample estimates:
mean difference 
        10.1649 

Put All Code Together

# ----------------------------------------------- #
# Step 1: Install and Import Libraries
# ----------------------------------------------- #

# Install dowhy
!pip install dowhy

# activate R magic
%load_ext rpy2.ipython

# Install R package tableone
%R install.packages('tableone')

# Install R package Matching
%R install.packages('Matching')

# Package to create synthetic data for causal inference
from dowhy import datasets

# Data processing
import pandas as pd
import numpy as np

# Python to R conversion
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

# Import R libraries
tableone = importr('tableone')
Matching = importr('Matching')

# Read R libraries
%R library(tableone)
%R library(Matching)

# ----------------------------------------------- #
# Step 2:Create Dataset
# ----------------------------------------------- #

# Set random seed
np.random.seed(42)

# Create a synthetic dataset
data = datasets.linear_dataset(
    beta=10,
    num_common_causes=4,
    num_samples=10000,
    treatment_is_binary=True,
    outcome_is_binary=False)

# Create Dataframe
df = data['df']

# Take a look at the data
df.head()

# Rename columns
df = df.rename({'v0': 'treatment', 'y': 'outcome'}, axis=1)

# Create the treatment variable, and change boolean values to 1 and 0
df['treatment'] = df['treatment'].apply(lambda x: 1 if x == True else 0)

# Take a look at the data
df.head()

# Convert Python dataframe to R dataframe
with localconverter(ro.default_converter + pandas2ri.converter):
  dfr = ro.conversion.py2rpy(df)

# Create a variable name in R's global environment
from rpy2.robjects import globalenv
globalenv['dfr'] = dfr

# ----------------------------------------------- #
# Step 3: Direct Differences
# ----------------------------------------------- #

# Get the average outcome for treatment group and non-treatment group
df.groupby('treatment')['outcome'].mean()

# Print the direct difference
print('The direct differenc is', df.groupby('treatment')['outcome'].mean()[1] -df.groupby('treatment')['outcome'].mean()[0])

# ----------------------------------------------- #
# Step 4: Standardized Mean Difference of Confounders
# ----------------------------------------------- #

%%R 
# Confounders
confounders <- c('W0', 'W1', 'W2', 'W3')

# Create table one, test=TRUE means with significance testing
table1 <- CreateTableOne(var=confounders, strata='treatment', data=dfr, test=TRUE)

# Print the results
print(table1, smd=TRUE)

# ----------------------------------------------- #
# Step 5: Greedy Matching on Confounders Using Mahalanobis Distance
# ----------------------------------------------- #

%%R 

# Greedy match on Mahalanobis distance
greedymatch_mdm <- Match(Tr=dfr[['treatment']], M=1, X=dfr[confounders])
matched_mdm <- dfr[unlist(greedymatch_mdm[c('index.treated', 'index.control')]),]

# Matched data
matchedtab1_mdm <- CreateTableOne(vars=confounders, strata='treatment', data=matched_mdm, test = TRUE)
print(matchedtab1_mdm, smd=TRUE)

%%R 

# Greedy match
greedymatch_mdm <- Match(Tr=dfr[['treatment']], M=1, X=dfr[confounders], caliper=0.2)
matched_mdm <- dfr[unlist(greedymatch_mdm[c('index.treated', 'index.control')]),]

# Matched data
matchedtab1_mdm <- CreateTableOne(vars=confounders, strata='treatment', data=matched_mdm, test = TRUE)
print(matchedtab1_mdm, smd=TRUE)

# ----------------------------------------------- #
# Step 6: Causal Inference Using Mahalanobis Distance Matching
# ----------------------------------------------- #

%%R
# Get the matched treatement outcomes
y_treatment_mdm <- matched_mdm$outcome[matched_mdm$treatment==1]

# Get the matched control outcomes
y_control_mdm <- matched_mdm$outcome[matched_mdm$treatment==0]

# t-test
t.test(y_treatment_mdm,
       y_control_mdm,
       paired=TRUE,
       conf.level=0.95)

Summary

In this tutorial, we talked about how to use Mahalanobis Distance Matching (MDM) for causal inference using the R package Matching. You learned:

  • How do the matching algorithms work?
  • How to do Mahahanois Distance Matching (MDM) using R?
  • How to do causal inference using matched data?
  • How much improvement in accuracy is made by causal inference?

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Recommended Tutorials

References

[1] DoWhy Documentation

[2] rpy2 R and pandas Dataframe Documentation

[3] Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010 Feb 1;25(1):1-21. doi: 10.1214/09-STS313. PMID: 20871802; PMCID: PMC2943670.

[4] R MatchIt Documentation

[5] Coursera course: A Crash Course in Causality

[6] R Matching Documentation on Match

[7] R CreateTableOne Documentation

Leave a Comment

Your email address will not be published. Required fields are marked *