Causal inference is the process of determining the effect of a treatment. The causal impact can be evaluated by randomized experiments or observational studies.

- A randomized experiment randomly separates the samples into the treatment group and the control group. The causal impact can be calculated by getting the difference between the treatment group and the control group.
- When only the observational data is available, we can use causal inference algorithms to calculate the causal impact. Such algorithms include but are not limited to difference-in-difference, matching, Inverse Probability Treatment Weighting (IPTW), and counterfactual modeling.

In this tutorial, we will talk about how to use Mahalanobis Distance Matching (MDM) for causal inference using the R package `Matching`

. You will learn:

- How do the matching algorithms work?
- How to do Mahalanobis Distance Matching (MDM) using R?
- How to do causal inference using matched data?
- How much improvement in accuracy is made by causal inference?

**Resources for this post:**

- Python code is at the end of the post. Click here for the Colab notebook.
- More video tutorials on Causal Inference
- More blog posts on Causal Inference

- If you prefer the video version of the tutorial, watch the video below on YouTube.

Let’s get started!

### Step 0: Matching Algorithms

Matching takes a sample in the treatment group and finds a similar sample in the non-treatment group. The goal of matching is to create the treatment group and the control group that are comparable.

The matching process involves the following steps:

- Compute a distance between the subject in the treatment group and the samples in the control group. The distance can be calculated using the confounders or using the propensity scores. In this tutorial, we will do the matching on confounders.
- Match subjects in the treatment and control group using the shortest distance. There are different ways of calculating distances, we will use the Mahalanobis Distance Matching (MDM) in this tutorial. It is similar to the Euclidean distance. The difference is that Mahalanobis Distance Matching (MDM) uses the standardized data while Euclidean distance uses the original data.
- Define a caliper as the maximum distance threshold we are willing to accept.
- A small caliper means a small distance threshold, a better balance between treatment and control group, a smaller number of matched pairs, and we get results with less bias and more variance.
- A large caliper means a large distance threshold, a worse balance between treatment and control group, a larger number of matched pairs and we get results with more bias and less variance.

### Step 1: Install and Import Libraries

In step 1, we will install and import libraries.

Firstly, let’s install dowhy for dataset creation.

# Install dowhy !pip install dowhy

Then, we activate the R magic command in the Google Colab notebook using rpy2 and install the R packages tableone and Matching. This tutorial uses both Python and R in the same Colab notebook. To get details on using R with Colab, please check out my tutorial How to Use R with Google Colab Notebook.

# activate R magic %load_ext rpy2.ipython # Install R package tableone %R install.packages('tableone') # Install R package Matching %R install.packages('Matching')

After the installation is completed, we can import the libraries.

- The
`datasets`

is imported from`dowhy`

for dataset creation. `pandas`

and`numpy`

are imported for data processing.- Packages from
`rpy2`

are imported for Python to R conversion. - Then we imported the two R libraries,
`tableone`

and`Matching`

.

# Package to create synthetic data for causal inference from dowhy import datasets # Data processing import pandas as pd import numpy as np # Python to R conversion import rpy2.robjects as ro from rpy2.robjects.packages import importr from rpy2.robjects import pandas2ri from rpy2.robjects.conversion import localconverter # Import R libraries tableone = importr('tableone') Matching = importr('Matching') # Read R libraries %R library(tableone) %R library(Matching)

### Step 2ï¼šCreate Dataset

In step 2, we will create a synthetic dataset for the causal inference.

- Firstly, we set a random seed using
`np.random.seed`

to make the dataset reproducible. - Then a dataset with the true causal impact of 10, four confounders, 10,000 samples, a binary treatment variable, and a continuous outcome variable is created.
- After that, we created a dataframe for the data. In the dataframe, the columns W0, W1, W2, and W3 are the four confounders, v0 is the treatment indicator, and y is the outcome.

# Set random seed np.random.seed(42) # Create a synthetic dataset data = datasets.linear_dataset( beta=10, num_common_causes=4, num_samples=10000, treatment_is_binary=True, outcome_is_binary=False) # Create Dataframe df = data['df'] # Take a look at the data df.head()

Next, let’s rename `v0`

to `treatment`

, rename `y`

to `outcome`

, and convert the boolean values to 0s and 1s.

# Rename columns df = df.rename({'v0': 'treatment', 'y': 'outcome'}, axis=1) # Create the treatment variable, and change boolean values to 1 and 0 df['treatment'] = df['treatment'].apply(lambda x: 1 if x == True else 0) # Take a look at the data df.head()

Finally, the Python dataframe is converted to R dataframe in order to use R packages on the dataset.

# Convert Python dataframe to R dataframe with localconverter(ro.default_converter + pandas2ri.converter): dfr = ro.conversion.py2rpy(df) # Create a variable name in R's global environment from rpy2.robjects import globalenv globalenv['dfr'] = dfr

### Step 3: Direct Differences

In step 3, we will get the direct difference between the treatment group and the non-treatment group.

We can see that for the group that received the treatment, the average outcome is about 14, while for the group that did not receive the treatment, the average outcome is about -2.

# Get the average outcome for treatment group and non-treatment group df.groupby('treatment')['outcome'].mean()

Output

treatment 0 -2.191159 1 13.940382 Name: outcome, dtype: float64

If we do not use causal inference and take the direct difference between the two groups, we will draw the conclusion that the treatment increased the outcome by 16, which is much higher than the true causal impact of 10.

# Print the direct difference print('The direct differenc is', df.groupby('treatment')['outcome'].mean()[1] -df.groupby('treatment')['outcome'].mean()[0])

The direct differenc is 16.131541137888554

### Step 4: Standardized Mean Difference of Confounders

In step 4, we will check the standardized mean difference (SMD) between the treatment group and the non-treatment group.

`var=confounders`

means that we will summarize the data for the four confounders.`strata='treatment'`

means that the variables will be grouped by the treatment type.`test=TRUE`

means the p-value for the significance testing will be reported.`smd=TRUE`

means that the standardized mean difference will be reported.

%%R # Confounders confounders <- c('W0', 'W1', 'W2', 'W3') # Create table one, test=TRUE means with significance testing table1 <- CreateTableOne(var=confounders, strata='treatment', data=dfr, test=TRUE) # Print the results print(table1, smd=TRUE)

Standardized Mean Differences(SMD) greater than 0.1 means that the data is imbalanced between the treatment and the control group. We can see that most of the confounders have SMD greater than 0.1 and a significant p-value, meaning that the confounders that different between the treatment group and the non-treatment group.

Stratified by treatment 0 1 p test SMD n 2269 7731 W0 (mean (SD)) -0.78 (0.94) -0.09 (0.97) <0.001 0.724 W1 (mean (SD)) -0.18 (0.68) 1.21 (0.83) <0.001 1.825 W2 (mean (SD)) 0.12 (1.00) 0.55 (0.99) <0.001 0.435 W3 (mean (SD)) 0.13 (0.97) 0.22 (1.01) <0.001 0.095

### Step 5: Greedy Matching on Confounders Using Mahalanobis Distance

In step 5, we will perform Mahalanobis Distance Matching (MDM) on the dataset using `Match`

in the R `Matching`

package.

`Tr`

takes in the treatment indicator column.`M=1`

means one record in the treatment group is matched with one record in the control group.`X`

takes the list of confounders to match on.`Match`

returns objects`index.treated`

and`index.control`

to reconstruct the matched dataset.

%%R # Greedy match on Mahalanobis distance greedymatch_mdm <- Match(Tr=dfr[['treatment']], M=1, X=dfr[confounders]) matched_mdm <- dfr[unlist(greedymatch_mdm[c('index.treated', 'index.control')]),] # Matched data matchedtab1_mdm <- CreateTableOne(vars=confounders, strata='treatment', data=matched_mdm, test = TRUE) print(matchedtab1_mdm, smd=TRUE)

Stratified by treatment 0 1 p test SMD n 7731 7731 W0 (mean (SD)) -0.33 (0.89) -0.09 (0.97) <0.001 0.260 W1 (mean (SD)) 0.70 (0.57) 1.21 (0.83) <0.001 0.709 W2 (mean (SD)) 0.40 (0.93) 0.55 (0.99) <0.001 0.159 W3 (mean (SD)) 0.20 (0.95) 0.22 (1.01) 0.152 0.023

TableOne results show that after matching, the test and control are still significantly different for some confounders, so we need to add a caliper.

A caliper is the distance threshold for any match. The matched pairs with a distance greater than the caliper distance will be dropped.

`caliper=.2`

means that matches greater than 20% of the standard deviations of each confounder will be dropped.

%%R # Greedy match greedymatch_mdm <- Match(Tr=dfr[['treatment']], M=1, X=dfr[confounders], caliper=0.2) matched_mdm <- dfr[unlist(greedymatch_mdm[c('index.treated', 'index.control')]),] # Matched data matchedtab1_mdm <- CreateTableOne(vars=confounders, strata='treatment', data=matched_mdm, test = TRUE) print(matchedtab1_mdm, smd=TRUE)

We can see that after adding the caliper, the Standardized Mean Distance (SMD) decreased and the p-values are not significant, meaning that the matched treatment and control group are not significantly different from each other for all the confounders.

Stratified by treatment 0 1 p test SMD n 659 659 W0 (mean (SD)) -0.43 (0.72) -0.41 (0.72) 0.685 0.022 W1 (mean (SD)) 0.33 (0.45) 0.37 (0.45) 0.171 0.075 W2 (mean (SD)) 0.26 (0.73) 0.27 (0.73) 0.783 0.015 W3 (mean (SD)) 0.07 (0.71) 0.08 (0.72) 0.849 0.010

### Step 6: Causal Inference Using Mahalanobis Distance Matching

In step 6, we will calculate the causal impact on the matched dataset. Firstly, we get the outcomes for the matched treatment samples and the matched control samples. Then a paired t-test is performed to check if the mean difference is equal to 0.

%%R # Get the matched treatement outcomes y_treatment_mdm <- matched_mdm$outcome[matched_mdm$treatment==1] # Get the matched control outcomes y_control_mdm <- matched_mdm$outcome[matched_mdm$treatment==0] # t-test t.test(y_treatment_mdm, y_control_mdm, paired=TRUE, conf.level=0.95)

From the output of the paired t-test, we can see that:

- The p-value is close to 0, meaning that the outcome mean of the treatment group is different from the outcome mean of the control group.
- The 95% confidence interval is between 10.12 and 10.21.
- The outcome mean difference is 10.1649, which is very close to the true causal impact of 10. Compared with the direct difference without matching of 16.13, the Mahalanobis Distance Matching (MDM) increased the accuracy by about 60%.

Paired t-test data: y_treatment_mdm and y_control_mdm t = 445.89, df = 658, p-value < 2.2e-16 alternative hypothesis: true mean difference is not equal to 0 95 percent confidence interval: 10.12014 10.20966 sample estimates: mean difference 10.1649

### Put All Code Together

# ----------------------------------------------- # # Step 1: Install and Import Libraries # ----------------------------------------------- # # Install dowhy !pip install dowhy # activate R magic %load_ext rpy2.ipython # Install R package tableone %R install.packages('tableone') # Install R package Matching %R install.packages('Matching') # Package to create synthetic data for causal inference from dowhy import datasets # Data processing import pandas as pd import numpy as np # Python to R conversion import rpy2.robjects as ro from rpy2.robjects.packages import importr from rpy2.robjects import pandas2ri from rpy2.robjects.conversion import localconverter # Import R libraries tableone = importr('tableone') Matching = importr('Matching') # Read R libraries %R library(tableone) %R library(Matching) # ----------------------------------------------- # # Step 2ï¼šCreate Dataset # ----------------------------------------------- # # Set random seed np.random.seed(42) # Create a synthetic dataset data = datasets.linear_dataset( beta=10, num_common_causes=4, num_samples=10000, treatment_is_binary=True, outcome_is_binary=False) # Create Dataframe df = data['df'] # Take a look at the data df.head() # Rename columns df = df.rename({'v0': 'treatment', 'y': 'outcome'}, axis=1) # Create the treatment variable, and change boolean values to 1 and 0 df['treatment'] = df['treatment'].apply(lambda x: 1 if x == True else 0) # Take a look at the data df.head() # Convert Python dataframe to R dataframe with localconverter(ro.default_converter + pandas2ri.converter): dfr = ro.conversion.py2rpy(df) # Create a variable name in R's global environment from rpy2.robjects import globalenv globalenv['dfr'] = dfr # ----------------------------------------------- # # Step 3: Direct Differences # ----------------------------------------------- # # Get the average outcome for treatment group and non-treatment group df.groupby('treatment')['outcome'].mean() # Print the direct difference print('The direct differenc is', df.groupby('treatment')['outcome'].mean()[1] -df.groupby('treatment')['outcome'].mean()[0]) # ----------------------------------------------- # # Step 4: Standardized Mean Difference of Confounders # ----------------------------------------------- # %%R # Confounders confounders <- c('W0', 'W1', 'W2', 'W3') # Create table one, test=TRUE means with significance testing table1 <- CreateTableOne(var=confounders, strata='treatment', data=dfr, test=TRUE) # Print the results print(table1, smd=TRUE) # ----------------------------------------------- # # Step 5: Greedy Matching on Confounders Using Mahalanobis Distance # ----------------------------------------------- # %%R # Greedy match on Mahalanobis distance greedymatch_mdm <- Match(Tr=dfr[['treatment']], M=1, X=dfr[confounders]) matched_mdm <- dfr[unlist(greedymatch_mdm[c('index.treated', 'index.control')]),] # Matched data matchedtab1_mdm <- CreateTableOne(vars=confounders, strata='treatment', data=matched_mdm, test = TRUE) print(matchedtab1_mdm, smd=TRUE) %%R # Greedy match greedymatch_mdm <- Match(Tr=dfr[['treatment']], M=1, X=dfr[confounders], caliper=0.2) matched_mdm <- dfr[unlist(greedymatch_mdm[c('index.treated', 'index.control')]),] # Matched data matchedtab1_mdm <- CreateTableOne(vars=confounders, strata='treatment', data=matched_mdm, test = TRUE) print(matchedtab1_mdm, smd=TRUE) # ----------------------------------------------- # # Step 6: Causal Inference Using Mahalanobis Distance Matching # ----------------------------------------------- # %%R # Get the matched treatement outcomes y_treatment_mdm <- matched_mdm$outcome[matched_mdm$treatment==1] # Get the matched control outcomes y_control_mdm <- matched_mdm$outcome[matched_mdm$treatment==0] # t-test t.test(y_treatment_mdm, y_control_mdm, paired=TRUE, conf.level=0.95)

### Summary

In this tutorial, we talked about how to use Mahalanobis Distance Matching (MDM) for causal inference using the R package `Matching`

. You learned:

- How do the matching algorithms work?
- How to do Mahahanois Distance Matching (MDM) using R?
- How to do causal inference using matched data?
- How much improvement in accuracy is made by causal inference?

For more information about data science and machine learning, please check out myÂ YouTube channelÂ andÂ Medium PageÂ or follow me onÂ LinkedIn.

### Recommended Tutorials

- GrabNGoInfo Machine Learning Tutorials Inventory
- One-Class SVM For Anomaly Detection
- 3 Ways for Multiple Time Series Forecasting Using Prophet in Python
- Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python
- Multivariate Time Series Forecasting with Seasonality and Holiday Effect Using Prophet in Python
- How to detect outliers | Data Science Interview Questions and Answers
- Time Series Anomaly Detection Using Prophet in Python
- How to Use R with Google Colab Notebook

### References

[2] rpy2 R and pandas Dataframe Documentation

[3] Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010 Feb 1;25(1):1-21. doi: 10.1214/09-STS313. PMID: 20871802; PMCID: PMC2943670.

[5] Coursera course: A Crash Course in Causality