Causal Inference Logit Propensity Score Matching (PSM). How can Python and R users use the R Matching package for causal inference with logit Propensity Score Matching (PSM)?

Causal Inference Logit Propensity Score Matching (PSM)

How can Python and R users use the R Matching package for causal inference with logit Propensity Score Matching (PSM)? Causal inference has well-established packages in R, but not in Python. This tutorial provides an example of using R packages for causal analysis in a Python notebook.

  • If you are a Python user, please follow the steps in the tutorial to set up the environment to run the R package Matching for propensity score matching. To learn more about how to run R in a Python notebook, please check out my previous tutorial How to Use R with Google Colab Notebook.
  • If you are an R user, please ignore the python code and use the R code directly. R code are the ones starting with %R or %%R.

In this tutorial, you will learn:

  • How do the Propensity Score Matching (PSM) algorithms work?
  • How to conduct Propensity Score Matching (PSM) or Logit Propensity Score Matching (PSM) using the R package Matching?
  • How to check the balance of the matched dataset?
  • How to calculate causal impact using a paired t-test?

Resources for this post:

  • If you prefer the video version of the tutorial, watch the video below on YouTube. Click here for the Colab notebook.
Causal Inference Logit Propensity Score Matching (PSM) – GrabNGoInfo.com

Let’s get started!

Step 0: Propensity Score Matching (PSM) Algorithms for Causal Inference

Propensity Score Matching (PSM) takes a sample in the treatment group and finds a sample with a similar propensity score, or logit propensity score in the non-treatment group. The goal of matching is to create the treatment and the control groups that are comparable.

The Propensity Score Matching (PSM) process involves the following steps:

  1. Create a matching dataset based on similarities in the probability of getting the treatment. The probability of getting the treatment is called the propensity score.
  2. Evaluate the balance of the matched dataset.
    • If the matched dataset is balanced, meaning that the test and control dataset has similar confounders, move forward to the next step.
    • If the matched dataset is not balanced, go back to the first step, adjust the hyperparameters, and redo the propensity score matching.
  3. Measure the causal impact. In this tutorial, we will use a paired t-test to measure the impact of the treatment.

Step 1: Install and Import Libraries

In step 1, we will install and import libraries.

Firstly, let’s install dowhy for dataset creation.

# Install dowhy
!pip install dowhy

Then, we activate the R magic command in the Google Colab notebook using rpy2 and install the R packages tableoneMatchinggtoolslmtest, and sandwich.

# activate R magic
%load_ext rpy2.ipython

# Install R package tableone
%R install.packages('tableone')

# Install R package Matching
%R install.packages('Matching')

# Install R package gtools
%R install.packages('gtools')

# Install R package lmtest
%R install.packages('lmtest')

# Install R package sandwich
%R install.packages('sandwich')

After the installation is completed, we can import the libraries.

  • The datasets is imported from dowhy for dataset creation.
  • pandas and numpy are imported for data processing.
  • Packages from rpy2 are imported for Python to R conversion.
  • Then we imported the R libraries, tableoneMatchinggtoolslmtest, and sandwich.
# Package to create synthetic data for causal inference
from dowhy import datasets

# Data processing
import pandas as pd
import numpy as np

# Python to R conversion
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

# Import R libraries
tableone = importr('tableone')
Matching = importr('Matching')
gtools = importr('gtools')
lmtest = importr('lmtest')
sandwich = importr('sandwich')

# Read R libraries
%R library(tableone)
%R library(Matching)
%R library(gtools) # For the logit function
%R library(lmtest)
%R library(sandwich)

Step 2:Create Dataset

In step 2, we will create a synthetic dataset for the causal inference.

  • Firstly, we set a random seed using np.random.seed to make the dataset reproducible.
  • Then a dataset with the true causal impact of 10, 4 confounders, 10,000 samples, a binary treatment variable, and a continuous outcome variable is created.
  • After that, we created a dataframe for the data. In the dataframe, the columns W0, W1, W2, and W3 are the four confounders, v0 is the treatment indicator, and y is the outcome.
# Set random seed
np.random.seed(42)

# Create a synthetic dataset
data = datasets.linear_dataset(
    beta=10,
    num_common_causes=4,
    num_samples=10000,
    treatment_is_binary=True,
    outcome_is_binary=False)

# Create Dataframe
df = data['df']

# Take a look at the data
df.head()
Dowhy synthetic data for causal inference — GrabNGoInfo.com

Next, let’s rename v0 to treatment, rename y to outcome, and convert the boolean values to 0s and 1s.

# Rename columns
df = df.rename({'v0': 'treatment', 'y': 'outcome'}, axis=1)

# Create the treatment variable, and change boolean values to 1 and 0
df['treatment'] = df['treatment'].apply(lambda x: 1 if x == True else 0)

# Take a look at the data
df.head()
Dowhy synthetic data for causal inference — GrabNGoInfo.com

Finally, the Python dataframe is converted to R dataframe to use R packages on the dataset.

# Convert Python dataframe to R dataframe
with localconverter(ro.default_converter + pandas2ri.converter):
  dfr = ro.conversion.py2rpy(df)

# Create a variable name in R's global environment
from rpy2.robjects import globalenv
globalenv['dfr'] = dfr

Step 3: Unadjusted Mean Differences

In step 3, we will get the unadjusted difference between the treatment group and the non-treatment group.

%%R
# Get the matched treatement outcomes
y_treatment <- dfr$outcome[dfr$treatment==1]

# Get the matched control outcomes
y_control <- dfr$outcome[dfr$treatment==0]

# Get the mean difference
unadjusted_diff <- mean(y_treatment - y_control)
print(unadjusted_diff)

If we do not use causal inference and take the direct difference between the two groups, we conclude that the treatment increased the outcome by 16, which is much higher than the true causal impact of 10.

[1] 16.13221

Step 4: Pre-Matching Standardized Mean Difference of Confounders

In step 4, we will check the standardized mean difference (SMD) between the treatment group and the non-treatment group.

  • var=confounders means that we will summarize the data for the four confounders.
  • strata='treatment' means that the variables will be grouped by the treatment type.
  • test=TRUE means the p-value for the significance testing will be reported.
  • smd=TRUE means that the standardized mean difference will be reported.
%%R 
# Confounders
confounders <- c('W0', 'W1', 'W2', 'W3')

# Create table one, test=TRUE means with significance testing
table1 <- CreateTableOne(var=confounders, strata='treatment', data=dfr, test=TRUE)

# Print the results
print(table1, smd=TRUE)

Standardized Mean Differences(SMD) greater than 0.1 is usually used to check the imbalance between the treatment and the control group. We can see that most confounders have SMD greater than 0.1 with a significant p-value, meaning that the confounders differ between the treatment group and the non-treatment group.

                Stratified by treatment
                 0            1            p      test SMD   
  n               2269         7731                          
  W0 (mean (SD)) -0.78 (0.94) -0.09 (0.97) <0.001       0.724
  W1 (mean (SD)) -0.18 (0.68)  1.21 (0.83) <0.001       1.825
  W2 (mean (SD))  0.12 (1.00)  0.55 (0.99) <0.001       0.435
  W3 (mean (SD))  0.13 (0.97)  0.22 (1.01) <0.001       0.095

Step 5: Predict Propensity Score

In step 5, we will build a model to predict propensity scores.

  • The function glm() fits a generalized linear model.
  • The propensity score is the probability of getting the treatment, so we need to use the confounders to predict the probability of getting the treatment.
  • family=binomial means that we are running a logistic regression.
  • propensity_model$fitted.values gets the predicted propensity scores and saves it into a variable called propensity_score.
  • summary(propensity_model) gives us the summary of the model.
%%R
# Propensity model
propensity_model <- glm(treatment ~ W0 + W1 + W2 + W3, family=binomial, data=dfr)

# Propensity score
propensity_score <- propensity_model$fitted.values

# Model results
summary(propensity_model)

Output

Call:
glm(formula = treatment ~ W0 + W1 + W2 + W3, family = binomial, 
    data = dfr)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.4960   0.0010   0.0374   0.2078   2.6537  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.05212    0.05309  -0.982    0.326    
W0           2.17105    0.06673  32.537  < 2e-16 ***
W1           4.52534    0.11658  38.816  < 2e-16 ***
W2           1.34954    0.05369  25.134  < 2e-16 ***
W3           0.28627    0.04405   6.500 8.06e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10710.1  on 9999  degrees of freedom
Residual deviance:  3578.4  on 9995  degrees of freedom
AIC: 3588.4

Number of Fisher Scoring iterations: 8

Step 6: Propensity Score Matching (PSM) Using R Package Matching

In step 6, we will discuss using the R Matching package to do the Propensity Score Matching (PSM).

After getting the propensity score, we can pass it to the Match function for Propensity Score Matching (PSM).

  • Tr=dfr[['treatment']] specifies the column name of the treatment indicator variable.
  • M=1 means that we are conducting one-to-one matching. Every treatment unit will be matched with 1 control unit.
  • X=propensity_score means that the match is based on the propensity score values.
  • replace=F means that each sample in the control group is only matched with one sample in the treatment group. The control group unit will not be put back after matching.
  • A caliper is the distance threshold for any match. The matched pairs with a distance greater than the caliper distance will be dropped. caliper=.2 means that matches greater than 20% of the standard deviations of propensity score will be dropped.

After matching was completed, the matched dataset was put into the CreateTableOne function for balance checking.

%%R
# Propensity score matching
psm_matching <- Match(Tr=dfr[['treatment']], M=1, X=propensity_score, replace=F, caliper = 0.2)

# Matched data
matched_psm_matching <- dfr[unlist(psm_matching[c('index.treated', 'index.control')]),]

# Matched data SMD
matchedtab1_psm_matching <- CreateTableOne(vars=confounders, strata='treatment', data=matched_psm_matching, test = TRUE)
print(matchedtab1_psm_matching, smd=TRUE)

We can see that 792 units were matched, and the standardized mean difference (SMD) for all the confounders is below 0.1 with insignificant p-values.

                Stratified by treatment
                 0            1            p      test SMD   
  n                792          792                          
  W0 (mean (SD)) -0.58 (0.92) -0.52 (0.90)  0.224       0.061
  W1 (mean (SD))  0.28 (0.53)  0.29 (0.62)  0.543       0.031
  W2 (mean (SD))  0.25 (0.98)  0.32 (0.97)  0.182       0.067
  W3 (mean (SD))  0.17 (0.97)  0.18 (1.02)  0.804       0.012

Step 7: Logit Propensity Score Matching (PSM) Using R Package Matching

Another way of doing Propensity Score Matching (PSM) is to match on the logit propensity score.

The code for matching is very similar to the regular Propensity Score Matching (PSM). The only difference is that X=propensity_score is replaced with X=logit(propensity_score).

%%R
# Logit propensity score matching
logit_psm_matching <- Match(Tr=dfr[['treatment']], M=1, X=logit(propensity_score), replace=F, caliper = 0.2)

# Matched data
matched_logit_psm_matching <- dfr[unlist(logit_psm_matching[c('index.treated', 'index.control')]),]

# Matched data SMD
matchedtab1_logit_psm_matching <- CreateTableOne(vars=confounders, strata='treatment', data=matched_logit_psm_matching, test = TRUE)
print(matchedtab1_logit_psm_matching, smd=TRUE)

After matching on the logit propensity score, we can see that 880 units were matched, and the standardized mean difference (SMD) for all the confounders is below 0.1 with insignificant p-values.

                Stratified by treatment
                 0            1            p      test SMD   
  n                880          880                          
  W0 (mean (SD)) -0.58 (0.91) -0.52 (0.90)  0.170       0.065
  W1 (mean (SD))  0.25 (0.53)  0.27 (0.53)  0.628       0.023
  W2 (mean (SD))  0.26 (0.97)  0.30 (0.98)  0.361       0.044
  W3 (mean (SD))  0.15 (0.98)  0.17 (1.03)  0.758       0.015

Step 8: Estimating Treatment Effect Using Matched Data

In step 8, we will estimate the treatment effect using the matched data.

A paired t-test is used on both the Propensity Score Matching (PSM) dataset and the logit Propensity Score Matching (PSM) dataset.

%%R
# Get the matched treatement outcomes
y_treatment_psm_matching <- matched_psm_matching$outcome[matched_psm_matching$treatment==1]

# Get the matched control outcomes
y_control_psm_matching <- matched_psm_matching$outcome[matched_psm_matching$treatment==0]

# t-test
t.test(y_treatment_psm_matching,
       y_control_psm_matching,
       paired=TRUE,
       conf.level=0.95)

For the Propensity Score Matching (PSM) dataset, we can see that

  • The estimated causal impact is 10.39.
  • The 95 percent confidence interval is 10.01 to 10.77.
  • The hypothesis testing is significant because the p-value is close to 0.
	Paired t-test

data:  y_treatment_psm_matching and y_control_psm_matching
t = 53.552, df = 791, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 10.00758 10.76915
sample estimates:
mean difference 
       10.38837 

For the logit Propensity Score Matching (PSM) dataset, we can see that

  • The estimated causal impact is 10.35.
  • The 95 percent confidence interval is 10.00 to 10.70.
  • The hypothesis testing is significant because the p-value is close to 0.
%%R
# Get the matched treatement outcomes
y_treatment_logit_psm_matching <- matched_logit_psm_matching$outcome[matched_logit_psm_matching$treatment==1]

# Get the matched control outcomes
y_control_logit_psm_matching <- matched_logit_psm_matching$outcome[matched_logit_psm_matching$treatment==0]

# t-test
t.test(y_treatment_logit_psm_matching,
       y_control_logit_psm_matching,
       paired=TRUE,
       conf.level=0.95)

Output

	Paired t-test

data:  y_treatment_logit_psm_matching and y_control_logit_psm_matching
t = 57.491, df = 879, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
  9.996423 10.703077
sample estimates:
mean difference 
       10.34975 

From the paired t-test result, we can see that logit Propensity Score Matching (PSM) has a more accurate causality estimation than the Propensity Score Matching (PSM) for this dataset.

Put All Code Together

#------------------------------------------------#
# Step 1: Install and Import Libraries
#------------------------------------------------#

# Install dowhy
!pip install dowhy

# activate R magic
%load_ext rpy2.ipython

# Install R package tableone
%R install.packages('tableone')

# Install R package Matching
%R install.packages('Matching')

# Install R package gtools
%R install.packages('gtools')

# Install R package lmtest
%R install.packages('lmtest')

# Install R package sandwich
%R install.packages('sandwich')

# Package to create synthetic data for causal inference
from dowhy import datasets

# Data processing
import pandas as pd
import numpy as np

# Python to R conversion
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

# Import R libraries
tableone = importr('tableone')
Matching = importr('Matching')
gtools = importr('gtools')
lmtest = importr('lmtest')
sandwich = importr('sandwich')

# Read R libraries
%R library(tableone)
%R library(Matching)
%R library(gtools) # For the logit function
%R library(lmtest)
%R library(sandwich)

#------------------------------------------------#
# Step 2:Create Dataset
#------------------------------------------------#

# Set random seed
np.random.seed(42)

# Create a synthetic dataset
data = datasets.linear_dataset(
    beta=10,
    num_common_causes=4,
    num_samples=10000,
    treatment_is_binary=True,
    outcome_is_binary=False)

# Create Dataframe
df = data['df']

# Take a look at the data
df.head()

# Rename columns
df = df.rename({'v0': 'treatment', 'y': 'outcome'}, axis=1)

# Create the treatment variable, and change boolean values to 1 and 0
df['treatment'] = df['treatment'].apply(lambda x: 1 if x == True else 0)

# Take a look at the data
df.head()

# Convert Python dataframe to R dataframe
with localconverter(ro.default_converter + pandas2ri.converter):
  dfr = ro.conversion.py2rpy(df)

# Create a variable name in R's global environment
from rpy2.robjects import globalenv
globalenv['dfr'] = dfr

#------------------------------------------------#
# Step 3: Unadjusted Mean Differences
#------------------------------------------------#

%%R
# Get the matched treatement outcomes
y_treatment <- dfr$outcome[dfr$treatment==1]

# Get the matched control outcomes
y_control <- dfr$outcome[dfr$treatment==0]

# Get the mean difference
unadjusted_diff <- mean(y_treatment - y_control)
print(unadjusted_diff)

#------------------------------------------------#
# Step 4: Pre-Matching Standardized Mean Difference of Confounders
#------------------------------------------------#

%%R 
# Confounders
confounders <- c('W0', 'W1', 'W2', 'W3')

# Create table one, test=TRUE means with significance testing
table1 <- CreateTableOne(var=confounders, strata='treatment', data=dfr, test=TRUE)

# Print the results
print(table1, smd=TRUE)

#------------------------------------------------#
# Step 5: Predict Propensity Score
#------------------------------------------------#

%%R
# Propensity model
propensity_model <- glm(treatment ~ W0 + W1 + W2 + W3, family=binomial, data=dfr)

# Propensity score
propensity_score <- propensity_model$fitted.values

# Model results
summary(propensity_model)

#------------------------------------------------#
# Step 6: Propensity Score Matching (PSM) Using R Package Matching
#------------------------------------------------#

%%R
# Propensity score matching
psm_matching <- Match(Tr=dfr[['treatment']], M=1, X=propensity_score, replace=F, caliper = 0.2)

# Matched data
matched_psm_matching <- dfr[unlist(psm_matching[c('index.treated', 'index.control')]),]

# Matched data SMD
matchedtab1_psm_matching <- CreateTableOne(vars=confounders, strata='treatment', data=matched_psm_matching, test = TRUE)
print(matchedtab1_psm_matching, smd=TRUE)

#------------------------------------------------#
# Step 7: Logit Propensity Score Matching (PSM) Using R Package Matching
#------------------------------------------------#

%%R
# Logit propensity score matching
logit_psm_matching <- Match(Tr=dfr[['treatment']], M=1, X=logit(propensity_score), replace=F, caliper = 0.2)

# Matched data
matched_logit_psm_matching <- dfr[unlist(logit_psm_matching[c('index.treated', 'index.control')]),]

# Matched data SMD
matchedtab1_logit_psm_matching <- CreateTableOne(vars=confounders, strata='treatment', data=matched_logit_psm_matching, test = TRUE)
print(matchedtab1_logit_psm_matching, smd=TRUE)

#------------------------------------------------#
# Step 8: Estimating Treatment Effect Using Matched Data
#------------------------------------------------#

%%R
# Get the matched treatement outcomes
y_treatment_psm_matching <- matched_psm_matching$outcome[matched_psm_matching$treatment==1]

# Get the matched control outcomes
y_control_psm_matching <- matched_psm_matching$outcome[matched_psm_matching$treatment==0]

# t-test
t.test(y_treatment_psm_matching,
       y_control_psm_matching,
       paired=TRUE,
       conf.level=0.95)

%%R
# Get the matched treatement outcomes
y_treatment_logit_psm_matching <- matched_logit_psm_matching$outcome[matched_logit_psm_matching$treatment==1]

# Get the matched control outcomes
y_control_logit_psm_matching <- matched_logit_psm_matching$outcome[matched_logit_psm_matching$treatment==0]

# t-test
t.test(y_treatment_logit_psm_matching,
       y_control_logit_psm_matching,
       paired=TRUE,
       conf.level=0.95)

Summary

In this tutorial, we talked about how to do Propensity Score Matching (PSM) using the R Matching package. You learned:

  • How do the Propensity Score Matching (PSM) algorithms work?
  • How to conduct Propensity Score Matching (PSM) or Logit Propensity Score Matching (PSM) using the R package Matching?
  • How to check the balance of the matched dataset?
  • How to calculate causal impact using a paired t-test?

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Recommended Tutorials

References

[1] DoWhy Documentation

[2] rpy2 R and pandas Dataframe Documentation

[3] Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010 Feb 1;25(1):1-21. doi: 10.1214/09-STS313. PMID: 20871802; PMCID: PMC2943670.

[4] R Matching Methods Documentation

[5] Coursera course: A Crash Course in Causality

[6] R Matching Documentation on Match

[7] R CreateTableOne Documentation

Leave a Comment

Your email address will not be published. Required fields are marked *