`CausalInference`

is a Python package for causal analysis. It has different functionalities such as propensity score trimming, covariates matching, counterfactual modeling, subclassification, and inverse probability weighting.

In this tutorial, we will talk about how to do propensity score trimming using `CausalInference`

, and how that impacts the causal impact analysis results. Other functionalities will be introduced in future tutorials.

**Resources for this post:**

- Click here for the Colab notebook.
- More video tutorials on Causal Inference
- More blog posts on Causal Inference
- Video tutorial for this post on YouTube

Let’s get started!

### Step 1: Install and Import Libraries

In step 1, we will install and import libraries.

Firstly, let’s install dowhy for dataset creation and `causalinference`

for propensity score trimming.

# Install dowhy !pip install dowhy # Install causal inference !pip install causalinference

You will see the message below after the libraries are successfully installed.

Successfully installed dowhy-0.8 pydot-1.4.2 Successfully installed causalinference-0.1.3

After the installation is completed, we can import the libraries.

- The
`datasets`

is imported from`dowhy`

for dataset creation. `pandas`

and`numpy`

are imported for data processing.`CausalModel`

is imported from the`causalinference`

package for propensity score trimming and causality analysis.

# Package to create synthetic data for causal inference from dowhy import datasets # Data processing import pandas as pd import numpy as np # Causal inference from causalinference import CausalModel

### Step 2ï¼šCreate Dataset

In step 2, we will create a synthetic dataset for the causal inference.

- Firstly, we set a random seed using
`np.random.seed`

to make the dataset reproducible. - Then a dataset with the true causal impact of 10, four confounders, 10,000 samples, a binary treatment variable, and a continuous outcome variable is created.
- After that, we created a dataframe for the data. In the dataframe, the columns W0, W1, W2, and W3 are the four confounders, v0 is the treatment indicator, and y is the outcome.

# Set random seed np.random.seed(42) # Create a synthetic dataset data = datasets.linear_dataset( beta=10, num_common_causes=4, num_samples=10000, treatment_is_binary=True, outcome_is_binary=False) # Create Dataframe df = data['df'] # Take a look at the data df.head()

Next, let’s rename `v0`

to `treatment`

, rename `y`

to `outcome`

, and convert the boolean values to 0 and 1.

# Rename columns df = df.rename({'v0': 'treatment', 'y': 'outcome'}, axis=1) # Create the treatment variable, and change boolean values to 1 and 0 df['treatment'] = df['treatment'].apply(lambda x: 1 if x == True else 0) # Take a look at the data df.head()

### Step 3: Raw Difference

In step 3, we will initiate `CausalModel`

and print the pre-trimming summary statistics. `CausalModel`

takes three arguments:

`Y`

is the observed outcome.`D`

is the treatment indicator.`X`

is the covariates matrix.

`CausalModel`

takes arrays as inputs, so `.values`

are used when reading the data.

# Run causal model causal = CausalModel(Y = df['outcome'].values, D = df['treatment'].values, X = df[['W0', 'W1', 'W2', 'W3']].values) # Print summary statistics print(causal.summary_stats)

`causal.summary_stats`

prints out the raw summary statistics. The output shows that:

- There are 2,269 units in the control group and 7,731 units in the treatment group.
- The average outcome for the treatment group is 13.94, and the average outcome for the control group is -2.191. So the raw difference between the treatment and the control group is 16.132.
`Nor-diff`

is the standardized mean difference (SMD) for covariates between the treatment group and the control group. Standardized Mean Differences(SMD) greater than 0.1 means that the data is imbalanced between the treatment and the control group. We can see that most of the covariates have SMD greater than 0.1.

### Step 4: Propensity Score Estimation

In step 4, we will get the propensity score estimation. Propensity score is the predicted probability of getting treatment. It is calculated by running a logistic regression with the treatment variable as the target, and the covariates as the features.

There are two methods for propensity score estimation, `est_propensity_s`

and `est_propensity`

.

`est_propensity`

allows users to add the interaction or quadratic features.`est_propensity_s`

automatically choose the features based on a sequence of likelihood ratio tests.

In this step, we will use `est_propensity_s`

to run the propensity score estimation.

# Automated propensity score estimation causal.est_propensity_s() # Propensity model results print(causal.propensity)

From the model results, we can see that the feature selection algorithm decided to include only the raw features, and to not include interaction or quadratic terms.

To get the propensity score, use `causal.propensity['fitted']`

.

# Propensity scores causal.propensity['fitted']

Output

array([0.99295272, 0.99217314, 0.00156753, ..., 0.69143426, 0.99983862, 0.99943713])

### Step 5: Propensity Score Trimming

In step 5, we will talk about propensity score trimming.

Propensity score trimming improves the balance between the treatment group and the control group by dropping units with extreme propensity scores.

The rationale behind the propensity score trimming is that

- for the units with extremely high propensity scores of being in the treatment group, it’s hard to find reliably comparable units in the control group.
- similarly, for the units with extremely low propensity scores of being in the treatment group, it’s hard to find reliably comparable units in the treatment group.

By default, the `causalinference`

package set the cutoff value as 0.1 after the propensity score estimation. We can check the cutoff value by running `causal.cutoff`

.

# Check the default propensity score trimming cutoff value causal.cutoff

Output

0.1

Running `causal.trim()`

will remove all the units with propensity scores greater than 0.9 or less than 0.1.

Alternatively, we can use an automated optimal cutoff search procedure to find the best cutoff value that minimizes the asymptotic sampling variance of the trimmed sample. Instead of running `causal.trim()`

, we will run `causal.trim_s()`

.

# Trim using the optimal cutoff value causal.trim_s() # Check the optimal cutoff value causal.cutoff

We can see that the optimal propensity score cutoff value is 0.08.

0.0844995651762354

### Step 6: After-Trimming Difference

In step 6, we will check the difference between the treatment and the control group after the propensity score trimming.

`causal.summary_stats`

prints out the summary statistics after trimming. The output shows that:

- The number of units in the control group decreased from 2,269 to 1,189, and the units in the treatment group decreased from 7,731 to 1,463.
- The raw difference between the treatment and the control group decreased from 16.132 to 11.02, which is much closer to the true treatment impact of 10.
- The standardized mean difference (SMD) for covariates between the treatment group and the control group decreased for every covariate.

# Print summary statistics print(causal.summary_stats)

For more information about data science and machine learning, please check out myÂ YouTube channelÂ andÂ Medium PageÂ or follow me onÂ LinkedIn.

### Recommended Tutorials

- GrabNGoInfo Machine Learning Tutorials Inventory
- One-Class SVM For Anomaly Detection
- Multivariate Time Series Forecasting with Seasonality and Holiday Effect Using Prophet in Python
- Hyperparameter Tuning For XGBoost
- Recommendation System: User-Based Collaborative Filtering
- Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python
- How to detect outliers | Data Science Interview Questions and Answers
- Causal Inference One-to-one Matching on Confounders Using R for Python Users
- 3 Ways for Multiple Time Series Forecasting Using Prophet in Python