`CausalImpact`

package created by Google estimates the impact of an intervention on a time series. In this tutorial, we will talk about how to tune the hyperparameters of the time series causal impact model using the R package `CausalImpact`

.

**Resources for this post:**

- Click here for the Colab notebook
- More video tutorials on Causal Inference and Time Series
- More blog posts on Causal Inference and Time Series
- Video tutorial for this post on YouTube

Let’s get started!

### Step 1: Install and Import R Libraries

In step 1, we will install and import the R libraries. To learn how to use the statistical package R with Google Colab notebook, please check out my tutorial How to Use R with Google Colab Notebook.

Firstly, let’s install `CausalImpact`

for time series causal analysis.

# Install R version of causal impact install.packages("CausalImpact")

After the installation is completed, we can import the library.

# Import library library(CausalImpact)

### Step 2ï¼šCreate Dataset

In step 2, we will create a synthetic time-series dataset for the causal impact analysis. The ground truth of the causal impact is usually not available. The benefit of using a synthetic dataset is that we can validate the accuracy of the model results.

The `CausalImpact`

package requires two types of time series:

- A response time series that is directly affected by the intervention.
- And one or more control time series that are not impacted by the intervention.

The idea is to build a time series model to predict the counterfactual outcome. In other words, the model will use the control time series to predict what the response time series outcome would have been if there had been no intervention.

In this example, we created one response time series variable and two control time series variables.

- To make the dataset reproducible, a random seed is set at the beginning of the code.
- Then an autoregressive moving average (ARMA) process is created.
- For the first control time series variable
`X1`

, the autoregressive (AR) part has two coefficients 0.95 and 0.04, and the moving average (MA) part has two coefficients 0.6 and 0.3. A constant value of 10 is added to the autoregressive moving average (ARMA) process. - For the second control time series variable
`X2`

, the autoregressive (AR) part has two coefficients 0.85 and 0.01, and the moving average (MA) part has two coefficients 0.7 and 0.2. A constant value of 20 is added to the autoregressive moving average (ARMA) process.

- For the first control time series variable
- After setting the autoregressive moving average (ARMA) coefficients, 5000 samples are generated for
`X1`

and`X2`

. - The response time series variable
`y`

is a function of the control time series variables. It equals 10 times`X1`

plus 2 times`X2`

, then plus a random value. - The intervention happens at the index of 3000, and the true causal impact is 20.

# Set up a seed for reproducibility set.seed(42) # Create the control ARMA time-series X1 <- 10 + arima.sim(n = 5000, list(ar = c(0.95, 0.04), ma = c(0.6, 0.3))) X2 <- 20 + arima.sim(n = 5000, list(ar = c(0.85, 0.01), ma = c(0.7, 0.2))) # Create the response time-series y <- 10*X1 + 2*X2 + rnorm(5000) # Add the true causal impact y[3001:5000] <- y[3001:5000] + 20

A time series usually has a time variable indicating the frequency of the data collected. We created 5000 dates beginning on January 1st, 2000 using the `seq.Date`

function.

After that, a dataset is created with the control variables `X1`

, `X2`

, the response varialbe `y`

, and the dates as the index.

# Create dates time.points <- seq.Date(as.Date("2000-01-01"), by = 1, length.out = 5000) # Create dataset df <- zoo(cbind(y, X1, X2), time.points) # Take a look at the data head(df)

Output:

y X1 X2 2000-01-01 150.0829 11.37125 18.09003 2000-01-02 154.4319 11.75323 17.99653 2000-01-03 161.6506 12.22989 19.06100 2000-01-04 165.4685 12.27458 20.92759 2000-01-05 173.5110 12.66373 24.11447 2000-01-06 185.8874 13.39407 26.82574

### Step 3: Set Pre and Post Periods

In step 3, we will set the pre and the post intervention periods.

We can see that The time-series start date is `2000-01-01`

, the time-series end date is `2013-09-08`

, and the treatment start date is `2008-03-19`

.

# Print out the time series start date df[1] # Print out the intervention start date df[3001] # Print out the time series end date df[5000]

Output:

y X1 X2 2000-01-01 150.0829 11.37125 18.09003 y X1 X2 2008-03-19 187.9989 13.31977 16.97817 y X1 X2 2013-09-08 297.0325 23.796 19.60739

Next, let’s visualize the time-series data.

# Visualize the time series plot(df, plot.type="single", col = 1:ncol(df)) legend("bottomleft", colnames(df), col=1:ncol(df), lty=1, cex=.65)

In the chart, the red line is the first control time series, the green line is the second control time series, and the black line is the response time series.

The R `CausalImpact`

package requires the pre and the post intervention periods to be specified.

In our dataset, the intervention starts on the 3001st day, so the pre-intervention time period includes the first 3000 days, and the post-intervention time period includes the last 2000 days. That corresponds to the start date of `2000-01-01`

and end date of `2008-03-18`

for the pre-period, and the start date of `2008-03-19`

and end date of `2013-09-08`

for the post-period.

# Set pre-period pre_period <- c(as.Date("2000-01-01"), as.Date("2008-03-18")) # Set post-period post_period <- c(as.Date("2008-03-19"), as.Date("2013-09-08"))

### Step 4: Causal Impact on Time Series with Default Hyperparameters

In step 4, we will execute the causal impact analysis on the time series using the default hyperparameters.

The causality analysis has two assumptions:

- Assumption 1: There are one or more control time series that are highly correlated with the response variable, but not impacted by the intervention. Violation of this assumption can result in wrong conclusions about the existence, the direction, or the magnitude of the treatment effect.
- Assumption 2: The correlation between the control and the response time series is the same for pre and post intervention.

The synthetic time series data we created satisfy the two assumptions.

The R `CausalImpact`

package has a function called `CausalImpact`

that implements a Bayesian Structural Time Series Model (BSTS) on the backend. It has three required inputs:

`data`

takes the dataset name.`pre.period`

takes the starting and the ending index values for the pre-intervention period.`post.period`

takes the starting and the ending index values for the post-intervention period.

After saving the output object in a variable called `impact_default`

, we can run `summary(impact_default)`

to get the results summary.

# Causal impact model impact_default <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period) # Causal impact summary summary(impact_default)

The summary results tell us that:

- The absolute causal effect is 20, which is the same as the true causal impact.
- The 95% confidence interval for the absolute causal effect is -10 to 51.

To learn more about the `CausalImpact`

function, please check out my tutorial Time Series Causal Impact Analysis in R.

Posterior inference {CausalImpact} Average Cumulative Actual 145 290492 Prediction (s.d.) 125 (15) 250164 (30543) 95% CI [95, 155] [189354, 310902] Absolute effect (s.d.) 20 (15) 40328 (30543) 95% CI [-10, 51] [-20410, 101138] Relative effect (s.d.) 16% (12%) 16% (12%) 95% CI [-8.2%, 40%] [-8.2%, 40%] Posterior tail-area probability p: 0.08571 Posterior prob. of a causal effect: 91% For more details, type: summary(impact, "report")

### Step 5: Hyperparameter Tuning for niter

In step 5, we will talk about how to tune the hyperparameter `niter`

for `CausalImpact`

.

`niter`

is the number of MCMC samples. Bayesian inference usually needs at least 1000 samples to get reasonable results. The default value for `niter`

in the `CausalImpact`

package is 1000, and we changed it to 3000.

# Causal impact model impact_niter <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(niter = 3000)) # Causal impact summary summary(impact_niter)

The results show the absolute effect of 20 and the confidence interval of -9.8 to 50, which is very similar to the default value.

Posterior inference {CausalImpact} Average Cumulative Actual 145 290492 Prediction (s.d.) 125 (16) 250136 (31086) 95% CI [95, 155] [190342, 310049] Absolute effect (s.d.) 20 (16) 40356 (31086) 95% CI [-9.8, 5e+01] [-19556.9, 1e+05] Relative effect (s.d.) 16% (12%) 16% (12%) 95% CI [-7.8%, 40%] [-7.8%, 40%] Posterior tail-area probability p: 0.09604 Posterior prob. of a causal effect: 90% For more details, type: summary(impact, "report")

### Step 6: Hyperparameter Tuning for standardize.data

In step 6, we will talk about how to tune the hyperparameter `standardize.data`

for `CausalImpact`

.

`standardize.data`

is a boolean value indicating whether to standardize all the time series before fitting the model. The default value is `TRUE`

.

In this example, we set the `standardize.data = FALSE`

.

# Causal impact model impact_sd <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(standardize.data = FALSE)) # Causal impact summary summary(impact_sd)

The results show an absolute effect of 20 and a confidence interval of -11 to 51. It has the same point estimation but a wider confidence interval, indicating that standardizing the data provides better results.

Posterior inference {CausalImpact} Average Cumulative Actual 145 290492 Prediction (s.d.) 125 (16) 250594 (31166) 95% CI [94, 156] [188173, 312563] Absolute effect (s.d.) 20 (16) 39898 (31166) 95% CI [-11, 51] [-22071, 102318] Relative effect (s.d.) 16% (12%) 16% (12%) 95% CI [-8.8%, 41%] [-8.8%, 41%] Posterior tail-area probability p: 0.09161 Posterior prob. of a causal effect: 91% For more details, type: summary(impact, "report")

### Step 7: Hyperparameter Tuning for prior.level.sd

In step 7, we will talk about how to tune the hyperparameter `prior.level.sd`

for `CausalImpact`

.

`prior.level.sd`

is the prior local level Gaussian standard deviation.

- The default value for prior.level.sd is 0.01. This default value usually works well for stable time series with low residual volatility after controlling covariates.
- When there is uncertainty about the stability of the data, the CausalImpact package documentation suggests setting prior.level.sd to 0.1 as a safer option. But this option may lead to very wide prediction intervals.
- We can also not specify a value and set
`prior.level.sd`

to`NULL`

.

# Causal impact model impact_prior_level_sd <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(prior.level.sd = 0.1)) # Causal impact summary summary(impact_prior_level_sd)

The summary results for `prior.level.sd=0.1`

tell us that:

- The absolute causal effect is 19, which is slightly lower than the true causal impact of 20.
- The 95% confidence interval for the absolute causal effect is -131 to 167, which is much wider than the confidence interval from the default values.

The larger `prior.level.sd`

produced worse results than the default value probably because true `prior.level.sd`

is quite different from 0.1.

Posterior inference {CausalImpact} Average Cumulative Actual 145 290492 Prediction (s.d.) 126 (77) 252764 (153142) 95% CI [-22, 276] [-43494, 552992] Absolute effect (s.d.) 19 (77) 37728 (153142) 95% CI [-131, 167] [-262500, 333986] Relative effect (s.d.) 15% (61%) 15% (61%) 95% CI [-104%, 132%] [-104%, 132%] Posterior tail-area probability p: 0.39874 Posterior prob. of a causal effect: 60% For more details, type: summary(impact, "report")

Next, let’s set `prior.level.sd`

to `NULL`

.

# Causal impact model impact_prior_level_sd <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(prior.level.sd = NULL)) # Causal impact summary summary(impact_prior_level_sd)

The results for `prior.level.sd=NULL`

show the same causal impact value and confidence interval as the default hyperparameter results.

Posterior inference {CausalImpact} Average Cumulative Actual 145 290492 Prediction (s.d.) 125 (15) 250164 (30573) 95% CI [95, 155] [189317, 310608] Absolute effect (s.d.) 20 (15) 40328 (30573) 95% CI [-10, 51] [-20116, 101175] Relative effect (s.d.) 16% (12%) 16% (12%) 95% CI [-8%, 40%] [-8%, 40%] Posterior tail-area probability p: 0.08462 Posterior prob. of a causal effect: 92% For more details, type: summary(impact, "report")

### Step 8: Hyperparameter Tuning for seasons

In step 8, we will talk about how to tune the hyperparameter `nseasons`

for `CausalImpact`

.

`nseasons`

specifies the seasonal components of the model.

- The default value is 1, meaning that there is no seasonality in the time series data.
- Changing the value to a positive integer greater than 1 automatically includes the seasonal component. For example,
`nseasons=7`

means that there is weekly seasonality. - Currently the
`CausalImpact`

package only supports one seasonal component, but we can include multiple seasonal components using the Bayesian Structural Time Series (BSTS) model, and pass the fitted model in as`bsts.model`

.

In this example, we will set the hyperparameter `nseasons=7`

to include weekly seasonality.

# Causal impact model impact_nseasons <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(nseasons = 7)) # Causal impact summary summary(impact_nseasons)

The summary results tell us that:

- The absolute causal effect is 21, which is slightly higher than the true causal impact of 20.
- The 95% confidence interval for the absolute causal effect is -8.4 to 48, which is narrower than the confidence interval from the default values.

Since the synthetic dataset does not include seasonality, there is no obvious improvement in the estimation results. But this hyperparameter should give better results for the time series with weekly seasonalities.

Posterior inference {CausalImpact} Average Cumulative Actual 145 290492 Prediction (s.d.) 125 (15) 249319 (29698) 95% CI [97, 154] [194240, 307292] Absolute effect (s.d.) 21 (15) 41173 (29698) 95% CI [-8.4, 48] [-16800.3, 96251] Relative effect (s.d.) 17% (12%) 17% (12%) 95% CI [-6.7%, 39%] [-6.7%, 39%] Posterior tail-area probability p: 0.08449 Posterior prob. of a causal effect: 92% For more details, type: summary(impact, "report")

### Step 9: Hyperparameter Tuning for season.duration

In step 9, we will talk about how to tune the hyperparameter `season.duration`

for `CausalImpact`

.

`season.duration`

specifies the number of data points in each season.

- The default value for
`season.duration`

is 1. For example,`nseasons=7, season.duration=1`

means that the time series data has weekly seasonality and each data point represents one day. - If we would like to include the monthly seasonality,
`nseasons`

needs to be set to 12 and`season.duration`

needs to be set to 30, indicating that every 30 days represent one month.

In this example, we will use `nseasons=12, season.duration=30`

for the hyperparameters.

# Causal impact model impact_season_duration <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(nseasons=12, season.duration=30)) # Causal impact summary summary(impact_season_duration)

The summary results tell us that:

- The absolute causal effect is 21, which is slightly higher than the true causal impact of 20.
- The 95% confidence interval for the absolute causal effect is -8.1 to 50, which is narrower than the confidence interval from the default values.

Since the synthetic dataset does not include seasonality, there is no obvious improvement in the estimation results. But this hyperparameter should give better results for the time series with monthly seasonalities.

Posterior inference {CausalImpact} Average Cumulative Actual 145 290492 Prediction (s.d.) 124 (15) 248451 (30281) 95% CI [95, 153] [190160, 306782] Absolute effect (s.d.) 21 (15) 42041 (30281) 95% CI [-8.1, 5e+01] [-16290.5, 1e+05] Relative effect (s.d.) 17% (12%) 17% (12%) 95% CI [-6.6%, 40%] [-6.6%, 40%] Posterior tail-area probability p: 0.08287 Posterior prob. of a causal effect: 92% For more details, type: summary(impact, "report")

### Step 10: Hyperparameter Tuning for dynamic.regression

In step 10, we will talk about how to tune the hyperparameter `dynamic.regression`

for `CausalImpact`

.

`dynamic.regression`

is a boolean value indicating whether to include time-varying regression coefficients.

Since including a time-varying local trend or a time-varying local level often leads to over-specification, this hyperparameter defaults to `FALSE`

.

We can manually change the value to `TRUE`

if the data has local trends for certain time periods.

# Causal impact model impact_dynamic_regression <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(dynamic.regression=TRUE)) # Causal impact summary summary(impact_dynamic_regression)

The summary results tell us that:

- The absolute causal effect is 20, which is the same as the true causal impact.
- The 95% confidence interval for the absolute causal effect is -1.6 to 43, which is much narrower than the confidence interval from the default values.

The results show that changing the `dynamic.regression`

hyperparameter from `FALSE`

to `TRUE`

improved the model performance.

Posterior inference {CausalImpact} Average Cumulative Actual 145 290492 Prediction (s.d.) 126 (11) 251350 (22530) 95% CI [1e+02, 147] [2e+05, 293638] Absolute effect (s.d.) 20 (11) 39141 (22530) 95% CI [-1.6, 43] [-3146.6, 85790] Relative effect (s.d.) 16% (9%) 16% (9%) 95% CI [-1.3%, 34%] [-1.3%, 34%] Posterior tail-area probability p: 0.03963 Posterior prob. of a causal effect: 96.037% For more details, type: summary(impact, "report")

For more information about data science and machine learning, please check out myÂ YouTube channelÂ andÂ Medium PageÂ or follow me onÂ LinkedIn.

### Recommended Tutorials

- GrabNGoInfo Machine Learning Tutorials Inventory
- Time Series Causal Impact Analysis in R
- Time Series Causal Impact Analysis in Python
- How to Use R with Google Colab Notebook
- 3 Ways for Multiple Time Series Forecasting Using Prophet in Python
- Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python
- Multivariate Time Series Forecasting with Seasonality and Holiday Effect Using Prophet in Python
- Time Series Anomaly Detection Using Prophet in Python
- Autoencoder For Anomaly Detection Using Tensorflow Keras
- Databricks Mount To AWS S3 And Import Data
- Hyperparameter Tuning For XGBoost
- One-Class SVM For Anomaly Detection
- Sentiment Analysis Without Modeling: TextBlob vs. VADER vs. Flair
- Recommendation System: User-Based Collaborative Filtering
- How to detect outliers | Data Science Interview Questions and Answers
- Causal Inference One-to-one Matching on Confounders Using R for Python Users
- Gaussian Mixture Model (GMM) for Anomaly Detection
- Time Series Anomaly Detection Using Prophet in Python

### References

- CausalImpact 1.2.1, Brodersen et al., Annals of Applied Statistics (2015). http://google.github.io/CausalImpact/
- R documentation for arima simulation