Hyperparameter Tuning for Time Series Causal Impact Analysis in R Hyperparameter tuning for Google's R package CausalImpact on time series intervention with Bayesian Structural Time Series Model (BSTS)

Hyperparameter Tuning for Time Series Causal Impact Analysis in R

CausalImpact package created by Google estimates the impact of an intervention on a time series. In this tutorial, we will talk about how to tune the hyperparameters of the time series causal impact model using the R package CausalImpact.

Resources for this post:

Hyperparameter Tuning for Time Series Causal Impact Analysis in R – GrabNGoInfo.com

Let’s get started!

Step 1: Install and Import R Libraries

In step 1, we will install and import the R libraries. To learn how to use the statistical package R with Google Colab notebook, please check out my tutorial How to Use R with Google Colab Notebook.

Firstly, let’s install CausalImpact for time series causal analysis.

# Install R version of causal impact
install.packages("CausalImpact")

After the installation is completed, we can import the library.

# Import library
library(CausalImpact)

Step 2:Create Dataset

In step 2, we will create a synthetic time-series dataset for the causal impact analysis. The ground truth of the causal impact is usually not available. The benefit of using a synthetic dataset is that we can validate the accuracy of the model results.

The CausalImpact package requires two types of time series:

  • A response time series that is directly affected by the intervention.
  • And one or more control time series that are not impacted by the intervention.

The idea is to build a time series model to predict the counterfactual outcome. In other words, the model will use the control time series to predict what the response time series outcome would have been if there had been no intervention.

In this example, we created one response time series variable and two control time series variables.

  • To make the dataset reproducible, a random seed is set at the beginning of the code.
  • Then an autoregressive moving average (ARMA) process is created.
    • For the first control time series variable X1, the autoregressive (AR) part has two coefficients 0.95 and 0.04, and the moving average (MA) part has two coefficients 0.6 and 0.3. A constant value of 10 is added to the autoregressive moving average (ARMA) process.
    • For the second control time series variable X2, the autoregressive (AR) part has two coefficients 0.85 and 0.01, and the moving average (MA) part has two coefficients 0.7 and 0.2. A constant value of 20 is added to the autoregressive moving average (ARMA) process.
  • After setting the autoregressive moving average (ARMA) coefficients, 5000 samples are generated for X1 and X2.
  • The response time series variable y is a function of the control time series variables. It equals 10 times X1 plus 2 times X2, then plus a random value.
  • The intervention happens at the index of 3000, and the true causal impact is 20.
# Set up a seed for reproducibility
set.seed(42)

# Create the control ARMA time-series
X1 <- 10 + arima.sim(n = 5000, list(ar = c(0.95, 0.04), ma = c(0.6, 0.3)))
X2 <- 20 + arima.sim(n = 5000, list(ar = c(0.85, 0.01), ma = c(0.7, 0.2)))

# Create the response time-series
y <- 10*X1 + 2*X2 + rnorm(5000)

# Add the true causal impact
y[3001:5000] <- y[3001:5000] + 20

A time series usually has a time variable indicating the frequency of the data collected. We created 5000 dates beginning on January 1st, 2000 using the seq.Date function.

After that, a dataset is created with the control variables X1X2, the response varialbe y, and the dates as the index.

# Create dates
time.points <- seq.Date(as.Date("2000-01-01"), by = 1, length.out = 5000)

# Create dataset
df <- zoo(cbind(y, X1, X2), time.points)

# Take a look at the data
head(df)

Output:

                  y       X1       X2
2000-01-01 150.0829 11.37125 18.09003
2000-01-02 154.4319 11.75323 17.99653
2000-01-03 161.6506 12.22989 19.06100
2000-01-04 165.4685 12.27458 20.92759
2000-01-05 173.5110 12.66373 24.11447
2000-01-06 185.8874 13.39407 26.82574

Step 3: Set Pre and Post Periods

In step 3, we will set the pre and the post intervention periods.

We can see that The time-series start date is 2000-01-01, the time-series end date is 2013-09-08, and the treatment start date is 2008-03-19.

# Print out the time series start date
df[1]

# Print out the intervention start date
df[3001]

# Print out the time series end date
df[5000]

Output:

                  y       X1       X2
2000-01-01 150.0829 11.37125 18.09003
                  y       X1       X2
2008-03-19 187.9989 13.31977 16.97817
                  y     X1       X2
2013-09-08 297.0325 23.796 19.60739

Next, let’s visualize the time-series data.

# Visualize the time series
plot(df, plot.type="single", col = 1:ncol(df))
legend("bottomleft", colnames(df), col=1:ncol(df), lty=1, cex=.65)
Causal inference time series data visualization – GrabNGoInfo.com

In the chart, the red line is the first control time series, the green line is the second control time series, and the black line is the response time series.

The R CausalImpact package requires the pre and the post intervention periods to be specified.

In our dataset, the intervention starts on the 3001st day, so the pre-intervention time period includes the first 3000 days, and the post-intervention time period includes the last 2000 days. That corresponds to the start date of 2000-01-01 and end date of 2008-03-18 for the pre-period, and the start date of 2008-03-19 and end date of 2013-09-08 for the post-period.

# Set pre-period
pre_period <- c(as.Date("2000-01-01"), as.Date("2008-03-18"))

# Set post-period
post_period <- c(as.Date("2008-03-19"), as.Date("2013-09-08")) 

Step 4: Causal Impact on Time Series with Default Hyperparameters

In step 4, we will execute the causal impact analysis on the time series using the default hyperparameters.

The causality analysis has two assumptions:

  • Assumption 1: There are one or more control time series that are highly correlated with the response variable, but not impacted by the intervention. Violation of this assumption can result in wrong conclusions about the existence, the direction, or the magnitude of the treatment effect.
  • Assumption 2: The correlation between the control and the response time series is the same for pre and post intervention.

The synthetic time series data we created satisfy the two assumptions.

The R CausalImpact package has a function called CausalImpact that implements a Bayesian Structural Time Series Model (BSTS) on the backend. It has three required inputs:

  • data takes the dataset name.
  • pre.period takes the starting and the ending index values for the pre-intervention period.
  • post.period takes the starting and the ending index values for the post-intervention period.

After saving the output object in a variable called impact_default, we can run summary(impact_default) to get the results summary.

# Causal impact model
impact_default <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period)

# Causal impact summary
summary(impact_default)

The summary results tell us that:

  • The absolute causal effect is 20, which is the same as the true causal impact.
  • The 95% confidence interval for the absolute causal effect is -10 to 51.

To learn more about the CausalImpact function, please check out my tutorial Time Series Causal Impact Analysis in R.

Posterior inference {CausalImpact}

                         Average        Cumulative      
Actual                   145            290492          
Prediction (s.d.)        125 (15)       250164 (30543)  
95% CI                   [95, 155]      [189354, 310902]
                                                        
Absolute effect (s.d.)   20 (15)        40328 (30543)   
95% CI                   [-10, 51]      [-20410, 101138]
                                                        
Relative effect (s.d.)   16% (12%)      16% (12%)       
95% CI                   [-8.2%, 40%]   [-8.2%, 40%]    

Posterior tail-area probability p:   0.08571
Posterior prob. of a causal effect:  91%

For more details, type: summary(impact, "report")

Step 5: Hyperparameter Tuning for niter

In step 5, we will talk about how to tune the hyperparameter niter for CausalImpact.

niter is the number of MCMC samples. Bayesian inference usually needs at least 1000 samples to get reasonable results. The default value for niter in the CausalImpact package is 1000, and we changed it to 3000.

# Causal impact model
impact_niter <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(niter = 3000))

# Causal impact summary
summary(impact_niter)

The results show the absolute effect of 20 and the confidence interval of -9.8 to 50, which is very similar to the default value.

Posterior inference {CausalImpact}

                         Average         Cumulative       
Actual                   145             290492           
Prediction (s.d.)        125 (16)        250136 (31086)   
95% CI                   [95, 155]       [190342, 310049] 
                                                          
Absolute effect (s.d.)   20 (16)         40356 (31086)    
95% CI                   [-9.8, 5e+01]   [-19556.9, 1e+05]
                                                          
Relative effect (s.d.)   16% (12%)       16% (12%)        
95% CI                   [-7.8%, 40%]    [-7.8%, 40%]     

Posterior tail-area probability p:   0.09604
Posterior prob. of a causal effect:  90%

For more details, type: summary(impact, "report")

Step 6: Hyperparameter Tuning for standardize.data

In step 6, we will talk about how to tune the hyperparameter standardize.data for CausalImpact.

standardize.data is a boolean value indicating whether to standardize all the time series before fitting the model. The default value is TRUE.

In this example, we set the standardize.data = FALSE.

# Causal impact model
impact_sd <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(standardize.data = FALSE))

# Causal impact summary
summary(impact_sd)

The results show an absolute effect of 20 and a confidence interval of -11 to 51. It has the same point estimation but a wider confidence interval, indicating that standardizing the data provides better results.

Posterior inference {CausalImpact}

                         Average        Cumulative      
Actual                   145            290492          
Prediction (s.d.)        125 (16)       250594 (31166)  
95% CI                   [94, 156]      [188173, 312563]
                                                        
Absolute effect (s.d.)   20 (16)        39898 (31166)   
95% CI                   [-11, 51]      [-22071, 102318]
                                                        
Relative effect (s.d.)   16% (12%)      16% (12%)       
95% CI                   [-8.8%, 41%]   [-8.8%, 41%]    

Posterior tail-area probability p:   0.09161
Posterior prob. of a causal effect:  91%

For more details, type: summary(impact, "report")

Step 7: Hyperparameter Tuning for prior.level.sd

In step 7, we will talk about how to tune the hyperparameter prior.level.sd for CausalImpact.

prior.level.sd is the prior local level Gaussian standard deviation.

  • The default value for prior.level.sd is 0.01. This default value usually works well for stable time series with low residual volatility after controlling covariates.
  • When there is uncertainty about the stability of the data, the CausalImpact package documentation suggests setting prior.level.sd to 0.1 as a safer option. But this option may lead to very wide prediction intervals.
  • We can also not specify a value and set prior.level.sd to NULL.
# Causal impact model
impact_prior_level_sd <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(prior.level.sd = 0.1))

# Causal impact summary
summary(impact_prior_level_sd)

The summary results for prior.level.sd=0.1 tell us that:

  • The absolute causal effect is 19, which is slightly lower than the true causal impact of 20.
  • The 95% confidence interval for the absolute causal effect is -131 to 167, which is much wider than the confidence interval from the default values.

The larger prior.level.sd produced worse results than the default value probably because true prior.level.sd is quite different from 0.1.

Posterior inference {CausalImpact}

                         Average         Cumulative       
Actual                   145             290492           
Prediction (s.d.)        126 (77)        252764 (153142)  
95% CI                   [-22, 276]      [-43494, 552992] 
                                                          
Absolute effect (s.d.)   19 (77)         37728 (153142)   
95% CI                   [-131, 167]     [-262500, 333986]
                                                          
Relative effect (s.d.)   15% (61%)       15% (61%)        
95% CI                   [-104%, 132%]   [-104%, 132%]    

Posterior tail-area probability p:   0.39874
Posterior prob. of a causal effect:  60%

For more details, type: summary(impact, "report")

Next, let’s set prior.level.sd to NULL.

# Causal impact model
impact_prior_level_sd <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(prior.level.sd = NULL))

# Causal impact summary
summary(impact_prior_level_sd)

The results for prior.level.sd=NULL show the same causal impact value and confidence interval as the default hyperparameter results.

Posterior inference {CausalImpact}

                         Average      Cumulative      
Actual                   145          290492          
Prediction (s.d.)        125 (15)     250164 (30573)  
95% CI                   [95, 155]    [189317, 310608]
                                                      
Absolute effect (s.d.)   20 (15)      40328 (30573)   
95% CI                   [-10, 51]    [-20116, 101175]
                                                      
Relative effect (s.d.)   16% (12%)    16% (12%)       
95% CI                   [-8%, 40%]   [-8%, 40%]      

Posterior tail-area probability p:   0.08462
Posterior prob. of a causal effect:  92%

For more details, type: summary(impact, "report")

Step 8: Hyperparameter Tuning for seasons

In step 8, we will talk about how to tune the hyperparameter nseasons for CausalImpact.

nseasons specifies the seasonal components of the model.

  • The default value is 1, meaning that there is no seasonality in the time series data.
  • Changing the value to a positive integer greater than 1 automatically includes the seasonal component. For example, nseasons=7 means that there is weekly seasonality.
  • Currently the CausalImpact package only supports one seasonal component, but we can include multiple seasonal components using the Bayesian Structural Time Series (BSTS) model, and pass the fitted model in as bsts.model.

In this example, we will set the hyperparameter nseasons=7 to include weekly seasonality.

# Causal impact model
impact_nseasons <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(nseasons = 7))

# Causal impact summary
summary(impact_nseasons)

The summary results tell us that:

  • The absolute causal effect is 21, which is slightly higher than the true causal impact of 20.
  • The 95% confidence interval for the absolute causal effect is -8.4 to 48, which is narrower than the confidence interval from the default values.

Since the synthetic dataset does not include seasonality, there is no obvious improvement in the estimation results. But this hyperparameter should give better results for the time series with weekly seasonalities.

Posterior inference {CausalImpact}

                         Average        Cumulative       
Actual                   145            290492           
Prediction (s.d.)        125 (15)       249319 (29698)   
95% CI                   [97, 154]      [194240, 307292] 
                                                         
Absolute effect (s.d.)   21 (15)        41173 (29698)    
95% CI                   [-8.4, 48]     [-16800.3, 96251]
                                                         
Relative effect (s.d.)   17% (12%)      17% (12%)        
95% CI                   [-6.7%, 39%]   [-6.7%, 39%]     

Posterior tail-area probability p:   0.08449
Posterior prob. of a causal effect:  92%

For more details, type: summary(impact, "report")

Step 9: Hyperparameter Tuning for season.duration

In step 9, we will talk about how to tune the hyperparameter season.duration for CausalImpact.

season.duration specifies the number of data points in each season.

  • The default value for season.duration is 1. For example, nseasons=7, season.duration=1 means that the time series data has weekly seasonality and each data point represents one day.
  • If we would like to include the monthly seasonality, nseasons needs to be set to 12 and season.duration needs to be set to 30, indicating that every 30 days represent one month.

In this example, we will use nseasons=12, season.duration=30 for the hyperparameters.

# Causal impact model
impact_season_duration <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(nseasons=12, season.duration=30))

# Causal impact summary
summary(impact_season_duration)

The summary results tell us that:

  • The absolute causal effect is 21, which is slightly higher than the true causal impact of 20.
  • The 95% confidence interval for the absolute causal effect is -8.1 to 50, which is narrower than the confidence interval from the default values.

Since the synthetic dataset does not include seasonality, there is no obvious improvement in the estimation results. But this hyperparameter should give better results for the time series with monthly seasonalities.

Posterior inference {CausalImpact}

                         Average         Cumulative       
Actual                   145             290492           
Prediction (s.d.)        124 (15)        248451 (30281)   
95% CI                   [95, 153]       [190160, 306782] 
                                                          
Absolute effect (s.d.)   21 (15)         42041 (30281)    
95% CI                   [-8.1, 5e+01]   [-16290.5, 1e+05]
                                                          
Relative effect (s.d.)   17% (12%)       17% (12%)        
95% CI                   [-6.6%, 40%]    [-6.6%, 40%]     

Posterior tail-area probability p:   0.08287
Posterior prob. of a causal effect:  92%

For more details, type: summary(impact, "report")

Step 10: Hyperparameter Tuning for dynamic.regression

In step 10, we will talk about how to tune the hyperparameter dynamic.regression for CausalImpact.

dynamic.regression is a boolean value indicating whether to include time-varying regression coefficients.

Since including a time-varying local trend or a time-varying local level often leads to over-specification, this hyperparameter defaults to FALSE.

We can manually change the value to TRUE if the data has local trends for certain time periods.

# Causal impact model
impact_dynamic_regression <- CausalImpact(data=df, pre.period=pre_period, post.period=post_period, model.args = list(dynamic.regression=TRUE))

# Causal impact summary
summary(impact_dynamic_regression)

The summary results tell us that:

  • The absolute causal effect is 20, which is the same as the true causal impact.
  • The 95% confidence interval for the absolute causal effect is -1.6 to 43, which is much narrower than the confidence interval from the default values.

The results show that changing the dynamic.regression hyperparameter from FALSE to TRUE improved the model performance.

Posterior inference {CausalImpact}

                         Average        Cumulative      
Actual                   145            290492          
Prediction (s.d.)        126 (11)       251350 (22530)  
95% CI                   [1e+02, 147]   [2e+05, 293638] 
                                                        
Absolute effect (s.d.)   20 (11)        39141 (22530)   
95% CI                   [-1.6, 43]     [-3146.6, 85790]
                                                        
Relative effect (s.d.)   16% (9%)       16% (9%)        
95% CI                   [-1.3%, 34%]   [-1.3%, 34%]    

Posterior tail-area probability p:   0.03963
Posterior prob. of a causal effect:  96.037%

For more details, type: summary(impact, "report")

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Recommended Tutorials

References

Leave a Comment

Your email address will not be published. Required fields are marked *