Welcome to GrabNGoInfo! Correlation vs. causation is one of the most commonly asked data science interview questions. In this tutorial, you will learn:

- What are the general strategies for answering the question about correlation and causation?
- How to answer the question in a clear and concise way?
- How to answer the follow-up question of how to calculate correlation?
- How to answer the follow-up question of how to measure causal impact?

**Resources for this post:**

- Slides for the tutorial
- More video tutorials on data science interview
- More blog posts on data science interview

- If you prefer the video version of the tutorial, watch the video below on YouTube.

Let’s get started!

### Correlation vs. causation Data Science Interview Question

The interview question of correlation vs. causation is usually asked with a specific example. For instance, the interviewer may ask

“*If we collect data for monthly ice cream sales and monthly shark attacks around the United States each year, we would find that the two variables are highly correlated. Does this mean that consuming ice cream causes shark attacks? [1]”*

### General Strategies

We will answer the question in three steps.

- In the first step, provide the
**definitions**for correlation and causality. - In the second step, talk about the
**value****range**for correlation and causality, and how to interpret the values. - In the third step, list the
**algorithms**for calculating correlation and for causal inference.

### Concise Answer

Ice cream sales and shark attacks are highly correlated, but this does not mean consuming ice cream causes shark attacks. This is because correlation does not imply causation.

- Correlation is a measure of the direction and strength of the association between two variables [1]. It ranges from -1 to 1, where the values close to 1 indicate a strong positive correlation, the values close to -1 indicate a strong negative correlation, and the values close to 0 indicate no correlation.
- Causation means one variable is influenced by another variable. The magnitude of the impact needs to be evaluated with the scale of the variable. For example, the causal impact of $1000 can be a large impact for monthly salary change, but can be a small impact on the housing price change. There is no limitation on the influence magnitude, so the causal impact can be very large or very small.

In this example, ice cream consumption and shark attacks have a correlation but not causation. Both of them are impacted by confounding factors such as temperature.

### Follow-up Question 1: How to calculate correlation?

There are two commonly used correlation algorithms, Pearson correlation and Spearman correlation.

- The Pearson correlation measures the linear relationship between two continuous variables. It tells us if one variable changes, whether the other variable changes proportionally. The Pearson correlation is calculated based on the raw values of the two variables.
- The Spearman correlation measures the monotonic relationship between two continuous or ordinal variables [3]. It tells us if one variable changes, whether the other variable tends to change as well, but not necessarily change in proportion. The Spearman correlation is calculated based on the ranks of the two variables.

### Follow-up Question 2: How to measure causal impact?

The causal impact can be evaluated by randomized experiments or observational studies.

- A randomized experiment randomly separates the samples into the treatment group and the control group. The causal impact can be calculated by getting the difference between the treatment group and the control group.
- When only observational data is available, we can use causal inference algorithms to calculate the causal impact. Such algorithms include but are not limited to difference-in-difference, Propensity Score Matching (PSM), Inverse Probability Treatment Weighting (IPTW), and counterfactual modeling.
- Difference-in-difference compares the outcomes over time between the treatment and control groups, and checks if there is a change in difference after the treatment intervention.
- Propensity Score Matching (PSM) constructs a quasi-experiment by matching the samples with and without treatment using propensity scores. The causal impact is calculated using the samples after matching.
- IPTW uses the inverse probability of receiving treatment as the weight to account for the sample imbalance when calculating the causal impact.
- Counterfactual modeling is also called potential outcome modeling, it uses a model to predict what could have happened and calculate the causal impact by getting the difference between the actual results and the counterfactual estimation.

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

### Recommended tutorials

- GrabNGoInfo Machine Learning Tutorials Inventory
- One-Class SVM For Anomaly Detection
- 3 Ways for Multiple Time Series Forecasting Using Prophet in Python
- Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python
- Multivariate Time Series Forecasting with Seasonality and Holiday Effect Using Prophet in Python
- How to detect outliers | Data Science Interview Questions and Answers

### References

[1] Correlation Does Not Imply Causation: 5 Real-World Examples

[3] A comparison of the Pearson and Spearman correlation methods