T-test is one of the most commonly asked topics in a data science interview. In this tutorial, we will talk about how to answer the two interview questions for a data scientist position:

- What is a t-test?
- What are the assumptions for a t-test?

Letâ€™s get started!

**Resources for this post:**

- Click here for the slides
- More video tutorials on Data Science Interview and Statistics
- More blog posts on Data Science Interview and Statistics
- Video tutorial for this post on YouTube

## T-test Definition

When answering the question about the t-test definition, we would like to cover what a t-test is, the goal of a t-test, the types of a t-test, and how to interpret t-test results.

- A t-test is a statistical test to compare the means of two sample groups.
- The goal of the t-test is to determine whether there is a statistically significant difference between the means of the two populations the two sample groups drew from.
- There are two types of t-tests:
- A two-sided t-test is used to decide if the two populations have different means.
- A one-sided t-test is used to decide if one populationâ€™s mean is larger or smaller than the other populationâ€™s mean

- The t-test results are interpreted by the p-value. A p-value threshold of 0.05 is usually used to reject the null hypothesis of equal means.

## Follow-up Questions: T-test Assumptions

When answering the question of t-test assumptions, we will talk about what the assumptions are, how to check if the assumptions are valid, and what to do if the assumptions are violated.

**Randomness**: The data needs to be randomly sampled from the population.- This assumption can be checked by examining the design of the experiment and the sampling strategy.
- If the randomness assumption is violated, we need to consider re-design the experiment and re-sample the data.

**Independence**: The data within each group and between groups should be independent.- The independence assumption can be checked by examining the design of the experiment and the sampling strategy.
- If two data points are collected from the same subject to make comparisons, those data points are dependent, and a paired t-test needs to be implemented.

**Normality**: The t-test assumes that both populations are normally distributed.- The normality of the data can be checked using visualization such as histogram or QQ plot. It can also be checked by statistical tests such as the Shapiro-Wilk test, the D’Agostino’s K-Squared test, or the Kolmogorov-Smirnov test.
- When the normality assumption is violated, try data transformation such as logarithm or square root, and check the normality of the transformed dataset.
- If the dataset is still not normally distributed after transformation, we can use the Wilcoxon test, a non-parametric test that does not assume a normal data distribution.

**Equal Variance**: The population variances are the same.- Equal variance can be checked using visualization such as boxplot or histogram. It can also be checked by statistical tests such as Leveneâ€™s test or F test.
- The Welch test should be used when the equal variance assumption is violated and the normality assumption is satisfied.
- When both the equal variance and the normality assumptions are violated, the non-parametric Wilcoxon test should be used.

**Data Type**: The t-test is for testing the equality of population means.- If we are interested in testing the equality of population proportions, the z-test should be used.
- Take a marketing campaign for example, if we are interested in understanding the response rate difference between the test and the control groups, the z-test should be used. On the other hand, if the goal is to understand if the test and the control group have different sales amounts, a t-test should be used.

**Outliers**: The t-test is sensitive to outliers and works best when there are no outliers in the dataset.- Outliers can be detected using Interquartile Range (IQR). The values beyond 1.5 times Interquartile Range (IQR) are defined as the outliers. Please check out my previous tutorial on How to detect outliers for other outlier detection techniques.
- When there are outliers in a dataset, try the data transformation techniques such as logarithm or square root.
- If there are still outliers after data transformation, use the non-parametric Wilcoxon test.

### Recommended Tutorials

- GrabNGoInfo Machine Learning Tutorials Inventory
- One-Class SVM For Anomaly Detection
- Multivariate Time Series Forecasting with Seasonality and Holiday Effect Using Prophet in Python
- Recommendation System: User-Based Collaborative Filtering
- Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python
- Causal Inference One-to-one Matching on Confounders Using R for Python Users
- 3 Ways for Multiple Time Series Forecasting Using Prophet in Python
- How to detect outliers | Data Science Interview Questions and Answers

For more information about data science and machine learning, please check out myÂ YouTube channelÂ andÂ Medium PageÂ or follow me onÂ LinkedIn.