What is a t-test? Data Science Interview Questions and Answers T-test definition, application, interpretation, assumptions, and alternatives when assumptions are violated

What is a t-test? Data Science Interview Questions and Answers

T-test is one of the most commonly asked topics in a data science interview. In this tutorial, we will talk about how to answer the two interview questions for a data scientist position:

  • What is a t-test?
  • What are the assumptions for a t-test?

Let’s get started!

Resources for this post:

T-test Assumptions – GrabNGoInfo.com

T-test Definition

When answering the question about the t-test definition, we would like to cover what a t-test is, the goal of a t-test, the types of a t-test, and how to interpret t-test results.

  • A t-test is a statistical test to compare the means of two sample groups.
  • The goal of the t-test is to determine whether there is a statistically significant difference between the means of the two populations the two sample groups drew from.
  • There are two types of t-tests:
    • A two-sided t-test is used to decide if the two populations have different means.
    • A one-sided t-test is used to decide if one population’s mean is larger or smaller than the other population’s mean
  • The t-test results are interpreted by the p-value. A p-value threshold of 0.05 is usually used to reject the null hypothesis of equal means.

Follow-up Questions: T-test Assumptions

When answering the question of t-test assumptions, we will talk about what the assumptions are, how to check if the assumptions are valid, and what to do if the assumptions are violated.

  1. Randomness: The data needs to be randomly sampled from the population.
    • This assumption can be checked by examining the design of the experiment and the sampling strategy.
    • If the randomness assumption is violated, we need to consider re-design the experiment and re-sample the data.
  2. Independence: The data within each group and between groups should be independent.
    • The independence assumption can be checked by examining the design of the experiment and the sampling strategy.
    • If two data points are collected from the same subject to make comparisons, those data points are dependent, and a paired t-test needs to be implemented.
  3. Normality: The t-test assumes that both populations are normally distributed.
    • The normality of the data can be checked using visualization such as histogram or QQ plot. It can also be checked by statistical tests such as the Shapiro-Wilk test, the D’Agostino’s K-Squared test, or the Kolmogorov-Smirnov test.
    • When the normality assumption is violated, try data transformation such as logarithm or square root, and check the normality of the transformed dataset.
    • If the dataset is still not normally distributed after transformation, we can use the Wilcoxon test, a non-parametric test that does not assume a normal data distribution.
  4. Equal Variance: The population variances are the same.
    • Equal variance can be checked using visualization such as boxplot or histogram. It can also be checked by statistical tests such as Levene’s test or F test.
    • The Welch test should be used when the equal variance assumption is violated and the normality assumption is satisfied.
    • When both the equal variance and the normality assumptions are violated, the non-parametric Wilcoxon test should be used.
  5. Data Type: The t-test is for testing the equality of population means.
    • If we are interested in testing the equality of population proportions, the z-test should be used.
    • Take a marketing campaign for example, if we are interested in understanding the response rate difference between the test and the control groups, the z-test should be used. On the other hand, if the goal is to understand if the test and the control group have different sales amounts, a t-test should be used.
  6. Outliers: The t-test is sensitive to outliers and works best when there are no outliers in the dataset.
    • Outliers can be detected using Interquartile Range (IQR). The values beyond 1.5 times Interquartile Range (IQR) are defined as the outliers. Please check out my previous tutorial on How to detect outliers for other outlier detection techniques.
    • When there are outliers in a dataset, try the data transformation techniques such as logarithm or square root.
    • If there are still outliers after data transformation, use the non-parametric Wilcoxon test.

Recommended Tutorials

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Leave a Comment

Your email address will not be published. Required fields are marked *