Skip to content

p-value

β€” data-science

Snippet

The probability of seeing the data given the null hypothesis is true, under specific Test Statistic. This is a Fisher's Approach.

Important tension to fully understand between using p-values and the Neyman Pearson Approach. It's not fixed as a decision threshold, but as a continuous measure of evidence against , it is used to judge the strength of evidence. That is, a p-value of 0.06 isn't fundamentally different from 0.05β€”it’s just slightly weaker evidence, so it's a continuous standard of evidence.

In the Gaussian setting, the p-value is computed via the Survival Function.

In short, the p-value is a Random Variable that is stochastically dominated by the uniform distribution. i.e. you get Type 1 Error (Alpha) control, i.e. you won't falsely reject more often than your level). Let be a p-value computed from a Test Statistic under the null hypothesis . Then under certain conditions (e.g. continuous test statistic), the distribution of under satisfies:This says that the CDF of lies below or equal to the CDF of the uniform distribution on , i.e., .

Common Choices of Test Statistics

  • Linear Regression: Under , test statistic follows a t-distribution (asymptotically Normal).
  • Generalized Linear Models (GLM) (e.g. Logistic Regresion): Typically use Wald Test; sometimes Likelihood Ratio Test (LR) or Score Tests.
  • Essential property: The Test Statistic must have a known distribution under to compute p-values.

Motivations for Choices

  • Wald Test, t-test: computationally simple, widely used.
  • Likelihood Ratio Test: more powerful when models are nested.
  • Score Tests: useful when fitting the full model is computationally difficult.

Lower Bound on p-value

  • Extremely low p-value () could indicate:
    • Real effect ( false).
    • Overfitting or multiple testing (inflated false positives).
    • Violation of test assumptions (invalid inference).
    • Large sample size (), causing negligible effects to appear statistically significant.
  • Better practice:
    • Report effect sizes (not just p-values).
    • Provide Confidence Intervals.
    • Correct for multiple comparisons (Bonferroni's method, BH-FDR).
    • Explore Bayesian alternatives (e.g., Posterior Inclusion Probability).

Independent p-value

  • Common situations that violate this:
    • Shared data, variables (p values likely to be correlated)
    • Multiple regressions on the same response, p values become correlated because tests are not independent samples
    • Different subsets of data.
      • ex) Effectiveness of drug on subgroups of patients. If patients overlap or share characteristics, not independent
    • Hierarchical testing
      • ex) First test for general effect, then subgroup effects after
    • Correlation in errors
      • If tests have correlated error terms, p values won't be independent

For non-linear models

  • Essentially, you used resampling (permutation/bootstrap) to simulate the null distribution of a test statistic.
    • Example choices are difference in prediction accuracy, cross-entropy loss/MSE, feature importance score, AUC, precision, recall.
  • Classical -tests/-tests rely on closed form distributions. In nonlinear models, these aren't available. Instead:
    • The Permutation Test gives you an empirical p value:
  • Model-Agnostic Feature Importance
    • SHAP values: Approximates Shapley Values (from Game Theory) for feature attribution -values can be derived via permutation or bootstrap on SHAP scores
    • LIME / Integrated Gradients: Estimate feature contribution locally; empirical testing over perturbations yields significance 5. Bayesian Inference (e.g. Bayesian Neural Networks)
  • Place priors on weights posterior distributions over predictions
  • Use credible intervals (e.g., 95%) for hypothesis testing: If 0 not in CI for feature's effect β†’ statistically significant