p-value

January 20, 2026 — data-science

Snippet

The probability of seeing the data given the null hypothesis is true, under specific Test Statistic. This is a Fisher's Approach.

Important tension to fully understand between using p-values and the Neyman Pearson Approach. It's not fixed as a decision threshold, but as a continuous measure of evidence against , it is used to judge the strength of evidence. That is, a p-value of 0.06 isn't fundamentally different from 0.05—it’s just slightly weaker evidence, so it's a continuous standard of evidence.

In the Gaussian setting, the p-value is computed via the Survival Function.

In short, the p-value is a Random Variable that is stochastically dominated by the uniform distribution. i.e. you get Type 1 Error (Alpha) control, i.e. you won't falsely reject more often than your level). Let be a p-value computed from a Test Statistic under the null hypothesis . Then under certain conditions (e.g. continuous test statistic), the distribution of under satisfies:This says that the CDF of lies below or equal to the CDF of the uniform distribution on , i.e., .

Common Choices of Test Statistics

Linear Regression: Under , test statistic follows a t-distribution (asymptotically Normal).
Generalized Linear Models (GLM) (e.g. Logistic Regresion): Typically use Wald Test; sometimes Likelihood Ratio Test (LR) or Score Tests.
Essential property: The Test Statistic must have a known distribution under to compute p-values.

Motivations for Choices

Wald Test, t-test: computationally simple, widely used.
Likelihood Ratio Test: more powerful when models are nested.
Score Tests: useful when fitting the full model is computationally difficult.

Lower Bound on p-value

Extremely low p-value () could indicate:
- Real effect ( false).
- Overfitting or multiple testing (inflated false positives).
- Violation of test assumptions (invalid inference).
- Large sample size (), causing negligible effects to appear statistically significant.
Better practice:
- Report effect sizes (not just p-values).
- Provide Confidence Intervals.
- Correct for multiple comparisons (Bonferroni's method, BH-FDR).
- Explore Bayesian alternatives (e.g., Posterior Inclusion Probability).

Independent p-value

Common situations that violate this:
- Shared data, variables (p values likely to be correlated)
- Multiple regressions on the same response, p values become correlated because tests are not independent samples
- Different subsets of data.
  - ex) Effectiveness of drug on subgroups of patients. If patients overlap or share characteristics, not independent
- Hierarchical testing
  - ex) First test for general effect, then subgroup effects after
- Correlation in errors
  - If tests have correlated error terms, p values won't be independent

For non-linear models

Essentially, you used resampling (permutation/bootstrap) to simulate the null distribution of a test statistic.
- Example choices are difference in prediction accuracy, cross-entropy loss/MSE, feature importance score, AUC, precision, recall.
Classical -tests/-tests rely on closed form distributions. In nonlinear models, these aren't available. Instead:
- The Permutation Test gives you an empirical p value:
Model-Agnostic Feature Importance
- SHAP values: Approximates Shapley Values (from Game Theory) for feature attribution -values can be derived via permutation or bootstrap on SHAP scores
- LIME / Integrated Gradients: Estimate feature contribution locally; empirical testing over perturbations yields significance 5. Bayesian Inference (e.g. Bayesian Neural Networks)
Place priors on weights posterior distributions over predictions
Use credible intervals (e.g., 95%) for hypothesis testing: If 0 not in CI for feature's effect → statistically significant