p-value
β data-science
Snippet
The probability of seeing the data given the null hypothesis is true, under specific Test Statistic. This is a Fisher's Approach.
Important tension to fully understand between using p-values and the Neyman Pearson Approach. It's not fixed as a decision threshold, but as a continuous measure of evidence against , it is used to judge the strength of evidence. That is, a p-value of 0.06 isn't fundamentally different from 0.05βitβs just slightly weaker evidence, so it's a continuous standard of evidence.
In the Gaussian setting, the p-value is computed via the Survival Function.
In short, the p-value is a Random Variable that is stochastically dominated by the uniform distribution. i.e. you get Type 1 Error (Alpha) control, i.e. you won't falsely reject more often than your level). Let be a p-value computed from a Test Statistic under the null hypothesis . Then under certain conditions (e.g. continuous test statistic), the distribution of under satisfies:This says that the CDF of lies below or equal to the CDF of the uniform distribution on , i.e., .
Common Choices of Test Statistics
- Linear Regression: Under , test statistic follows a t-distribution (asymptotically Normal).
- Generalized Linear Models (GLM) (e.g. Logistic Regresion): Typically use Wald Test; sometimes Likelihood Ratio Test (LR) or Score Tests.
- Essential property: The Test Statistic must have a known distribution under to compute p-values.
Motivations for Choices
- Wald Test, t-test: computationally simple, widely used.
- Likelihood Ratio Test: more powerful when models are nested.
- Score Tests: useful when fitting the full model is computationally difficult.
Lower Bound on p-value
- Extremely low p-value () could indicate:
- Real effect ( false).
- Overfitting or multiple testing (inflated false positives).
- Violation of test assumptions (invalid inference).
- Large sample size (), causing negligible effects to appear statistically significant.
- Better practice:
- Report effect sizes (not just p-values).
- Provide Confidence Intervals.
- Correct for multiple comparisons (Bonferroni's method, BH-FDR).
- Explore Bayesian alternatives (e.g., Posterior Inclusion Probability).
Independent p-value
- Common situations that violate this:
- Shared data, variables (p values likely to be correlated)
- Multiple regressions on the same response, p values become correlated because tests are not independent samples
- Different subsets of data.
- ex) Effectiveness of drug on subgroups of patients. If patients overlap or share characteristics, not independent
- Hierarchical testing
- ex) First test for general effect, then subgroup effects after
- Correlation in errors
- If tests have correlated error terms, p values won't be independent
For non-linear models
- Essentially, you used resampling (permutation/bootstrap) to simulate the null distribution of a test statistic.
- Example choices are difference in prediction accuracy, cross-entropy loss/MSE, feature importance score, AUC, precision, recall.
- Classical -tests/-tests rely on closed form distributions. In nonlinear models, these aren't available. Instead:
- The Permutation Test gives you an empirical p value:
- Model-Agnostic Feature Importance
- SHAP values: Approximates Shapley Values (from Game Theory) for feature attribution -values can be derived via permutation or bootstrap on SHAP scores
- LIME / Integrated Gradients: Estimate feature contribution locally; empirical testing over perturbations yields significance 5. Bayesian Inference (e.g. Bayesian Neural Networks)
- Place priors on weights posterior distributions over predictions
- Use credible intervals (e.g., 95%) for hypothesis testing: If 0 not in CI for feature's effect β statistically significant