### April 10, 2018

### John P. A. Ioannidis

*P* values and accompanying methods of statistical significance testing are creating challenges in biomedical science and other disciplines. The vast majority (96%) of articles that report *P* values in the abstract, full text, or both include some values of .05 or less.^{1} However, many of the claims that these reports highlight are likely false.^{2} Recognizing the major importance of the statistical significance conundrum, the American Statistical Association (ASA) published^{3} a statement on *P* values in 2016. The status quo is widely believed to be problematic, but how exactly to fix the problem is far more contentious. The contributors to the ASA statement also wrote 20 independent, accompanying commentaries focusing on different aspects and prioritizing different solutions. Another large coalition of 72 methodologists recently proposed^{4} a specific, simple move: lowering the routine *P* value threshold for claiming statistical significance from .05 to .005 for new discoveries. The proposal met with strong endorsement in some circles and concerns in others.

*P* values are misinterpreted, overtrusted, and misused. The language of the ASA statement enables the dissection of these 3 problems. Multiple misinterpretations of *P* values exist, but the most common one is that they represent the “probability that the studied hypothesis is true.”^{3} A *P* value of .02 (2%) is wrongly considered to mean that the null hypothesis (eg, the drug is as effective as placebo) is 2% likely to be true and the alternative (eg, the drug is more effective than placebo) is 98% likely to be correct. Overtrust ensues when it is forgotten that “proper inference requires full reporting and transparency.”^{3} Better-looking (smaller) *P* values alone do not guarantee full reporting and transparency. In fact, smaller *P* values may hint to selective reporting and nontransparency. The most common misuse of the *P* value is to make “scientific conclusions and business or policy decisions” based on “whether a *P* value passes a specific threshold” even though “a *P* value, or statistical significance, does not measure the size of an effect or the importance of a result,” and “by itself, a *P* value does not provide a good measure of evidence.”^{3}

These 3 major problems mean that passing a statistical significance threshold (traditionally *P* = .05) is wrongly equated with a finding or an outcome (eg, an association or a treatment effect) being true, valid, and worth acting on. These misconceptions affect researchers, journals, readers, and users of research articles, and even media and the public who consume scientific information. Most claims supported with *P*values slightly below .05 are probably false (ie, the claimed associations and treatment effects do not exist). Even among those claims that are true, few are worth acting on in medicine and health care.

Lowering the threshold for claiming statistical significance is an old idea. Several scientific fields have carefully considered how low a *P* value should be for a research finding to have a sufficiently high chance of being true. For example, adoption of genome-wide significance thresholds (*P* < 5 × 10^{−8}) in population genomics has made discovered associations highly replicable and these associations also appear consistently when tested in new populations. The human genome is very complex, but the extent of multiplicity of significance testing involved is known, the analyses are systematic and transparent, and a requirement for *P* < 5 × 10^{−8} can be cogently arrived at.

However, for most other types of biomedical research, the multiplicity involved is unclear and the analyses are nonsystematic and nontransparent. For most observational exploratory research that lacks preregistered protocols and analysis plans, it is unclear how many analyses were performed and what various analytic paths were explored. Hidden multiplicity, nonsystematic exploration, and selective reporting may affect even experimental research and randomized trials. Even though it is now more common to have a preexisting protocol and statistical analysis plan and preregistration of the trial posted on a public database, there are still substantial degrees of freedom regarding how to analyze data and outcomes and what exactly to present. In addition, many studies in contemporary clinical investigation focus on smaller benefits or risks; therefore, the risk of various biases affecting the results increases.

Moving the *P* value threshold from .05 to .005 will shift about one-third of the statistically significant results of past biomedical literature to the category of just “suggestive.”^{1} This shift is essential for those who believe (perhaps crudely) in black and white, significant or nonsignificant categorizations. For the vast majority of past observational research, this recategorization would be welcome. For example, mendelian randomization studies show that only few past claims from observational studies with *P* < .05 represent causal relationships.^{5} Thus, the proposed reduction in the level for declaring statistical significance may dismiss mostly noise with relatively little loss of valuable information. For randomized trials, the proportion of true effects that emerge with *P* values in the window from .005 to .05 will be higher, perhaps the majority in several fields. However, most findings would not represent treatment effects that are large enough for outcomes that are serious enough to make them worthy of further action. Thus, the reduction in the *P* value threshold may largely do more good than harm, despite also removing an occasional true and useful treatment effect from the coveted significance zone. Regardless, the need for also focusing on the magnitude of all treatment effects and their uncertainty (such as with confidence intervals) cannot be overstated.

Lowering the threshold of statistical significance is a temporizing measure. It would work as a dam that could help gain time and prevent drowning by a flood of statistical significance, while promoting better, more-durable solutions.^{6} These solutions may involve abandoning statistical significance thresholds or *P*values entirely. If any thresholds are to continue in use, even lower thresholds are probably preferable for most observational research. Comprehensive reviews (termed *umbrella reviews*) that have evaluated multiple systematic reviews of observational studies propose a *P* < 10^{−6} threshold.^{5} In addition, falsification end-point methods (ie, using such *P* value thresholds that almost all well-established null associations will not be able to pass them) also point to very low *P* values.^{7} With the advent of big data, statistical significance will increasingly mean very little because extremely low *P* values are routinely obtained for signals that are too small to be useful even if true.

Adopting lower *P* value thresholds may help promote a reformed research agenda with fewer, larger, and more carefully conceived and designed studies with sufficient power to pass these more demanding thresholds. However, collateral harms may also emerge. Bias may escalate rather than decrease if researchers and other interested parties (eg, for-profit sponsors) try to find ways to make the results have lower *P* values. Selected study end points may become even less clinically relevant because it is easier to reach lower *P* values with weak surrogate end points than with hard clinical outcomes. Moreover, results that pass a lower *P* value threshold may be limited by greater regression to the mean and new discoveries may have even more exaggerated effect sizes than before.

Because the proposed threshold of *P* < .005 is imperfect, other more difficult but more durable alternative solutions should also be contemplated (Table). These solutions vary based on how quickly and easily they can be adopted. They can target the use and interpretation of the past biomedical literature accumulated to date or the design and deployment of the new literature that will accumulate in the future. The situation is dire for the past literature because there is no perfect remedy after the fact. In the long-term, the scientific workforce will need to be more properly trained in using the best fit for purpose statistical inference tools and biases will need to be addressed preemptively rather than retrospectively. However, these may continue to be largely unachievable goals.

Data are becoming more complex. If time for rigorous training in methods and statistics for researchers and for research users remains limited, subpar medical statistics and concomitant misinterpretations may continue. Nevertheless, hopefully several fields will adopt better standards for *P* values, will decrease their dependence on *P* values, and enhance the adoption of other useful inferential tools (eg, Bayesian statistics) when appropriate. The rapidity and extent of these changes is unpredictable. Low adoption in the past may cause some pessimism. However, a fresh start and a rapid acceleration of adoption of better practices is always possible. Incentives from major journals and funders as well as radical changes in training curricula may be necessary to achieve more widespread and effective shifts.