Data driven hypotheses without disclosure (‘HARKing’)

What is this about?

HARKing i.e. Hypothesizing After the Results are Known or post hoc testing, as it is more widely known, is not unfamiliar to many researchers. In scientific methodology or statistics class in grad school, many of us have been told that such practice was flawed, but few of us has ever heard the rationale behind it. HARKing is considered to be a detrimental research practice (1). This thematic page will try to address the logic behind HARKing and hopefully shed some light on its nature and validity.

Why is this important?

HARKing can increase the chance of falsely rejecting the null hypothesis, or type I error (2). Each time when a statistical analysis is being done, theories or hypotheses are formalized in terms of mathematical models (3). Models are built from main outcome measure and factors that are supposed to influence the main outcome measure (3). Factors that are supposed to determine the outcome measure are usually derived either from published research or data gathered in experiments or surveys. Once a model with satisfactory explanatory or predictive properties is built, it needs to be externally validated i.e. tested on a new, similar dataset (4). This is needed because model might be so well suited for the data on which it was built that it becomes too specific, and thus loses ability to be generalized on somewhat similar datasets (4). If we put this in more technical terms, some of explanatory or predictive factors in the model might correlate with real causes of effect only in our dataset but not in the other similar datasets.

Replication of studies is the way through HARKing can be recognized (5), but that’s only after the damage has been done. Pre-registration of studies, with clearly stated hypotheses and planned statistical analysis, is how we can hope to prevent HARKing.

For whom is this important?

Students, PhD Students, Scientists, Researchers, Postdocs

What are the best practices?

The most prominent examples in practice are diagnostic studies and hypothesis generating studies. When developing new diagnostic models authors tend to combine multiple prognostic factors and then test such models using the ROC analysis on whole sample without validating the model on a separate sample. However, sometimes the need for validation of model is not disclosed in discussion section. Hypothesis generating studies are usually done on “big data” from databases such as The Cancer Genome Atlas. The primary goal of such studies is to build models based on large data sets and “get the feeling for the data”, or in more technical language to do exploratory data analysis, sometimes such studies do not disclose need for model validation (i.e. confirmatory data analysis). Sometimes after ANOVA, correction for multiple comparison testing also known as post hoc testing is done, these post hoc tests have more stringent statistical significance criteria with the purpose of somewhat replacing model validation. However, replacing model validation with more stringent statistical significance criteria is highly debated topics in a world of statistics (6).

Another case which is usually confused with HARKing are planned multiple comparisons after ANOVA. In this case the fact that comparisons are planned means that model was built before the experiment and based on it, comparisons are done after gathering data (7).


1. Bouter LM, Tijdink J, Axelsen N, Martinson BC, Ter Riet G. Ranking major and minor research misbehaviors: results from a survey among participants of four World Conferences on Research Integrity. Res Integr Peer Rev. 2016;1:17.

2. Kerr NL. HARKing: hypothesizing after the results are known. Pers Soc Psychol Rev. 1998;2(3):196-217.

3. Introduction to Process Modeling. 2012 [cited 6.27.2019.]. In: NIST/SEMATECH e-Handbook of Statistical Methods [Internet]. NIST, [cited 6.27.2019.]. Available from:

4. Burnham KP, Anderson DR. Inference and Principle of Prasimony. In: Burnham KP, Anderson DR, editors. Model Selection and Multimodel Inference. Berlin: Springer; 2010. p. 29-37.

5. Van Bavel JJ, Mende-Siedlecki P, Brady WJ, Reinero DA. Contextual sensitivity in scientific reproducibility. Proc Natl Acad Sci U S A. 2016;113(23):6454-9.

6. Rothman KJ. No adjustments are needed for multiple comparisons. Epidemiology. 1990;1(1):43-6.

7. Althouse AD. Adjust for Multiple Comparisons? It's Not That Simple. Ann Thorac Surg. 2016;101(5):1644-5.

Benjamin Benzon contributed to this theme.

Latest contribution was July 11, 2019