The Transparency Revolution
Political science (and the social sciences in general) are in the midst of significant changes in how research is conducted and reported. I refer to these reforms as the “transparency revolution,” which has emphasized the importance of ensuring the accuracy and plausibility of published research findings.
One concern that has been raised is that researchers often fail to report and adjust for multiple testing. If scholars selectively report statistically significant findings and bury null results in the file drawer, then the published literature will be distorted. Not only will published effect sizes be too large, but the precision of these effects will be overestimated. Uri Simonsohn and his colleagues have referred to this practice of selective reporting as “p-hacking” while Andrew Gelman has used the less-pejorative phrase “the garden of forking paths.” This movement has come on the heels of the “credibility revolution,” which has raised empirical standards for disentangling cause from effect.
Like all revolutions, this one is not without controversy. For example, the recent DA-RT guidelines for political science journals have weathered criticism. Proposed solutions for dealing with “p-hacking” such as preregistration of studies and the filing of pre-analysis plans have raised skepticism. One thing missing from these debates is empirical evidence on the extent of selective reporting in political science. This is difficult to measure since we do typically do not observe how the sausage is made in a research project.
I recently published a paper in Political Analysis with my co-authors Annie Franco and Gabor Simonovits (“Underreporting in Political Science Survey Experiments: Comparing Questionnaires to Published Results”) to bring some data to this debate. We address the issue of the “black box” of the research process by examining TESS (Time-sharing Experiments in the Social Sciences), an NSF program that provides grants to researchers to conduct survey experiments on representative samples. The advantageous feature of TESS is that the questionnaires from the experiments are made public. Therefore, we have a registry of what researchers intended to analyze and what they actually report. Given that survey time on TESS surveys is a scarce and precious resource, the assumption that the questionnaire content is core to the central hypotheses of the project is likely justified. Further, since that authors are aware that the questionnaires will eventually be made public, the degree of underreporting measured here is likely a lower bound of what would be found in survey experiments generally.
We find extensive evidence of underreporting. 30% of published papers report fewer experimental conditions than are listed in the questionnaire. About 60% of papers report fewer outcome variables than are listed in the questionnaire. Putting these two pieces of data together, about 80% of papers fail to report all experimental conditions and outcomes. These findings suggest that the published Type I error rates are likely much lower than what they are in reality.
So What Do We Do?
It appears as if the problem of selective underreporting is real. In a follow-up study examining psychology papers, we find that this underreporting has a systematic pattern. As one might expect, the reported results have larger effect sizes and are more likely to be statistically significant than the unreported results. On first blush, these findings suggest that pre-analysis plans might be an important institutional solution to make clear what scholars intended to analyze ahead of time. However, if the problem is mainly due to editors and reviewers “streamlining” manuscripts by filtering out insignificant results, then pre-analysis plans may not be efficacious. More dramatic reforms, such as the publication of articles blind to the results, may be in order.
Lastly, I want to emphasize that I do not believe that any of these results signify fraud or bad intentions on the part of researchers. I concur with Andrew Gelman that it is better to think of these reporting decisions as a “garden of forking paths” rather than “p-hacking.” However, regardless of the framing, the consequences of such research practices is to distort the published literature. Scholars follow norms in the discipline, and selective reporting clearly is a norm. Individual researchers are not incentivized to unilaterally deviate from the norm. Rather, collective action is needed to bolster the transparency and credibility of empirical research in political science.