Michèle B. Nuijten and co’s statcheck program re-examines the datasets in peer-reviewed science and flags anomalies that are associated with fakery, from duplication of data to internal inconsistencies.
As Nuijten writes, this isn’t just about catching deliberate data-cooking — it’s also good at catching self-delusion in data-analysis: “it was strange that we have all kinds of blinding procedures during data collection to avoid expectancy effects among experimenters, but that there was no such blinding during the analysis phase. As though the analyst is not prone to error and bias!”
In September, statcheck checks triggered 50,000 re-examinations after it was run on a large corpus of psychology papers — psychology being in the midst of a reproducibility crisis that calls into question many of the received, foundational ideas of the field.
Now researchers using statcheck have suggest that Japanese bone specialist Yoshihiro Sato “fabricated data in many, if not all, of 33 randomized controlled trials they analyzed.”
Sato is the latest, but not the last. At the rate this is going, I’d be surprised if statcheck compliance didn’t become a standard badge in peer-reviewed pieces.
But this raises an important question: will statcheck work against researchers whose fakes are specifically designed to beat it? Recent history is littered with promising data-analysis techniques that worked when they looked backwards over what had been done before they were invented, and became much less effective once people who wanted to fake out the algorithm were able to respond. Think of how Google’s Pagerank did an amazing job of finding the latent relevance-structures in the links that formed the web, only to be swamped in linkfarms and more sophisticated techniques designed to exploit Pagerank’s weaknesses.
Since then, we’ve had the Wall Street hacking of the bond rating agencies, which allowed them to package high-risk subprime debt as AAA low-risk bonds; not to mention petty scams like the Kindle Unlimited fraud.
I project a stats-fraud arms-race, like the one we’re seeing in adversarial stylometry.
Bolland’s team looked at how much the results Sato reported deviated from what would normally be expected for the patients he was purportedly studying. In other words, when scientists make up data, they tend to do so in a way that’s too smooth and fails to reflect the natural world’s tendency to be messy. Although catching such homogeneity is tough with the naked eye, statistical analysis can find it pretty easily.
So how off were the results? It turns out that some of the data Sato reported had a 5.2 x 10–82 chance of being plausible. That’s basically zero.
Sato’s studies also had improbably positive results and much less mortality over time considering that the patients in his research tended to be older people with significant health problems. “Taken together with the implausible productivity of the group, internal inconsistencies for outcome data in their work, duplication of data, numerous misleading statements and errors, and concerns regarding ethical oversight,” the authors wrote, “our analysis suggests that the results of at least some of these trials are not reliable.”
statcheck [Michèle B. Nuijten]
There’s a way to spot data fakery. All journals should be using it [Ivan Oransky and Adam Marcus/Statnews]
Here’s why more than 50,000 psychology studies are about to have PubPeer entries [Dalmeet Singh Chawla/Retraction Watch]
(via Skepchick)
(Image: XKCD: Significance)