Last year I wrote a response to a an article in Current Directions in Psychological Science by Robert Plomin, John DeFries, Valerie Knopik and Jenae Niederhiser titled, “Top Ten Replicated Findings from Behavioral Genetics.” Mine was titled “Weak Genetic Explanation Twenty Years Later.” There is a lot of arguing in there, but one point in particular has struck readers: I referred to GWAS as, “unapologetic, high-tech p-hacking.”
Just yesterday Dorothy Bishop posted,
Certainly socks it to them! ‘Genome- wide association is unapologetic, high-tech p-hacking’ :Turkheimer, E. (2016). Weak genetic explanation 20 years later Perspectives on Psychological Science, 11(1), 24-28. doi:10.1177/1745691615617442, p 27. https://t.co/RFfgRdf9vI
— Dorothy Bishop (@deevybee) February 18, 2019
Other people I respect made it clear they disagreed with me:
That’s just not p hacking. Best not to double down on this.
— Ruben C. Arslan (@rubenarslan) December 14, 2018
So to be clear at the outset, I am not accusing anyone of p-hacking in the sense of cheating on their data. As I have said before, unapologetic p-hacking is way better than secretive p-hacking, and by some lights isn’t p-hacking at all. GWAS methods are public and open, but that doesn’t mean they are scientifically desirable.
To understand GWAS you have to review a little of the history of behavior genetics (and these comments are about GWAS as applied to behavior). It started with twin and family studies, which through the nineties showed over and over again that everything is heritable, but despite ingenious complex twin models, never actually produced much about the etiology of the phenomena it investigated. But the genome project was on the horizon, so everyone figured that it was just a matter of time before we identified the genes for schizophrenia and intelligence.
It didn’t happen. Initial attempts using linkage or association were published weekly in Nature, but none of them replicated. Notswithstanding Plomin et al’s title about replicability, molecular genetics was at the heart of the replication crisis. Where were the genes?
Then GWAS came along, allowing investigators to use big samples to search the genome for associations with SNPs. At first this didn’t work either, because the effects of individual SNPs were tiny, and the enormous number of statistical tests pushed the individual significance level so low. But eventually the sample sizes got big enough and genome-wide significant SNPs started to be found.
There is more to be said (next post), but it is worth pausing here to evaluate GWAS at this point in its evolution. It declared public methods and adhered to them, and in that sense was not p-hacking. On the other hand, the method it adhered to, of increasing sample size endlessly until some unpredicted correlation reaches an arbitrary level of significance, sounds a lot like p-hacking to me.
Suppose Brian Wansink, the nutrition researcher who was brought down by revelations of p-hacking and other questionable research practices, had adopted the following strategy in response to criticism of his experiments. Instead of designing individual studies with hypotheses that were always susceptible to hacking, he got funding for an enormous nationwide program that monitored pizza restaurants across the country. The behaviors of hundreds of thousands of pizza eaters were recorded, as were many thousands of tiny characteristics of the environments they ate in. Then they searched for correlations between characteristics of restaurants and eating behaviors, at stringent levels of significance. At first, nothing is significant, so the samples were pushed up from hundreds of thousands to nearly a million pizza eaters, and finally, some significant “hits” emerge. It turns out that eating in a pale green restaurant is associated with a 1 milligram increase in pizza consumption, R2=.0004, p < 10-8.
What would you make of such a finding? I for one would be deeply skeptical. Not because I doubted the mathematical operation of significance testing– I’d be willing to believe that the tiny correlation was not the result of sampling error– but because I doubt the scientific method of trading sample size against significance, especially at the extremes, and because I don’t know of a single example, at least in behavior, where a tiny correlation of that kind actually provided a meaningful footprint in the direction of a useful causal finding. Correlations of that kind are within what Meehl called the “crud factor”, the tendency for everything to correlated with everything, the result of brownian motion, Jungian synchronicity, and changes in the earth’s magnetic field.
Now is it “p-hacking”? It’s close enough for me. It is using the mechanics of significance testing to put a scientific gloss on a process that consists essentially of printing out a big table of correlation coefficients and circling the ones that are p