Hacker News new | past | comments | ask | show | jobs | submit login

I did a news search to see what researchers are saying recently about the data analyzed by 23andMe. One news article led me to a company blog post by a bioinformatics researcher, Gabe Rudy, "GATK is a Research Tool. Clinics Beware"

http://blog.goldenhelix.com/?p=1534

in which he applied his own industry knowledge to his updated 23andMe report. His conclusions suggest that the product needs much more work:

"I promptly sent an email off to 23andMe’s exome team letting them know about what is clearly a bug in the GATK variant caller. They confirmed it was a bug that went away after updating to a newer release. I talked to 23andMe’s bioinformatician behind the report face-to-face a bit at this year’s ASHG conference, and it sounds like it was most likely a bug in the tool’s multi-sample variant calling mode as this phantom insertion was a real insertion in one of the other samples."

. . . .

"But because GATK has been used so prolifically in publications and is backed by the Broad Institute, it can be viewed as a 'safe' choice. As small labs and clinical centers around the world are starting to set up their DNA-seq pipelines for gene panel and exome sequencing, they may choose GATK with the assumption that the output doesn’t need to be validated.

"And that would be a mistake.

"GATK is as susceptible to bugs as much as any complex software. Their new mixed licensing model (free for academic, fee for commercial) is intended to add more dedicated support resources to the team. I suggest they think about adding dedicated testers as well."

So for those of us following along at home, the crucial idea is that most of the "information" that 23andMe provides paying clients has not been validated. Not only has it not been validated as to correctness of the genome analysis software (the industry scientist's observation), it has even less been validated as a clue to clinically significant disease risk for the majority of diseases that afflict people in developed countries. Pay your money for the service at the new lower price if you like, but prepared to see your personal genome results repackaged and reinterpreted for years to come before you learn anything from them that will help you improve your health.




Interesting article, although I think the author incorrectly blames the software tool rather than 23andMe.

The GATK is a research tool in active development. "Clinics beware" makes little sense because it's not like there is any real alternative (aside from other research tools). This is new territory for everyone. It almost strikes me in the same way as if someone were to say, "Users beware: Linux is a research tool". Whether or not it is a research tool doesn't change that misuse of the tool will lead to poor results.

It's been a little over a year since I last directly used the GATK's caller, but at that time indel calling (the thing that produced this blogger's error) was experimental and clearly labeled as such in loud capital letters.

Also, the GATK does not make one use insane parameters such as allowing variant calls supported by 0 reads; that's the choice of the person running the software...


Thanks very much for your reply.

This is new territory for everyone.

I have developed a habit of liberally upvoting most of your comments on threads related to biology, as I consistently learn from them and see you taking the time and effort to correct popular misconceptions as you participate here. So knowing that I'm asking someone who'll give me a straight answer, I wonder if you could comment specifically on my statement above

"So for those of us following along at home, the crucial idea is that most of the 'information' that 23andMe provides paying clients has not been validated. Not only has it not been validated as to correctness of the genome analysis software (the industry scientist's observation), it has even less been validated as a clue to clinically significant disease risk for the majority of diseases that afflict people in developed countries. Pay your money for the service at the new lower price if you like, but prepared to see your personal genome results repackaged and reinterpreted for years to come before you learn anything from them that will help you improve your health."

I'm heartily in agreement with the idea of doing fundamental research on the human genome and gathering large datasets to analyze to look for genetic clues to human health and disease. I participate each week during the school year in the University of Minnesota "journal club" on behavior genetics, in which a group of scientists (mostly psychologists, but a few mathematicians and economists) who investigate genetic influences on human behavior meet to discuss the latest papers on new research. The overwhelming impression I get is that commercial businesses like 23andMe certainly mean well, and are trying to make available new gene analysis tools to a broader public. But that they are running ahead of their ability, based on current science, to deliver actionable information to the clients who pay for their services. There is still an astounding lack of replicability and of large effect sizes in almost any genome study related to common human diseases or to socially meaningful human behaviors. Much more research needs to be done.


Thanks for your kind words; I feel the same way re: your posts as well.

> So for those of us following along at home, the crucial idea is that most of the 'information' that 23andMe provides paying clients has not been validated. Not only has it not been validated as to correctness of the genome analysis software (the industry scientist's observation), it has even less been validated as a clue to clinically significant disease risk for the majority of diseases that afflict people in developed countries.

In general, what you say here is going to true. I can't comment specifically on what 23andMe claim to demonstrate, because I haven't seen their actual output, but it's usually very difficult to go from genetic data to individual risk prediction. In fact, trying to do so is low-yield-enough that I don't even expect to find individual risk prediction to be interesting for most diseases (at least, not for the next several years). So heuristically I will assert that any claims about individual risk prediction, for most diseases, are unlikely to be clinically important. The obvious exceptions are Mendelian genetic conditions. If they find that you are homozygous for CFTR ∆F508, you're almost certainly going to develop cystic fibrosis (well, you'd probably already have clinical symptoms by the time you get the test).

But a good number of disease conditions aren't (typically) Mendelian. Will you develop heart disease? If you have certain high-impact mutations in the LDL receptor, we might be able to say with reasonable certainty that you will develop heart disease by a certain age. But high-impact mutations are (usually) rare for most diseases.

The majority of what we have been discovering over the past few years are common variants of modest impact (powerful statistical associations but with odds ratios barely differing from 1). When we try to answer questions like, "Why do black Americans have more heart disease in the US?" we don't get smoking-gun mono- or oligogenic answers.[1] Even combining these markers (that are known to be robustly associated with a phenotype) into a single score doesn't do a whole lot more than just knowing the biomarkers that we already measure.[2]

Finally, we have to wrestle with the issue of causality. For example, increased HDL-cholesterol is epidemiologically associated with decreased risk of heart disease. So people say, "HDL is good cholesterol." OK, maybe. My colleagues tested that hypothesis with one single nucleotide polymorphism in a gene that appears only to modify HDL-C levels, LIPG. (Most genes that modify HDL-C also modify LDL-C or triglycerides, and pleiotropic effects ruin your ability to assess a single biomarker.) They said, "Epidemiologically, an X% increase in HDL-C associates with a Y% decrease in risk of heart attack. Also, this LIPG SNP associates with a J% increase in HDL-C. Therefore, based on that J% increase in HDL-C, we expect a K% decrease in heart disease if HDL-C is a causal biomarker." What was the result? The LIPG SNP had the expected effect on HDL-C levels, but had absolutely zero association with your risk for heart attack (the OR was like 0.99 with a CI that easily included 1.0). In contrast, an LDL-C SNP score had a robust association in the expected direction (more genetic variants that are known to raise LDL-C? more heart disease risk).[3]

In other words, is HDL a causal protective factor? It appears that, for at least one cause of high HDL-C, it is not. (This doesn't give me license to dismiss the entire HDL hypothesis and I wouldn't intend to do that without exhaustive scientific work, of course.)

How much of this is out there? Probably tons. It's probably only known by domain experts, or (more commonly) nobody.

So if you're still following this minimally coherent post, we're discovering common variants with very weak effects that often impact biomarkers which may or may not be causal for the diseases of interest, and our view is being revised all of the time.

Is it interesting? From a research perspective, yes, it's awesome. We're finding all of these landmarks in this enormous genomic map. This gives us insight into the architecture of diseases. It gives us smarter therapeutic targets. It helps us evaluate potential therapeutic targets via tools such as Mendelian Randomization to save years of time and billions of dollars avoiding clinical trials that are unlikely to bear fruit (one example of which I discussed above with the LIPG paragraph).

But to an individual alive today wondering about the direct clinical utility of this information? Not today, not right now, not in my opinion. If it's any indicator, I haven't done 23andMe, and if I did, it would be for entertainment purposes.

Will this stuff be in clinics in a few years? Yes, probably. Will it be useful? My guess is that it will be most useful, even then, for research. Eventually there will be clinical significance, but I suspect that the clinical significance will largely go hand-in-hand with the development of therapeutics that have specific genetic targets (e.g., perhaps you have some predisposition to developing cancer, but with your genotype we know that if we give you these 3 tyrosine kinase inhibitors you're very unlikely to develop cancer). I could see that developing.

> Pay your money for the service at the new lower price if you like, but prepared to see your personal genome results repackaged and reinterpreted for years to come before you learn anything from them that will help you improve your health.

Yes, this is absolutely true; there is much more work to be done, and if 23andMe's users are lucky they'll have to keep re-downloading their genomic data as they are updated with new information. If they end up stuck with what we've got now, well, that's stable but incredibly boring.

> But that they are running ahead of their ability, based on current science, to deliver actionable information to the clients who pay for their services. There is still an astounding lack of replicability and of large effect sizes in almost any genome study related to common human diseases or to socially meaningful human behaviors.

I think that there is actually remarkably good reproducibility of the genetic associations with common human diseases, but these are mostly of small effect. A lot of the associations in the candidate-gene/pre-human-genome era do appear to be spurious, however.

1 = http://www.ncbi.nlm.nih.gov/pubmed/21347282?dopt=Abstract

2 = http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2845522/

3 = http://www.thelancet.com/journals/lancet/article/PIIS0140-67...


It's a mistake to infer that this means "regular" 23andMe results are similarly affected. The exome project was clearly labeled as a research one and the data was returned with no guarantees whatsoever. The "regular" 23andMe SNPs go through a multitude of comprehensive checks before they are used in reports. Not to say that occasional issues don't crop up, but it's apples and oranges.

Edit: I was the first engineer at 23andMe, so I have a decent idea about what's involved in the analysis and quality control.


"Pay your money for the service at the new lower price if you like, but prepared to see your personal genome results repackaged and reinterpreted for years to come before you learn anything from them that will help you improve your health."

This is precisely the spirit of the project from my interpretation. Increase the data set that is available for trend analysis, and therefore increase the quality and precision of the predictive analytics available.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: