It may now be possible for anyone, even if they follow rigorous privacy and anonymity practices, to be identified by DNA data from people they do not even know.
A paper published in January in the journal Science describes a process by which it's possible to identify by name the donors of DNA samples, even without any demographic or personal information. The technique was developed by a team of geneticists at MIT's Whitehead Institute for Biomedical Research and is intended to demonstrate that science and technology have surpassed the techniques and laws currently in place for safeguarding private medical data, according to Yaniv Erlich, a fellow at Whitehead and member of the research team.
The point was not to reveal private information, but to demonstrate a systemic weakness that will require research, debate and new laws and technology to overcome, Erlich says. The technique relies on the custom of passing family names down through the fathers family. By statistically modeling the distribution of family names, the researchers were able to narrow the list of possible contributors of DNA samples. They then pinpointed individuals using a range of other publicly available sources, none of which were directly connected to the original donors and none of which included protected personal data.
This isn't a specific exploit against an effective wall of security, Erlich says. Instead, it demonstrates that genomic research may have grown beyond our ability to conceal the identities of the sources of DNA samples. The team started with a list of genomes that had already been sequenced, mapped and published for the use of genetic researchers. They analyzed the material to find identifying markers on the Y chromosome -- which is present only in men -- because surnames are generally passed down through fathers. They compared those Y markers to databases that list such markers along with the surnames of those from whom the samples were taken, but were not able to match all the samples with surnames using confirmed data. They determined which surnames were most likely to belong to which samples using scientifically accepted statistical models that were designed, among other things, to track the movement of regional populations by following the spread of family names.
The next step was more hack than science: The team used record-search engines on the Internet, obituaries, genealogical websites and demographic data from the National Institutes of Health's Human Genetic Cell Repository. Researchers then linked 50 of the samples to the names with those who contributed them.
Until now, the risk that private genetic data could be made public was considered fairly limited. Data about samples was kept separate from data about donors, and demographic data about the donors could only be supplied after identifiers were removed.
Sign up for CIO Asia eNewsletters.