26 May 2013

Deanonymisation

'Identifying Participants in the Personal Genome Project by Name' [PDF], a Harvard University Data Privacy Lab project white paper by Latanya Sweeney, Akua Abu and Julia Winn offers a sobering but - for this author - unsurprising - report on deanonymisation of data on the Personal Genome Project.

Sweeney, Abu and Winn state that
We linked names and contact information to publicly available profiles in the Personal Genome Project. These profiles contain medical and genomic information, including details about medications, procedures and diseases, and demographic information, such as date of birth, gender, and postal code. By linking demographics to public records such as voter lists, and mining for names hidden in attached documents, we correctly identified 84 to 97 percent of the profiles for which we provided names. Our ability to learn their names is based on their demographics, not their DNA, thereby revisiting an old vulnerability that could be easily thwarted with minimal loss of research value. So, we propose technical remedies for people to learn about their demographics to make better decisions.
They explain that
The freedom to decide with whom to share one’s own medical and genomic information seems critical to protecting personal privacy in today's datarich networked society. Individuals are often in the best position to make decisions about sharing extensive amounts of personal information for many worthy purposes like research. A person can weigh harms and benefits as relevant to her own life. In comparison, decisions by policy makers and committees do not usually allow fine-grained personal distinctions, but instead dictate sweeping actions that apply the same to everyone. But how good are the decisions individuals will make? A person may have far less expertise than vetted committee members or veteran policy makers. And potential harms and vulnerabilities may be hidden; if so, an individual may not be able to make good decisions.
For example, sharing information about sexual abuse, abortions, or depression medication may be liberating for one person yet harmful for another. Further, if the information is shared without the explicit appearance of name or address, a person may be more likely to share the information publicly because of a false belief she is anonymous. It is important to help people make good data sharing decisions. If people share data widely and thousands of people get subsequently harmed doing so, policy makers may respond and take away the freedom to make personal data sharing decisions, thereby depriving society of individual choice. To make smarter decisions, people need an understanding of actual risks and ways technology can help. xxxx
The authors comment that
Launched in 2006, the Personal Genome Project (PGP) aims to sequence the genotypic and phenotypic information of 100,000 informed volunteers and display it publicly online in an extensive public database [1]. Information provided in the PGP includes DNA information, behavioral traits, medial conditions, physical characteristics, and environmental factors. A general argument for the disclosure of such information is its utility. The PGP founders believe this information will aid researchers in establishing correlations between certain traits and conducting research in personalized medicine. They also foresee its use as a tool for individuals to learn about their own genetic risk profiles for disease, uncover ancestral data, and examine biological characteristics [2]. According to the project’s principal founder, Harvard geneticist George Church, the only real utility of this type of information is as data reflecting physical and genomic characteristics [3]. Along with Steven Pinker and Esther Dyson, he volunteered his information publicly as one of the first ten participants in the project. Currently, 2,593 individuals disclose their information publicly at the PGP website.
The PGP operates under a privacy protocol it terms “open consent”[4]. Individual volunteers freely choose to disclose as much personal data as they want, often including identifying demographic data, such as date of birth, gender, and postal code (ZIP). Online, the profiles appear in a “de-identified state,” being void of the direct appearance of the participant’s name or address. The result provides volunteers with seeming anonymity and a participant is assigned an identification number as the reference to his profile. Participants may upload information directly from external DNA sequencing service (e.g., from 23andMe), but these services often provide documents having additional personal information including the participant name. PGP participants are required to sign a range of consent forms and pass an entrance exam.
The consent form does not in any way guarantee participants a degree of privacy. To the contrary, the form explicitly states that participation may even reveal other non-disclosed information about the participant:
“If you have previously made available or intend to make available genetic or other medical or trait information in a confidential setting, for example in another research study, the data that you provide to the PGP may be used, on its own or in combination with your previously shared data, to identify you as a participant in otherwise private and/or confidential research. This means that any data or other information you may have shared pursuant to a promise of confidentiality or privacy may become public despite your intent that they be kept private and confidential. This could result in certain adverse effects for you, including ones not considered or anticipated by this consent form”.
Risks mentioned by the form include public disclosure and identification and the use of personal genomic information for non-medical purposes including cloning provided cell lines. It is emphasized that all risk lies with the individual. Once a participant uploads information to his online profile, the PGP offers almost no means to amend or modify information. Participants basically display all the contents of the profile or none at all unless they know how to edit files directly. Some of these files use complicated and unusual formats (e.g., a continuity of care report that holds the participant’s personal health record).