23 December 2015

Genomic Beacons

Yet another genomic privacy article, this time 'Privacy Risks from Genomic Data-Sharing Beacons' [PDF] by Suyash S. Shringarpure and Carlos D. Bustamante in (2015) 97(Nov) The American Journal of Human Genetics 1–16.

The authors indicate
The human genetics community needs robust protocols that enable secure sharing of genomic data from participants in genetic research. Beacons are web servers that answer allele-presence queries—such as ‘‘Do you have a genome that has a specific nucleotide (e.g., A) at a specific genomic position (e.g., position 11,272 on chromosome 1)?’’—with either ‘‘yes’’ or ‘‘no.’’ Here, we show that individuals in a beacon are susceptible to re-identification even if the only data shared include presence or absence information about alleles in a beacon.Specifically, we propose a likelihood-ratio test of whether a given individual is present in a given genetic beacon. Our test is not dependent on allele frequencies and is the most powerful test for a specified false-positive rate. Through simulations, we showed that in a beacon with 1,000 individuals, re-identification is possible with just 5,000 queries. Relatives can also be identified in the beacon. Re-identification is possible even in the presence of sequencing errors and variant-calling differences. In a beacon constructed with 65 European individuals from the 1000 Genomes Project, we demonstrated that it is possible to detect membership in the beacon with just 250 SNPs. With just 1,000 SNP queries, we were able to detect the presence of an individual genome from the Personal Genome Project in an existing beacon. Our results show that beacons can disclose membership and implied phenotypic information about participants and do not protect privacy a priori. We discuss risk mitigation through policies and standards such as not allowing anonymous pings of genetic beacons and requiring minimum beacon sizes.
They go on to comment
In the coming decade, a great deal of human genomic data, along with linked phenotypes in electronic health records, will be collected in the context of health care. A major goal of the human genomics community is to enable efficient sharing, aggregation, and analysis of these data in order to understand the genetic contributors of health and dis- ease. Previous large-scale data-sharing approaches have had limited success because of the potential for privacy breaches and risks of participant re-identification. Homer et al. and others showed that subjects in a genome- wide association study could be re-identified with the use of allele frequencies, resulting in the removal of publicly available allele-frequency data.
The Beacon Project by the Global Alliance for Genomics and Health (GA4GH) aims to simplify data sharing through a web service (‘‘beacon’’) that provides only allele-presence information. Users can query institutional beacons for information about genomic data available at the institution. Queries are of the form ‘‘Do you have a genome that has a specific nucleotide (e.g., A) at a specific genomic position (e.g., position 11,272 on chromosome 1)?’’ and the beacon server can answer ‘‘yes’’ or ‘‘no.’’ Beacons are intended to be easily set up and to allow data sharing while protecting participant privacy. By providing only allele-presence infor- mation, beacons are safe from attacks that require allele fre- quencies.
However, a privacy breach from a beacon would be troubling given that beacons often summarize data with a particular disease of interest. For instance, identifying that a given genome is part of the SFARI beacon, which contains genomic data from families with a child affected by autism spectrum disorder, means that the individual belongs to a family where some member has autism spectrum disorder. Thus, beacons could leak not only membership information but also phenotype information. Although genetic privacy is protected to some extent by the Genetic Information Nondiscrimination Act (GINA), the offered protections are limited, and GINA does not apply to long-term care insurance, life insurance, disability insurance, or other special cases.
Therefore, all data-sharing mechanisms, including beacons, must protect participant privacy. To examine the question of re-identification in a beacon, we have developed a likelihood-ratio test (LRT) that uses allele presence or absence responses from a beacon to predict whether a given individual genome is present in the beacon database. Our approach is independent of allele fre- quencies. The statistical properties of the LRT guarantee that it is the most powerful test for this problem. A variation of our LRT can detect relatives of the query individual in the beacon. Our results suggest that anonymous-access beacons do not protect individual privacy and are open to re-identification attacks. As a result, they can also disclose phenotype information about individuals whose genomes are present in the beacon.
 'On Non-cooperative Genomic Privacy' by Mathias Humbert, Erman Ayday, Jean-Pierre Hubaux and Amalio Telenti in FC 2015: Financial Cryptography and Data Security (Springer, 2015) 407-426 comments
 Over the last few years, the vast progress in genome sequencing has highly increased the availability of genomic data. Today, individuals can obtain their digital genomic sequences at reasonable prices from many online service providers. Individuals can store their data on personal devices, reveal it on public online databases, or share it with third parties. Yet, it has been shown that genomic data is very privacy-sensitive and highly correlated between relatives. Therefore, individuals’ decisions about how to manage and secure their genomic data are crucial. People of the same family might have very different opinions about (i) how to protect and (ii) whether or not to reveal their genome. We study this tension by using a game-theoretic approach. First, we model the interplay between two purely-selfish family members. We also analyze how the game evolves when relatives behave altruistically. We define closed-form Nash equilibria in different settings. We then extend the game to N players by means of multi-agent influence diagrams that enable us to efficiently compute Nash equilibria. Our results notably demonstrate that altruism does not always lead to a more efficient outcome in genomic-privacy games. They also show that, if the discrepancy between the genome-sharing benefits that players perceive is too high, they will follow opposite sharing strategies, which has a negative impact on the familial utility.
'Family tree and ancestry inference: is there a need for a ‘generational’ consent?' by Susan E. Wallace, Elli G. Gourna, Viktoriya Nikolova and Nuala A. Sheehan in (2015) 201516 BMC Medical Ethics87 comments
Genealogical research and ancestry testing are popular recreational activities but little is known about the impact of the use of these services on clients’ biological and social families. Ancestry databases are being enriched with self-reported data and data from deoxyribonucleic acid (DNA) analyses, but also are being linked to other direct-to-consumer genetic testing and research databases. As both family history data and DNA can provide information on more than just the individual, we asked whether companies, as a part of the consent process, were informing clients, and through them clients’ relatives, of the potential implications of the use and linkage of their personal data. 
Methods 
We used content analysis to analyse publically-available consent and informational materials provided to potential clients of ancestry and direct-to-consumer genetic testing companies to determine what consent is required, what risks associated with participation were highlighted, and whether the consent or notification of third parties was suggested or required. 
Results 
We identified four categories of companies providing: 1) services based only on self-reported data, such as personal or family history; 2) services based only on DNA provided by the client; 3) services using both; and 4) services using both that also have a research component. The amount of information provided on the potential issues varied significantly across the categories of companies. ‘Traditional’ ancestry companies showed the greatest awareness of the implications for family members, while companies only asking for DNA focused solely on the client. While in some cases companies included text recommending clients inform their relatives, showing they recognised the issues, often it was located within lengthy terms and conditions or privacy statements that may not be read by potential clients. 
Conclusions 
We recommend that companies should make it clearer that clients should inform third parties about their plans to participate, that third parties’ data will be provided to companies, and that that data will be linked to other databases, thus raising privacy and issues on use of data. We also suggest investigating whether a ‘generational consent’ should be created that would include more than just the individual in decisions about participating in genetic investigations.