26 August 2016

NIST on Deidentification

The draft NIST Special Publication 800-188 De-Identifying Government Datasets [PDF] by Simson L. Garfinkel comments
De-identification removes identifying information from a dataset so that the remaining data cannot be linked with specific individuals. Government agencies can use de-identification to reduce the privacy risk associated with collecting, processing, archiving, distributing or publishing government data. Previously NIST published NISTIR 8053, “De-Identifying Personal Data,” which provided a survey of de-identification and re-identification techniques. This document provides specific guidance to government agencies that wish to use de-identification. Before using de-identification, agencies should evaluate their goals in using de-identification and the potential risks that de-identification might create. Agencies should decide upon a de-identification release model, such as publishing de-identified data, publishing synthetic data based on identified data, and providing a query interface to identified data that incorporates de-identification. Agencies can use a Disclosure Review Board to oversee the process of de-identification; they can also adopt a de-identification standard with measurable performance levels. Several specific techniques for de-identification are available, including de-identification by removing identifiers and transforming quasi-identifiers and the use of formal de-identification models that rely upon Differential Privacy. De-identification is typically performed with software tools which may h ave multiple features; however, not all tools that mask personal information provide sufficient functionality for performing de-identification. This document also includes an extensive list of references, a glossary, and a list of specific de-identification tools, although the mention of these tools is only to be used to convey the range of tools currently available, and is not intended to imply recommendation or endorsement by NIST.
The document goes on to state
The US Government collects, maintains, and uses many kinds of datasets. Every federal agency creates and maintains internal datasets that are vital for fulfilling its mission, such as delivering services to taxpayers or ensuring regulatory compliance. Federal agencies can use de-identification to make government datasets available while protecting the privacy of the individuals whose data are contained within those datasets.
Increasingly these government datasets are being made available to the public. For the datasets that contain personal information, agencies generally first remove that personal information from the dataset prior to making the datasets publicly available. De-identification is a term used within the US Government to describe the removal of personal information from data that are collected, used, archived, and shared. De-identification is not a single technique, but a collection of approaches, algorithms, and tools that can be applied to different kinds of data with differing levels of effectiveness. In general, the potential risk to privacy posed by a dataset’s release decreases as more aggressive de-identification techniques are employed, but data quality decreases as well.
The modern practice of de-identification comes from three distinct intellectual traditions:
• For four decades, official statistical agencies have researched and investigated methods broadly termed Statistical Disclosure Limitation (SDL) or Statistical Disclosure Control
• In the 1990s there was an increase in the unrestricted release of microdata, or individual responses from surveys or administrative records. Initially these releases merely stripped obviously identifying information such as names and social security numbers (what are now called direct identifiers). Following some releases, researchers discovered that it was possible to re-identify individual data by triangulating with some of the remaining identifiers (now called quasi-identifiers or indirect identifiers). The result of this NIST research was the development of the k-anonymity model for protecting privacy, which is reflected in the HIPAA Privacy Rule.
• In the 2000s, computer science research in the area of cryptography involving private information retrieval, database privacy, and interactive proof systems developed the theory of differential privacy , which is based on a mathematical definition of the privacy loss to an individual resulting from queries on a database containing that individual’s personal information. Starting with this definition, researchers in the field of differential privacy have developed a variety of mechanisms for minimizing the amount privacy loss associated with various database operations.
In recognition of both the growing importance of de-identification within the US Government and the paucity of efforts addressing de-identification as a holistic field, NIST began research in this area in 2015. As part of that investigation, NIST researched and published NIST Interagency Report 8053, De-Identification of Personal Information.
Since the publication of NISTIR 8053, NIST has continued research in the area of de-identification. NIST met with de-identification experts within and outside the United States Government, convened a Government Data De-Identification Stakeholder’s Meeting in June 2016, and conducted an extensive literature review.
The decisions and practices regarding the de-identification and release of government data can be integral to the mission and proper functioning of a government agency. As such, these activities should be managed by an agency’s leadership in a way that assures performance and results in a manner that is consistent with the agency’s mission and legal authority. Before engaging in de-identification, agencies should clearly articulate their goals in performing the de-identification, the kinds of data that they intend to de-identify and the uses that they envision for the de-identified data. Agencies should also conduct a risk assessment that takes into account the potential adverse actions that might result from the release of the de-identified data; this risk assessment should include analysis of risk that might result from the data being re-identified and risk that might result from the mere release of the de-identified data itself.
One way that agencies can manage this risk is by creating a formal Disclosure Review Board (DRB) consisting of stakeholders within the organization and representatives of the organization’s leadership. The DRB should evaluate applications for de-identification that describe the data to be released, the techniques that will be used to minimize the risk of disclosure, and how the effectiveness of those techniques will be evaluated.
Several specific models have been developed for the release of de-identified data. These include:
• The Release and Forget model: The de-identified data may be released to the public, typically by being published on the Internet.
• The Data Use Agreement (DUA) model: The de-identified data may be made available to qualified users under a legally binding data use agreement that details what can and cannot be done with the data.
• The Simulated Data with Verification Model: The original dataset is used to create a simulated dataset that contains many of the aspects of the original dataset. The simulated dataset is released, either publically or to vetted researchers. The simulated data can be used to develop queries or analytic software; these queries and/or software can then be provided to the agency and be applied on the original data. The results of the queries and/or analytics processes can then be subjected to Statistical Disclosure Limitation and the results provided to the researchers.
• The Enclave model:  The de-identified data may be kept in some kind of segregated enclave that restricts the export of the original data, and instead accepts queries from qualified researchers, runs the queries on the de-identified data, and responds with results.
Agencies can create or adopt standards to guide those performing de-identification. The standards can specific disclosure techniques, or they can specify privacy guarantees that the de-identified data must uphold. There are many techniques available for de-identifying data; most of these techniques are specific to a particular modality. Some techniques are based on ad-hoc procedures, while others are based on formal privacy models that make it possible to rigorously calculate the amount of data manipulation required of the data to assure a particular level of privacy protection.
De-identification is generally performed by software. Features required of this software includes detection of identifying information; calculation of re-identification probabilities; performing de-identification; mapping identifiers to pseudonyms; and providing for the selective revelation of pseudonyms . Today there are several non-commercial open source programs for performing de-identification but only a few commercial products. Currently there are no performance standards, certification, or third-party testing programs available for de-identification software.