The draft NIST Special Publication 800-188
De-Identifying Government Datasets [
PDF] by
Simson L. Garfinkel comments
De-identification removes
identifying information from
a dataset so that
the remaining data cannot
be linked with specific individuals. Government agencies can use de-identification
to
reduce the
privacy risk associated with collecting, processing, archiving, distributing or publishing
government data.
Previously NIST published NISTIR 8053, “De-Identifying Personal Data,”
which provided a survey of de-identification and re-identification techniques. This document
provides specific guidance to government agencies that wish to use de-identification. Before using
de-identification, agencies should evaluate their goals in using de-identification and the potential
risks that de-identification might create. Agencies should decide upon a de-identification release
model, such as publishing de-identified data, publishing synthetic data based on identified data,
and providing a query interface to identified data that incorporates de-identification. Agencies can
use a Disclosure Review Board to oversee the process of de-identification; they can also adopt a
de-identification standard with measurable performance levels. Several specific techniques for de-identification are available, including de-identification by removing identifiers and transforming
quasi-identifiers and the use of formal de-identification models that rely upon Differential Privacy.
De-identification is typically performed with software tools which may h
ave multiple features;
however, not all tools that mask personal information provide sufficient functionality for
performing de-identification. This document also includes an extensive list of references, a
glossary, and a list of specific de-identification tools, although the mention of these tools is only
to be used to convey the range of tools currently available, and is not intended to imply
recommendation or endorsement by NIST.
The document goes on to state
The US Government collects, maintains, and uses many kinds of datasets.
Every federal agency creates and maintains internal
datasets that are vital for fulfilling its mission, such as delivering
services to
taxpayers
or ensuring regulatory compliance.
Federal agencies can use de-identification to make government datasets available while protecting the privacy of the
individuals whose data are contained within
those datasets.
Increasingly these
government
datasets are being made available to the public. For the datasets
that contain personal information, agencies generally first remove that personal information from
the dataset prior to making the datasets publicly available.
De-identification
is a term used within
the
US
Government to describe the removal of personal information from data that are collected,
used, archived, and shared.
De-identification is not a single technique, but a collection of
approaches, algorithms, and tools that can be applied to different kinds of data with differing
levels of effectiveness. In general, the potential risk to privacy posed by a
dataset’s release decreases as more aggressive de-identification techniques are employed, but data quality
decreases as well.
The modern practice of de-identification comes from three distinct intellectual traditions:
•
For four decades, official statistical agencies have researched and investigated methods
broadly termed
Statistical Disclosure
Limitation
(SDL) or
Statistical Disclosure
Control
•
In the 1990s there was an increase in the unrestricted release of microdata, or individual
responses from surveys or administrative records. Initially these releases merely stripped
obviously identifying information such as names and social security numbers (what are
now called direct identifiers). Following some releases, researchers discovered that it was
possible to re-identify individual data by triangulating with some of the remaining
identifiers (now called quasi-identifiers or indirect identifiers).
The result of this NIST
research was the development of the k-anonymity model for protecting privacy,
which is
reflected in the HIPAA Privacy Rule.
•
In the 2000s, computer science research in the area of
cryptography involving private
information retrieval, database privacy, and interactive proof systems developed
the
theory of
differential privacy
,
which
is based on a mathematical definition of the
privacy
loss to
an
individual
resulting from queries on
a database containing that individual’s
personal information. Starting with this definition, researchers in the field
of
differential
privacy have developed a variety of mechanisms for minimizing the amount privacy loss
associated with various database operations.
In recognition of both the growing importance of de-identification within the
US
Government
and the paucity of efforts addressing de-identification as a holistic field, NIST began research in
this area
in 2015. As part of that investigation, NIST
researched and published NIST Interagency
Report 8053,
De-Identification of Personal Information.
Since the publication of NISTIR 8053, NIST has continued research in the area of de-identification. NIST met with de-identification experts within and outside the United States
Government, convened a Government Data De-Identification Stakeholder’s Meeting in June
2016, and conducted an extensive literature review.
The decisions and practices regarding the de-identification and release of government data can
be integral to the mission and proper functioning of a government agency. As such, these
activities
should
be managed by an agency’s leadership in a way that assures performance and
results in a manner that is consistent with the agency’s mission and legal
authority.
Before engaging in de-identification, agencies should clearly articulate their goals in performing
the de-identification, the kinds of data that they intend to de-identify and the uses that they
envision for the de-identified data. Agencies should also conduct a risk assessment that takes into
account the potential adverse actions
that
might result from the release of the de-identified data;
this risk assessment should include
analysis of
risk that might result from the data being re-identified
and risk that might result from the mere release of
the de-identified data itself.
One way that agencies can manage this risk is by creating a formal Disclosure Review Board
(DRB) consisting of stakeholders within the organization and representatives of the
organization’s leadership. The DRB should evaluate applications for de-identification that
describe
the data to be released, the techniques that will be used to minimize the risk of
disclosure, and how the effectiveness of those techniques will be evaluated.
Several specific models have been developed for the release of de-identified data. These include:
•
The Release and Forget model: The de-identified data may be released to the public,
typically by being published on the Internet.
•
The Data Use Agreement (DUA) model:
The de-identified data may be made available
to
qualified users
under a legally binding data use agreement that details what can and
cannot be done with the data.
•
The Simulated Data with Verification Model:
The original
dataset is used to create a
simulated dataset that contains many of the aspects of the original dataset. The simulated
dataset is released, either publically or to vetted researchers. The simulated data can be
used to develop queries or analytic software;
these queries and/or software can then be
provided to the agency and be applied on the original data. The results of the queries
and/or analytics processes can then be subjected to Statistical Disclosure Limitation and
the results provided to the researchers.
•
The Enclave model:
The de-identified data may be kept in some kind of segregated
enclave that restricts the export of the original data, and instead accepts queries from
qualified researchers, runs the queries on the de-identified data, and responds with
results.
Agencies can create or adopt standards to guide those performing de-identification. The
standards can specific disclosure techniques, or they can specify privacy guarantees that the de-identified data must uphold. There are many techniques
available for de-identifying data; most of
these techniques are specific to a particular modality. Some techniques are based on ad-hoc
procedures, while others are based on formal privacy models that make it possible to rigorously
calculate the amount of
data manipulation required of the data to assure a particular level of
privacy protection.
De-identification is generally performed by software. Features required of this software includes
detection of identifying information; calculation of re-identification
probabilities; performing de-identification; mapping identifiers to pseudonyms; and providing for the selective revelation of
pseudonyms
.
Today there are several non-commercial open source programs for performing de-identification but only a few commercial products. Currently there are no performance standards,
certification, or third-party testing programs available for
de-identification
software.