'Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning' by
Liangyuan Na, Cong Yang, Chi-Cheng Lo,Fangyuan Zhao, Yoshimi Fukuoka and Anil Aswani in (2018) 1(8)
JAMA Network Open e186040 comments
Policymakers have raised the possibility of identifying individuals or their actions based on activity
data, whereas device manufacturers and exercise-focused social networks maintain that sharing
deidentified data poses no privacy risks. Wearable device users are concerned with privacy
issues, and ethical consequences have been discussed. There are also potentially legal
requirements from the Health Insurance Portability and Accountability Act (HIPAA) on the privacy of
activity data. One key unresolved question is whether it is possible to reidentify activity data. A
better understanding on the feasibility of such reidentification will provide guidance to researchers,
health care providers (ie, hospitals and physicians), and policymakers on creating practical privacy
regulations for activity data.
Reidentification of data is not just theoretical but has been demonstrated in several contexts.
For instance, demographics in an anonymized data set can function as a quasi-identifier that is
capable of being used to reidentify individuals. Reidentification is also possible using online search
data, movie rating data, social network data, and genetic data. However, a key feature in these
examples is a type of data sparsity, specifically, a large number of characteristics for each individual,
which leads to a diversity of combinations in such a way that any particular combination of the data is
identifying. For example, individuals’ movie ratings are highly revealing because of the many
permutations of likes and dislikes. As another example, the particular genetic sequence
combinations (and especially single-nucleotide polymorphisms) of a single individual are unique and
capable of identifying that individual.
In contrast, physical activity data do not feature the type of data sparsity found in the above
examples because health data from a single individual often exhibit high variability. For example,
for heart rate, variability is a constant and expected feature in healthy and unhealthy individuals.
However, this variability does not protect against reidentification. A previous study found that high
temporal resolution data from wearable devices transform this variability into repeated patterns that
can be used for reidentification. In response, commercial organizations have argued that aggregated
sets of wearable device data (without the high resolution) cannot be reidentified. It was recently
reported that location information from activity trackers could be used to identify the location of
military sites. Although this is not strictly an example of reidentifying specific individuals, it is
nonetheless an example of the potential loss of privacy attributable to sharing of physical activity
data. As a result, many location data are no longer being shared by commercial organizations;
however, to our knowledge, reidentification excluding location data has not been studied or
demonstrated.
The primary aim of this study was to examine the feasibility of reidentifying activity data
(collected from wearable devices) that have been partially aggregated. In this article, we specifically
considered aggregations of an individual's activity data into walking intensity at the resolution of
20-minute intervals. This intensity represents a substantial level of aggregation compared with the
raw digital accelerometer data that were used for reidentification in a previous study. We further
studied other different levels of aggregation (from 15-minute intervals to 24-hour intervals) in the
same manner.
The scenario that we envisioned is summarized in Figure 1, and we gave one specific scenario
to better describe the threat model considered in this article. This scenario involves an accountable care organization (ACO), such as the Kaiser Permanente network, that has stored their patients’ demographic data, complete health records, and physical activity data, which were collected as part
of a weight loss intervention conducted by the ACO. This intervention involved recording physical
activity data using a smartphone, activity tracker, or smartwatch. This scenario also involved an
employer who has access to the names, demographic information, and physical activity data of their
employees. The employer has access to physical activity data because they were collected by a
smartphone, activity tracker, or smartwatch during the employees’ participation in a wellness
program in exchange for a discount on health insurance premiums. There is a potential danger to
privacy when the ACO shares deidentified data with the employer if the employer is able to reidentify
the data using demographics and physical activity data. We evaluated the feasibility of this scenario
by attempting to match a second data set of physical activity data and demographic information to a
first data set of record numbers, physical activity data, and demographic information. From the
standpoint of machine learning, matching record numbers is algorithmically and mathematically
equivalent to matching names or other identifying information