05 January 2019

Reidentification

'Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning' by Liangyuan Na, Cong Yang, Chi-Cheng Lo,Fangyuan Zhao, Yoshimi Fukuoka and Anil Aswani in (2018) 1(8) JAMA Network Open e186040 comments
Policymakers have raised the possibility of identifying individuals or their actions based on activity data, whereas device manufacturers and exercise-focused social networks maintain that sharing deidentified data poses no privacy risks. Wearable device users are concerned with privacy issues, and ethical consequences have been discussed. There are also potentially legal requirements from the Health Insurance Portability and Accountability Act (HIPAA) on the privacy of activity data. One key unresolved question is whether it is possible to reidentify activity data. A better understanding on the feasibility of such reidentification will provide guidance to researchers, health care providers (ie, hospitals and physicians), and policymakers on creating practical privacy regulations for activity data. 
Reidentification of data is not just theoretical but has been demonstrated in several contexts. For instance, demographics in an anonymized data set can function as a quasi-identifier that is capable of being used to reidentify individuals. Reidentification is also possible using online search data, movie rating data, social network data, and genetic data. However, a key feature in these examples is a type of data sparsity, specifically, a large number of characteristics for each individual, which leads to a diversity of combinations in such a way that any particular combination of the data is identifying. For example, individuals’ movie ratings are highly revealing because of the many permutations of likes and dislikes. As another example, the particular genetic sequence combinations (and especially single-nucleotide polymorphisms) of a single individual are unique and capable of identifying that individual. 
In contrast, physical activity data do not feature the type of data sparsity found in the above examples because health data from a single individual often exhibit high variability. For example, for heart rate, variability is a constant and expected feature in healthy and unhealthy individuals. However, this variability does not protect against reidentification. A previous study found that high temporal resolution data from wearable devices transform this variability into repeated patterns that can be used for reidentification. In response, commercial organizations have argued that aggregated sets of wearable device data (without the high resolution) cannot be reidentified. It was recently reported that location information from activity trackers could be used to identify the location of military sites. Although this is not strictly an example of reidentifying specific individuals, it is nonetheless an example of the potential loss of privacy attributable to sharing of physical activity data. As a result, many location data are no longer being shared by commercial organizations; however, to our knowledge, reidentification excluding location data has not been studied or demonstrated. 
The primary aim of this study was to examine the feasibility of reidentifying activity data (collected from wearable devices) that have been partially aggregated. In this article, we specifically considered aggregations of an individual's activity data into walking intensity at the resolution of 20-minute intervals. This intensity represents a substantial level of aggregation compared with the raw digital accelerometer data that were used for reidentification in a previous study. We further studied other different levels of aggregation (from 15-minute intervals to 24-hour intervals) in the same manner. 
The scenario that we envisioned is summarized in Figure 1, and we gave one specific scenario to better describe the threat model considered in this article. This scenario involves an accountable care organization (ACO), such as the Kaiser Permanente network, that has stored their patients’ demographic data, complete health records, and physical activity data, which were collected as part of a weight loss intervention conducted by the ACO. This intervention involved recording physical activity data using a smartphone, activity tracker, or smartwatch. This scenario also involved an employer who has access to the names, demographic information, and physical activity data of their employees. The employer has access to physical activity data because they were collected by a smartphone, activity tracker, or smartwatch during the employees’ participation in a wellness program in exchange for a discount on health insurance premiums. There is a potential danger to privacy when the ACO shares deidentified data with the employer if the employer is able to reidentify the data using demographics and physical activity data. We evaluated the feasibility of this scenario by attempting to match a second data set of physical activity data and demographic information to a first data set of record numbers, physical activity data, and demographic information. From the standpoint of machine learning, matching record numbers is algorithmically and mathematically equivalent to matching names or other identifying information