18 December 2017

Reidentification of Australian Health Data

Recalling past items on health data sharing (eg here and here) and restrictions on reidentification (eg here) it is interesting to see a solid Australian study of reidentification.

 'Health Data in an Open World' by Chris Culnane, Benjamin I. P. Rubinstein and Vanessa Teague comments
With the aim of informing sound policy about data sharing and privacy, we describe successful re-identification of patients in an Australian de-identified open health dataset. As in prior studies of similar datasets, a few mundane facts often suffice to isolate an individual. Some people can be identified by name based on publicly available information. Decreasing the precision of the unit-record level data, or perturbing it statistically, makes re-identification gradually harder at a substantial cost to utility. We also examine the value of related datasets in improving the accuracy and confidence of re-identification. Our re-identifications were performed on a 10% sample dataset, but a related open Australian dataset allows us to infer with high confidence that some individuals in the sample have been correctly re-identified. Finally, we examine the combination of the open datasets with some commercial datasets that are known to exist but are not in our possession. We show that they would further increase the ease of re-identification.
The authors note
In August 2016, pursuing the Australian government’s policy of open government data, the federal Department of Health published online the de-identified longitudinal medical billing records of 10% of Australians, about 2.9 million people. For each selected patient, all publicly-reimbursed medical and pharmaceutical bills for the years 1984 to 2014 were included. Suppliers' and patients' IDs were encrypted, though it was obvious which bills belonged to the same person.
In September 2016 we decrypted IDs of suppliers (doctors, midwives etc) and informed the department. The dataset was then taken offline. In this paper we show that patients can also be re-identified, without decryption, by linking the unencrypted parts of the record with known information about the individual. Our aim is to inform policy about data sharing and privacy with a scientific demonstration of the ease of re-identification of this kind of data. We notified the Department of Health of these findings in December 2016.
Access to high quality, and at times sensitive, data is a modern necessity for many areas of research. The challenge we face is in how to deliver that access, whilst still protecting the privacy of the individuals in the associated datasets. There is a misconception that this is either a solved problem, or an easy problem to solve. Whilst there are a number of proposals (Australian Government Productivity Commission, 2017), they need further research, development, and analysis. 
One thing is certain: open publication of de-identified data is not a secure solution for sensitive unit-record level data.
Our motivation in this work is to highlight the challenges and demonstrate the surprising ease with which de-identification can fail. Conquering this challenge will require open and transparent discussion and research, in advance of any future releases. This report concludes with some specific alternative suggestions, including the use of differential privacy for published data, and secure, controlled access to sensitive data for researchers.
Our findings replicate those of similar studies of other de-identified datasets:
• A few mundane facts taken together often suffice to isolate an individual. 
• Some patients can be identified by name from publicly available information. 
• Decreasing the precision of the data, or perturbing it statistically, makes re-identification gradually harder at a substantial cost to utility.
We first examine uniqueness according to basic medical procedures such as childbirth. We show that some individuals are unique given public information, and show also that many patients are unique given a few basic facts such as year of birth and dates of childbirth.
Although the data is only a 10% sample, we can quantify the confidence of re-identifications, which can be high. We use a second dataset of population-wide billing frequencies, which sometimes shows that the person is unique in the whole population.
We then examine uniqueness according to the characteristics of commercial datasets we know of but do not have. We find high uniqueness rates that would allow linking with a commercial pharmaceutical dataset. We also explain that, consistent with the ``Unique in the shopping mall,” (de Montjoye, Radaelli, Singh, & Pentland, 2015) financial transactions in the dataset are sufficient for easy re-identification by the patient’s bank.