19 June 2014

Big Bad Data

'Policing by Numbers: Big Data and the Fourth Amendment' by Elizabeth E. Joh in (2014) 89 Washington Law Review 35 argues that
The age of “big data” has come to policing. In Chicago, police officers are paying particular attention to members of a “heat list”: those identified by a risk analysis as most likely to be involved in future violence. In Charlotte, North Carolina, the police have compiled foreclosure data to generate a map of high-risk areas that are likely to be hit by crime. In New York City, the N.Y.P.D. has partnered with Microsoft to employ a “Domain Awareness System” that collects and links information from sources like CCTVs, license plate readers, radiation sensors, and informational databases. In Santa Cruz, California, the police have reported a dramatic reduction in burglaries after relying upon computer algorithms that predict where new burglaries are likely to occur. Unlike the data crunching performed by Target, Walmart, or Amazon, the introduction of big data to police work raises new and significant challenges to the regulatory framework that governs conventional policing. This article identifies three uses of big data and the questions that these tools raise about conventional Fourth Amendment analysis. Two of these examples, predictive policing and mass surveillance systems, have already been adopted by a small number of police departments around the country. A third example — the potential use of DNA databank samples — presents an untapped source of big data analysis. While seemingly quite distinct, these three examples of big data policing suggest the need to draw new Fourth Amendment lines now that the government has the capability and desire to collect and manipulate large amounts of digitized information.
'Privacy and Data-Based Research' by Ori Heffetz and Katrina Ligett in (2014) 28(2) Journal of Economic Perspectives 75–98 comments 
On August 9, 2006, the “Technology” section of the New York Times contained a news item titled “A Face Is Exposed for AOL Searcher No. 4417749,” in which reporters Michael Barbaro and Tom Zeller (2006) tell a story about big data and privacy: Buried in a list of 20 million Web search queries collected by AOL and recently released on the Internet is user No. 4417749. The number was assigned by the company to protect the searcher’s anonymity, but it was not much of a shield. No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from “numb fi ngers” to “60 single men” to “dog that urinates on everything.” And search by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for “landscapers in Lilburn, Ga,” several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.” It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga. . . . Ms. Arnold, who agreed to discuss her searches with a reporter, said she was shocked to hear that AOL had saved and published three months’ worth of them. “My goodness, it’s my whole personal life,” she said. “I had no idea somebody was looking over my shoulder.” . . .“We all have a right to privacy,” she said. “Nobody should have found this all out.” 
Empirical economists are increasingly users, and even producers, of large datasets with potentially sensitive information. Some researchers have for decades handled such data (for example, certain kinds of Census data), and routinely think and write about privacy. Many others, however, are not accustomed to think about privacy, perhaps because their research traditionally relies on already-publicly-available data, or because they gather their data through relatively small, “mostly harmless” surveys and experiments. This ignorant bliss may not last long; detailed data of unprecedented quantity and accessibility are now ubiquitous. Common examples include a private database from an Internet company, data from a field experiment on massive groups of unsuspecting subjects, and confidential administrative records in digital form from a government agency. The AOL story above is from 2006; our ability to track, store, and analyze data has since then dramatically improved. While big data become difficult to avoid, getting privacy right is far from easy—even for data scientists.
This paper aims to encourage data-based researchers to think more about issues such as privacy and anonymity. Many of us routinely promise anonymity to the subjects who participate in our studies, either directly through informed consent procedures, or indirectly through our correspondence with Institutional Review Boards. But what is the informational content of such promises? Given that our goal is, ultimately, to publish the results of our research—formally, to publish functions of the data—under what circumstances, and to what extent, can we guarantee that individuals’ privacy will not be breached and their anonymity will not be compromised?
These questions may be particularly relevant in a big data context, where there may be a risk of more harm due to both the often-sensitive content and the vastly larger numbers of people affected. As we discuss below, it is also in a big data context that privacy guarantees of the sort we consider may be most effective. Our paper proceeds in three steps. First, we retell the stories of several privacy debacles that often serve as motivating examples in work on privacy. The first three stories concern intentional releases of de-identified data for research purposes. The fourth story illustrates how individuals’ privacy could be breached even when the data themselves are not released, but only a seemingly innocuous function of personal data is visible to outsiders. None of our stories involves security horrors such as stolen data, broken locks and passwords, or compromised secure connections. Rather, in all of them information was released that had been thought to have been anonymized, but, as was soon pointed out, was rather revealing.
Second, we shift gears and discuss differential privacy, a rigorous, portable privacy notion introduced roughly a decade ago by computer scientists aiming to enable the release of information while providing provable privacy guarantees. At the heart of this concept is the idea that the addition or removal of a single individual from a dataset should have nearly no effect on any publicly released functions of the data, but achieving this goal requires introducing randomness into the released outcome. We discuss simple applications, highlighting a privacy-accuracy tension: randomness leads to more privacy, but less accuracy.
Third, we offer lessons and reflections, discuss some limitations, and briefly mention additional applications. We conclude with reflections on current promises of “anonymity” to study participants—promises that, given common practices in empirical research, are not guaranteed to be kept. We invite researchers to consider either backing such promises with meaningful privacy-preserving techniques, or qualifying them. While we are not aware of major privacy debacles in economics research to date, the stakes are only getting higher.
'Governing, Exchanging, Securing: Big Data and the Production of Digital Knowledge' (Columbia Public Law Research Paper No. 14-390) by Bernard E. Harcourt comments 
The emergence of Big Data challenges the conventional boundaries between governing, exchange, and security. It ambiguates the lines between commerce and surveillance, between governing and exchanging, between democracy and the police state. The new digital knowledge reproduces consuming subjects who wittingly or unwittingly allow themselves to be watched, tracked, linked and predicted in a blurred amalgam of commercial and governmental projects. Linking back and forth from consumer data to government information to social media, these new webs of information become available to anyone who can purchase the information. How is it that governmental, commercial and security interests have converged, coincided, and also diverged in ways, in the production of Big Data? Which sectors have stimulated the production and mining of this information? How have the various projects aligned or contradicted each other? In this paper, I begin to explore these questions along two dimensions. First, I sketch in broad strokes the historical development and growth of the digital realm. I offer some categories to understand the mass of data that surrounds us today, and lay some foundation for the notion of a digital knowledge. Then, I investigate the new political economy of data that has emerged, as a way to suggest some of the larger forces that are at play in our new digital age.
‘The deluge’ by Fleur Johns in (2013) 1 London Review of International Law 9-34 comments
Scarcity is a critical adjuvant for international legal thought. Scarcity of water, diminishing forests, depleted natural resources; these are conditions about which international lawyers often worry. Yet abundance and proliferation are also sources of concern for international law. Mounting waste, rising obesity rates, the multiplying agendas of terrorist cells, an ever- mutating array of pathogens; a sense of ‘too much’ triggers as much global anxiety as ‘not enough’. 
Among the dimensions of excess by which the globe is said to be afflicted is a ‘data deluge’ being experienced globally. According to a May 2012 United Nations (UN) report, the world’s stock of digital data is expected to increase forty-four times between 2007 and 2020, doubling every twenty months. A new order of measurement has been coined to cope: the term zettabytes was introduced in 2010.2 This inundation is often greeted with awe. In scale and inevitability, it smacks of the sublime. Big data is said to define our age, even as it frequently exceeds our grasp. 
Data can be ‘big’ in a number of ways; in conceptual terms, in volume, in complexity. The term ‘big data’ alludes to the amassing and analysis of vast amounts of recombinable digital information that eludes corralling for a variety reasons. The ‘global data economy’, with which big data is often associated, is sometimes framed as an ‘economy of goods and ideas . . . that trades in personal information’ with consumer-monitoring, advertising and Internet- based transacting comprising its ‘backbone’. Elsewhere, the global data economy appears as a comprehensively reconfigured and recharged version of the pre-existing global economy, whereby processing and manufacturing times are reduced, performance analysis capacity intensified, product and service-customisation increased, decision-making optimised, and ‘entirely new business models’ invented thanks to the harnessing of big data. Visions of economic yield surrounding big data have a significant financial dimension as well. Imagining augmented databases, data sets or data analysis capacities as income-yielding assets raises the prospect of their financing and securitisation generating value, independently of any other business innovations with which they might be associated. 
At the forefront of the worries by which the term ‘big data’ is often accompanied is a sense that ‘stewardship’ is lacking in the assembly, storage, management, analysis, distribution, business or developmental deployment and monetisation of data. Particular concerns surround accumulations and transfers of personal data, that is, data traceable to and about a natural person. Along silent data streams all around us — being channelled, pooled, commodified, corrupted, diverted, erased—human freedom and agency are draining away, or so it is often supposed. 
Uneasiness abounds: surely someone, somewhere, in all this, needs to have a hand—or a coordinated group of hands—on some tiller? An invisible hand would suffice for some, but invocations of the market don’t always reassure. ‘The challenges are great, and will only be solved by focused effort and collaboration’, wrote Clifford Lynch, executive director of the Coalition for Networked Information, in Nature a few years ago. ‘As the volume of data, and the need to manage it grows, disciplinary consensus [and] leadership will be very powerful factors in addressing [those] challenges’, Lynch continued. Then again, ‘necessary community standards are lacking’, Lynch acknowledged. On what, then, might ‘disciplinary consensus’ rest? Indeed, which discipline or disciplines and which leader or leaders are, or should be, in play? 
Breathiness, glee and hope are, at the same time, plentiful in the proximity of the term ‘big data’. The unprecedented global ‘data flood’ is life-giving and value-generating in many accounts. Personal data flows are said to yield ‘the new “oil”’ for the twenty-first century and big data ‘a new asset class touching all aspects of society’, according to the influential World Economic Forum (WEF). The WEF has identified a ‘new wave of opportunity for economic and societal value creation’ with the vast masses of digital data being created by and about people the world over. This value-creation opportunity demands, according to the WEF, ‘a new way of thinking about individuals’. It also demands global legal reform. ‘Unlocking the full potential of data’ would, the WEF has argued, require redress of the ‘lack of global legal interoperability’ and resolution of ‘points of tension’ around privacy, ownership, transparency and value distribution. 
Development opportunities potentially associated with big data engender particular excitement. ‘At any one point in time and space’, the UN has observed, ‘such data may . . . provid[e] an opportunity to figuratively take the pulse of communities’, a possibility that is ‘immensely consequential for society, perhaps especially for developing countries’. Households in developing countries might, the UN has speculated, leverage real-time digital data to improve their access to food, tailor energy use, gain better access to micro-credit or other financial support, access health advice, and contribute to early warning mechanisms and social movements while benefiting from the same. A growing community of ‘mobile money intellectuals’ has highlighted the potential that mobile telephone infrastructure and associated data streams may hold for better integrating and servicing the poor in global financial services markets. Yet, social scientists working on these developments have also warned against their unreserved celebration, cautioning that ‘[t]he failure to link technological questions to normative political questions [and, I would add, legal questions] can lead to undesirable outcomes’. 
On the whole, policy speculations about the burgeoning data economy and related global development opportunities seem largely dominated by ‘big-push logic’ — a logic that has met with circumspection in development economics over the past decade. That logic ‘stresses that poor economies need some sort of large demand expansion, to expand the size of the market, so that entrepreneurs will find it profitable to incur the fixed costs of industrialization’ (or post-industrial infrastructure improvement) and suggests that ‘anything that stimulates demand will do’, including resource discovery. Critics of this logic have observed that natural resource booms are sometimes accompanied by declining per-capita GDP. 
Public international law has had much to say, for a long time, about the ‘big push’ of natural resource discovery and the struggles and responsibilities to which it may give rise. International law has been a key battleground for claim-making and norm-making in such contexts.If data are, indeed, the ‘resources’ likely to fuel much twenty-first century development, then one might expect data’s mining and monetisation—amid developing countries’ growing numbers of consumers especially—to elicit a new set of stakeholder claims, conflicts and challenges to which international law should already be attuned. One might anticipate, for example, the emergence of something akin to the doctrine of permanent sovereignty over natural resources put forward by Third World states in the context of the commodities boom of the 1970s. Data-oriented, collective claims have been advanced in debates surrounding access to vaccines, expressed as assertions of ‘viral sovereignty’ over a national population’s genetic data. Yet, no comparable claims appear to have yet been formulated or anticipated in relation to the economic developments identified with big data. Online protest about individuals’ treatment as commodities rather than consumers, and generic scholarly recognition of a ‘digital divide’: this seems more or less the extent of collective global ‘speaking back’ to the WEF’s sweeping, big data-centred vision and its like. 
Before being carried away too readily by the WEF’s ‘new wave of opportunity’, this article would have readers pause. The goal of this article is to initiate and lay some groundwork for a conversation about big data within public international law scholarship. Such a public international law conversation is never likely to cover the field — or rather intersecting fields — engaged by the theme of big data. Nonetheless, if media historian Lisa Gitelman is right that ‘every discipline . . . has its own norms and standards for the imagination of data’, then nascent orthodoxies may already be at work in global law and policy surrounding personal data that are helping to structure and shape the emergent global data economy, and might yet do so further. If so, then the ‘models of intelligibility’ surrounding data and the person that global policy-making is bringing to bear merit elucidation and critique. What are these models’ lawful dimensions and what could be their potential ramifications for public international law? What questions and difficulties might a WEF-evoked vision of ‘global legal interoperability’ towards enhanced data flow pose for public international lawyers? This article does not purport to answer these questions but rather to raise them (and some related questions), to highlight what may be at stake in addressing them, and to propose an agenda for their collaborative pursuit. In short, it contends that there is much more at issue in the governance of the emerging global data economy than technical interface between existing legal systems and well-aired privacy concerns. 
The first part of this article sets out some emergent orthodoxies by which much of the international legal and policy literature concerning the global data economy appears marked. In each case, it offers some grounds for calling those orthodoxies into question. The second part of this article identifies some questions that developments in the global data economy call upon public international lawyers to address in broad-ranging, counter-disciplinary collaboration.