'‘For good measure’: data gaps in a big data world' by Sarah Giest and Annemarie Samuels in (2020)
Policy Sciences comments
Policy and data scientists have paid ample attention to the amount of data being collected and the challenge for policymakers to use and utilize it. However, far less attention has been paid towards the quality and coverage of this data specifically pertaining to minority groups. The paper makes the argument that while there is seemingly more data to draw on for policymakers, the quality of the data in combination with potential known or unknown data gaps limits government’s ability to create inclusive policies. In this context, the paper defines primary, secondary, and unknown data gaps that cover scenarios of knowingly or unknowingly missing data and how that is potentially compensated through alternative measures. Based on the review of the literature from various fields and a variety of examples highlighted throughout the paper, we conclude that the big data movement combined with more sophisticated methods in recent years has opened up new opportunities for government to use existing data in different ways as well as fill data gaps through innovative techniques. Focusing specifically on the representativeness of such data, however, shows that data gaps affect the economic opportunities, social mobility, and democratic participation of marginalized groups. The big data movement in policy may thus create new forms of inequality that are harder to detect and whose impact is more difficult to predict.
The authors argue
Since the amount of data has increased, there is a widespread techno-optimist notion that so-called big data will provide better information and that this better information will in turn facilitate better decisions. Big data is largely referred to as the collection of data so large, varied, and dynamic that it cannot be handled through conventional processing methods and often combines enormous volumes of digital data with advanced data analysis (Klievink et al. 2017; Vydra and Klievink 2019). In this context, some specifically point toward the new forms of social data generated by internet users (Mergel et al. 2016). However, marginalized groups often produce less data, ‘because they are less involved in the formal economy and its data-generating activities [or because they] have unequal access to and relatively less fluency in the technology necessary to engage online’ (Barocas and Selbst 2016, 685). In other words, some people do not engage with activities that advanced analytics is designed to capture (Lerman 2013). Therefore, while there is seemingly more data to draw on for policymakers (Giest 2017), mining data can reproduce existing patterns of discrimination and exclusion by drawing on biased data. At the core of this paper is thus the idea that even though the volume of data has increased in recent years, the quality of the data in combination with potential known or unknown data gaps limits government’s ability to create inclusive policies. Simply put, having a lot of data does not necessarily mean that the data are representative and reliable (Desouza and Smith 2014) or that governments are able to utilize them. In this context, Lerman (2013) and Hand (2020) talk about ‘big data’s exclusions’ and ‘dark data’ respectively. Both conclude that the data used can have hidden data gaps that differ depending on how data was collected and analyzed as well as the kind of questions being asked. In addition, these gaps might contain non-random and systematic omissions, which can lead to data that excludes or underrepresents people at the margins—whether that is due to poverty, geography, or lifestyle (Lerman 2013; Hand 2020).
Beyond this, however, data gaps with a specific focus on marginalized groups and policymaking have received limited attention over the years. The literature on this topic focuses largely on the Global South in the context of data agency and bottom-up data generation as well as defiance (e.g. Milan and Trere 2019). Another stream of the literature highlights potential biases in big data, zooming in on social media data (e.g., Hargittai 2018; Olteanu et al. 2019). For this paper, we are particularly interested in how these data gaps manifest in different areas of government decision-making and how they potentially impact policymaking and public services. We define data gaps as data for particular elements or social groups that are knowingly or unknowingly missing when policy is made on the basis of large datasets. We thereby distinguish among three categories that are summarized in Table 1. A data gap may occur either when a part of the necessary data for policymaking is absent or when it is present but underused/of low quality. Importantly, the gap may be either known or unknown. In each case, the data gap may lead to an incomplete picture for policymaking.
First, data may be unavailable, and this gap is known to government. In this scenario, where the gap has been detected, government can compensate with alternative measures, which, as will be discussed below, have their own pitfalls. Policymakers may also decide to not follow-up on collecting missing data. This is what we define as ‘primary data gap’. In a second version of this scenario, the data gap might be unknown to government. In this context, hidden data gaps can lead to policymakers relying on datasets that unintentionally underrepresent certain groups, which can potentially have wider repercussions for public decision-making and may overlook smaller, potentially vulnerable groups. In a scenario where awareness of the gap is met with available data, there are additional hurdles that government may encounter. These can originate from the required data being proprietary and in the hands of private companies or government lacking the expertise or resources to utilize them. Finally, the data that are available may also be of poor quality and are unable to be a good ‘fix’ for the data gap that is being filled. This is what we call a ‘secondary data gap’. These aspects are particularly relevant when we turn to ‘inclusive policymaking’. The OECD (2019) draws attention to an approach to policymaking that better understands how policies are designed and implemented. This, according to the OECD, builds on reliable and relevant information in order to make informed decisions. If some reliable information is missing or is perceived as complete while experiencing data gaps, this creates an issue for those affected by policies created based on incomplete data.
The following sections will discuss in more detail the primary, secondary and hidden data gaps based on examples. The analysis will also show how flaws in the data have effects on public decision-making and service delivery. The final section is dedicated to raising larger questions around the data input and output in times of big data and how that changes the way governments see and design policies for marginalized groups.