the murky swamp of mass atrocity data

Evangelists of “big data,” the possibility of computed knowledge at unprecedented scale, often describe our contemporary world as a “sea” of information. Data scientists have more and better knowledge of how humans behave, how they interact, how they cooperate, and how they conflict, generated as much by our own actions–through the Internet, mostly–as by those who surveil us. For some problems, the dataset is a near perfect match. Commercial airlines use “frequent flyer” programs to track when their customers fly, and to where; electoral strategists manipulate marketing information to infer norms, cultural preferences, and political opinions among likely voters. Amid a unfathomable sea, these data are intimate and human. Sgt. Pepper’s “day in the life,” once framed by a cup of coffee, is now an ever-present data-stream. We wake up, we create data; we go to the bodega, we create data; we set up shop in a six-by-six cubicle. We create data.

Violent conflict, especially on a mass scale, is never so neat. Acts of violence don’t create data, but rather destroy them. Both local and global information economies suffer during conflict, as warring rumors proliferate and trickle into the exchange of information–knowledge, data–beyond a community’s borders. Observers create complex categories to simplify events, and to (barely) fathom violence as it scales and fragments and coheres and collapses. A “mass atrocity” is a fiction; an analytically and morally useful one, but a fiction nonetheless. We expect system to follow scale, but it rarely does. So rarely, in fact, that observers identify little more than one hundred mass atrocity events since the end of the cataclysmic Second World War. One hundred is a large number, but it’s a negligible fraction of the individual violence that comprises its subjects.

Mass atrocity data have improved in fits and starts. The Global Dataset of Events, Language, and Tone (GDELT), a massive open-source computing effort, uses an automated, iterative data-stream to collect events. GDELT ingests information, imperfectly, to create a more perfect portrait of where events, including violence, globally occur. John Beieler, a political science PhD student at Penn State, recently experimented with the GDELT dataset of violent events in the Central African Republic (CAR) and South Sudan, both of which are embroiled in ongoing mass atrocities. Beieler uses the dataset to assess the likelihood of future mass atrocities in either country, but came up short. Local and international media sources feature both conflicts–gruesome portraits grace A1, and prominent global officials publish opinion pieces to “bear witness” to CAR and South Sudan’s respective horrors. But media publications cover these events as “mass atrocities,” and not as a sequential series of individual violent events. In a coda, Beieler contrasts this to Egypt, which, because of a glut of foreign journalism, the availability of citizen reporting tools like Twitter, and robust foreign diplomatic engagement, appears as both “mass repression” and a sequential series. Our understanding of the conflict’s progression throughout time–what it is, as a global event–determines its media coverage, and therefore its usefulness as a big data subject.

The convergence of scarce media, knowledge, and data is not unique to massive datasets, nor to time-bounded events. The information that local aid groups use to assist conflict-affected communities is small, in comparison. Small data are complementary, not subordinate, to their massive counterparts. Humanitarian networks, mediators, and civil society organizations want to know where violence occurred and, consequently, where vulnerabilities persist. While time is a useful data point, location is essential. Without location, aid groups won’t know where to go or how far to extend their operations. As Christopher Neu, a peacebuilding technologist, observes, the usefulness of public small data rests on an ethical quandary: In a live conflict, do humanitarian small data expose the same vulnerabilities they aim to fix? Where GDELT’s big data are open-source, small data are inherently proprietary–they’re generated by a user, one who sometimes risks physical safety to report a violent event’s location. Proximity, so praised among peacebuilders as big data’s lacking nuance, also muddies the data pond it aspires to clarify.

Leave a comment