What can you do with the Peru Data?

Somebody asked a fair question in the comments surrounding the release of the Peru dataset: what can you do with it?

That is a very big question that I can’t fully address in a blog post.  Still, I’ll try to offer a few useful thoughts.  Perhaps some readers will jump in with better ideas.  Also, I’d be delighted to hear from anyone who downloads the data and does something interesting with it.

Here’s some background.

First of all, it is event data .  This means that each line in the spread sheet is a discrete occurrence, such as a battle or a massacre.  There are a bunch of pieces of information about each event such as the date, location, number of people killed, violent actors involved, type of event, etc..

The methodology documents posted on the conflict data page give a fair amount of detail on what is in the data and what the criteria are.  It also could be useful to read this data description for the Colombia conflict database (which is also posted on the conflict data page.)  Of course, they are different conflicts and different databases but the methodologies are very similar.

This paper by David Fielding and Anja Shortland used the Peru data to demonstrate escalation cycles (my phrase, not the authors’) in the conflict:

We show that an increase in civilian abuse by one side was strongly associated with subsequent increases in abuse by the other. In this type of war, foreign intervention could substantially reduce the impact on civilians of a sudden rise in conflict intensity, by moderating the resulting ‘cycle of violence’.

I’m afraid that the published version of their paper is behind a paywall but it should be possible to get hold of it if you really want to.

I believe that Fielding and Shortland didn’t use the event character of the data specifically, instead aggregating the events into monthly time series.  However, in this paper we focused entirely on events, focusing on their sizes and timings:

Many collective human activities, including violence, have been shown to exhibit universal patterns1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17, 18, 19. The size distributions of casualties both in whole wars from 1816 to 1980 and terrorist attacks have separately been shown to follow approximate power-law distributions6, 7, 9, 10. However, the possibility of universal patterns ranging across wars in the size distribution or timing of within-conflict events has barely been explored. Here we show that the sizes and timing of violent events within different insurgent conflicts exhibit remarkable similarities. We propose a unified model of human insurgency that reproduces these commonalities, and explains conflict-specific variations quantitatively in terms of underlying rules of engagement. Our model treats each insurgent population as an ecology of dynamically evolving, self-organized groups following common decision-making processes. Our model is consistent with several recent hypotheses about modern insurgency18, 19, 20, is robust to many generalizations21, and establishes a quantitative connection between human insurgency, global terrorism10 and ecology13, 14, 15, 16, 17, 22, 23. Its similarity to financial market models24, 25, 26 provides a surprising link between violent and non-violent forms of human behaviour.

The Peru dataset was one of many we used in that article,.which was about patterns in the size distributions and timings of events that appear in war after war, not just the war in Peru.

The reader’s comment also asked about possible projects for undergraduates.  I’m not sure how to answer this question without knowing more about what kinds of undergraduates we’re talking about and what kinds of skills they have.  But students could certainly do various data manipulation exercises such as breaking down the data by region, perpetrator or type of event.

I hope that this post was useful.  I would be happy to respond to further questions.



Secret Data Sunday – International Rescue Committee Edition

I haven’t posted for a while on this subject so here’s some background.

The International Rescue Committee (IRC) did a series of surveys in the Democratic Republic of Congo (DRC).  The final installment summed up the IRC findings as follows:

Based on the results of the five IRC studies, we now estimate that 5.4 million excess deaths have occurred between August 1998 and April 2007. An estimated 2.1 million of those deaths have occurred since the formal end of war in 2002.

The IRC’s estimate of 5.4 million excess deaths received massive publicity, some of it critical, but journalists and scholars have mostly taken the IRC claim at face value.  The IRC work had substantial methodological flaws that were exposed in detail in the Human Security Report and you should definitely have a look if you haven’t seen this critique. But I won’t rehash all these issues in the present blog post.  Instead, I will just discuss data.

One of the main clouds hanging over the IRC work is the fact that three other surveys find child mortality rates to be steadily falling during the period when the IRC claims there was a massive spike in these rates.  (See this post and this post for more information.)  In particular, there are two DHS surveys and a MICS survey that strongly contradict the IRC claims.

And guess what?

The DHS and MICS data are publicly available but the IRC hides its data.

As always, I don’t draw the conclusion of data hiding lightly but, rather, I’ve tried pretty hard to persuade the relevant actors to come clean.

Frankly, I don’t think I’m under any obligation to make all these efforts.  I haven’t sent any emails to the DHS or MICS people because there’s no need to bother, given that their data are free for the taking.  But the IRC hasn’t posted their data so I resorted to emails.

I wrote multiple times over many months with no success to Ben Coghlan of the Burnet Institute in Australia.  He led the last two rounds of the IRC research, including an academic publication in the Lancet, so he was a sensible starting point.

In the end, it would have been better if Coghlan had just done a Taleb and told me to “fuck off” straight away rather than stringing me along.  First he asked what I wanted to do with the data.  I feel that this is not an appropriate questions since data access shouldn’t really depend plans.  But I told him that I wanted to get to the bottom of why the IRC data were so inconsistent with the other data.  After prompting, he said he needed to delay because he was just finishing his PhD.  I made the obvious reply, pointing out that even while completing a PhD he should still be able to spare ten minutes to send a dataset.  On my next prompt he replied by asking me, rather disingenuously I thought,  how my project was getting on.  I replied that I hadn’t been able to get out of the starting block because he hadn’t sent me any data.  I gave up after two more prompts.

Next I tried Jeannie Annan, the Senior Director of Research and Evaluation at the IRC.  She replied that she didn’t have the data and that I should try …..Ben Coghlan and Les Roberts who led the early rounds of the surveys.

I knew that Les Roberts would never cough up the data (too long a story for this blog post) but wrote him anyway.  He didn’t reply.

I wrote back to Jeannie Annan saying that both Coghlan and Roberts were uncooperative but that, ultimately, this is IRC work and that the IRC needs to take responsibility for it. In my view:

  1. The IRC should have the data if they stand behind their work
  2. If the IRC doesn’t have the data then they should insist that Roberts and Coghlan hand it over.
  3. If Roberts and Coghlan refuse to provide them with the data then the IRC should retract the work.

She didn’t reply.

Here’s where this unfortunate situation stands.

The IRC estimate of 5.4 million excess deaths in the DRC exerts a big influence on the conflict field and on the perceptions of the general public.  It is widely, but erroneously, believed that this DRC conflict has been the deadliest since World War 2.  The IRC estimate survives largely as conventional wisdom, despite the critique of the Human Security Report.

The IRC and the academics involved keep their data well hidden,  choking off further discussion.

PS – Note that this is not only a tale of an NGO that doesn’t uphold scientific standards – there are also academics involved.  I say this because last week at least one person commented that, although Taleb’s behavior is appalling, he’s not really an academic.


Pinker versus Taleb: A Non-deadly Quarrel over the Decline of Violence

As promised, I’ve just posted the slides of the talk I gave yesterday at York University (with some overnight modifications).

You can get background with links for further background here.

Somewhat bizarrely, Steven Pinker’s 2011 book was rocketing to the top of the Amazon best seller list due to a Bill Gates Tweet right when I was talking about it at York..  So I guess my timing is good.

Secret Data Sunday – Nassim Nicholas Taleb Edition

When data are central to scientific discussions, as is typically the case, then the relevant data should be open to all.

OK, we don’t have to be totally rigid about this.  People may sink a lot of effort into building a data set so it’s reasonable for data builders to milk their data monopoly for some grace period.  In my opinion, you get one publication.  Then you put your data into the public domain.

And public domain means public domain.  It’s not OK to hide your data from people you don’t like, from people you think are incompetent, from people you suspect of having engaged in acts of moral turpitude, etc..  You post your data so everyone can have them.

If you put your data into the public domain and someone does something stupid with it then it’s fine to say that.  It’s a virtue to be nice but being nice isn’t a requirement.  But as far as I’m concerned you share your data or you’re not doing science.

Readers of the blog should be well aware that there has been a dispute about the decline of war (or not), primarily between Steven Pinker and Nassim Nicholas Taleb.  You can track my participation in this debate from a bunch of my blog entries and the links they contain.  I’m in the middle of preparing a conference talk on this subject, and I’ll post the slides later this week….so more is coming.

I planned a little data work to support the talk so I emailed Taleb asking him for the data he used to launch his attack on Pinker’s work.  Here is his reply.

1) It is not professional to publish a “flaw” without first contacting the authors. You did it twice.

2) Your 2 nitpicking “flaws” betrayed total ignorance of the subject.

So I will ask you to fuck off.

He is referring to this post (which did contain an error that I corrected after a reader pointed it out.)

What can I say?

The main thing is that if he wants to do science then it’s not OK to just declare someone to be ignorant and withhold data.

Beyond that I’d say that if he still objects to something in my post he should be specific, either in the comments or to me directly.  As always, I’ll issue a correction or clarification if I get something wrong.

Third, it isn’t really standard to clear in advance criticisms of someone’s work with the person being criticized.  Doing this could be a reasonable strategy in some cases.  And it’s reasonable to send criticism to the person being criticized.  Correcting errors, as I do, is essential.

Anyway, I take away from this episode that Taleb isn’t doing science and also that he probably doesn’t have great confidence in his work on this subject or else he wouldn’t hide his data.

Secret Data Sunday – AAPOR Investigates the Trump-Clinton Polling Miss Using Data you Can’t See

The long-awaited report from the American Association for Public Opinion Research (AAPOR) on the performance of polling in the Trump-Clinton race is out.  You will see that this material is less of a stretch for the blog than it might seem to be at first glance and I plan a second post on it.

Today I just want to highlight the hidden data issue which rears its head very early in the report:

The committee is composed of scholars of public opinion and survey methodology as well as election polling practitioners. While a number of members were active pollsters during the election, a good share of the academic members were not. This mix was designed to staff the committee both with professionals having access to large volumes of poll data they knew inside and out, and with independent scholars bringing perspectives free from apparent conflicts of interest. The report addresses the following questions:

So on the one hand we have pollsters “having access to large volumes of poll data” and on the other hand we have “independent scholars” who….errr….don’t normally have access to large volumes of polling data because the pollsters normally hide it from them.   (I’m not sure what the apparent conflict of interest of the pollsters is but I guess it’s that they might be inclined to cover up errors they may have made in their election forecasts.)

You might well ask how come all these datasets aren’t in the public domain?


Sadly, there is no good answer to that question.

But the reason all these important data remain hidden is pretty obvious.  Pollsters don’t want independent analysts to embarrass them by finding flaws in their data or their analysis.

This is a bad reason.

There is a strong public interest in having the data available.  The data would help all of us, not just the AAPOR committee, understand what went wrong with polling in the the Trump-Clinton race.  The data would also help us learn why Trump won which is clearly an important question.


But we don’t have the data.

I understand that there are valid commercial reasons for holding polling data privately while you sell some stories about it.  But a month should be more than sufficient for this purpose.

It is unacceptable to say that sharing requires resources that you don’t have because sharing data just doesn’t require a lot of resources.  Yes, I know that I’ve whinged a bit on the blog about sharing all that State Department data and I’m doing it in tranches.  Still, this effort is costing me only about 15-30 minutes per dataset.  It’s really not a big deal.

I suppose somebody might say that these datasets are collected privately and so it’s OK to permanently keep them private.  But election polls drive public discussions and probably affect election outcomes.  There is a really strong public interest in disclosure.

There is further material in the report on data openness:

Since provision of microdata is not required by the AAPOR Transparency Initiative, we are particularly grateful to ABC News, CNN, Michigan State University, Monmouth University, and University of Southern California/Los Angeles Times for joining in the scientific spirit of this investigation and providing microdata. We also thank the employers of committee members (Pew Research Center, Marquette University, SurveyMonkey, The Washington Post, and YouGov) for demonstrating this same commitment.

I’ve written before about how AAPOR demands transparency on everything except the main thing you would think of when it comes to survey transparency – showing your data.

I’ll return to this AAPOR problem in a future Secret Data Sunday.  But for now I just want to say that the Committee’s appeal to a “scientific spirit” falls flat.  Nobody outside the committee can audit the AAPOR report and it will be unnecessarily difficult to further develop lines of inquiry initiated by the report for one simple reason; nobody outside the committee has access to all of the data the committee analyzed.  This is not science.

OK, that’s all I want to say today.  I’ll return to the main points of the report in a future post.

I’ve Done Something or Other and Say that 470,000 People were Killed in Syria – Would you Like to Interview Me?

Let’s go back to February of 2016 when the New York Times ran this headline:

Death Toll from War in Syria now 470,000, Group Finds

The headline is more conservative than a caption in the same article which reads:

At least [my emphasis] 470,000 Syrians have died as a result of the war, according to the Syrian Center for Policy Research.

This switch between the headline and the caption is consistent with a common pattern of converting an estimate, that might be either too high or too low, into a bare minimum.

Other respected outlets such as PBS, and Time jumped onto the 470,000 bandwagon with the Guardian claiming primacy in this story with an early exclusive that quotes the report’s author:

“We use very rigorous research methods and we are sure of this figure,” Rabie Nasser, the report’s author, told the Guardian. “Indirect deaths will be greater in the future, though most NGOs [non-governmental organisations] and the UN ignore them.

“We think that the UN documentation and informal estimation underestimated the casualties due to lack of access to information during the crisis,” he said.

Oddly, none of the news articles say anything about what this rigorous methodology is.  The Guardian refers to “counting” which I would normally interpret as saying that the Syrian Center for Policy Research (SCPR) has a list of 470,000 people killed but it is not at all clear that they really have such a list.

This report was the source for all the media attention.  The figure of 470,000 appears just once in the report, in a throwaway line in the conclusion:

 The armed conflict badly harmed human development in Syria where the fatalities in 2015 reached about 470,000 deaths, the life expectancy at birth estimated at 55.4 years, and the school age non-attendance rate projected at 45.2 per cent; consequently, the HDI of Syria is estimated to have lost 29.8 per cent of its HDI value in 2015 compared to 2010.

The only bit of the report that so much as hints at where the 470,00 number came from is this:

The report used results and methodology from a forthcoming SCPR report on the human development in Syria that is based on a comprehensive survey conducted in the mid of 2014 and covered all regions in Syria. The survey divided Syria into 698 studied regions and questionnaire three key informants, with specific criteria that guarantee inclusiveness and transparency, from each region. Moreover, the survey applied a strict system of monitoring and reviewing to ensure the correctness of responses. About 300 researchers, experts, and programmers participated in this survey.

This is nothing.

The hunger for scraps of information on the number of people killed in Syria is, apparently, so great that it is feasible to launch a bunch of news headlines just by saying you’ve looked into this question and come up with a number that is larger than what was previously thought.  (I strongly suspect that having a bigger number which you use to dump on any smaller numbers is a key part of getting noticed.)

That said, the above quote does promise a new report with more details and eventually a new report was released – but the details in the new report on methodology are still woefully inadequate.  They divide Syria up, interview three key informants in each area and then, somehow, calculate the number of dead people based on these interviews.  I have no idea what this calculation looks like.  There is a bit of description on how SCPR picked their key informants but, beyond that, the new report provides virtually no information relevant for evaluating the 470,000 figure.  The SCPR doesn’t even provide a copy of their questionnaire and I can hardly even guess at what it looks like.

One thing is clear though – they did not use the standard sample survey method for estimating the number of violent deaths.  Under this approach you pick a bunch of households at random, do interviews on the number of people who have lived and died in each one and extrapolate a national death rate based on death rates observed in your sample households.  If the SCPR had done something like this then at least I would’ve had a sense of where the 470,000 number came from, although I’d still want to know details.

I emailed Rabie Nasser asking for details but didn’t hear back.  Who knows.  Maybe my message went into his spam folder.  There are other people associated with this work and I’ll try to contact them and will report back if I hear something interesting.

I want to be clear.  I’m not saying that this work is useless for estimating the number of people killed in the Syrian war.  In fact, I suspect that the SCPR generated some really useful information on this question and on other issues as well.  But until they explain what they actually did I would just disregard the work, particularly the 470,000 figure.  I’m not saying that I think this number is too high or that it is too low.  I just think that it is floating in thin air without any methodological moorings to enable us to understand it.

Journalists should lay off press releases taking the form of “I did some unspecified research and here are my conclusions.”