Secret Data Sunday – International Rescue Committee Edition

I haven’t posted for a while on this subject so here’s some background.

The International Rescue Committee (IRC) did a series of surveys in the Democratic Republic of Congo (DRC).  The final installment summed up the IRC findings as follows:

Based on the results of the five IRC studies, we now estimate that 5.4 million excess deaths have occurred between August 1998 and April 2007. An estimated 2.1 million of those deaths have occurred since the formal end of war in 2002.

The IRC’s estimate of 5.4 million excess deaths received massive publicity, some of it critical, but journalists and scholars have mostly taken the IRC claim at face value.  The IRC work had substantial methodological flaws that were exposed in detail in the Human Security Report and you should definitely have a look if you haven’t seen this critique. But I won’t rehash all these issues in the present blog post.  Instead, I will just discuss data.

One of the main clouds hanging over the IRC work is the fact that three other surveys find child mortality rates to be steadily falling during the period when the IRC claims there was a massive spike in these rates.  (See this post and this post for more information.)  In particular, there are two DHS surveys and a MICS survey that strongly contradict the IRC claims.

And guess what?

The DHS and MICS data are publicly available but the IRC hides its data.

As always, I don’t draw the conclusion of data hiding lightly but, rather, I’ve tried pretty hard to persuade the relevant actors to come clean.

Frankly, I don’t think I’m under any obligation to make all these efforts.  I haven’t sent any emails to the DHS or MICS people because there’s no need to bother, given that their data are free for the taking.  But the IRC hasn’t posted their data so I resorted to emails.

I wrote multiple times over many months with no success to Ben Coghlan of the Burnet Institute in Australia.  He led the last two rounds of the IRC research, including an academic publication in the Lancet, so he was a sensible starting point.

In the end, it would have been better if Coghlan had just done a Taleb and told me to “fuck off” straight away rather than stringing me along.  First he asked what I wanted to do with the data.  I feel that this is not an appropriate questions since data access shouldn’t really depend plans.  But I told him that I wanted to get to the bottom of why the IRC data were so inconsistent with the other data.  After prompting, he said he needed to delay because he was just finishing his PhD.  I made the obvious reply, pointing out that even while completing a PhD he should still be able to spare ten minutes to send a dataset.  On my next prompt he replied by asking me, rather disingenuously I thought,  how my project was getting on.  I replied that I hadn’t been able to get out of the starting block because he hadn’t sent me any data.  I gave up after two more prompts.

Next I tried Jeannie Annan, the Senior Director of Research and Evaluation at the IRC.  She replied that she didn’t have the data and that I should try …..Ben Coghlan and Les Roberts who led the early rounds of the surveys.

I knew that Les Roberts would never cough up the data (too long a story for this blog post) but wrote him anyway.  He didn’t reply.

I wrote back to Jeannie Annan saying that both Coghlan and Roberts were uncooperative but that, ultimately, this is IRC work and that the IRC needs to take responsibility for it. In my view:

  1. The IRC should have the data if they stand behind their work
  2. If the IRC doesn’t have the data then they should insist that Roberts and Coghlan hand it over.
  3. If Roberts and Coghlan refuse to provide them with the data then the IRC should retract the work.

She didn’t reply.

Here’s where this unfortunate situation stands.

The IRC estimate of 5.4 million excess deaths in the DRC exerts a big influence on the conflict field and on the perceptions of the general public.  It is widely, but erroneously, believed that this DRC conflict has been the deadliest since World War 2.  The IRC estimate survives largely as conventional wisdom, despite the critique of the Human Security Report.

The IRC and the academics involved keep their data well hidden,  choking off further discussion.

PS – Note that this is not only a tale of an NGO that doesn’t uphold scientific standards – there are also academics involved.  I say this because last week at least one person commented that, although Taleb’s behavior is appalling, he’s not really an academic.

 

Secret Data Sunday – Nassim Nicholas Taleb Edition

When data are central to scientific discussions, as is typically the case, then the relevant data should be open to all.

OK, we don’t have to be totally rigid about this.  People may sink a lot of effort into building a data set so it’s reasonable for data builders to milk their data monopoly for some grace period.  In my opinion, you get one publication.  Then you put your data into the public domain.

And public domain means public domain.  It’s not OK to hide your data from people you don’t like, from people you think are incompetent, from people you suspect of having engaged in acts of moral turpitude, etc..  You post your data so everyone can have them.

If you put your data into the public domain and someone does something stupid with it then it’s fine to say that.  It’s a virtue to be nice but being nice isn’t a requirement.  But as far as I’m concerned you share your data or you’re not doing science.

Readers of the blog should be well aware that there has been a dispute about the decline of war (or not), primarily between Steven Pinker and Nassim Nicholas Taleb.  You can track my participation in this debate from a bunch of my blog entries and the links they contain.  I’m in the middle of preparing a conference talk on this subject, and I’ll post the slides later this week….so more is coming.

I planned a little data work to support the talk so I emailed Taleb asking him for the data he used to launch his attack on Pinker’s work.  Here is his reply.

1) It is not professional to publish a “flaw” without first contacting the authors. You did it twice.

2) Your 2 nitpicking “flaws” betrayed total ignorance of the subject.

So I will ask you to fuck off.

He is referring to this post (which did contain an error that I corrected after a reader pointed it out.)

What can I say?

The main thing is that if he wants to do science then it’s not OK to just declare someone to be ignorant and withhold data.

Beyond that I’d say that if he still objects to something in my post he should be specific, either in the comments or to me directly.  As always, I’ll issue a correction or clarification if I get something wrong.

Third, it isn’t really standard to clear in advance criticisms of someone’s work with the person being criticized.  Doing this could be a reasonable strategy in some cases.  And it’s reasonable to send criticism to the person being criticized.  Correcting errors, as I do, is essential.

Anyway, I take away from this episode that Taleb isn’t doing science and also that he probably doesn’t have great confidence in his work on this subject or else he wouldn’t hide his data.

Secret Data Sunday – AAPOR Investigates the Trump-Clinton Polling Miss Using Data you Can’t See

The long-awaited report from the American Association for Public Opinion Research (AAPOR) on the performance of polling in the Trump-Clinton race is out.  You will see that this material is less of a stretch for the blog than it might seem to be at first glance and I plan a second post on it.

Today I just want to highlight the hidden data issue which rears its head very early in the report:

The committee is composed of scholars of public opinion and survey methodology as well as election polling practitioners. While a number of members were active pollsters during the election, a good share of the academic members were not. This mix was designed to staff the committee both with professionals having access to large volumes of poll data they knew inside and out, and with independent scholars bringing perspectives free from apparent conflicts of interest. The report addresses the following questions:

So on the one hand we have pollsters “having access to large volumes of poll data” and on the other hand we have “independent scholars” who….errr….don’t normally have access to large volumes of polling data because the pollsters normally hide it from them.   (I’m not sure what the apparent conflict of interest of the pollsters is but I guess it’s that they might be inclined to cover up errors they may have made in their election forecasts.)

You might well ask how come all these datasets aren’t in the public domain?

elephant

Sadly, there is no good answer to that question.

But the reason all these important data remain hidden is pretty obvious.  Pollsters don’t want independent analysts to embarrass them by finding flaws in their data or their analysis.

This is a bad reason.

There is a strong public interest in having the data available.  The data would help all of us, not just the AAPOR committee, understand what went wrong with polling in the the Trump-Clinton race.  The data would also help us learn why Trump won which is clearly an important question.

 

But we don’t have the data.

I understand that there are valid commercial reasons for holding polling data privately while you sell some stories about it.  But a month should be more than sufficient for this purpose.

It is unacceptable to say that sharing requires resources that you don’t have because sharing data just doesn’t require a lot of resources.  Yes, I know that I’ve whinged a bit on the blog about sharing all that State Department data and I’m doing it in tranches.  Still, this effort is costing me only about 15-30 minutes per dataset.  It’s really not a big deal.

I suppose somebody might say that these datasets are collected privately and so it’s OK to permanently keep them private.  But election polls drive public discussions and probably affect election outcomes.  There is a really strong public interest in disclosure.

There is further material in the report on data openness:

Since provision of microdata is not required by the AAPOR Transparency Initiative, we are particularly grateful to ABC News, CNN, Michigan State University, Monmouth University, and University of Southern California/Los Angeles Times for joining in the scientific spirit of this investigation and providing microdata. We also thank the employers of committee members (Pew Research Center, Marquette University, SurveyMonkey, The Washington Post, and YouGov) for demonstrating this same commitment.

I’ve written before about how AAPOR demands transparency on everything except the main thing you would think of when it comes to survey transparency – showing your data.

I’ll return to this AAPOR problem in a future Secret Data Sunday.  But for now I just want to say that the Committee’s appeal to a “scientific spirit” falls flat.  Nobody outside the committee can audit the AAPOR report and it will be unnecessarily difficult to further develop lines of inquiry initiated by the report for one simple reason; nobody outside the committee has access to all of the data the committee analyzed.  This is not science.

OK, that’s all I want to say today.  I’ll return to the main points of the report in a future post.

Secret Data Sunday – The Iraq Child and Maternal Mortality Survey

Many readers of the blog know that there was a major cock-up over child mortality figures for Iraq.  In fact, exaggerated child mortality figures have been used to justify the 2003 invasion of Iraq, both prospectively and retrospectively.

Here I won’t repeat the basics one more time, although anyone unfamiliar with this debacle should click on the above link which, in turn, offers further links providing more details.

Today I just inject one new point into this discussion – the dataset for the UNICEF survey that wildly overestimated Iraq’s child mortality rates is not available.  (To be clear, estimates from this dataset are available but the underlying data you need to audit the survey are hidden.)

The hidden survey is called the Iraq Child and Maternal Mortality Survey  (ICMMS).  This graph (which you can enlarge on your screen) reveals the ICMMS as way out of line with no fewer than four subsequent surveys, all debunking the stratospheric ICMMS child mortality estimates.  The datasets for three of the four contradicting surveys are publicly available and open to scrutiny (I will return to the fourth of the contradicting surveys in a future blog post.)

But the ICMMS dataset is nowhere to be found – and I’ve looked for it.

For starters, I emailed UNICEF but couldn’t find anyone there who had it or was willing to share it.

I also requested the dataset multiple times from Mohamed Ali, the consulting statistician on the survey who now is at the World Health Organization (WHO).

At one point Mohamed directed me to the acting head of the WHO office in Iraq who blew me off before I had a chance to request the data from him.  But, then, you have to wonder what the current head of the WHO office in Iraq has to do with a 1990’s UNICEF survey, anyway.

I persisted with Mohamed who then told me that if he still has the data it would be somewhere on some floppy disk.  This nostalgic reminder of an old technology is kind of cute but doesn’t let him off the hook for the dataset which I never received on a floppy disk or otherwise.

There is a rather interesting further wrinkle on this saga of futility.  The ICMMS dataset was heavily criticized in research commissioned for the UN’s oil for food report:

It is clear, however, that widely quoted claims made in 1995 of 500,00 deaths of children under 5 as a result of sanctions were far too high;

John Blacker, Mohamed Ali and Gareth Jones then responded to this criticism with a 2007 academic article defending the ICMMS dataset:

A response to criticism of our estimates of under-5 mortality in Iraq, 1980-98.

Abstract

According to estimates published in this journal, the number of deaths of children under 5 in Iraq in the period 1991-98 resulting from the Gulf War of 1991 and the subsequent imposition of sanctions by the United Nations was between 400,000 and 500,000. These estimates have since been held to be implausibly high by a working group set up by an Independent Inquiry Committee appointed by the United Nations Secretary-General. We believe the working group’s own estimates are seriously flawed and cannot be regarded as a credible challenge to our own. To obtain their estimates, they reject as unreliable the evidence of the 1999 Iraq Child and Maternal Mortality Survey–despite clear evidence of its internal coherence and supporting evidence from another, independent survey. They prefer to rely on the 1987 and 1997 censuses and on data obtained in a format that had elsewhere been rejected as unreliable 30 years earlier.

For the record, the Blacker, Ali and Jones article is weak and unconvincing and I may make it the subject of a future blog post.  But today I just concentrate on the (non)availability of the ICMMS dataset so I won’t wander off into a critique of their article.

Thinking purely in terms of data availability, the 2007 article raises some interesting questions.  Was Mohamed Ali still working off of floppy disks in 2007 when he published this article?  Surely he must have copied the dataset onto a hard disk to do the analysis.  And what about his co-authors?  They must have the dataset too, no?

Unfortunately, John Blacker has passed away but Gareth Jones is still around so I emailed him asking for the ICMMS dataset which he had defended so gamely.

He replied that he didn’t have never had access to the dataset when he wrote the 2007 article and still doesn’t have access now.  [MS – I reviewed the correspondence a few weeks after writing this post and realized that Jones did have access to the data way back when he worked for UNICEF but lost it after retiring a long time ago.  So he has seen the data but didn’t have it when writing his academic article defending it.]

Let that point sink in for a moment.   Jones co-authored an article in an academic journal, the only point of which was to defend the quality of a dataset.  Yet, he never saw didn’t have access to the dataset that he defended?  Sorry but this doesn’t work for me.  As far as I’m concerned when you write an article that is solely about the quality of a dataset then you need to at least take a little peek at the dataset itself.

I see two possibilities here and can’t decide which is worse.  Either these guys are pretending that they don’t have a dataset that they actually do have because they don’t want to make it public or they have been defending the integrity of a dataset they don’t even have.  Either way, they should stop the charade and declare that the ICMMS was just a big fat mistake.

I have known for a long time that the ICMMS was crap but the myth it generated lives on.  It is time for the principle defenders of this sorry survey to officially flush it down the toilet.