Secret Data Sunday – International Rescue Committee Edition

I haven’t posted for a while on this subject so here’s some background.

The International Rescue Committee (IRC) did a series of surveys in the Democratic Republic of Congo (DRC).  The final installment summed up the IRC findings as follows:

Based on the results of the five IRC studies, we now estimate that 5.4 million excess deaths have occurred between August 1998 and April 2007. An estimated 2.1 million of those deaths have occurred since the formal end of war in 2002.

The IRC’s estimate of 5.4 million excess deaths received massive publicity, some of it critical, but journalists and scholars have mostly taken the IRC claim at face value.  The IRC work had substantial methodological flaws that were exposed in detail in the Human Security Report and you should definitely have a look if you haven’t seen this critique. But I won’t rehash all these issues in the present blog post.  Instead, I will just discuss data.

One of the main clouds hanging over the IRC work is the fact that three other surveys find child mortality rates to be steadily falling during the period when the IRC claims there was a massive spike in these rates.  (See this post and this post for more information.)  In particular, there are two DHS surveys and a MICS survey that strongly contradict the IRC claims.

And guess what?

The DHS and MICS data are publicly available but the IRC hides its data.

As always, I don’t draw the conclusion of data hiding lightly but, rather, I’ve tried pretty hard to persuade the relevant actors to come clean.

Frankly, I don’t think I’m under any obligation to make all these efforts.  I haven’t sent any emails to the DHS or MICS people because there’s no need to bother, given that their data are free for the taking.  But the IRC hasn’t posted their data so I resorted to emails.

I wrote multiple times over many months with no success to Ben Coghlan of the Burnet Institute in Australia.  He led the last two rounds of the IRC research, including an academic publication in the Lancet, so he was a sensible starting point.

In the end, it would have been better if Coghlan had just done a Taleb and told me to “fuck off” straight away rather than stringing me along.  First he asked what I wanted to do with the data.  I feel that this is not an appropriate questions since data access shouldn’t really depend plans.  But I told him that I wanted to get to the bottom of why the IRC data were so inconsistent with the other data.  After prompting, he said he needed to delay because he was just finishing his PhD.  I made the obvious reply, pointing out that even while completing a PhD he should still be able to spare ten minutes to send a dataset.  On my next prompt he replied by asking me, rather disingenuously I thought,  how my project was getting on.  I replied that I hadn’t been able to get out of the starting block because he hadn’t sent me any data.  I gave up after two more prompts.

Next I tried Jeannie Annan, the Senior Director of Research and Evaluation at the IRC.  She replied that she didn’t have the data and that I should try …..Ben Coghlan and Les Roberts who led the early rounds of the surveys.

I knew that Les Roberts would never cough up the data (too long a story for this blog post) but wrote him anyway.  He didn’t reply.

I wrote back to Jeannie Annan saying that both Coghlan and Roberts were uncooperative but that, ultimately, this is IRC work and that the IRC needs to take responsibility for it. In my view:

  1. The IRC should have the data if they stand behind their work
  2. If the IRC doesn’t have the data then they should insist that Roberts and Coghlan hand it over.
  3. If Roberts and Coghlan refuse to provide them with the data then the IRC should retract the work.

She didn’t reply.

Here’s where this unfortunate situation stands.

The IRC estimate of 5.4 million excess deaths in the DRC exerts a big influence on the conflict field and on the perceptions of the general public.  It is widely, but erroneously, believed that this DRC conflict has been the deadliest since World War 2.  The IRC estimate survives largely as conventional wisdom, despite the critique of the Human Security Report.

The IRC and the academics involved keep their data well hidden,  choking off further discussion.

PS – Note that this is not only a tale of an NGO that doesn’t uphold scientific standards – there are also academics involved.  I say this because last week at least one person commented that, although Taleb’s behavior is appalling, he’s not really an academic.

 

Secret Data Sunday – Nassim Nicholas Taleb Edition

When data are central to scientific discussions, as is typically the case, then the relevant data should be open to all.

OK, we don’t have to be totally rigid about this.  People may sink a lot of effort into building a data set so it’s reasonable for data builders to milk their data monopoly for some grace period.  In my opinion, you get one publication.  Then you put your data into the public domain.

And public domain means public domain.  It’s not OK to hide your data from people you don’t like, from people you think are incompetent, from people you suspect of having engaged in acts of moral turpitude, etc..  You post your data so everyone can have them.

If you put your data into the public domain and someone does something stupid with it then it’s fine to say that.  It’s a virtue to be nice but being nice isn’t a requirement.  But as far as I’m concerned you share your data or you’re not doing science.

Readers of the blog should be well aware that there has been a dispute about the decline of war (or not), primarily between Steven Pinker and Nassim Nicholas Taleb.  You can track my participation in this debate from a bunch of my blog entries and the links they contain.  I’m in the middle of preparing a conference talk on this subject, and I’ll post the slides later this week….so more is coming.

I planned a little data work to support the talk so I emailed Taleb asking him for the data he used to launch his attack on Pinker’s work.  Here is his reply.

1) It is not professional to publish a “flaw” without first contacting the authors. You did it twice.

2) Your 2 nitpicking “flaws” betrayed total ignorance of the subject.

So I will ask you to fuck off.

He is referring to this post (which did contain an error that I corrected after a reader pointed it out.)

What can I say?

The main thing is that if he wants to do science then it’s not OK to just declare someone to be ignorant and withhold data.

Beyond that I’d say that if he still objects to something in my post he should be specific, either in the comments or to me directly.  As always, I’ll issue a correction or clarification if I get something wrong.

Third, it isn’t really standard to clear in advance criticisms of someone’s work with the person being criticized.  Doing this could be a reasonable strategy in some cases.  And it’s reasonable to send criticism to the person being criticized.  Correcting errors, as I do, is essential.

Anyway, I take away from this episode that Taleb isn’t doing science and also that he probably doesn’t have great confidence in his work on this subject or else he wouldn’t hide his data.

Secret Data Sunday – AAPOR Investigates the Trump-Clinton Polling Miss Using Data you Can’t See

The long-awaited report from the American Association for Public Opinion Research (AAPOR) on the performance of polling in the Trump-Clinton race is out.  You will see that this material is less of a stretch for the blog than it might seem to be at first glance and I plan a second post on it.

Today I just want to highlight the hidden data issue which rears its head very early in the report:

The committee is composed of scholars of public opinion and survey methodology as well as election polling practitioners. While a number of members were active pollsters during the election, a good share of the academic members were not. This mix was designed to staff the committee both with professionals having access to large volumes of poll data they knew inside and out, and with independent scholars bringing perspectives free from apparent conflicts of interest. The report addresses the following questions:

So on the one hand we have pollsters “having access to large volumes of poll data” and on the other hand we have “independent scholars” who….errr….don’t normally have access to large volumes of polling data because the pollsters normally hide it from them.   (I’m not sure what the apparent conflict of interest of the pollsters is but I guess it’s that they might be inclined to cover up errors they may have made in their election forecasts.)

You might well ask how come all these datasets aren’t in the public domain?

elephant

Sadly, there is no good answer to that question.

But the reason all these important data remain hidden is pretty obvious.  Pollsters don’t want independent analysts to embarrass them by finding flaws in their data or their analysis.

This is a bad reason.

There is a strong public interest in having the data available.  The data would help all of us, not just the AAPOR committee, understand what went wrong with polling in the the Trump-Clinton race.  The data would also help us learn why Trump won which is clearly an important question.

 

But we don’t have the data.

I understand that there are valid commercial reasons for holding polling data privately while you sell some stories about it.  But a month should be more than sufficient for this purpose.

It is unacceptable to say that sharing requires resources that you don’t have because sharing data just doesn’t require a lot of resources.  Yes, I know that I’ve whinged a bit on the blog about sharing all that State Department data and I’m doing it in tranches.  Still, this effort is costing me only about 15-30 minutes per dataset.  It’s really not a big deal.

I suppose somebody might say that these datasets are collected privately and so it’s OK to permanently keep them private.  But election polls drive public discussions and probably affect election outcomes.  There is a really strong public interest in disclosure.

There is further material in the report on data openness:

Since provision of microdata is not required by the AAPOR Transparency Initiative, we are particularly grateful to ABC News, CNN, Michigan State University, Monmouth University, and University of Southern California/Los Angeles Times for joining in the scientific spirit of this investigation and providing microdata. We also thank the employers of committee members (Pew Research Center, Marquette University, SurveyMonkey, The Washington Post, and YouGov) for demonstrating this same commitment.

I’ve written before about how AAPOR demands transparency on everything except the main thing you would think of when it comes to survey transparency – showing your data.

I’ll return to this AAPOR problem in a future Secret Data Sunday.  But for now I just want to say that the Committee’s appeal to a “scientific spirit” falls flat.  Nobody outside the committee can audit the AAPOR report and it will be unnecessarily difficult to further develop lines of inquiry initiated by the report for one simple reason; nobody outside the committee has access to all of the data the committee analyzed.  This is not science.

OK, that’s all I want to say today.  I’ll return to the main points of the report in a future post.

Secret Data Sunday: Why does it Matter?

Bernie Sanders made some useful comments last week about the attempt, ultimately successful, to prevent Ann Coulter from speaking at Berkeley:

“What are you afraid of ― her ideas? Ask her the hard questions,” he concluded. “Confront her intellectually. Booing people down, or intimidating people, or shutting down events, I don’t think that that works in any way.”

I totally agree with Sanders and anyone on the fence on this issue should read this article about how Georgetown students politely put tough questions to Sebastian Gorka who had no answers and fled.

Sure, you might say, but what does this have to do with people who hide their data?

The connection has to do with confidence …  or lack thereof.

If you fear that Ann Coulter will run rings around you in a debate then why not try to shut her down before she gets the chance?  But if you are confident that you can outmaneuver Seb Gorka then why not exchange views with him in public?

Similarly, if you are afraid that an independent researcher might expose embarrassing weaknesses in your data and/or analysis then you are drawn to hiding your data.  But if you are confident in your data and your work then  you are not afraid of outside scrutiny.  In fact, you positively welcome outside scrutiny because you might learn something useful from it.

Another parallel between the two situations is that in both cases the choice of remaining closed should not be an allowable option.  Berkeley should have let Coulter speak (after having invited her) regardless of whether or not the dominant locals there are afraid of her.  Similarly, the dataset for the UN-sponsored Iraq Child and Maternal Mortality Survey really should be in the public domain even though releasing it will embarrass UNICEF and some people associated with the survey.

Over the next few weeks I’ll continue to give examples of people hiding important conflict datasets.  I believe that lack of confidence is a common denominator that runs underneath all these situations.

We need to draw appropriate inferences when we ask for data and the answer is “no”.