Secret Data Sunday – Iraq Family Health Survey

The WHO-sponsored Iraq Family Health Survey (IFHS) led to a nice publication in the New England Journal of Medicine that came complete with an editorial puff piece extolling its virtues.  According to the NEJM website this publication has generated 60 citations and we’re still counting.   If you cast a net wider than just medical publications then the  citation count must run well into the hundreds.

But the IFHS virtues don’t stop there.  The NEJM paper, and the accompanying report, are well written and supply plenty of good methodological information about the survey.  The authors are pretty up front about the limitations of their work, notably that they had to skip interviews in some areas due to security concerns.  Moreover, the IFHS is an important survey not least because its estimate of 150,000 violent deaths discredited the Burnham et al. estimate of 600,000 violent deaths for almost exactly the same time period.  (The Burnham et al. survey hid its methodology and was afflicted by serious ethical and data integrity problems. )

I have cited the IFHS multiple times in my own work and generally believe in it.  At the same time, the IFHS people did several questionable things with their analysis that I would like to correct, or at least investigate, by reanalyzing the IFHS data.

But here’s the rub.  The WHO has not released the IFHS dataset.

I and other people have requested it many times.  The field work was conducted way back in 2006.  So what is the WHO waiting on?

I’ll leave a description of my unrealized reanalysis to a future post. This is because my plans just don’t matter for the issue at hand; the IFHS data should be in the public domain whether or not I have a good plan for analyzing them.  (See this post on how the International Rescue Committee hides its DRC data in which I make the same point.)

There is an interesting link between the IFHS and the Iraq Child and Maternal Mortality Survey, another important dataset that is also unavailable.  The main point of contact for both surveys is Mohamed Ali of the WHO.  Regarding the IFHS. Mohamed seemed to tell me in an email that only the Iraqi government is empowered to release the dataset.  If so, this suggests a new (at least for me) and disturbing problem;

Apparently, the WHO uses public money to sponsor surveys but then sells out the general public by ceding their data distribution rights to local governments, in this case to Iraq.  

This is practice of allowing governments benefiting from UN-sponsored research to withhold data from the public that pays for the research is unacceptable .  It’s great that the WHO sponsors survey research in needy countries but open data should be a precondition for this service.

 

 

Secret Data Sunday – Nassim Nicholas Taleb Edition

When data are central to scientific discussions, as is typically the case, then the relevant data should be open to all.

OK, we don’t have to be totally rigid about this.  People may sink a lot of effort into building a data set so it’s reasonable for data builders to milk their data monopoly for some grace period.  In my opinion, you get one publication.  Then you put your data into the public domain.

And public domain means public domain.  It’s not OK to hide your data from people you don’t like, from people you think are incompetent, from people you suspect of having engaged in acts of moral turpitude, etc..  You post your data so everyone can have them.

If you put your data into the public domain and someone does something stupid with it then it’s fine to say that.  It’s a virtue to be nice but being nice isn’t a requirement.  But as far as I’m concerned you share your data or you’re not doing science.

Readers of the blog should be well aware that there has been a dispute about the decline of war (or not), primarily between Steven Pinker and Nassim Nicholas Taleb.  You can track my participation in this debate from a bunch of my blog entries and the links they contain.  I’m in the middle of preparing a conference talk on this subject, and I’ll post the slides later this week….so more is coming.

I planned a little data work to support the talk so I emailed Taleb asking him for the data he used to launch his attack on Pinker’s work.  Here is his reply.

1) It is not professional to publish a “flaw” without first contacting the authors. You did it twice.

2) Your 2 nitpicking “flaws” betrayed total ignorance of the subject.

So I will ask you to fuck off.

He is referring to this post (which did contain an error that I corrected after a reader pointed it out.)

What can I say?

The main thing is that if he wants to do science then it’s not OK to just declare someone to be ignorant and withhold data.

Beyond that I’d say that if he still objects to something in my post he should be specific, either in the comments or to me directly.  As always, I’ll issue a correction or clarification if I get something wrong.

Third, it isn’t really standard to clear in advance criticisms of someone’s work with the person being criticized.  Doing this could be a reasonable strategy in some cases.  And it’s reasonable to send criticism to the person being criticized.  Correcting errors, as I do, is essential.

Anyway, I take away from this episode that Taleb isn’t doing science and also that he probably doesn’t have great confidence in his work on this subject or else he wouldn’t hide his data.

Secret Data Sunday – AAPOR Investigates the Trump-Clinton Polling Miss Using Data you Can’t See

The long-awaited report from the American Association for Public Opinion Research (AAPOR) on the performance of polling in the Trump-Clinton race is out.  You will see that this material is less of a stretch for the blog than it might seem to be at first glance and I plan a second post on it.

Today I just want to highlight the hidden data issue which rears its head very early in the report:

The committee is composed of scholars of public opinion and survey methodology as well as election polling practitioners. While a number of members were active pollsters during the election, a good share of the academic members were not. This mix was designed to staff the committee both with professionals having access to large volumes of poll data they knew inside and out, and with independent scholars bringing perspectives free from apparent conflicts of interest. The report addresses the following questions:

So on the one hand we have pollsters “having access to large volumes of poll data” and on the other hand we have “independent scholars” who….errr….don’t normally have access to large volumes of polling data because the pollsters normally hide it from them.   (I’m not sure what the apparent conflict of interest of the pollsters is but I guess it’s that they might be inclined to cover up errors they may have made in their election forecasts.)

You might well ask how come all these datasets aren’t in the public domain?

elephant

Sadly, there is no good answer to that question.

But the reason all these important data remain hidden is pretty obvious.  Pollsters don’t want independent analysts to embarrass them by finding flaws in their data or their analysis.

This is a bad reason.

There is a strong public interest in having the data available.  The data would help all of us, not just the AAPOR committee, understand what went wrong with polling in the the Trump-Clinton race.  The data would also help us learn why Trump won which is clearly an important question.

 

But we don’t have the data.

I understand that there are valid commercial reasons for holding polling data privately while you sell some stories about it.  But a month should be more than sufficient for this purpose.

It is unacceptable to say that sharing requires resources that you don’t have because sharing data just doesn’t require a lot of resources.  Yes, I know that I’ve whinged a bit on the blog about sharing all that State Department data and I’m doing it in tranches.  Still, this effort is costing me only about 15-30 minutes per dataset.  It’s really not a big deal.

I suppose somebody might say that these datasets are collected privately and so it’s OK to permanently keep them private.  But election polls drive public discussions and probably affect election outcomes.  There is a really strong public interest in disclosure.

There is further material in the report on data openness:

Since provision of microdata is not required by the AAPOR Transparency Initiative, we are particularly grateful to ABC News, CNN, Michigan State University, Monmouth University, and University of Southern California/Los Angeles Times for joining in the scientific spirit of this investigation and providing microdata. We also thank the employers of committee members (Pew Research Center, Marquette University, SurveyMonkey, The Washington Post, and YouGov) for demonstrating this same commitment.

I’ve written before about how AAPOR demands transparency on everything except the main thing you would think of when it comes to survey transparency – showing your data.

I’ll return to this AAPOR problem in a future Secret Data Sunday.  But for now I just want to say that the Committee’s appeal to a “scientific spirit” falls flat.  Nobody outside the committee can audit the AAPOR report and it will be unnecessarily difficult to further develop lines of inquiry initiated by the report for one simple reason; nobody outside the committee has access to all of the data the committee analyzed.  This is not science.

OK, that’s all I want to say today.  I’ll return to the main points of the report in a future post.

Secret Data Sunday – The Iraq Child and Maternal Mortality Survey

Many readers of the blog know that there was a major cock-up over child mortality figures for Iraq.  In fact, exaggerated child mortality figures have been used to justify the 2003 invasion of Iraq, both prospectively and retrospectively.

Here I won’t repeat the basics one more time, although anyone unfamiliar with this debacle should click on the above link which, in turn, offers further links providing more details.

Today I just inject one new point into this discussion – the dataset for the UNICEF survey that wildly overestimated Iraq’s child mortality rates is not available.  (To be clear, estimates from this dataset are available but the underlying data you need to audit the survey are hidden.)

The hidden survey is called the Iraq Child and Maternal Mortality Survey  (ICMMS).  This graph (which you can enlarge on your screen) reveals the ICMMS as way out of line with no fewer than four subsequent surveys, all debunking the stratospheric ICMMS child mortality estimates.  The datasets for three of the four contradicting surveys are publicly available and open to scrutiny (I will return to the fourth of the contradicting surveys in a future blog post.)

But the ICMMS dataset is nowhere to be found – and I’ve looked for it.

For starters, I emailed UNICEF but couldn’t find anyone there who had it or was willing to share it.

I also requested the dataset multiple times from Mohamed Ali, the consulting statistician on the survey who now is at the World Health Organization (WHO).

At one point Mohamed directed me to the acting head of the WHO office in Iraq who blew me off before I had a chance to request the data from him.  But, then, you have to wonder what the current head of the WHO office in Iraq has to do with a 1990’s UNICEF survey, anyway.

I persisted with Mohamed who then told me that if he still has the data it would be somewhere on some floppy disk.  This nostalgic reminder of an old technology is kind of cute but doesn’t let him off the hook for the dataset which I never received on a floppy disk or otherwise.

There is a rather interesting further wrinkle on this saga of futility.  The ICMMS dataset was heavily criticized in research commissioned for the UN’s oil for food report:

It is clear, however, that widely quoted claims made in 1995 of 500,00 deaths of children under 5 as a result of sanctions were far too high;

John Blacker, Mohamed Ali and Gareth Jones then responded to this criticism with a 2007 academic article defending the ICMMS dataset:

A response to criticism of our estimates of under-5 mortality in Iraq, 1980-98.

Abstract

According to estimates published in this journal, the number of deaths of children under 5 in Iraq in the period 1991-98 resulting from the Gulf War of 1991 and the subsequent imposition of sanctions by the United Nations was between 400,000 and 500,000. These estimates have since been held to be implausibly high by a working group set up by an Independent Inquiry Committee appointed by the United Nations Secretary-General. We believe the working group’s own estimates are seriously flawed and cannot be regarded as a credible challenge to our own. To obtain their estimates, they reject as unreliable the evidence of the 1999 Iraq Child and Maternal Mortality Survey–despite clear evidence of its internal coherence and supporting evidence from another, independent survey. They prefer to rely on the 1987 and 1997 censuses and on data obtained in a format that had elsewhere been rejected as unreliable 30 years earlier.

For the record, the Blacker, Ali and Jones article is weak and unconvincing and I may make it the subject of a future blog post.  But today I just concentrate on the (non)availability of the ICMMS dataset so I won’t wander off into a critique of their article.

Thinking purely in terms of data availability, the 2007 article raises some interesting questions.  Was Mohamed Ali still working off of floppy disks in 2007 when he published this article?  Surely he must have copied the dataset onto a hard disk to do the analysis.  And what about his co-authors?  They must have the dataset too, no?

Unfortunately, John Blacker has passed away but Gareth Jones is still around so I emailed him asking for the ICMMS dataset which he had defended so gamely.

He replied that he didn’t have never had access to the dataset when he wrote the 2007 article and still doesn’t have access now.  [MS – I reviewed the correspondence a few weeks after writing this post and realized that Jones did have access to the data way back when he worked for UNICEF but lost it after retiring a long time ago.  So he has seen the data but didn’t have it when writing his academic article defending it.]

Let that point sink in for a moment.   Jones co-authored an article in an academic journal, the only point of which was to defend the quality of a dataset.  Yet, he never saw didn’t have access to the dataset that he defended?  Sorry but this doesn’t work for me.  As far as I’m concerned when you write an article that is solely about the quality of a dataset then you need to at least take a little peek at the dataset itself.

I see two possibilities here and can’t decide which is worse.  Either these guys are pretending that they don’t have a dataset that they actually do have because they don’t want to make it public or they have been defending the integrity of a dataset they don’t even have.  Either way, they should stop the charade and declare that the ICMMS was just a big fat mistake.

I have known for a long time that the ICMMS was crap but the myth it generated lives on.  It is time for the principle defenders of this sorry survey to officially flush it down the toilet.

Data Dump Friday – Part 2, some Iraq Polling Data

Happy Friday.

I have just posted here three new things.

  1. A list of all the data sets that Steve Koczela obtained from the State Department through his successful FOIA application.
  2. An Iraq poll from April 2006 fielded by the Iraq Center for Research and Strategic Studies (ICRSS).  [Note – this organization seems to be defunct.  Perhaps someone out there knows something about this?]
  3. An Iraq poll also from April 2006 and asking the same questions as the ICRSS poll but fielded by the notorious combination of D3 Systems and KA Research Limited (KARL).

We already saw a head to head comparison of these two polls that left no doubt that much of the D3/KA data were fabricated (see also this post).

More next week!

Data Dump Friday – Event Data on the Colombia Conflict, 1988 – 2005

Hello everybody.  Today initiates a new regular feature of the blog which I will call “Data Dump Friday”.

Each week I will post some conflict data on this page until I run out of stuff to post.  Unless I get a research assistant to streamline this operation this will go on for a long time.  (If you’re aware of some data set you think I have and you want to have please email me at m.spagat@rhul.ac.uk or put a comment up on the blog.)

The back story is that I’ve been planning for a long time to create a massive data page and then make a grand announcement when it’s done.  But it has finally dawned on me that I will never reach a point when suddenly I have a big bloc of free time to complete this chore.  So I’ve decided to do this project in dribs and drabs until I’m done.

The first installment is now up.  It is event data on the Colombian Conflict, 1988 – 2005.