Data Dump Friday is Back Again: More Iraq Opinion Polls plus a film of Lions Hugging a Woman

Here’s the film you’re looking for.

Also, after being lazy for more than a month I’ve started uploading the remaining Iraq opinion polls at our old friend, the conflict data page.

Advertisements

Open the Door to all the Hidden Election Polling Data

The UK House of Lords has issued a call for evidence on the effects of political polling and digital digital media on politics.  Submissions are due next week so maybe someone out there wants to dash something off….or maybe someone would be so kind as to give me feedback on my proposal.  Below I give a draft.

Comments welcome!

Note that everything I say applies equally to political polling in the US and around the globe but, quite reasonably, the Lords ask about British polling so the proposal is written about British polling.

(OK, the proposal is about election polling, not war.  But this post is very much in keeping with the open data theme of the blog so I believe it will be of general interest to my readers.)

pexels-photo-147634

 

The Proposal[1]

I have one specific suggestion that could, if implemented, substantially improve political life in the UK; require collectors of political polling data to release their detailed micro datasets into the public domain.

A Preliminary Clarification

Some readers may think, wrongly, that pollsters generally do provide detailed micro datasets already.  Occasionally they do.  But normally they just publish summary tables, while withholding the interview-by-interview results (anonymized, of course).  Researchers need such detailed data to make valid estimates.

The Argument

Let me develop this idea in steps.

Political pollsters face two main challenges.  First, they cannot draw well-behaved random samples of voters for their polls.  This is mainly because most people selected for interviews refuse to participate.  Moreover, the political views of the refusers differ systematically from those of the participants.  Second, it is difficult to predict which poll participants will turn out to vote.  Yet good election prediction relies on good turnout prediction.

These two challenges dictate that political polling datasets cannot simply interpret themselves.  Rather, pollsters must use their knowledge, experience, intuition, wisdom and other wiles to model their way out of the shortcomings of their data.  There now exists a growing array of techniques that can be deployed to address political polling challenges.  But good applications of these techniques embody substantial elements of professional judgment, about which experts disagree.

This New York Times article leaves little doubt about the point of the last paragraph.  The NYT gave the detailed micro data from one of their proprietary Trump-Clinton polls to four analytical teams and asked for projections.  The results ranged from Clinton +4 to Trump +1.[2]  These are all valid estimates made by serious professionals.  Yet they differ quite substantively because the teams differ in some of their key judgments.

The key point is that for the foreseeable future there will not be one correct analytical technique that, if applied properly, will always lead to a correct treatment of new polling data.  Rather, there will be a useful range of valid analyses that can be made from any political polling dataset.

Presently we are robbed of all but one analysis of most political polling datasets that are collected in the UK.  This is because polling data are held privately and never released into the public domain.  This data black hole wastes opportunities in two distinct directions.  First, we cannot learn as much as possible about the state of public opinion during elections.  Second, by limiting the range of experimentation that is applied to each dataset we retard the development process for improving our analytical techniques.

An Important Caveat

Much political polling data are collected by private companies that must make a profit on their investment.  These organizations might feel threatened by this open data proposal.  However, these concerns can easily be addressed by allowing an appropriate interval of time for data collectors to monopolize their datasets.  This could work much in the way that patents are issued to provide creative incentives for inventors by giving inventors a window of time to reap high rewards before their inventions can be copied by competitors.  The only difference here is that these monopolization intervals for pollsters should be much shorter than they are for patent intervals, probably only two weeks or so.

Parliament Should Defend the Public Interest

There is a strong public interest in making full use of political polling data.  Yet even public organizations like the BBC collect political polling data (although not in the 2017 election), write up general summaries and then consign their detailed micro data into oblivion.  If public organizations cannot be convinced to do a better job of serving the public interest then they should be forced to do so.  Even private companies should be forced, by legislation if necessary, to place their political polling data into the public domain after they have been allowed a decent interval designed to feed their bottom lines.

I do not argue that all private survey data should be released to the public.  There must be a public interest test that has to be satisfied before public release can be mandated.  This test would not be satisfied for most privately collected survey data.  But election polling does meet this public interest standard and should be claimed as a public resource to benefit everyone in the UK.

 

[1] I urge the committee to consult with the leadership of the Royal Statistical Society on this question.  I have not coordinated my submission with them but I believe that they would back it.

[2] The official NYT estimate was Clinton +1.

Fabrication in Survey Data: A Sustainable Ecosystem

Here is a presentation I gave a few weeks ago on fabrication in survey data.

It includes some staple material from the blog but, mainly, I set off in a new direction – trying to explain why survey data get fabricated in the first place.

While writing the presentation I realized that these conditions are similar to those that led to the Grenfell Tower fire.  I only hint at these connections in the presentation but I plan to pursue this angle in the future.

Secret Data Sunday – Iraq Family Health Survey

The WHO-sponsored Iraq Family Health Survey (IFHS) led to a nice publication in the New England Journal of Medicine that came complete with an editorial puff piece extolling its virtues.  According to the NEJM website this publication has generated 60 citations and we’re still counting.   If you cast a net wider than just medical publications then the  citation count must run well into the hundreds.

But the IFHS virtues don’t stop there.  The NEJM paper, and the accompanying report, are well written and supply plenty of good methodological information about the survey.  The authors are pretty up front about the limitations of their work, notably that they had to skip interviews in some areas due to security concerns.  Moreover, the IFHS is an important survey not least because its estimate of 150,000 violent deaths discredited the Burnham et al. estimate of 600,000 violent deaths for almost exactly the same time period.  (The Burnham et al. survey hid its methodology and was afflicted by serious ethical and data integrity problems. )

I have cited the IFHS multiple times in my own work and generally believe in it.  At the same time, the IFHS people did several questionable things with their analysis that I would like to correct, or at least investigate, by reanalyzing the IFHS data.

But here’s the rub.  The WHO has not released the IFHS dataset.

I and other people have requested it many times.  The field work was conducted way back in 2006.  So what is the WHO waiting on?

I’ll leave a description of my unrealized reanalysis to a future post. This is because my plans just don’t matter for the issue at hand; the IFHS data should be in the public domain whether or not I have a good plan for analyzing them.  (See this post on how the International Rescue Committee hides its DRC data in which I make the same point.)

There is an interesting link between the IFHS and the Iraq Child and Maternal Mortality Survey, another important dataset that is also unavailable.  The main point of contact for both surveys is Mohamed Ali of the WHO.  Regarding the IFHS. Mohamed seemed to tell me in an email that only the Iraqi government is empowered to release the dataset.  If so, this suggests a new (at least for me) and disturbing problem;

Apparently, the WHO uses public money to sponsor surveys but then sells out the general public by ceding their data distribution rights to local governments, in this case to Iraq.  

This is practice of allowing governments benefiting from UN-sponsored research to withhold data from the public that pays for the research is unacceptable .  It’s great that the WHO sponsors survey research in needy countries but open data should be a precondition for this service.

 

 

How Many People were Killed in the Libyan Conflict – Some field work that raises more questions than it answers

Hana Salama asked me for an opinion on this article. I had missed it but it is, potentially, interesting to me so I am happy to oblige her.

I’ve now absorbed it but find myself even more puzzled than I was after reading that Syria survey I blogged on a few weeks back.  Again, it looks like some people did some useful field work but the write up is so bad that it’s hard to know exactly what they did.  In fact, the Libya work is more opaque than the Syria work to the point where I wonder what, if anything, was actually done.

For orientation here is the core of the abstract:

Methods

A systematic cross-sectional field survey and non-structured search was carried out over fourteen provinces in six Libyan regions, representing the primary sites of the armed conflict between February 2011 and February 2012. Thirty-five percent of the total area of Libya and 62.4% of the Libyan population were involved in the study. The mortality and injury rates were determined and the number of displaced people was calculated during the conflict period.

Results

A total of 21,490 (0.5%) persons were killed, 19,700 (0.47%) injured and 435,000 (10.33%) displaced. The overall mortality rate was found to be 5.1 per 1000 per year (95% CI 4.1–7.4) and injury rate was found to be 4.7 per 1000 per year (95% CI 3.9–7.2) but varied by both region and time, reaching peak rates by July–August 2011.

I’m not sure but I think the researchers (hereafter Daw et. al.) tried to count war deaths (plus injuries and displacement numbers) rather than trying to statistically estimate these numbers.  (See this paper on the distinction.)

Actually, I read the whole paper thinking that Daw et al. drew a random sample and did statistical estimation but then I changed my mind.  I got my initial impression at the beginning because they say

This epidemiological community-based study was guided by previously published studies and guidelines.

They then cite the (horrible) Roberts et al. (2004) Iraq survey as providing a framework for their research (see this and follow the links).   Since Roberts et al. was a sample survey I figured that Daw et al. was also a sample survey.  They then go on to say that

Face to face interviews were carried out with at least one member of each affected family….

This also seemed to point in the direction of a sample survey conducted on a bunch of randomly selected households.  (With this method you pick a bunch of households at random, find out how many people lived and died in each one and then extrapolate a national death rate from the in-sample death data.)

But then I realized that the above quote continues with

…listed in the registry of the Ministry of Housing and Planning

Hmmmm….so they interviewed all affected families listed in the registry of some Ministry.  This registry cannot have been a registry of every family living in the areas covered by the survey because there are far more families there than could have been interviewed on this project.  (The areas covered contain around 4.2 million people according to Table 1 of the paper and  surely Daw et al. did not conduct hundreds of thousands of interviews.)

So I’m guessing that the interviews were just of people from families on an official list of victims; killed, injured or displaced.  This guess places a lot of emphasis on one interpretation of the words “listed” and “affected” but it does make some sense.

To be clear, even interviewing one representative from every affected family would have been a gargantuan task since Daw et al. identify around 40,000 casualties (killings plus injuries) and more than 400,000 displaced people.  So we would still be talking about tens of thousands of interviews.

To be honest, now I’m wondering if all these interviews really happened.  That’s an awful lot of interviews and they would have been conducted in the middle of a war.

So now I’m back to thinking that maybe it was a sample survey of a few thousand households.  But if so then the write up has the large flaw that there is no description whatsoever of how its sample was drawn (if, indeed, there was a sample).

Something is definitely wrong here.  I shouldn’t have to get out a Ouiji board to divine the authors’ methodology.

The Syria survey discussed a few weeks ago seems to be in a different category.  For that one I have a lot of questions about what they did combined with doubts about whether their methods make sense.  But this Libya write-up seems weird to the point where I wonder whether they were actually out in the field at all.

Maybe an email to Dr. Daw will clear things up in a positive way.  With the Syria paper emailing the lead author got me nowhere but maybe here it will work.  I’m afraid that the best case scenario is that Daw et al. did some useful field work that was obscured by a poor write up and that there is a better paper waiting to get written.