Data Dump Friday is Back Again: More Iraq Opinion Polls plus a film of Lions Hugging a Woman

Here’s the film you’re looking for.

Also, after being lazy for more than a month I’ve started uploading the remaining Iraq opinion polls at our old friend, the conflict data page.

Advertisements

Open the Door to all the Hidden Election Polling Data

The UK House of Lords has issued a call for evidence on the effects of political polling and digital digital media on politics.  Submissions are due next week so maybe someone out there wants to dash something off….or maybe someone would be so kind as to give me feedback on my proposal.  Below I give a draft.

Comments welcome!

Note that everything I say applies equally to political polling in the US and around the globe but, quite reasonably, the Lords ask about British polling so the proposal is written about British polling.

(OK, the proposal is about election polling, not war.  But this post is very much in keeping with the open data theme of the blog so I believe it will be of general interest to my readers.)

pexels-photo-147634

 

The Proposal[1]

I have one specific suggestion that could, if implemented, substantially improve political life in the UK; require collectors of political polling data to release their detailed micro datasets into the public domain.

A Preliminary Clarification

Some readers may think, wrongly, that pollsters generally do provide detailed micro datasets already.  Occasionally they do.  But normally they just publish summary tables, while withholding the interview-by-interview results (anonymized, of course).  Researchers need such detailed data to make valid estimates.

The Argument

Let me develop this idea in steps.

Political pollsters face two main challenges.  First, they cannot draw well-behaved random samples of voters for their polls.  This is mainly because most people selected for interviews refuse to participate.  Moreover, the political views of the refusers differ systematically from those of the participants.  Second, it is difficult to predict which poll participants will turn out to vote.  Yet good election prediction relies on good turnout prediction.

These two challenges dictate that political polling datasets cannot simply interpret themselves.  Rather, pollsters must use their knowledge, experience, intuition, wisdom and other wiles to model their way out of the shortcomings of their data.  There now exists a growing array of techniques that can be deployed to address political polling challenges.  But good applications of these techniques embody substantial elements of professional judgment, about which experts disagree.

This New York Times article leaves little doubt about the point of the last paragraph.  The NYT gave the detailed micro data from one of their proprietary Trump-Clinton polls to four analytical teams and asked for projections.  The results ranged from Clinton +4 to Trump +1.[2]  These are all valid estimates made by serious professionals.  Yet they differ quite substantively because the teams differ in some of their key judgments.

The key point is that for the foreseeable future there will not be one correct analytical technique that, if applied properly, will always lead to a correct treatment of new polling data.  Rather, there will be a useful range of valid analyses that can be made from any political polling dataset.

Presently we are robbed of all but one analysis of most political polling datasets that are collected in the UK.  This is because polling data are held privately and never released into the public domain.  This data black hole wastes opportunities in two distinct directions.  First, we cannot learn as much as possible about the state of public opinion during elections.  Second, by limiting the range of experimentation that is applied to each dataset we retard the development process for improving our analytical techniques.

An Important Caveat

Much political polling data are collected by private companies that must make a profit on their investment.  These organizations might feel threatened by this open data proposal.  However, these concerns can easily be addressed by allowing an appropriate interval of time for data collectors to monopolize their datasets.  This could work much in the way that patents are issued to provide creative incentives for inventors by giving inventors a window of time to reap high rewards before their inventions can be copied by competitors.  The only difference here is that these monopolization intervals for pollsters should be much shorter than they are for patent intervals, probably only two weeks or so.

Parliament Should Defend the Public Interest

There is a strong public interest in making full use of political polling data.  Yet even public organizations like the BBC collect political polling data (although not in the 2017 election), write up general summaries and then consign their detailed micro data into oblivion.  If public organizations cannot be convinced to do a better job of serving the public interest then they should be forced to do so.  Even private companies should be forced, by legislation if necessary, to place their political polling data into the public domain after they have been allowed a decent interval designed to feed their bottom lines.

I do not argue that all private survey data should be released to the public.  There must be a public interest test that has to be satisfied before public release can be mandated.  This test would not be satisfied for most privately collected survey data.  But election polling does meet this public interest standard and should be claimed as a public resource to benefit everyone in the UK.

 

[1] I urge the committee to consult with the leadership of the Royal Statistical Society on this question.  I have not coordinated my submission with them but I believe that they would back it.

[2] The official NYT estimate was Clinton +1.

Fabrication in Survey Data: A Sustainable Ecosystem

Here is a presentation I gave a few weeks ago on fabrication in survey data.

It includes some staple material from the blog but, mainly, I set off in a new direction – trying to explain why survey data get fabricated in the first place.

While writing the presentation I realized that these conditions are similar to those that led to the Grenfell Tower fire.  I only hint at these connections in the presentation but I plan to pursue this angle in the future.

Data Dump Friday – Just Three this Week Plus a Cleanup of the Censorship Page

Hi.

I’ve now put all the State Department public opinion polls conducted in Iraq during 2005 up on the conflict data page.

I’ve also cleaned up the censorship page after I realized that its organization is worse than the organization on the conflict data page.

And, yes, I should unite the two pages since there is now a lot of data posted on the censorship page that isn’t on the conflict data page which doesn’t make a lot of sense.  So I’ll probably set up a mirror posting next week.

Look at this post if you’ve forgotten what the censorship page is about.

Secret Data Sunday – Iraq Family Health Survey

The WHO-sponsored Iraq Family Health Survey (IFHS) led to a nice publication in the New England Journal of Medicine that came complete with an editorial puff piece extolling its virtues.  According to the NEJM website this publication has generated 60 citations and we’re still counting.   If you cast a net wider than just medical publications then the  citation count must run well into the hundreds.

But the IFHS virtues don’t stop there.  The NEJM paper, and the accompanying report, are well written and supply plenty of good methodological information about the survey.  The authors are pretty up front about the limitations of their work, notably that they had to skip interviews in some areas due to security concerns.  Moreover, the IFHS is an important survey not least because its estimate of 150,000 violent deaths discredited the Burnham et al. estimate of 600,000 violent deaths for almost exactly the same time period.  (The Burnham et al. survey hid its methodology and was afflicted by serious ethical and data integrity problems. )

I have cited the IFHS multiple times in my own work and generally believe in it.  At the same time, the IFHS people did several questionable things with their analysis that I would like to correct, or at least investigate, by reanalyzing the IFHS data.

But here’s the rub.  The WHO has not released the IFHS dataset.

I and other people have requested it many times.  The field work was conducted way back in 2006.  So what is the WHO waiting on?

I’ll leave a description of my unrealized reanalysis to a future post. This is because my plans just don’t matter for the issue at hand; the IFHS data should be in the public domain whether or not I have a good plan for analyzing them.  (See this post on how the International Rescue Committee hides its DRC data in which I make the same point.)

There is an interesting link between the IFHS and the Iraq Child and Maternal Mortality Survey, another important dataset that is also unavailable.  The main point of contact for both surveys is Mohamed Ali of the WHO.  Regarding the IFHS. Mohamed seemed to tell me in an email that only the Iraqi government is empowered to release the dataset.  If so, this suggests a new (at least for me) and disturbing problem;

Apparently, the WHO uses public money to sponsor surveys but then sells out the general public by ceding their data distribution rights to local governments, in this case to Iraq.  

This is practice of allowing governments benefiting from UN-sponsored research to withhold data from the public that pays for the research is unacceptable .  It’s great that the WHO sponsors survey research in needy countries but open data should be a precondition for this service.