Here’s the film you’re looking for.
Also, after being lazy for more than a month I’ve started uploading the remaining Iraq opinion polls at our old friend, the conflict data page.
The UK House of Lords has issued a call for evidence on the effects of political polling and digital digital media on politics. Submissions are due next week so maybe someone out there wants to dash something off….or maybe someone would be so kind as to give me feedback on my proposal. Below I give a draft.
Note that everything I say applies equally to political polling in the US and around the globe but, quite reasonably, the Lords ask about British polling so the proposal is written about British polling.
(OK, the proposal is about election polling, not war. But this post is very much in keeping with the open data theme of the blog so I believe it will be of general interest to my readers.)
I have one specific suggestion that could, if implemented, substantially improve political life in the UK; require collectors of political polling data to release their detailed micro datasets into the public domain.
A Preliminary Clarification
Some readers may think, wrongly, that pollsters generally do provide detailed micro datasets already. Occasionally they do. But normally they just publish summary tables, while withholding the interview-by-interview results (anonymized, of course). Researchers need such detailed data to make valid estimates.
Let me develop this idea in steps.
Political pollsters face two main challenges. First, they cannot draw well-behaved random samples of voters for their polls. This is mainly because most people selected for interviews refuse to participate. Moreover, the political views of the refusers differ systematically from those of the participants. Second, it is difficult to predict which poll participants will turn out to vote. Yet good election prediction relies on good turnout prediction.
These two challenges dictate that political polling datasets cannot simply interpret themselves. Rather, pollsters must use their knowledge, experience, intuition, wisdom and other wiles to model their way out of the shortcomings of their data. There now exists a growing array of techniques that can be deployed to address political polling challenges. But good applications of these techniques embody substantial elements of professional judgment, about which experts disagree.
This New York Times article leaves little doubt about the point of the last paragraph. The NYT gave the detailed micro data from one of their proprietary Trump-Clinton polls to four analytical teams and asked for projections. The results ranged from Clinton +4 to Trump +1. These are all valid estimates made by serious professionals. Yet they differ quite substantively because the teams differ in some of their key judgments.
The key point is that for the foreseeable future there will not be one correct analytical technique that, if applied properly, will always lead to a correct treatment of new polling data. Rather, there will be a useful range of valid analyses that can be made from any political polling dataset.
Presently we are robbed of all but one analysis of most political polling datasets that are collected in the UK. This is because polling data are held privately and never released into the public domain. This data black hole wastes opportunities in two distinct directions. First, we cannot learn as much as possible about the state of public opinion during elections. Second, by limiting the range of experimentation that is applied to each dataset we retard the development process for improving our analytical techniques.
An Important Caveat
Much political polling data are collected by private companies that must make a profit on their investment. These organizations might feel threatened by this open data proposal. However, these concerns can easily be addressed by allowing an appropriate interval of time for data collectors to monopolize their datasets. This could work much in the way that patents are issued to provide creative incentives for inventors by giving inventors a window of time to reap high rewards before their inventions can be copied by competitors. The only difference here is that these monopolization intervals for pollsters should be much shorter than they are for patent intervals, probably only two weeks or so.
Parliament Should Defend the Public Interest
There is a strong public interest in making full use of political polling data. Yet even public organizations like the BBC collect political polling data (although not in the 2017 election), write up general summaries and then consign their detailed micro data into oblivion. If public organizations cannot be convinced to do a better job of serving the public interest then they should be forced to do so. Even private companies should be forced, by legislation if necessary, to place their political polling data into the public domain after they have been allowed a decent interval designed to feed their bottom lines.
I do not argue that all private survey data should be released to the public. There must be a public interest test that has to be satisfied before public release can be mandated. This test would not be satisfied for most privately collected survey data. But election polling does meet this public interest standard and should be claimed as a public resource to benefit everyone in the UK.
 I urge the committee to consult with the leadership of the Royal Statistical Society on this question. I have not coordinated my submission with them but I believe that they would back it.
 The official NYT estimate was Clinton +1.
The WHO-sponsored Iraq Family Health Survey (IFHS) led to a nice publication in the New England Journal of Medicine that came complete with an editorial puff piece extolling its virtues. According to the NEJM website this publication has generated 60 citations and we’re still counting. If you cast a net wider than just medical publications then the citation count must run well into the hundreds.
But the IFHS virtues don’t stop there. The NEJM paper, and the accompanying report, are well written and supply plenty of good methodological information about the survey. The authors are pretty up front about the limitations of their work, notably that they had to skip interviews in some areas due to security concerns. Moreover, the IFHS is an important survey not least because its estimate of 150,000 violent deaths discredited the Burnham et al. estimate of 600,000 violent deaths for almost exactly the same time period. (The Burnham et al. survey hid its methodology and was afflicted by serious ethical and data integrity problems. )
I have cited the IFHS multiple times in my own work and generally believe in it. At the same time, the IFHS people did several questionable things with their analysis that I would like to correct, or at least investigate, by reanalyzing the IFHS data.
But here’s the rub. The WHO has not released the IFHS dataset.
I and other people have requested it many times. The field work was conducted way back in 2006. So what is the WHO waiting on?
I’ll leave a description of my unrealized reanalysis to a future post. This is because my plans just don’t matter for the issue at hand; the IFHS data should be in the public domain whether or not I have a good plan for analyzing them. (See this post on how the International Rescue Committee hides its DRC data in which I make the same point.)
There is an interesting link between the IFHS and the Iraq Child and Maternal Mortality Survey, another important dataset that is also unavailable. The main point of contact for both surveys is Mohamed Ali of the WHO. Regarding the IFHS. Mohamed seemed to tell me in an email that only the Iraqi government is empowered to release the dataset. If so, this suggests a new (at least for me) and disturbing problem;
Apparently, the WHO uses public money to sponsor surveys but then sells out the general public by ceding their data distribution rights to local governments, in this case to Iraq.
This is practice of allowing governments benefiting from UN-sponsored research to withhold data from the public that pays for the research is unacceptable . It’s great that the WHO sponsors survey research in needy countries but open data should be a precondition for this service.
When data are central to scientific discussions, as is typically the case, then the relevant data should be open to all.
OK, we don’t have to be totally rigid about this. People may sink a lot of effort into building a data set so it’s reasonable for data builders to milk their data monopoly for some grace period. In my opinion, you get one publication. Then you put your data into the public domain.
And public domain means public domain. It’s not OK to hide your data from people you don’t like, from people you think are incompetent, from people you suspect of having engaged in acts of moral turpitude, etc.. You post your data so everyone can have them.
If you put your data into the public domain and someone does something stupid with it then it’s fine to say that. It’s a virtue to be nice but being nice isn’t a requirement. But as far as I’m concerned you share your data or you’re not doing science.
Readers of the blog should be well aware that there has been a dispute about the decline of war (or not), primarily between Steven Pinker and Nassim Nicholas Taleb. You can track my participation in this debate from a bunch of my blog entries and the links they contain. I’m in the middle of preparing a conference talk on this subject, and I’ll post the slides later this week….so more is coming.
I planned a little data work to support the talk so I emailed Taleb asking him for the data he used to launch his attack on Pinker’s work. Here is his reply.
1) It is not professional to publish a “flaw” without first contacting the authors. You did it twice.
2) Your 2 nitpicking “flaws” betrayed total ignorance of the subject.
So I will ask you to fuck off.
He is referring to this post (which did contain an error that I corrected after a reader pointed it out.)
What can I say?
The main thing is that if he wants to do science then it’s not OK to just declare someone to be ignorant and withhold data.
Beyond that I’d say that if he still objects to something in my post he should be specific, either in the comments or to me directly. As always, I’ll issue a correction or clarification if I get something wrong.
Third, it isn’t really standard to clear in advance criticisms of someone’s work with the person being criticized. Doing this could be a reasonable strategy in some cases. And it’s reasonable to send criticism to the person being criticized. Correcting errors, as I do, is essential.
Anyway, I take away from this episode that Taleb isn’t doing science and also that he probably doesn’t have great confidence in his work on this subject or else he wouldn’t hide his data.
The long-awaited report from the American Association for Public Opinion Research (AAPOR) on the performance of polling in the Trump-Clinton race is out. You will see that this material is less of a stretch for the blog than it might seem to be at first glance and I plan a second post on it.
Today I just want to highlight the hidden data issue which rears its head very early in the report:
The committee is composed of scholars of public opinion and survey methodology as well as election polling practitioners. While a number of members were active pollsters during the election, a good share of the academic members were not. This mix was designed to staff the committee both with professionals having access to large volumes of poll data they knew inside and out, and with independent scholars bringing perspectives free from apparent conflicts of interest. The report addresses the following questions:
So on the one hand we have pollsters “having access to large volumes of poll data” and on the other hand we have “independent scholars” who….errr….don’t normally have access to large volumes of polling data because the pollsters normally hide it from them. (I’m not sure what the apparent conflict of interest of the pollsters is but I guess it’s that they might be inclined to cover up errors they may have made in their election forecasts.)
You might well ask how come all these datasets aren’t in the public domain?
Sadly, there is no good answer to that question.
But the reason all these important data remain hidden is pretty obvious. Pollsters don’t want independent analysts to embarrass them by finding flaws in their data or their analysis.
This is a bad reason.
There is a strong public interest in having the data available. The data would help all of us, not just the AAPOR committee, understand what went wrong with polling in the the Trump-Clinton race. The data would also help us learn why Trump won which is clearly an important question.
But we don’t have the data.
I understand that there are valid commercial reasons for holding polling data privately while you sell some stories about it. But a month should be more than sufficient for this purpose.
It is unacceptable to say that sharing requires resources that you don’t have because sharing data just doesn’t require a lot of resources. Yes, I know that I’ve whinged a bit on the blog about sharing all that State Department data and I’m doing it in tranches. Still, this effort is costing me only about 15-30 minutes per dataset. It’s really not a big deal.
I suppose somebody might say that these datasets are collected privately and so it’s OK to permanently keep them private. But election polls drive public discussions and probably affect election outcomes. There is a really strong public interest in disclosure.
There is further material in the report on data openness:
Since provision of microdata is not required by the AAPOR Transparency Initiative, we are particularly grateful to ABC News, CNN, Michigan State University, Monmouth University, and University of Southern California/Los Angeles Times for joining in the scientific spirit of this investigation and providing microdata. We also thank the employers of committee members (Pew Research Center, Marquette University, SurveyMonkey, The Washington Post, and YouGov) for demonstrating this same commitment.
I’ve written before about how AAPOR demands transparency on everything except the main thing you would think of when it comes to survey transparency – showing your data.
I’ll return to this AAPOR problem in a future Secret Data Sunday. But for now I just want to say that the Committee’s appeal to a “scientific spirit” falls flat. Nobody outside the committee can audit the AAPOR report and it will be unnecessarily difficult to further develop lines of inquiry initiated by the report for one simple reason; nobody outside the committee has access to all of the data the committee analyzed. This is not science.
OK, that’s all I want to say today. I’ll return to the main points of the report in a future post.
Many readers of the blog know that there was a major cock-up over child mortality figures for Iraq. In fact, exaggerated child mortality figures have been used to justify the 2003 invasion of Iraq, both prospectively and retrospectively.
Here I won’t repeat the basics one more time, although anyone unfamiliar with this debacle should click on the above link which, in turn, offers further links providing more details.
Today I just inject one new point into this discussion – the dataset for the UNICEF survey that wildly overestimated Iraq’s child mortality rates is not available. (To be clear, estimates from this dataset are available but the underlying data you need to audit the survey are hidden.)
The hidden survey is called the Iraq Child and Maternal Mortality Survey (ICMMS). This graph (which you can enlarge on your screen) reveals the ICMMS as way out of line with no fewer than four subsequent surveys, all debunking the stratospheric ICMMS child mortality estimates. The datasets for three of the four contradicting surveys are publicly available and open to scrutiny (I will return to the fourth of the contradicting surveys in a future blog post.)
But the ICMMS dataset is nowhere to be found – and I’ve looked for it.
For starters, I emailed UNICEF but couldn’t find anyone there who had it or was willing to share it.
I also requested the dataset multiple times from Mohamed Ali, the consulting statistician on the survey who now is at the World Health Organization (WHO).
At one point Mohamed directed me to the acting head of the WHO office in Iraq who blew me off before I had a chance to request the data from him. But, then, you have to wonder what the current head of the WHO office in Iraq has to do with a 1990’s UNICEF survey, anyway.
I persisted with Mohamed who then told me that if he still has the data it would be somewhere on some floppy disk. This nostalgic reminder of an old technology is kind of cute but doesn’t let him off the hook for the dataset which I never received on a floppy disk or otherwise.
There is a rather interesting further wrinkle on this saga of futility. The ICMMS dataset was heavily criticized in research commissioned for the UN’s oil for food report:
It is clear, however, that widely quoted claims made in 1995 of 500,00 deaths of children under 5 as a result of sanctions were far too high;
John Blacker, Mohamed Ali and Gareth Jones then responded to this criticism with a 2007 academic article defending the ICMMS dataset:
According to estimates published in this journal, the number of deaths of children under 5 in Iraq in the period 1991-98 resulting from the Gulf War of 1991 and the subsequent imposition of sanctions by the United Nations was between 400,000 and 500,000. These estimates have since been held to be implausibly high by a working group set up by an Independent Inquiry Committee appointed by the United Nations Secretary-General. We believe the working group’s own estimates are seriously flawed and cannot be regarded as a credible challenge to our own. To obtain their estimates, they reject as unreliable the evidence of the 1999 Iraq Child and Maternal Mortality Survey–despite clear evidence of its internal coherence and supporting evidence from another, independent survey. They prefer to rely on the 1987 and 1997 censuses and on data obtained in a format that had elsewhere been rejected as unreliable 30 years earlier.
For the record, the Blacker, Ali and Jones article is weak and unconvincing and I may make it the subject of a future blog post. But today I just concentrate on the (non)availability of the ICMMS dataset so I won’t wander off into a critique of their article.
Thinking purely in terms of data availability, the 2007 article raises some interesting questions. Was Mohamed Ali still working off of floppy disks in 2007 when he published this article? Surely he must have copied the dataset onto a hard disk to do the analysis. And what about his co-authors? They must have the dataset too, no?
Unfortunately, John Blacker has passed away but Gareth Jones is still around so I emailed him asking for the ICMMS dataset which he had defended so gamely.
He replied that he didn’t have
never had access to the dataset when he wrote the 2007 article and still doesn’t have access now. [MS – I reviewed the correspondence a few weeks after writing this post and realized that Jones did have access to the data way back when he worked for UNICEF but lost it after retiring a long time ago. So he has seen the data but didn’t have it when writing his academic article defending it.]
Let that point sink in for a moment. Jones co-authored an article in an academic journal, the only point of which was to defend the quality of a dataset. Yet, he
never saw didn’t have access to the dataset that he defended? Sorry but this doesn’t work for me. As far as I’m concerned when you write an article that is solely about the quality of a dataset then you need to at least take a little peek at the dataset itself.
I see two possibilities here and can’t decide which is worse. Either these guys are pretending that they don’t have a dataset that they actually do have because they don’t want to make it public or they have been defending the integrity of a dataset they don’t even have. Either way, they should stop the charade and declare that the ICMMS was just a big fat mistake.
I have known for a long time that the ICMMS was crap but the myth it generated lives on. It is time for the principle defenders of this sorry survey to officially flush it down the toilet.
I have just posted here three new things.
More next week!