Data Dump Friday – Part 2, some Iraq Polling Data

Happy Friday.

I have just posted here three new things.

  1. A list of all the data sets that Steve Koczela obtained from the State Department through his successful FOIA application.
  2. An Iraq poll from April 2006 fielded by the Iraq Center for Research and Strategic Studies (ICRSS).  [Note – this organization seems to be defunct.  Perhaps someone out there knows something about this?]
  3. An Iraq poll also from April 2006 and asking the same questions as the ICRSS poll but fielded by the notorious combination of D3 Systems and KA Research Limited (KARL).

We already saw a head to head comparison of these two polls that left no doubt that much of the D3/KA data were fabricated (see also this post).

More next week!

Special Journal Issue on Fabrication in Survey Research

The Statistical Journal of the IAOS has just released a new issue with a bunch of articles on fabrication in survey research, a subject of great interest for the blog.

Unfortunately, most of the articles are behind a paywall but, thankfully, the overview by Steve Koczela and Fritz Scheuren is open access.  It’s a beautiful piece – short, sweet, wise and accurate.  Please read it.

Here are my comments.

Way back in 1945 the legendary Leo Crespi stressed the importance of what he called “the cheater problem.”  Although he did this in the flagship survey research journal, Public Opinion Quarterly, the topic has never become mainstream in the profession.  Many survey researchers seem to view the topic of fabrication as not really appropriate for polite company, akin to discussing the sexual history of a bride at her wedding.  Of course, this semi taboo is convenient for cheaters.  Maria Konnikova has a great new book about confidence artists.  Much in the book is relevant to the subject of fabrication in survey research but one point really stands out for me; a key reason why the same cons and the same con artists move seamlessly from mark to mark is that each victim is too embarrassed  to publicize his/her victimization.  276365-smiley 4

Discussions of fabrication that have occurred over the years have almost always focused on what is known as curbstoning, i.e., a single interviewer making up data. (The term comes from an image of a guy sitting on a street curb filling out his forms.)  But this is just one type of cheating and one of the great contributions of Koczela and Scheuren’s  journal edition and the impressive series of prior conferences is that have substantially expanded the scope of the survey fabrication field.  Now we discuss fabrication by supervisors, principal investigators and the leaders of a survey companies.  We now know that  hundreds of public opinion surveys, especially surveys conducted in poor countries, are contaminated by widespread duplication and near duplication of single observations.  (This journal issue publishes the key paper on duplication.)

Let me quote a bit from the to-do list of Koczela an Scheuren.

It does not only happen to small research organizations with fewer resources, as was previously believed [12].  Recent instances involve the biggest and most names in the survey research business, academia and the US Government.

This is certainly true but I would add that reticence about naming names is crippling.  Yes, it’s helpful to know that there are many dubious surveys out there but guidance on which ones they are would be very helpful.

An acknowledgement by the research community that data fabrication is a common threat, particularly in remote and dangerous survey environments would allow the community to be cooperative and proactive in preventing, identifying and mitigating the effects of fabrication.

This comment about remote and dangerous survey environments fits perfectly with my critiques of Iraq surveys including this one.

Given the perceived stakes, these discussion often result in legal threats or even legal action of various types.

Ummm….yes.

…the problem of fabrication is fundamentally one of co-evolution.  The more detection and prevention methods evolve, the more fabricators may evolve to stay ahead.  And to the extent we discover and confirm fabrication, we will never know whether we found it all, or caught only the weakest of the pack.  With these truths in mind, more work is needed in developing and testing statistical methods of fabrication detection.  This is made more difficult by the lack of training datasets, a problem prolonged by a general unwillingness to openly discuss data fabrication.

Again, I couldn’t agree more.

Technical countermeasures during fielding are less useful in harder to survey areas, which also happen to be the areas where the incentive to fabricate data is the highest. Many of the recent advances in field quality control processes focus on areas where technical measures such as computer audio recording, GPS, and other mechanisms can be used [6,13].

In remote and dangerous areas, where temptation to fabricate is the highest, technical countermeasures are often sparse [9]. And perversely, these are often the most closely watched international polls, since they often represent the hotspots of American interest and activity. Robbins and Kuriakose show a heavy skew in the presence of duplicate cases in non-OECD countries, potentially a troubling indicator. These polls conducted in remote areas often have direct bearing on policy for the US and other countries. To get a sense of the impact of the polls, a brief review of the recently released Iraq Inquiry, the so-called Chilcot report, contains dozens of documents that refer, in most cases uncritically, to the impact and importance of polls.

To be honest, Koczela and Scheuren do such a great job with their short essay that I’m struggling to add value here.  What they write above is hugely pertinent to all the work I’ve done on surveys in Iraq.

By the way, a response I sometimes get to my critiques of the notorious Burnham et al. survey of deaths in the Iraq war (see, for example, here, here and here) is that it is unreasonable to expect perfection for a survey operating in such a difficult environment.  Fair enough.  But then you have to concede that we cannot expect high-quality results from such a survey either.  If I were to walk in off the street and take Harvard’s PhD qualification exam in physics (I’m assuming they have such a thing….) it would be unreasonable to expect me to do well.  I just haven’t prepared for such an exam.  Fine, but that doesn’t somehow make me an authority on physics.  It just gives me a perfect excuse for not being such an authority.

Finally, Koczela and Scheuren provide a mass of resources that researchers can use to bring themselves to the frontier of the survey fabrication field.  Anyone interested in this subject needs to take a look at these resources.

Chilcot on Civilian Casualties: Part 4

In October of 2004 The Lancet published a paper by Roberts et al. that estimated the number of excess deaths for the first year and a half of the Iraq war using data from a new survey they had just conducted.  (Readers wanting a refresher course on the concept of excess deaths  can go here.)

One of the best parts of the civilian casualties chapter of the Chilcot report is the front-row seat it provides for the (rather panicked) discussion that Roberts et al. provoked within the UK government.  Here the real gold takes the form of links to three separate reviews of the paper provided by government experts.  The experts are Sir Roy Anderson of the first report, Creon Butler of the second report and Bill Kirkup, CBE of the third report.

In the next several posts I will evaluate the evaluators.  I start by largely incorporating only information that was available when they made their reports.   But I will, increasingly, take advantage of hindsight..

For orientation I quote the “Interpretation” part of the Summary of Roberts et al.:

Making conservative assumptions, we think that about 100,000 excess deaths, or more have happened since the 2003 invasion of Iraq.  Violence accounted for most of the excess deaths and airstrikes from coalition forces accounted for most violent deaths.  We have shown that collection of public-health information is possible even during periods of extreme violence.  Our results need further verification and should lead to changes to reduce non-combatant deaths from air strikes.

The UK government reaction focused exclusively, so far as I can tell, on the question of how to respond to the PR disaster ensuing from:

  1.  The headline figure of 100,000 deaths which was much bigger than any that had been seriously put forward before.
  2. The claim that the Coalition was directly responsible for most of the violence.  (Of course, one could argue that the Coalition was ultimately responsible for all violence since it initiated the war in the first place but nobody in the government took such a position.)

Today I finish with two important points that none of the three experts noticed.

First, the field work for the survey could not have been conducted as claimed in the paper.  The authors write that two teams conducted all the interviews between September 8 and September 20, i.e., in just 13 days.  There were 33 clusters, each containing 30 households. This means that each team had to average nearly 40 interviews per day, often spread across more than a single sampling point (cluster).  These interviews had be on top of travelling all over the country, on poor roads with security checkpoints, to reach the 33 clusters in the first place.

To get a feel for the logistical challenge that faced the field teams consider this picture of the sample from a later, and much larger, survey – the Iraq Living Conditions Survey:

ILCS Sample

I know the resolution isn’t spectacular on the picture but I still hope that you can make out the blue dots.  There are around 2,200 of them, one for each cluster of interviews in this survey.

Now imagine choosing 33 of these dots at random and trying to reach all of them with two teams in 13 days.  Further imagine conducting 30 highly sensitive interviews (about deaths of family members) each time you make it to one of the blue points.  If a grieving parent asks you to stay for tea do you tell to just answer your questions because you need to move on instantly?

The best-case scenario is that is that the field teams cut corners with the cluster selection to render the logistics possible and then raced through the interviews at break-neck speed (no more than 10 minutes per interview).  In other words, the hope is that the teams succeeded in taking bad measurements of a non-random sample (which the authors then treat as random).  But, as Andrew Gelman reminds us, accurate measurement is hugely important.

The worst-case scenario is that field teams simplified their logistical challenges by making up their data.  Recall, that data fabrication is widespread in surveys done in poor countries.  Note, also, that the results of the study were meant to be released before the November 2 election in the US and the field work was completed only on September 20; so slowing down the field work to improve quality was not an option.

Second, no expert picked up on the enormous gap between the information on death certificates reported in the Roberts et al. paper and the mortality information the Iraqi Ministry of Health (MoH) was releasing at the time.  A crude back-of-the-envelope calculation reveals the immense size of this inconsistency:

  1.  The population of Iraq was, very roughly, 24 million and the number of people in the sample is reported as 7,868.  So each in-sample death translates into about 3,000 estimated deaths (24,000,000/7,868).  Thus, the 73 in-sample violent deaths become an estimate of well over 200,000 violent deaths.
  2. Iraq’s MoH reported 3,858 violent deaths between April 5, 2004 and October 5, 2004, in other words a bit fewer than 4,000 deaths backed by MoH death certificates.  The MoH has no statistics prior to April 5, 2004 because their systems were in disarray before then (p. 191 of the Chilcot chapter)
  3. Points 1 and 2 together imply that death certificates for violent deaths should have been present only about 2% of the time (200,000/4,000).
  4. Yet Roberts et al. report that their field teams tried to confirm 78 of their recorded deaths by asking respondents to produce death certificates and that 63 of these attempts (81%) were successful.

The paper makes clear that the selection of the 78 cases wasn’t random and it could be that death certificate coverage is better for non-violent deaths than it is for violent deaths.

Still……

There is a big, yawning, large, humongous massive gap between 2% and 81% and something has to give.

screen-shot-2016-05-16-113942-am

Here are the only resolution possibilities I can think of::

  1.  The MoH issued vastly more (i.e., 50 times more) death certificates  for violent deaths than it has admitted to issuing.  This seems far fetched in the extreme.
  2. The field teams for Roberts et al. fabricated their death certificate confirmation figures.  This seems likely especially since the paper reports:

Interviewers were initially reluctant to ask to see death certificates because this might have implied they did not believe the respondents, perhaps triggering violence.  Thus, a compromise was reached for which interviewers would attempt to confirm at least two deaths per cluster.

Compromises that pressure interviewers to risk their lives are not promising and can easily lead to data fabrication.

3.   The survey picked up too many violent deaths.  I think this is true and                we will return to this possibility in a follow-up post but I don’t think that            this can be the main explanation for the death certificate gap.

OK, that’s enough for today.

In the next post I’ll discuss more what the expert reports actually said rather than what they didn’t say.

 

 

 

 

 

 

 

 

 

 

 

Show me Your Data

I would love to claim Andrew Gelman as the model for my blogging although, realistically, I’ll never never be able to match his torrent of wonderful material.  Thus, it’s always an honour when Andrew features my work on my blog.

Andrew’s piece brings high-level attention to the issue of fabrication in survey research  in general and to the issue of fabrication in a series of Iraq surveys fielded by D3 Systems and KA Research Limited in particular.

Andrew writes:

I don’t don’t don’t don’t don’t trust surveys where the data are hidden.

I think he slightly slightly exaggerates the trustworthiness of surveys with hidden data but still he’s on the right track..

Andrew’s comment drove me to take a close look at the AAPOR Transparency Initiative, an effort by the American Association for Public Opinion Research (AAPOR).  AAPOR is busy getting institutions to sign on to a pledge to disclose central aspects of their methodology.  This is important work since we know that some people like to hide their methodologies and still hope to be taken seriously.  Nevertheless, AAPOR has not traditionally pushed for disclosure of data so I was skeptical of this initiative.

Thus, I was pleasantly surprised to discover that AAPOR’s disclosure standards have moved forward to include some pressure for data disclosure in addition to AAPOR’s longstanding emphasis on methodological essentials such as sampling designs, question wordings and target populations:

“Finally, reflecting the fundamental goals of transparency and replicability we share the expectation that access to datasets and related documentation will be provided to allow for independent review and verification of research claims upon request.  Datasets may be held without release for a period of up to one year after findings are publicly released to allow full opportunity for primary analysis.  In order to protect the privacy of individual respondents such datasets must be de-identified to remove variables that can reasonably be expected to identify a respondent.  Those who commission publicly disseminated research have an obligation to disclose a rationale for why eventual public release or access to the datasets is not possible if that is the case.” (Informational Module 5)

Honestly, it would be much better to say that if you want to be part of the Transparency Initiative then you have to share your data.  It’s hard to understand how an institution that hides its data can claim to be a paragon of transparency.

Still, the glass is at least half full here.  There are clear expectations that survey data should be released and you have some explaining to do if you violate this expectation.

Great.

I see that both Langer Associates and D3 Systems are charter members of AAPOR’s Transparency Initiative.  So I have requested the Iraq polls listed on this page from Langer Associates. I’ll make an announcement on the blog when the data arrive.

Check out my New Article at STATS.org

Hello everybody.

Please have a look at this new article that has just gone up on STATS.org.

It is a compact exposition of the evidence of fabrication in public opinion surveys in Iraq as well as the threats and debates flowing from this evidence.

My current plan for the blog is to do one follow up post on some material that was left on the cutting room floor for the STATS.org article and then move on to other stuff….unless circumstances dictate a return to the Iraq polling issue.

Have a great weekend!

More Evidence of Fabrication in D3 Polls in Iraq: Part 2

On Tuesday I provided some eye-popping comparisons on one Iraq survey fielded by D3/KA against another Iraq survey fielded by another company at exactly the same time.  In light of this evidence any reasonable person has to agree that the D3/KA data are fabricated.  Nevertheless, today I give you a different window into the same D3/KA survey.

Recall that one of the main markers of fabrication in these surveys is that the respondents to what I’m calling the “focal supervisors” have too many “empty categories”.  A response category is “empty” for a group of supervisors if it is offered as a possible choice but zero respondents actually chose it.  For example, in Part I to this series we saw that for all public services zero  respondents for the focal supervisors said that the service was “unavailable” or that availability was “very good”.  These are, therefore, both empty categories for the focal supervisors.

Langer Research Associates tried to rationalize all the empties for the focal supervisors by arguing that other supervisors also have empties.  Langer Associates also argued that Steve Koczela and I were unfair to compare the group of focal supervisors with  the group of all the other supervisors.  This is because the number of empties should be decreasing in the total number of interviews and the all-others group did more interviews than the focal group did.  Langer does have a point on this which I addressed in this post.  Here I follow up with a couple of pictures based on the same D3/KA survey discussed on Tuesday.

Each picture takes a bunch of different combinations of supervisors and for each combination plots the number of empties against the number of interviews.  The first plot graphs the data on 100 combinations of three supervisors plus the focals.  The second plot graphs the data on 100 combinations of four supervisors plus the focals.

Empties versus Interviews_three supervisors

Empties versus Interviews_four supervisors

You can see that:

1,  The number of empties is, indeed, decreasing in the number of interviews.

2.  Even after adjusting for this fact the focal supervisors still have overwhelmingly more empties than they should have, given the number of interviews they have conducted.

More Evidence of Fabrication in D3 Polls in Iraq: Part 1

Veteran readers know that I have posted a lot of this subject, including here, here, here, here , here and here.

To recap, a bunch of Iraq polls fielded by D3 Systems and its partner KA Research Limited contain data that appear to be fabricated.  In particular, there is a list of supervisors who consistently preside over non-credible interviews.  Steve Koczela and I dubbed these the “focal supervisors” since we focused our attention on them in our original paper on this subject.

We have known for a long time that D3/KA fielded a large number of surveys in Iraq and that we only had access to a few of them.  This changed recently. when Steve’s Freedom of Information Request to the US State Department came through, providing us with a mass of new Iraq polls.  Some of these were fielded by D3/KA and some were fielded by other companies.  This embarrassment of riches enables all sorts of new tests and comparisons.  I have only scratched the surface of the gold but I can report that lack of credibility of the D3/KA data screams off of the computer screen.

Let’s take a peak at two polls that ask exactly the same questions and were both fielded in April of 2006, one by D3/KA and the other by a company called the Iraq Center for Research and Strategic Studies (ICRSS).

Before looking at some numbers it is worth asking ourselves why the State Department Commissioned two different companies to administer an identical questionnaire simultaneously?  The only reason I can think of is that people in the State Department were suspicious of one of the companies.

In any case, for this short blog post let’s just look at one battery of questions on the availability of various services.  We compare the following two things:

  1.  ICRSS in the regions covered by the focal supervisors  in the comparable D3/KA survey:
  2. The focal supervisors in the D3/KA survey.

Of course, the two surveys should yield roughly the same answers since I hold the zone fixed in the comparisons.

The questions take the following form:

Q3_1:  Please tell me whether the following services for your neighborhood [in the quarter in which you live] over the past month have been very good, good, poor, very poor or not available. … Water supply

The same question is then asked for electricity, telephone service, etc.

Have a scroll through the table below:

Water Supply
Focals ICRSS Survey
Very Good 0 189
Good 0 977
Poor 245 466
Very Poor 198 128
Not Available 0 3
Don’t Know 0 0
NA 0 8
Electricity Supply
Focals ICRSS Survey
Very Good 0 11
Good 0 224
Poor 245 626
Very Poor 198 822
Not Available 0 80
Don’t Know 0 0
NA 0 8
Telephone Service (land line)
Focals ICRSS Survey
Very Good 0 71
Good 0 608
Poor 245 433
Very Poor 198 571
Not Available 0 36
Don’t Know 0 40
NA 0 12
Telephone Service (mobile)
Focals ICRSS Survey
Very Good 0 266
Good 0 1105
Poor 245 185
Very Poor 198 142
Not Available 0 40
Don’t Know 0 21
NA 0 12
Garbage Collection
Focals ICRSS Survey
Very Good 0 57
Good 0 608
Poor 245 667
Very Poor 198 373
Not Available 0 53
Don’t Know 0 0
NA 0 13
Sewage Disposal
Focals ICRSS Survey
Very Good 0 64
Good 0 574
Poor 91 662
Very Poor 352 370
Not Available 0 87
Don’t Know 0 0
NA 0 14
Conditions of Roads
Focals ICRSS Survey
Very Good 0 26
Good 0 532
Poor 148 769
Very Poor 295 388
Not Available 0 39
Don’t Know 0 5
NA 0 12
Traffic Management
Focals Nonfocals
Very Good 0 111
Good 0 834
Poor 245 505
Very Poor 198 207
Not Available 0 58
Don’t Know 0 35
NA 0 21
Police Presence
Focals ICRSS Survey
Very Good 0 255
Good 217 948
Poor 24 390
Very Poor 202 124
Not Available 0 23
Don’t Know 0 10
NA 0 16
Army Presence
Focals ICRSS Survey
Very Good 0 250
Good 217 834
Poor 24 371
Very Poor 202 171
Not Available 0 109
Don’t Know 0 19
NA 0 17

 

This is what your face looks like now:

S1E3_your-favourite-doctor-600x347
What????

 

 

 

 

 

In the D3/KA survey:

  • For six of the ten services exactly 245 rate the availability as “poor” and exactly 198 rate the availability as very “poor”
  • In two of the four cases for which the split is not 245-198 the breakdown is exactly 217-24-202
  • Despite the overwhelming preponderance of answers of “poor” and “very poor” nobody ever answers that a service is “unavailable”.
  • There are zero answers of “very good” and “don’t know.”

The above points easily condemn the D3/KA survey to the dustbin of lies but it’s a piece of cake to come up with more.

  • For four services the most common answer is “good” for ICRSS yet zero people give this answer for D3/KA.
  • ICRSS always has some responses of “unavailable” and “very good” but D3/KA always has zero people giving these answers.

This is not a judgement call.  It is blatantly obvious that the D3/KA data are fabricated.