Special Journal Issue on Fabrication in Survey Research

The Statistical Journal of the IAOS has just released a new issue with a bunch of articles on fabrication in survey research, a subject of great interest for the blog.

Unfortunately, most of the articles are behind a paywall but, thankfully, the overview by Steve Koczela and Fritz Scheuren is open access.  It’s a beautiful piece – short, sweet, wise and accurate.  Please read it.

Here are my comments.

Way back in 1945 the legendary Leo Crespi stressed the importance of what he called “the cheater problem.”  Although he did this in the flagship survey research journal, Public Opinion Quarterly, the topic has never become mainstream in the profession.  Many survey researchers seem to view the topic of fabrication as not really appropriate for polite company, akin to discussing the sexual history of a bride at her wedding.  Of course, this semi taboo is convenient for cheaters.  Maria Konnikova has a great new book about confidence artists.  Much in the book is relevant to the subject of fabrication in survey research but one point really stands out for me; a key reason why the same cons and the same con artists move seamlessly from mark to mark is that each victim is too embarrassed  to publicize his/her victimization.  276365-smiley 4

Discussions of fabrication that have occurred over the years have almost always focused on what is known as curbstoning, i.e., a single interviewer making up data. (The term comes from an image of a guy sitting on a street curb filling out his forms.)  But this is just one type of cheating and one of the great contributions of Koczela and Scheuren’s  journal edition and the impressive series of prior conferences is that have substantially expanded the scope of the survey fabrication field.  Now we discuss fabrication by supervisors, principal investigators and the leaders of a survey companies.  We now know that  hundreds of public opinion surveys, especially surveys conducted in poor countries, are contaminated by widespread duplication and near duplication of single observations.  (This journal issue publishes the key paper on duplication.)

Let me quote a bit from the to-do list of Koczela an Scheuren.

It does not only happen to small research organizations with fewer resources, as was previously believed [12].  Recent instances involve the biggest and most names in the survey research business, academia and the US Government.

This is certainly true but I would add that reticence about naming names is crippling.  Yes, it’s helpful to know that there are many dubious surveys out there but guidance on which ones they are would be very helpful.

An acknowledgement by the research community that data fabrication is a common threat, particularly in remote and dangerous survey environments would allow the community to be cooperative and proactive in preventing, identifying and mitigating the effects of fabrication.

This comment about remote and dangerous survey environments fits perfectly with my critiques of Iraq surveys including this one.

Given the perceived stakes, these discussion often result in legal threats or even legal action of various types.

Ummm….yes.

…the problem of fabrication is fundamentally one of co-evolution.  The more detection and prevention methods evolve, the more fabricators may evolve to stay ahead.  And to the extent we discover and confirm fabrication, we will never know whether we found it all, or caught only the weakest of the pack.  With these truths in mind, more work is needed in developing and testing statistical methods of fabrication detection.  This is made more difficult by the lack of training datasets, a problem prolonged by a general unwillingness to openly discuss data fabrication.

Again, I couldn’t agree more.

Technical countermeasures during fielding are less useful in harder to survey areas, which also happen to be the areas where the incentive to fabricate data is the highest. Many of the recent advances in field quality control processes focus on areas where technical measures such as computer audio recording, GPS, and other mechanisms can be used [6,13].

In remote and dangerous areas, where temptation to fabricate is the highest, technical countermeasures are often sparse [9]. And perversely, these are often the most closely watched international polls, since they often represent the hotspots of American interest and activity. Robbins and Kuriakose show a heavy skew in the presence of duplicate cases in non-OECD countries, potentially a troubling indicator. These polls conducted in remote areas often have direct bearing on policy for the US and other countries. To get a sense of the impact of the polls, a brief review of the recently released Iraq Inquiry, the so-called Chilcot report, contains dozens of documents that refer, in most cases uncritically, to the impact and importance of polls.

To be honest, Koczela and Scheuren do such a great job with their short essay that I’m struggling to add value here.  What they write above is hugely pertinent to all the work I’ve done on surveys in Iraq.

By the way, a response I sometimes get to my critiques of the notorious Burnham et al. survey of deaths in the Iraq war (see, for example, here, here and here) is that it is unreasonable to expect perfection for a survey operating in such a difficult environment.  Fair enough.  But then you have to concede that we cannot expect high-quality results from such a survey either.  If I were to walk in off the street and take Harvard’s PhD qualification exam in physics (I’m assuming they have such a thing….) it would be unreasonable to expect me to do well.  I just haven’t prepared for such an exam.  Fine, but that doesn’t somehow make me an authority on physics.  It just gives me a perfect excuse for not being such an authority.

Finally, Koczela and Scheuren provide a mass of resources that researchers can use to bring themselves to the frontier of the survey fabrication field.  Anyone interested in this subject needs to take a look at these resources.

Dispute Resolution by Mutual Maiming

I’m puzzled by the following sequence of events.  (This story has a very clear summary.)

  1. The UN issues a report entitled “Children and Armed Conflict”.  The report highlights quite a few groups for committing grave abuses against children.  The “Saudi Arabia-led Coalition”  in the war in Yemen is on this UN blacklist.  The report fingers the Coalition for killing and maiming children and for attacking hospitals and schools.  (So far I’m not puzzled.)
  2. UN Secretary General Ban Ki Moon then announces that he is caving in to pressure and will remove Saudi Arabia from the UN blacklist:

“The report describes horrors no child should have to face,” Ban said at a press conference. “At the same time, I also had to consider the very real prospect that millions of other children would suffer grievously if, as was suggested to me, countries would defund many U.N. programs. Children already at risk in Palestine, South Sudan, Syria, Yemen, and so many other places would fall further into despair.”  (The quote is from the same summary story mentioned above.)

Moon stops just short of directly naming his blackmailer but it’s obviously  Saudi Arabia.

Of course, this story is sad and pathetic.  It would be nice to live in a world in which the UN can at least speak the truth and exert moral suasion upon belligerent parties to clean up their acts even if the UN cannot force good behaviour.  Unfortunately, we do not really live in this world.

But here’s the puzzle.  Why do the Saudis think they have accomplished something with their bullying censorship?

Saudi Arabia was named in an obscure report that is read by only a handful of specialists.  Suddenly the report is famous.   What’s the take-home message for people outside the Saudi inner circle?  Is it that the UN screwed up by naming the Saudi-led Coalition but that this mistake has now been corrected and the Saudis are finally getting the respect they deserve?  I don’t think so.

It’s as if a rape victim names her rapist but then recants, saying that he threatened to kill her unless she did so – the rapist then breaths of sigh of relief now that his good name has been cleared.

The only way I can make sense of the Saudi behaviour is to think of it as just a single  move in a long game.   This time Saudi Arabia elevates a black-hole report to a major news item spiced up by Saudi blackmail.

But next time the UN will think twice before embarrassing the Saudis.

Maybe it makes sense that way.  But if so then we should always assume that the Saudis are behaving much worse than the self-censoring UN says they are.

PS (Two hours after posting) – Looking at this again I realize that my title is a little weird.  This is because I started with the title but then the ideas drifted while I wrote and by the end the connection between the post and the title became obscure.

For the record, the idea is that the dispute resolution harmed both parties.   Saudi Arabia comes off as a bully and blackmailer in addition to the original charge of abusing children.  The UN demonstrates that it can’t be trusted to speak the truth.  So, at least in the short run, both sides are damaged by the dispute.

Check out my New Article at STATS.org

Hello everybody.

Please have a look at this new article that has just gone up on STATS.org.

It is a compact exposition of the evidence of fabrication in public opinion surveys in Iraq as well as the threats and debates flowing from this evidence.

My current plan for the blog is to do one follow up post on some material that was left on the cutting room floor for the STATS.org article and then move on to other stuff….unless circumstances dictate a return to the Iraq polling issue.

Have a great weekend!

Langer Research Associates Responds: Part IV

This continues the stream of posts beginning here and continuing through here, here, here and here.

Today I had wanted to write on duplicates in the D3/KA Iraq surveys but I’ve hit a little snag in the analysis so I will postpone this subject for the near future.

Instead, today I’ll cover empty categories, that is, answer choices that are offered to respondents but that nobody among some broad class of respondents actually picks.

We stressed these empty categories in our original paper, finding a number of questions for which all respondents to our flagged supervisors failed to give certain answers that at least some respondents for other supervisor did give.

Yesterday’s discussion of duplicates is actually relevant for understanding why having so many empty categories is suspicious.  Peoples’ opinions are not cloned from one another.  Moreover, there is randomness in how people respond to questions and how these responses are recorded.  So we would expect a lot of natural variation in real answers to real survey questions.  We would not expect all responses to converge on just a few categories.

Quick note – today I will merge together two things that we held separate in the original paper.  Back then we had a section on substantive responses such as how much people approved of Prime Minister Maliki or whether or not people owned shortwave radios.  Then we had a another section on the responses “don’t know” and “refused to answer”.  Here I simplify things by treating the two types of missing categories as interchangeable.

The Exhaustive Review made a good point on missing categories.  We always split our sample in two: the interviews conducted by the flagged supervisors (we called them “focal supervisors” in the paper) and the interviews conducted by all the other supervisors.  This method of splitting means that there were always two to three times as many interviews in the unflagged category than there were were in the flagged category.  So maybe the excess of empty categories for the flagged supervisors is just because of the lower number of interviews they conducted.  In particular, the Exhaustive Review points out that when you group interviews by single supervisors, rather than by groups of supervisors as we did, you see that many supervisors have empty categories, not just the flagged supervisors.

This is definitely something that merits further investigation which I’m still doing.  However, I can report that a clear pattern has already emerged.  Once you adjust for the numbers of interviews the flagged supervisors tend to produce roughly two to four times the number of empty categories as the other supervisors do.

For example, in the January 2006 PIPA survey our flagged supervisors have a total of 240 empty categories in 332 interviews.  Two nonoverlapping combinations of other supervisors with 320 and 322 interviews had 110 and 122 missing categories, respectively.  I did find a single supervisor who had 316 missing categories…. but on only  66 interviews.

The results are similar for other surveys.  More interviews do tend to reduce the count of missing categories but the flagged supervisors consistently rack up 2 to 4 times their share of empties relative to interview counts.

So the Exhaustive Review has made a useful point that helps to improve the analysis of these surveys.  I just wish they had made the point openly back in 2011.  And this extension of the original approach does not weaken the evidence for fabricated data in the surveys.

Langer Research Associates Responds: Part III

This post is a continuation of this one and this one with further links to be found in the first two.

I’ll start with an important announcement.  Steve Koczela just had success with a Freedom of Information Request to the US State Department.  This means that he now has a mountain of new polling data from Iraq which he will be releasing in due course.

Some of these surveys were fielded by D3/KA, giving us a great chance to test our findings out of sample.  On top of that there are some surveys fielded by another company which provides an even better opportunity to get to the bottom of what has been going on in these polls.

I couldn’t resist having a look today at a D3/KA survey from 2006.

The survey has a battery of questions on the quality of public services.  I give the questions at the bottom of this post.  The possible answers are: “very good”, “good”, “poor”, “very poor”, “not available” and “don’t know”. Based on previous work I predict that supervisors 36, 43 and 44 are cheaters.  So I divide the sample into two pieces: the interviews of these supervisors and the interviews of all the other supervisors.

For the ones I predict for cheating the most common answer on these questions is that services are “very poor”. Not a single person says that services are “very good” or that they “don’t know”.  This is strange.  You’d expect that at least one of the out of 443 would go for one of these answers but let’s leave that aside.  Maybe these people are all very sure that they are receiving bad services.

Much more surprising is that not a single person says that a service is “not available”.  So,overwhelmingly, services are very bad or bad … but still available.  This is weird.  Don’t you think that at least a few of these dissatisfied customers would tick the worst box of all?

All boxes get ticked for the group of the other supervisors.  These supervisors did do 1,557 interviews so you could follow the Exhaustive Review and say their fuller coverage is down to their higher numbers.  In a future post I will explain why I am not convinced on this point.  But let’s leave this aside as well for today.

Instead, let’s look at correlations between answers to different questions.  For example, to what extent are people who are happy with their trash collection also happy with their landline service, etc.?

Here’s a list of the correlations on this battery of questions.  On the left are the interviews of the predicted cheaters and on the right are the interviews or all the others.

Predicted Cheaters All the Others
1.00 0.35
1.00 0.04
1.00 0.05
1.00 0.10
1.00 0.05
1.00 0.26
1.00 0.16
1.00 0.10
1.00 0.04
1.00 0.04
0.46 0.16
0.46 0.11
0.46 0.03
0.46 0.03
0.46 0.97
0.64 0.37
0.64 0.27
0.64 0.10
0.64 0.08
0.64 0.19
0.61 0.20
1.00 0.14
1.00 0.08
1.00 0.01
1.00 0.03
1.00 0.50
0.46 0.50
0.64 0.15
0.33 0.12
0.33 0.08
0.33 0.02
0.33 0.01
0.33 0.67
-0.04 0.66
0.24 0.15
0.33 0.68
0.33 0.06
0.33 0.03
0.33 0.00
0.33 0.00
0.33 0.05
-0.04 0.04
0.24 0.10
0.33 0.04
1.00 0.07

Look at all the perfect correlations of 1.00 for the predicted cheaters!  

Every time you see a 1.00 you should hear the sound of 443 people answering questions in perfect lock step with one another.  If you are slightly happier with your electricity than I am then you are also slightly happier with your water than I am…and also slightly happier about your landline…and slightly happier with your mobile, and your garbage collection….and traffic management in your area.

I didn’t make that up.  All of the above variables are perfectly correlated.  C’mon guys.  You’re making yourselves too easy to catch.

For the supervisors not flagged in advance as likely cheaters there is never a perfect correlation between two questions.  This is what we would expect in real interviews.

Your eye may have been drawn toward the very high correlation of 0.97 for the supervisors I haven’t flagged as suspicious.  But this is for garbage collection versus sewage disposal.  In fact, it makes sense that these two would be closely linked and the much weaker connection for the likely cheaters strikes me as further evidence that they made up their data.

Quoting again from the Exhaustive Review:

Examining expected correlations is a reasonable way to search for evidence of data fabrication; it’s very difficult for a fabricator to anticipate relationships among variables and fake data accordingly. We find, however, that the lack of correlations of the type that Koczela and Spagat document appears again to be an artifact of their groupings of supervisors. (We also note that we have examined many more correlations, 96 in total, than Koczela and Spagat report.)

I have to agree with the exhaustive reviewers here.  Looking at correlations is quite a good way to uncover fabrication. Indeed, the above table is strong evidence of fabrication.  However, I’m baffled by how results like these are supposed to be “artifacts” of groupings.  I honestly don’t know what to make of this comment.

Of course, the above table only contains 45 correlations.  With the five reported in the original paper, which the exhaustive reviewers did not attempt to explain, I’m still 46 shy of the exhaustive reviewers.  I guess I’ll have to work harder.

Remember that all the analysis in this post is of a new survey not covered in my original paper.  I was able to use the list of suspicious supervisors taken from the earlier paper to immediately find big correlation, and other, anomalies in a new dataset.  In other words, this is an out-of-sample success, and it is an easy one at that.

Finally, here is the list of questions in the battery:

Q3a-The following services for your neighborhood over the past month have been…Water Supply?
Q3b-The following services for your neighborhood over the past month have been…Electric Supply?
Q3c-The following services for your neighborhood over the past month have been…Telephone Service (land line)?
Q3d-The following services for your neighborhood over the past month have been…Telephone Service (Mobile)?
Q3e-The following services for your neighborhood over the past month have been…Garbage Collection?
Q3f-The following services for your neighborhood over the past month have been…Sewage Disposal?
Q3g-The following services for your neighborhood over the past month have been…Conditions of roads?
Q3h-The following services for your neighborhood over the past month have been…Traffic Management?
Q3i-The following services for your neighborhood over the past month have been…Police Presence?
Q3j-The following services for your neighborhood over the past month have been…Army Presence?

Langer Research Associates Responds: Part II

Yesterday I gave some background on this discussion.

Today I will start working through the “Exhaustive Review” conducted by Langer Research Associates (hereafter, “LRA” to save space) of this paper.  (The lawyer for D3 Systems called the LRA report an “exhaustive review.” .  Well, he’s a lawyer he must be right so I’ll stick with his name.)

LRA open by saying that I have posted online “false accusations” about their company.  I would ask LRA to specify what they are referring to.  I will be happy to apologize and correct any false accusation I have made.  I have already made one such correction on my blog and am confident that I will have to do this again.

Steve Koczela and I found suspicious patterns in the interviews that certain supervisors presided over in five polls that were conducted in Iraq.  We called these the “focal” supervisors since we focused on them in our paper.

The Exhaustive Review is of four different Iraq polls.  The connection between our work and LAR ‘s is that both sets of polls were fielded in Iraq by the same companies at around the same time period.  Moreover, the Exhaustive Review makes clear that some of our “focal supervisors” appear in the four polls they analyze so, clearly, we can learn about one set of polls from studying the other set of polls.

LAR stresses on their first page that they “have some documentary evidence at hand” :

KA/D3’s work for us in Iraq included delivery of interviewer and supervisor journals describing their field work experiences, and photos of field work as it occurred. Our review finds that we have both journals and photos of field work from the areas where Koczela and Spagat suggest that field work did not occur.

First, to be clear, it ts possible that some field work did actually occur in the areas of the focal supervisors.  In the data I have there is evidence (see tomorrow’s post) for a large number of duplicated or nearly duplicated interviews..  Focal supervisors may have presided over some legitimate interviews and then done duplication with minor changes to evade detection.  (Readers, please let me know if you see me anywhere stating flatly that no field work occurred.and I will correct myself.)

Second, having pictures and journals is, indeed, positive for LAR.  Certainly, if the focal supervisors refused to supply such required evidence and all the other supervisors did that would that would look like quite a smoking gun in the hands of the focal supervisors.

Still, anyone fabricating data would have to be incredibly stupid to fail to comply with a requirement to deliver journals and pictures.  LAR gives no indication that their Exhaustive Review extended to examining these pictures and journals.  Certainly it is possible to fake such documents.

LAR should release these journals and photos along with their data.  But for now have a look at this description of these documents for one poll drawn from here:  Bear in mind that the focal supervisors operated in Anbar, Baghdad and Diyala.

In addition to keeping field notes, teams carried cameras to take photos of interviews when the respondents agreed. The pictures underscore the wide range of Iraq’s population, with some respondents in Western garb, down to a knotted tie; others in traditional clothes such as the hijab (veil) and dishdasha (flowing robes).

Notably, all the photos from Anbar and Baghdad are from the neck down; no respondent in either of these provinces consented to have their faces shown, an indication itself of security concerns there. In other areas – notably the far more secure Kurdish north – respondents smiled genially for the camera.

Nonetheless, even in Anbar, where insularity is high and resentment over the U.S. invasion seethes, an interviewer reported, “I have noticed that the respondents answered very seriously and were not afraid to tell me their answers to these questions.”

Hmmm….complete unanimity on neck-down photos and bland statements about how successful the interviews were.  This doesn’t bowl me over.

Langer Research Associates Responds: Part 1

Avid readers of this blog know that back in  2011 I wrote a paper with Steve Koczela finding evidence of fabrication in polls fielded in Iraq by D3 Systems and KA Research Limited. (You can find a trail of links here.)

We tried to open a dialogue with D3 and other interested parties but D3 responded in the time-honored fashion of those standing on firm intellectual ground:

punch

In other words, they threatened to sue us.

The D3 threat was backed by a secret analysis of Langer Research Associates which “exhaustively reviewed the Subject Document and conclusively determined that it is false, misleading and asserts facts and conclusions which are incorrect.”  At the time I asked to see the Exhaustive Review but got no response.  Finally, it has appeared on the Langer Research Associates website.

I will address the substance of the Exhaustive Review in follow up posts.  For now I’ll just set the table with some points that are obvious but could,nevertheless, be overlooked.

talk-about-the-white-elephant-in-the-room

  1.  I have placed our data in the public domain.  Loyal readers: please have a look.  In fact, if you’re not a loyal reader and would like to make me look stupid, here’s your chance.  Go for it!
  2. Langer Research Associates has risen to the challenge and also placed their data in the public domain has not released their data. The strategy of suppressing relevant data is consistent with the strategy of using the legal system to suppress analysis of related data.
  3. The Exhaustive Review is not so exhaustive as to include any analysis of the polls we analyzed in our original paper.  (I know this is totally confusing but the link between the Langer data and our data is that both sets of polls were fielded in Iraq by D3 and KA.)
  4. Some of our data were already in the public domain and we would have gladly provided the rest to the exhaustive reviewers although they would have had to communicate with us in order to achieve this result.
  5. It is better to resolve research disputes with dialogue rather than with legal threats.