Suppose you and I take the same public opinion survey. What are the chances that we give the same answer to a bunch, say, 50 or more questions? Somewhere between slim and none I’d say unless we belong to the same cult.
In fact, if I answered the same survey questions on two successive days my answers wouldn’t match exactly, especially on issues I’m pretty ignorant of such as how the fire department is performing in my county. There would be further randomness injected into responses if someone else wrote them down and a third person typed them into a computer. So even if there is someone out there who agrees with me on everything it’s still unlikely that we’d be recorded as agreeing on everything in a long public opinion survey.
In addition, it’s barely more likely that we would agree on all but a few questions in a long survey. In other words, near duplicates are almost as improbable as full duplicates. But how near do a pair of interviews have to be before they start looking suspicious?
Enter Noble Kuriakose and Michael Robbins with important new research directly addressing this question. Their work has been well covered so you can go here, here here or, to get it straight from the horses mouth, here.
Kuriakose and Robbins looked at hundreds of large public opinion surveys and found it rare to have two interviews that match on more than 85% of the questions, although this 85% level should be treated as a guide rather than as a magic threshold. They also found that in legitimate survey data all the percent matches across interviews tend to more or less follow something known as a Gumbel distribution, a key property of which is that there is a single peak.
Kuriakose and Robbins have written a nice Stata programme that checks for duplication and near duplication in surveys. Overall this is a really nice piece of work.
Tomorrow I will show you results I have from applying this programme to our favourite Iraq polling data.