We return again to the Roberts et al. paper.

In part 5 of my postings on the Chilcot Report I promised to discuss the calculations of confidence intervals underlying these claims:

One standard calculation method (bootstrapping) leads to a central estimate of 210,000 violent deaths with a 95% confidence interval of around 40,000 to 600,000. However, if you remove the Fallujah cluster the central estimate plummets to 60,000 with a 95% confidence interval of 40,000 to 80,000. (I’ll give details on these calculations in a follow-up post.)

I have to start with some caveats.

*Caveat 1: No household data – failure to account for this makes confidence intervals too narrow*

As we know the authors of the paper have not released a proper dataset. To do this right I would need to have violent deaths by household but the authors are holding this information back. Thus, I have to operate at the cluster level. This shortcut suppresses household-level variation which, in turn, constricts the widths of the confidence intervals I calculate. It’s possible to get a handle on the sizes of these effects but I won’t go there in this blog post.

*Caveat 2: Violent deaths are not broken down by cluster – confidence intervals depend on how I resolve this ambiguity*

Roberts et al. don’t provide us with all the information we need to proceed optimally at the cluster level either since they don’t tell us the number of violent deaths in each of their 33 clusters. All they say in the paper (unless I’ve missed something) is that the Fallujah cluster had 52 violent deaths and the other 32 clusters combined had 21 violent deaths spread across 14 clusters. I believe this is the best you can do although maybe a clever reader can mine the partial striptease to extract a few more scraps of information on how the 21 non-Fallujah violent deaths are allocated across clusters.

This ambiguity leaves many possibilities. Maybe 13 clusters had one violent death and one cluster had the remaining eight. Or maybe ten cluster had one death, three clusters had two deaths and the last cluster had 5 violent deaths. Etc.

To keep things simple I’ll consider just four scenarios. This first is that there are 18 clusters with 0 deaths, 7 clusters with 1 death, 7 clusters with 2 deaths and the Fallujah cluster with 52 deaths. The second is that there are 18 clusters with 0 deaths, 13 clusters with 1 death, 1 cluster with 8 deaths and the Fallujah cluster with 52 deaths. The third and fourth scenarios are the same as the first and second except that the latter toss out the Fallujah clusters.

*Caveat 3: There is a first stage to the sampling procedures that tosses out 6 governorates – failure to account for this makes the confidence intervals too narrow. (I already alluded to this issue in this post.)*

I quote from the paper:

During September, 2004, many roads were not under the control of the Government of Iraq or coalition forces. Local police checkpoints were perceived by team members as target identification screens for rebel groups. To lessen risk to investigators, we sought to minimise travel distance and the number of Governorates to visit, while still sampling from all regions of the country. We did this by clumping pairs of Governorates. Pairs were adjacent Governorates that the Iraqi study team members believed to have had similar levels of violence and economic status during the preceding 3 years.

Roberts et al. randomly selected one governorate from each pair, visited only the selected governorates and ignored the non-selected ones. So, for example, Karbala and Najaf were a pair. In the event Karbala was selected and the field teams never visited Najaf. In this way Dehuk, Arbil, Tamin, Najaf, Qadisiyah and Basrah were all eliminated.

This is not OK.

The problem is that the random selection of 6 governorates out of 12 introduces variation into the measurement system that should be, but isn’t, built into the confidence intervals calculated by Roberts et al. This problem makes all the confidence intervals in the paper too narrow.

It’s worth understanding this point well so I offer an example.

Suppose I want to know the average height of students in a primary school consisting of grades 1 through 8. I get my estimate by taking a random sample of 30 students and averaging their heights. If I repeat the exercise by taking another sample of 30 I’ll get a different estimate of average student height. Any confidence interval for this sampling procedure will be based on modelling how these 30-student averages vary across different random samples.

Now suppose that I decide to save effort by streamlining my sampling procedure. Rather than taking a simple random sample of 30 students from the whole school I first choose a grade at random and then randomly select 30 students from this grade. This is an attractive procedure because now I don’t have to traipse around the whole school measuring only one or two students from each class. Now I may be able to draw my sample from just one or two classrooms. This procedure is even unbiased, i.e., I get the right answer on average.

But the streamlined procedure produces much more variation than the original one does. If, at the first stage, I happen to select the 8th grade then my estimate for the school’s average height will be much higher than the actual average height. If, on the other hand, I select the 1st grade then my estimate will be much lower than the actual average. These two outcomes balance each other out (the unbiasedness property). But the *variation* in the estimates across repeated samples will be much higher under the streamlined procedure than it will be under the original one. A proper confidence interval for the streamlined procedure will need to be wider than a proper confidence interval for the original procedure will be.

Analogously, the confidence intervals of Roberts et al. need to account for their first-stage randomization over governorates. Since they don’t do this all their confidence intervals are too narrow.

Unfortunately, this problem is thornier than it may appear to be at first glance. The only way to correct the error is to incorporate information about what would have happened in the excluded governorates if they had actually been selected. But since these governorates were not selected the survey itself supplies no useful information to fill this gap. We could potentially address this issue by importing information from outside the system but I won’t do this today. So I, like Roberts et al., will just ignore this problem which means that my confidence intervals, like theirs, will be too narrow.

OK, enough with the caveats. I just need to make one more observation and we’re really to roll.

Buried deep within the paper there is an assumption that the 33 clusters are “exchangeable”. This technical term is actually crucial. In essence, it means that each cluster can potentially represent any area of Iraq. So if, for example, there was a cluster in Missan with 2 violent deaths then if we resample we can easily find a cluster in Diala just like it, in particular having 2 violent deaths. Of course, this exchangeability assumption implies that there is nothing special about the fact that the cluster with 52 violent deaths turned out to be in Fallujah. Exchangeability implies that if we sample again we might well find a cluster with 52 deaths in Baghdad or Sulaymaniya. Exchangeability seems pretty far fetched when we think in terms of the Fallujah cluster but if we leave this aside the assumption is strong but, perhaps, not out of line with many assumptions researchers tend to make in statistical work.

We can now implement an easy computer procedure to calculate confidence intervals:

- Select 1,000 samples, each one containing 33 clusters. These samples of 33 clusters are chosen at random (with replacement) from the list of 33 clusters given above (18 0’s, 7 1’s, 7 2’s and 1 52). Thus, an individual sample can turn out to have 33 0’s or 33 52’s although both of these outcomes are very unlikely (particularly the second one.)
- Estimate the number of violent deaths for each of the 1,000 samples. As I noted in a previous post we can do this in a rough and ready way by multiplying the total number of deaths in the sample by 3,000.
- Order these 1,000 estimates from smallest to largest.
- The lower bound of the 95% confidence interval is the estimate in position 25 on the list. The upper bound is the estimate in position 975. The central estimate is the estimate at position 500.

Following these procedures I get a confidence interval of **40,000 to 550,000 with a central estimate of 220,000.** (I’ve rounded all numbers to the nearest 10,000 as it seems ridiculous to have more precision than that.) Notice that these numbers are slightly different from the ones at the top of the post because I took 1,000 samples this time and only 100 last time. So these numbers supercede the earlier ones.

We can do the same thing** without the Fallujah cluster**. Now we take samples of 32 from a list with 18 0’s, 7 1’s and 7 2’s. This time I get a central estimate of **60,000 violent deaths with a 95% confidence interval of 40,000 to 90,000**.

Next I briefly address caveat 2 above by reallocating the 21 violent deaths that are spread over 14 clusters in an indeterminate way. Suppose now that we have **13 clusters with one violent death and 1 cluster with 8 violent deaths.** Now the estimate that includes the Fallujah cluster becomes **210,000 with a confidence interval of 30,000 to 550,000**. Without Fallujah I get an estimate of **60,000 with a range of 30,000 to 120,000**.

Caveats 1 and 3 mean that these intervals should be stretched further by an unknown amount.

Here are some general conclusions.

- The central estimate for the number of violent deaths depends hugely on whether Fallujah is in or out. This is no surprise.
- The bottoms of the confidence intervals do not depend very much on whether Fallujah is in or out. This may be surprising at first glance but not upon reflection. The sampling simulations that include Fallujah have just a 1/33 chance of picking Fallujah at each chance. Many of these simulations will not chose Fallujah in any of their 33 tries. These will be the low-end estimates. So at the low end it is almost as if Fallujah never happened. These sampling outcomes correspond with reality. In three subsequent surveys of Iraq nothing like that Fallujah cluster ever appeared again. It really seems to have been an anomaly.
- The high-end estimates are massively higher when Fallujah is included than they are when it isn’t. Again, this makes sense since some of the simulations will pick the Fallujah cluster two or three times.

## One thought on “Mismeasuring War Deaths in Iraq: Confidence Interval Calculations”