New Paper on Accounting for Civilian War Casualties

Hello everybody.

The radio silence was much longer than intended but blog posts should start coming fast and furious now.  I’ve got a lot I want to get off my chest as soon as possible.

Let’s get the ball rolling with a new paper I have with Nicholas Jewell and Britta Jewell.  (Well, to be honest, it isn’t really a brand new paper but it’s newly accepted at a journal and we’re now putting it into the public domain.)

I dare say that this paper is a very readable introduction to civilian casualty recording and estimation, that is, to most of the subject matter of the blog.  I hope you will all have a look.

And, please, send in your comments..

More soon…..

PS – Here is an alternative link to the paper in case the first one doesn’t work for you.

 

Mismeasuring Deaths in Iraq: Addendum on Confidence Interval Calculations

Garfield_musing_CIs_533965604

In my last post I used a combination of bootstrapping and educated guesswork to find  confidence intervals for violent deaths in Iraq based on the data from the Roberts et al. survey.  (The need for guesswork arose because the authors have not been forthcoming with their data.)

Right after this went up a reader contacted me and asked whether the bottom of one of these confidence intervals can go below 0.

The short answer is “no” with the bootstrap method.  This technique can only take us down to 0 and no further.

Explanation

With bootstrapping we randomly select from a list of 33 clusters.  Of course, none of these clusters experienced a negative number of violent deaths. So 0 is the smallest possible count we can get for violent deaths in any simulation sample.  (In truth, the possibility of pulling 33 0’s is more theoretical than real.  This didn’t happen in any of my 1,000 draws of 33.)

Nevertheless, it turns out that if we employ the most common methods for calculating confidence intervals (not bootstrapping) then the bottom of the interval does dip below 0 when the dubious Fallujah cluster is included.

Here’s a step by step walk-through of the traditional method applied to the Roberts et al. data.  (I will assume that violent deaths are allocated across the 33 clusters as 18 0’s, 7 1’s, 7 2’s and 1 52.)

  1. Compute the mean number of violent deaths per cluster.  This is 2.2.  An indication that something is screwy here is the fact that the mean is bigger than the number of violent deaths in 32 out of the 33 clusters.  At the same time the mean is way below the number of violent deaths in the Fallujah cluster (52).  Note that without the Fallujah cluster the mean becomes 0.7, i.e., eliminating Fallujah cuts the mean by more than a factor of 3.
  2. Compute the sample standard deviation which is a measure of how strongly the number of violent deaths varies by cluster.  This is 9.0.  Note that if we eliminate the Fallujah cluster then the sample standard deviation plummets by more than a factor of 10, all the way down to 0.8.  This is just a quantitative expression of the obvious fact that the data are highly variable with Fallujah in there.  Note further that the big outlier observation affects the standard deviation more than it affects the mean.
  3. Adjust for sample size.  We do this by dividing the sample standard deviation by the square root of the sample size.  This gives us 1.6.  Here the idea is that you can tame the variation in the data by taking a large sample.  The larger the sample size the more you tame the data.  However, as we shall see, the Fallujah cluster makes it impossible to really tame the data with a sample of only 33 clusters.
  4. Unfortunately, the last step is mysterious unless you’ve put a fair amount of effort into studying statistics.  (This, alone, is a great reason to prefer bootstrapping which is very intuitive.)  Our 95% confidence interval for the mean number of violent deaths per cluster is, approximately, the average plus or minus 2 times 1.6, i.e., -1.0 to 5.4.  There’s the negative lower bound!
  5. We can translate from violent deaths per cluster to estimated violent deaths by multiplying by 33 and again by 3,000.  We end up with -100,000 to 530,000.  (I’ve been rounding at each step.  If, instead I don’t round until the very end I get -90,000 to 530,000….this doesn’t really matter.)  Note that without Fallujah we get a confidence interval of 30,000 to 90,000 which is about what we got with bootstrapping.

Have we learned anything here other than that I respond to reader questions?

I don’t think we’ve learned much, if anything, about violent deaths in Iraq.  We already knew that the Roberts et al. data, especially the Fallujah observation, is questionable and maybe the above calculation reinforces this view a little bit.

But, mostly, we learn something about the standard method for calculating confidence intervals; when the data are wild this method can give incredible answers.  Of course, a negative number of violent deaths is not credible.

There is an intuitive reason why the standard method fails with the Roberts et al. data; it forces a symmetric estimate onto highly asymmetric data.  Remember we get 2.2 plus or minus 3.2 average violent deaths per cluster.  The plus or minus means that the confidence interval is symmetric.  The Fallujah observation forces a wide confidence interval which has to go just as wide on the down side as it is on the up side.  In some sense the method is saying that if it’s possible to find a cluster with 52 violent deaths then it also must be possible to find a cluster with around -52 violent deaths.  But, of course, no area of Iraq  experienced -52 violent deaths.  So you wind up with garbage.

Part of the story is also the small sample size. With twice as many cluster, but the same sort of data, the lower limit would only go down to about 0.

It’s tempting to just say “garbage in, garbage out” and, up to a point, this is accurate.   But the bigger problem is that the usual method for calculating confidence intervals is not appropriate in this case.

Mismeasuring War Deaths in Iraq: Confidence Interval Calculations

We return again to the Roberts et al. paper.

In part 5 of my postings on the Chilcot Report I promised to discuss the calculations of confidence intervals underlying these claims:

One standard calculation method (bootstrapping) leads to a central estimate of 210,000 violent deaths with a 95% confidence interval of around 40,000 to 600,000.  However, if you remove the Fallujah cluster the central estimate plummets to 60,000 with a 95% confidence interval of 40,000 to 80,000.  (I’ll give details on these calculations in a follow-up post.)

I have to start with some caveats.

Caveat 1:  No household data – failure to account for this makes confidence intervals too narrow

As we know the authors of the paper have not released a proper dataset.  To do this right I would need to have violent deaths by household but the authors are holding this information back.  Thus, I have to operate at the cluster level.  This shortcut suppresses household-level variation which, in turn, constricts the widths of the confidence intervals I calculate.  It’s possible to get a handle on the sizes of these  effects but I won’t go there in this blog post.

Caveat 2: Violent deaths are not broken down by cluster – confidence intervals depend on how I resolve this ambiguity

Roberts et al. don’t provide us with all the information we need to proceed optimally at the cluster level either since they don’t tell us the number of violent deaths in each of their 33 clusters.  All they say in the paper (unless I’ve missed something) is that the Fallujah cluster had 52 violent deaths and the other 32 clusters combined had 21 violent deaths spread across 14 clusters.  I believe this is the best you can do although maybe a clever reader can mine the partial striptease to extract a few more scraps of information on how the 21 non-Fallujah violent deaths are allocated across clusters.

This ambiguity leaves many possibilities.  Maybe 13 clusters had one violent death and one cluster had the remaining eight.  Or maybe ten cluster had one death, three clusters had two deaths and the last cluster had 5 violent deaths.  Etc.

To keep things simple I’ll consider just four scenarios.  This first is that there are 18 clusters with 0 deaths, 7 clusters with 1 death, 7 clusters with 2 deaths and the Fallujah cluster with 52 deaths.  The second is that there are 18 clusters with 0 deaths, 13 clusters with 1 death, 1 cluster with 8 deaths and the Fallujah cluster with 52 deaths. The third and fourth scenarios are the same as the first and second except that the latter toss out the Fallujah clusters.

Caveat 3: There is a first stage to the sampling procedures that tosses out 6 governorates – failure to account for this makes the confidence intervals too narrow.  (I already alluded to this issue in this post.)

I quote from the paper:

During September, 2004, many roads were not under the control of the Government of Iraq or coalition forces. Local police checkpoints were perceived by team members as target identification screens for rebel groups.  To lessen risk to investigators, we sought to minimise travel distance and the number of Governorates to visit, while still sampling from all regions of the country. We did this by clumping pairs of Governorates. Pairs were adjacent Governorates that the Iraqi study team members believed to have had similar levels of violence and economic status during the preceding 3 years.

Roberts et al. randomly selected one governorate from each pair, visited only the selected governorates and ignored the non-selected ones.  So, for example, Karbala and Najaf were a pair.  In the event Karbala was selected and the field teams never visited Najaf.  In this way Dehuk, Arbil, Tamin, Najaf, Qadisiyah and Basrah were all eliminated.

This is not OK.

The problem is that the random selection of 6 governorates out of 12 introduces variation into the measurement system that should be, but isn’t, built into the confidence intervals calculated by Roberts et al.  This problem makes all the confidence intervals in the paper too narrow.

It’s worth understanding this point well so I offer an example.

Suppose I want to know the average height of students in a primary school consisting of  grades 1 through 8.  I get my estimate by taking a random sample of 30 students and averaging their heights.  If I repeat the exercise by taking another sample of 30 I’ll get a different estimate of average student height.  Any confidence interval for this sampling procedure will be based on modelling how these 30-student averages vary across different random samples.

Now suppose that I decide to save effort by streamlining my sampling procedure.  Rather than taking a simple random sample of 30 students from the whole school I first choose a grade at random and then randomly select 30 students from this grade.  This is an attractive procedure because now I don’t have to traipse around the whole school measuring only one or two students from each class.  Now I may be able to draw my sample from just one or two classrooms. This procedure is even unbiased, i.e., I get the right answer on average.

But the streamlined procedure produces much more variation than the original one does.  If, at the first stage, I happen to select the 8th grade then my estimate for the school’s average height will be much higher than the actual average height.  If, on the other hand, I select the 1st grade then my estimate will be much lower than the actual average. These two outcomes balance each other out (the unbiasedness property).  But the variation in the estimates across repeated samples will be much higher under the streamlined procedure than it will be under the original one.  A proper confidence interval for the streamlined procedure will need to be wider than a proper confidence interval for the original procedure will be.

Analogously, the confidence intervals of Roberts et al. need to account for their first-stage randomization over governorates.  Since they don’t do this all their confidence intervals are too narrow.

Unfortunately, this problem is thornier than it may appear to be at first glance.  The only way to correct the error is to incorporate information about what would have happened in the excluded governorates if they had actually been selected.  But since these governorates were not selected the survey itself supplies no useful information to fill this gap.  We could potentially address this issue by importing information from outside the system but I won’t do this today.  So I, like Roberts et al., will just ignore this problem which means that my confidence intervals, like theirs, will be too narrow.

OK, enough with the caveats.  I just need to make one more observation and we’re really to roll.

Buried deep within the paper there is an assumption that the 33 clusters are “exchangeable”. This technical term is actually crucial.  In essence, it means that each cluster can potentially represent any area of Iraq. So if, for example, there was a cluster in Missan with 2 violent deaths then if we resample we can easily find a cluster in Diala just like it, in particular having 2 violent deaths.   Of course, this exchangeability assumption implies that there is nothing special about the fact that the cluster with 52 violent deaths turned out to be in Fallujah.  Exchangeability implies that if we sample again we might well find a cluster with 52 deaths in Baghdad or Sulaymaniya.  Exchangeability seems pretty far fetched when we think in terms of the Fallujah cluster but if we leave this aside the assumption is strong but, perhaps, not out of line with many assumptions researchers tend to make in statistical work.

We can now implement an easy computer procedure to calculate confidence intervals:

  1. Select 1,000 samples, each one containing 33 clusters.  These samples of 33 clusters are chosen at random (with replacement) from the list of 33 clusters given above (18 0’s, 7 1’s, 7 2’s and 1 52).  Thus, an individual sample can turn out to have 33 0’s or 33 52’s although both of these outcomes are very unlikely (particularly the second one.)
  2. Estimate the number of violent deaths for each of the 1,000 samples.  As I noted in a previous post we can do this in a rough and ready way by multiplying the total number of deaths in the sample by 3,000.
  3. Order these 1,000 estimates from smallest to largest.
  4. The lower bound of the 95% confidence interval is the estimate in position 25 on the list.  The upper bound is the estimate in position 975.  The central estimate is the estimate at position 500.

Following these procedures I get a confidence interval of 40,000 to 550,000 with a central estimate of 220,000.  (I’ve rounded all numbers to the nearest 10,000 as it seems ridiculous to have more precision than that.)  Notice that these numbers are slightly different from the ones at the top of the post because I took 1,000 samples this time and only 100 last time.  So these numbers supercede the earlier ones.

We can do the same thing without the Fallujah cluster.  Now we take samples of 32 from a list with 18 0’s, 7 1’s and 7 2’s.  This time I get a central estimate of 60,000 violent deaths with a 95% confidence interval of 40,000 to 90,000.

Next I briefly address caveat 2 above by reallocating the 21 violent deaths that are spread over 14 clusters in an indeterminate way.  Suppose now that we have 13 clusters with one violent death and 1 cluster with 8 violent deaths.  Now the estimate that includes the Fallujah cluster becomes 210,000 with a confidence interval of 30,000 to 550,000.  Without Fallujah I get an estimate of 60,000 with a range of 30,000 to 120,000.

Caveats 1 and 3 mean that these intervals should be stretched further by an unknown amount.

Here are some general conclusions.

  1. The central estimate for the number of violent deaths depends hugely on whether Fallujah is in or out.  This is no surprise.
  2. The bottoms of the confidence intervals do not depend very much on whether Fallujah is in or out.  This may be surprising at first glance but not upon reflection.  The sampling simulations that include Fallujah have just a 1/33 chance of picking Fallujah at each chance.  Many of these simulations will not chose Fallujah in any of their 33 tries.  These will be the low-end estimates.  So at the low end it is almost as if Fallujah never happened.  These sampling outcomes correspond with reality.  In three subsequent surveys of Iraq nothing like that Fallujah cluster ever appeared again. It really seems to have been an anomaly.
  3. The high-end estimates are massively higher when Fallujah is included than they are when it isn’t.  Again, this makes sense since some of the simulations will pick the Fallujah cluster two or three times.

 

Mismeasuring War Deaths in Iraq: The Partial Striptease

I now continue the discussion of the Roberts et al. paper that I started in my series on the Chilcot Report.  This is tangent from Chilcot so I’ll hold this post and its follow-ups outside of that series.

Les Roberts never released a proper data set for his survey.  Worse, the authors are sketchy on important details in the paper, leaving us to guess on some key issues.  For example, in his report on Roberts et al. to the UK government Bill Kirkup wrote:

The authors provide a reasonable amount of detail on their figures in most of the paper.  They do, however, become noticeably reticent when it comes to the breakdown of deaths into violent and non-violent, and the breakdown of violent deaths into those attributed to the coalition and those due to terrorism or criminal acts, particularly taking into account the ‘Fallujah problem’…

Roberts et al. claim that “air strikes from coalition forces accounted for most violent deaths” but Kirkup points out that without the dubious Fallujah cluster it’s possible that the coalition accounted for less than half of the survey’s violent deaths.

Kirkup’s suspicion turns out to be correct.

However, you need to look at this email from Les Roberts to a blog to settle the issue.  It turns out that coalition air strikes outside Fallujah account for 6 out of 21 violent deaths there with 4 further deaths attributed to the coalition using other weapons.

My primary point here is about data openness rather than about coalition air strikes.  Roberts et al. should just show their data rather than dribbling it out in bibs and bobs into the blogosphere.

Roberts gives another little top up here.  (I give that link only to document my source.  I recommend against ploughing through this Gish Gallop by Les Roberts.)  Buried deep inside a lot of nonsense Roberts writes:

The Lancet estimate [i.e. Roberts et al.], for example, assumes that no violent deaths have occurred in Anbar Province; that it is fair to subtract out the pre-invasion violence rate; and that the 5 deaths in our data induced by a US military vehicles are not “violent deaths.”

Hmmm…..5 deaths caused by US military vehicles.

Recall that each death in the sample yields around 3,000 estimated deaths. This translates into 15,000 estimated deaths caused by US military vehicles – nearly 30 per day for a year and a half.  There have, unfortunately, been a number of Iraqis killed by US military vehicles.  Iraq Body Count (IBC) has 110 such deaths in its database during the period covered by the Roberts et al. survey.  I’m sure that IBC hasn’t captured all deaths in vehicle accidents but, nevertheless, the 15,000 figure looks pretty crazy.

Again I come back to my main point – please just give us a proper dataset rather than a partial striptease.  Meanwhile, I can’t help thinking Roberts et al. are holding back on the data because it contains more embarrassments that we don’t yet know about.

PS – After providing the above quote I feel obligated to debunk it further.

  1. Roberts writes that his estimate omits deaths in Anbar Province (which contains Fallujah).  But many claims in his paper are only true if you include Anbar (Fallujah).  Indeed, this very blog post opened with one such claim.  We see that Fallujah is in for the purpose of saying that most violent deaths were caused by coalition airstrikes but Fallujah is out when it’s time to talk about how conservative the estimate is because it omits Fallujah.  Call this the “Fallujah Shell Game”.  (See the comments of Josh Dougherty here.)Shell Game_Thimblerig small
  2. Roberts suggests that he bent over backwards to be fair by omitting pre-invasion violent deaths from his estimate.  But, first of all, there was only one such death so it hardly makes a difference whether this one is in our out.  Second, it’s hard to understand what the case would be for blaming a pre-invasion death on the invasion.  .

 

Comments Down Below!

Hello everybody.

This is just a quick note to say that there were some interesting comments that appeared on my last two post on Chilcot (here and here).  I’ve just replied to both.

While I’m at it I have a question for Bill Kirkup (who made one of the comments).  Can he give us a little briefing on how death certificates have been handled in post-invasion Iraq?

Actually, I have a number of specific questions on this subject. I’d be happy to switch to email to clear these up and then post a summary if that works best (m.spagat@rhul.ac.uk).

Chilcot on Civilian Casualties: Part 5

This post continues my coverage of the three reports (one, two, three) written by UK government experts on the Roberts et al. 2004 article claiming that the 2003 invasion of Iraq caused a very large number of deaths.  According to the abstract of the paper:

We estimate that 98,000 more deaths than expected (8,000-194,000) happened after the invasion outside of Falluja and far more if the outlier Falluja cluster is included…Violent deaths were widespread, reported in 15 of 33 clusters, and were mainly attributed to coalition forces.  Most individuals reportedly killed by coalition forces were women and children.

Here’s some useful background.

Iraq Body Count (IBC) had already documented the violent deaths of nearly 20,000 civilians by the time the Roberts et al. paper was released.  So it was already clear that the war had caused a very large number of civilian deaths. The civilians chapter of the Chilcot Report does not suggest to me that this fact triggered deep concern within the UK government.  But the Roberts et al. paper produced a shock which I attribute mainly to its headline-grabbing figure of 100,000.

The 100,000 estimate is not directly comparable to IBC’s 20,000 count because 100,000 refers to excess deaths, i.e., violent plus non-violent deaths of civilians plus combatants beyond a baseline level, whereas IBC records only violent deaths of civilians.  There is also a phenomenally wide confidence interval of 8,000 to 194,000 surrounding the 100,000  estimate which severely complicates any comparison with another source.

Despite all these ambiguities media coverage tended to present the Roberts et al. results as reliably demonstrating in the prestigious scientific journal, The Lancet, that the war had caused 100,000 violent deaths of civilains.  This Guardian article is typical of much misleading media coverage.  There is no mention of a confidence interval, the excess-death estimate is portrayed as a violent-death estimate which is then presented as civilians-only when, in fact, the estimate mixes combatants with civilians. Such media attention further upped the ante on the 100,000 figure, making it still harder to ignore.

Roberts et al. conducted a “cluster survey“.   Specifically, they selected 33 locational points in Iraq and interviewed a bunch of close neighbours at each place.  Households located so close to one another are likely to have similar violence experiences.  So it’s probably more useful to view the sample as 33 data points, one for each cluster, rather than as roughly 1,000 data points, one for each household.

This is a tiny sample.

To get a handle on the sample-size problem consider some pertinent simulations I ran a few years back on some Iraq violence data. These show just how easy it is to overestimate violent deaths by factors of 2, 3 or more when you only have around 30 clusters. By the same token, surveys of this size can easily fail to detect a single violent death even when these surveys are conducted within very violent environments.

6a0105369e6edf970b01a73e187118970d-800wi

The problem with using a mere 33 clusters to measure war violence is intuitive.  Interviewers can easily stumble onto a few unusually violent hot spots and overestimate the average level of violence by a wide margin.   On the other hand, researchers can just as easily draw a qualitatively different kind of sample consisting of 33 peaceful islands.

The Roberts et al. survey seems to have landed on a super turbo charged version of this small sample issue.  They found a total of 21 violent deaths in 32 of their clusters, i.e., less than one death per cluster.  Yet they reported no fewer than 52 violent deaths in their 33rd cluster in Fallujah..

Such a sample yields estimates that are all over the place depending on your assumptions.  One standard calculation method (bootstrapping) leads to a central estimate of 210,000 violent deaths with a 95% confidence interval of around 40,000 to 600,000.  However, if you remove the Fallujah cluster the central estimate plummets to 60,000 with a 95% confidence interval of 40,000 to 80,000.  (I’ll give details on these calculations in a follow-up post.)

In short, there is no reliable way to create a stable estimate out of the Roberts et al. data. We would like to have an estimate that is robust to whether or not we include the extreme Fallujah outlier.  Alas, the usual methods are highly sensitive to whether the wild Fallujah observation is in or out.

photorealistic-3d-render-of-a-house-of-cards-collapsing-B79JFX.jpg

Given this background I’m at a loss to explain how Sir Roy Anderson can describe the Roberts et al. methodology as “robust”.  In fact, he invokes the r-word in two successive sentences.  Yet extreme sensitivity to outliers is one of the main characteristics that earns estimates the lable “non-robust”.

Sir Roy notices that the sample is small but goes nowhere from this starting point.  He seems unaware that war violence tends to cluster heavily at some locations.  Indeed, he  did not even read the Roberts et al. paper carefully enough to discern that their sample displays this pattern in spades.

Sir Roy swings and a misses in another, more subtle, way.  He points, rightly, to a key measurement problem with the Roberts et al. methodology – how do we know that households reporting deaths really did suffer these reported deaths?  He notes that Roberts et al. try to diffuse this issue by checking death certificates.  However, they only check for death certificates in a small non-random sample of their reported deaths and 20% of these checks were failures. So there is plenty of room to question the veracity of many of the reported deaths in the survey.

This is a good catch for Sir Roy but he doesn’t then ascend to the next level.  Suppose that out of the households that did not experience a violent death a mere 1% are recorded as having one anyway.  Since there must be more than 900 such households, this error rate would generate around 9 falsely reported deaths.  These false reports would then translate into about 27,000 estimated violent deaths.  Thus, a small rate of “false positives” can inflate the number of estimated deaths quite substantially, creating another non-robustness issue for the Roberts et al. methodology.

Someone might respond that we don’t have to worry about “false positives” because there will also be “false negatives”, i.e., households that experienced real deaths that somehow don’t get recorded.  However, this view is wrong because the situation is fundamentally asymmetric.  If roughly 50 households experienced violent deaths and 1% of these failed to report these deaths then we’d expect to miss only 0 or 1 real deaths this way.  So a 1% false negative rate will deflate an estimate by much less than a 1% false positive rate will inflate the same estimate.

(Alert readers will have noticed that I just described the base rate fallacy.  See the last slides of this presentation for more details.)

To summarize, Sir Roy wasted the small amount of effort he invested in his report.

Creon Butler at least had a serious go at evaluating the Roberts et al. paper, managing to notice some important new points that eluded Sir Roy.  I list the better ones here.  First on this positive side of the ledger is that Butler at least mentions the crucial Fallujah cluster.  Second, he correctly questions whether the sample is genuinely random.  Butler notes, in particular, that:

  1. The Fallujah field team did not follow the survey’s official randomization methodology when they  selected that cluster.
  2. Six of Iraq’s 18 governorates were excluded from the sample, although Butler thinks this was OK since they were randomly excluded.

Third, Butler draws attention to the preposterously wide confidence interval in the estimate for excess deaths – 8,000 to 194,000.  Fourth, Butler realizes, rightly, that the Roberts et al. figures for violent deaths suggest that hospitals should have received vastly more injured people than the figures of the Iraqi Ministry of Health (MoH) suggest they actually received.

Despite these strengths the Butler report is still weak.  As noted in post number 4 all three expert reports, including Butler’s, missed some central problems with the Roberts et al. paper.  Beyond that, Butler is strangely tolerant of the weaknesses he finds. Here are a few examples:

  1. He knows that the Fallujah field team violated the sampling protocols and then recorded a tremendous outlier observation that was then excluded from the main estimate published in the paper.  But it never seems to occur to him that such a serious data quality issue in one cluster could signal a deeper data quality problem affecting other clusters.
  2. He notices, but immediately shies away from, a weird aspect of the sampling scheme.  Twelve governorates are divided into two pairs with one governorate from each pair selected randomly for sampling and the other one excluded from the sample.  At a stretch we can view this as an acceptable way to claim national coverage for the survey while actually excluding 6 governorates from the sample.  But to do this legitimately you need to build this source of random variation into your confidence interval.  Roberts et al. don’t do this.  So even the gargantuan confidence interval of 8,000 to 194,000 is actually too narrow.
  3. Butler does a bad job of quantifying his point about injuries.  For example, he should have mentioned that the MoH recorded 15,517 injuries during the last 6 months covered by the survey.  Roberts et al. have something like 56 violent deaths during this period which translates into around 170,000 estimated violent deaths. Assuming a rule of thumb of 3 injuries per death one could predict 500,000 injuries, a number which exceeds the MoH figure by more than a factor of 300.  Note, moreover, that people with serious injuries should almost always put in an appearance at a hospital.  So there is really something to explain here.  Yet Butler  pretty much lets this discrepancy pass.

To summarize, in an era of grade inflation Creon Butler gets a gentleman’s pass.

Bill Kirkup of the Department of Health wrote the most perceptive UK government analysis although his paper is marred by one big error.  Here are some of his strong points:

  1.  He spots the absurdity of the confidence interval and grasps the magnitude of the problem – “A confidence interval this large makes the meaning of the estimate difficult to interpret. This point has been largely ignored in media reporting.”
  2. He is aware of what he calls the “patchy distribution of violence” in war and he realizes that this feature renders the survey’s 33 sampling points to be precious few.  He connects the reported results for Fallujah with this patchiness issue.  (You might say this is obvious but the other two experts missed it.)
  3. He identifies an annoying tendency for Roberts et al. to make detailed claims about types of deaths, e.g., the percent of all deaths accounted for by coalition air strikes, without providing numerical tables sufficiently detailed to flesh out these claims.  It appears that some such claims rely on data from the dubious Fallujah outlier which is something we would like to know whenever this is the case.  But it is often hard to be sure of such dependence on Fallujah without more information.  This is information that the authors could easily supply but chose not to do so.  Such “reticence”, as Kirkup puts it, does not inspire confidence.
  4. He realizes that the arrival of a survey team into a neighbourhood will draw the attention of local dangerous individuals.  These violent thugs will pressure people to answer the survey questions in ways that further their agendas.  Such local dynamics decrease the reliability of the data and place the survey’s interviewees at risk.  (I blogged recently on this issue in a similar context.)

Kirkup’s big error is that, somehow, he estimates that only 23,000 of the 98,000 estimated excess deaths (outside Fallujah) were violent deaths.  But a very easy back-of-the-envelope calculation shows that such a low number can’t possibly be right.  (21 violent deaths in the sample, around 8,000 people in the sample in a country of around 24 million – 24,000,000 x 21/8,000 gives around 60,000 violent deaths).  This mistake messes up Kirkup’s report substantially.

Nevertheless, I still think that Kirkup delivered the best report because he alone grasps the fundamental low quality of the Roberts et al. paper.

Where does this leave us?

First, I’d like to soften my criticism of the government evaluators a little bit.  In this post and in the last one I’ve tried to impose a ground rule of evaluating the evaluators based only on information that was available to them when they did their reports.  (I’ll drop this straighjacket in the next post in this series)  But it is hard to maintain this discipline and I’m sure that I’ve allowed myself to benefit in certain ways from some hindsight knowledge.  In addition, I’m sure these guys were under pressure to produce lots of stuff really fast so they couldn’t make every project into their best work.

That said, the casualties of war is a vital issue.  So the UK government should have allowed its analysts the space, and perhaps the outside consultants, they needed to give their work on civilian casualties its due.  (Of course, this applies even more strongly to the US government which has avoided a Chilcot-type enquiry in the first place.)

Finally, I’d like to give a sense of what I think a good report would have looked like.  Here’s a provisional list of key points:

  1.  We already knew that thousands of people are dying because of the Iraq war.
  2.  We should track these deaths closely and, more importantly, use the tracking data to figure out ways to save lives.  (I can’t find anything in the Chilcot Report to suggest that anyone in the government was thinking about this.)
  3. The Roberts et al. paper doesn’t change this picture qualitatively but it does suggest that people could be dying at far greater rates in the war than anyone has previously suggested.
  4. However, the Roberts et al. methodology is extremely weak and unreliable (see the technical appendix to this report) so we shouldn’t count on it except possibly on points that can be corroborated from other sources.
  5. Nevertheless, we should request the detailed data from this project and also from  Iraq Body Count and see whether we can learn something helpful from them.
  6. We should issue a public statement saying that we are not convinced by the Roberts et al. study at this moment but we have requested the data and are looking into it.  Meanwhile, we are very concerned about civilain casualties in Iraq and are working hard to reduce them.
  7. Point 6 should be reality, not just a public relations position.

Chilcot on Civilian Casualties: Part 4

In October of 2004 The Lancet published a paper by Roberts et al. that estimated the number of excess deaths for the first year and a half of the Iraq war using data from a new survey they had just conducted.  (Readers wanting a refresher course on the concept of excess deaths  can go here.)

One of the best parts of the civilian casualties chapter of the Chilcot report is the front-row seat it provides for the (rather panicked) discussion that Roberts et al. provoked within the UK government.  Here the real gold takes the form of links to three separate reviews of the paper provided by government experts.  The experts are Sir Roy Anderson of the first report, Creon Butler of the second report and Bill Kirkup, CBE of the third report.

In the next several posts I will evaluate the evaluators.  I start by largely incorporating only information that was available when they made their reports.   But I will, increasingly, take advantage of hindsight..

For orientation I quote the “Interpretation” part of the Summary of Roberts et al.:

Making conservative assumptions, we think that about 100,000 excess deaths, or more have happened since the 2003 invasion of Iraq.  Violence accounted for most of the excess deaths and airstrikes from coalition forces accounted for most violent deaths.  We have shown that collection of public-health information is possible even during periods of extreme violence.  Our results need further verification and should lead to changes to reduce non-combatant deaths from air strikes.

The UK government reaction focused exclusively, so far as I can tell, on the question of how to respond to the PR disaster ensuing from:

  1.  The headline figure of 100,000 deaths which was much bigger than any that had been seriously put forward before.
  2. The claim that the Coalition was directly responsible for most of the violence.  (Of course, one could argue that the Coalition was ultimately responsible for all violence since it initiated the war in the first place but nobody in the government took such a position.)

Today I finish with two important points that none of the three experts noticed.

First, the field work for the survey could not have been conducted as claimed in the paper.  The authors write that two teams conducted all the interviews between September 8 and September 20, i.e., in just 13 days.  There were 33 clusters, each containing 30 households. This means that each team had to average nearly 40 interviews per day, often spread across more than a single sampling point (cluster).  These interviews had be on top of travelling all over the country, on poor roads with security checkpoints, to reach the 33 clusters in the first place.

To get a feel for the logistical challenge that faced the field teams consider this picture of the sample from a later, and much larger, survey – the Iraq Living Conditions Survey:

ILCS Sample

I know the resolution isn’t spectacular on the picture but I still hope that you can make out the blue dots.  There are around 2,200 of them, one for each cluster of interviews in this survey.

Now imagine choosing 33 of these dots at random and trying to reach all of them with two teams in 13 days.  Further imagine conducting 30 highly sensitive interviews (about deaths of family members) each time you make it to one of the blue points.  If a grieving parent asks you to stay for tea do you tell to just answer your questions because you need to move on instantly?

The best-case scenario is that is that the field teams cut corners with the cluster selection to render the logistics possible and then raced through the interviews at break-neck speed (no more than 10 minutes per interview).  In other words, the hope is that the teams succeeded in taking bad measurements of a non-random sample (which the authors then treat as random).  But, as Andrew Gelman reminds us, accurate measurement is hugely important.

The worst-case scenario is that field teams simplified their logistical challenges by making up their data.  Recall, that data fabrication is widespread in surveys done in poor countries.  Note, also, that the results of the study were meant to be released before the November 2 election in the US and the field work was completed only on September 20; so slowing down the field work to improve quality was not an option.

Second, no expert picked up on the enormous gap between the information on death certificates reported in the Roberts et al. paper and the mortality information the Iraqi Ministry of Health (MoH) was releasing at the time.  A crude back-of-the-envelope calculation reveals the immense size of this inconsistency:

  1.  The population of Iraq was, very roughly, 24 million and the number of people in the sample is reported as 7,868.  So each in-sample death translates into about 3,000 estimated deaths (24,000,000/7,868).  Thus, the 73 in-sample violent deaths become an estimate of well over 200,000 violent deaths.
  2. Iraq’s MoH reported 3,858 violent deaths between April 5, 2004 and October 5, 2004, in other words a bit fewer than 4,000 deaths backed by MoH death certificates.  The MoH has no statistics prior to April 5, 2004 because their systems were in disarray before then (p. 191 of the Chilcot chapter)
  3. Points 1 and 2 together imply that death certificates for violent deaths should have been present only about 2% of the time (200,000/4,000).
  4. Yet Roberts et al. report that their field teams tried to confirm 78 of their recorded deaths by asking respondents to produce death certificates and that 63 of these attempts (81%) were successful.

The paper makes clear that the selection of the 78 cases wasn’t random and it could be that death certificate coverage is better for non-violent deaths than it is for violent deaths.

Still……

There is a big, yawning, large, humongous massive gap between 2% and 81% and something has to give.

screen-shot-2016-05-16-113942-am

Here are the only resolution possibilities I can think of::

  1.  The MoH issued vastly more (i.e., 50 times more) death certificates  for violent deaths than it has admitted to issuing.  This seems far fetched in the extreme.
  2. The field teams for Roberts et al. fabricated their death certificate confirmation figures.  This seems likely especially since the paper reports:

Interviewers were initially reluctant to ask to see death certificates because this might have implied they did not believe the respondents, perhaps triggering violence.  Thus, a compromise was reached for which interviewers would attempt to confirm at least two deaths per cluster.

Compromises that pressure interviewers to risk their lives are not promising and can easily lead to data fabrication.

3.   The survey picked up too many violent deaths.  I think this is true and                we will return to this possibility in a follow-up post but I don’t think that            this can be the main explanation for the death certificate gap.

OK, that’s enough for today.

In the next post I’ll discuss more what the expert reports actually said rather than what they didn’t say.