The Statistical Estimates of Peru’s Truth and Reconciliation Commission are Really Bad: Part 3

This post continues my series on Silvio Rendon’s rejoinder to the reply of Daniel Manrique-Vallier and Patrick Ball (MVB) to Silvio Rendon’s critique of the statistical work of Peru’s Truth and Reconciliation Commission (TRC).

OK that’s complicated so how about this?

The present series covers a second paper by Rendon that definitively shreds the TRC’s estimates.

Today we cover yet another dimension of unacceptably low quality in the TRC’s work.  Many of the TRC’s Shining Path (SP) models predict that some lists, or combinations of lists, used by the TRC record negative numbers of people killed.

You must be doing a double take right now. So let’s go back to

basics on how the TRC’s estimates actually work.  The data for each of the TRC’s 58 geographical strata consists of three lists of people killed, plus information on the overlaps across these lists.  In particular, the TRC’s SP data for each stratum gives the number of deaths appearing on:

  • list A alone
  • list B alone
  • list C alone
  • lists A and B but not list C
  • lists A and C but not list B
  • lists B and C but not list A
  • lists A, B and C

The goal is to estimate the number of people in each stratum who were killed by the SP but who didn’t make it onto any of these three lists.

(Note – Here I’m only discussing estimates of SP-caused deaths because MVB have not contested Rendon’s estimates for deaths attributed to the State, which are about 40% higher than the TRC’s estimates.)

There are two key steps to the estimation:

  1. Fit a model to the above 7 data-points.
  2. Extrapolate this model to an estimate of unrecorded deaths.  This second step is enabled by an assumption that the deaths not recorded on any list are just like the deaths that are recorded on a list – except for the obvious difference that the unrecorded deaths are…well…unrecorded.

Here is a short video that explains the danger of extrapolating beyond your data range as this estimation method requires.  The discussion is set in the different context of linear regression models but the main ideas still apply.  The extrapolation method is likely to give inaccurate results if deaths not appearing on any list are qualitatively different from deaths appearing on at least one list.  In fairness, however, the same risk applies to Rendon’s estimates so let’s just set it aside for the rest of this post.

Each model of the 7 data-points in a stratum predicts the 7 data-points themselves as well as the unknown data-point (deaths not on any list).    These predictions should not exactly equal the true recorded figures; this would constitute extreme over-fitting, as discussed in post 1 of this series.  Nor should the fit be extremely loose, as discussed in post 2 of this series.    The fit should be somewhere in between to provide a decent chance at good prediction of unrecorded deaths.  This triple of pictures below illustrates the argument nicely although, again, in a different context.

The point of the present post is that many of the TRC’s predictions for its known data-points are bat-shit crazy.  In no fewer than 26 of the TRC’s 58 strata at least one such prediction is a negative number (Table 5).

In fact, Rendon shows that 40 out of the 58 SP models either fit perfectly or predict a negative number of deaths for at least one list or list combination (Table 5).  And 57 out of the 58 models fail the TRC’s own standards for  over-fitting or under-fitting or they make negative predictions.

There’s quite an irony here. MVB hang their whole critique of Rendon’s  alternative SP estimates on their claim that some of these estimates are, in some convoluted sense, “impossible”.  I argue here that this is wrong and  Rendon strongly bolsters this point is his second paper.  Still, that was the MVB critique.  Now we learn that many of the TRC’s estimates are, literally, impossible: no list of deaths can contain negative number of them.  So this  critique is analogous to a jet engine calling a cat noisy.

To be clear, the TRC never predicts a negative number for deaths not recorded on any list.  The negative predictions are always for  numbers that are already known.  But this in-sample character of their impossible predictions makes the problem worse, not better.  The issue was readily apparent back when the TRC made its estimates but they plowed on anyway.

Recall, furthermore, that the TRC used an unconventional indirect method for its SP estimates.  It’s worth noting, therefore, that Rendon’s conventional direct method rules out negative estimates as a built-in feature.  So the TRC’s indirect method opens the door to the negative numbers problem.

I feel like I’m beating a dead horse but there’s still an important remaining issue concerning the data to which I’ll turn next.

 

 

 

Advertisements

The Statistical Estimates of Peru’s Truth and Reconciliation Commission are Really Bad: Part 2

Back to blogging after diverting my energy in recent weeks to putting out a few fires.

I’ll assume here that you’ve read the first post of this new series.  So you know that overfitting is a terminal problem for the Shining Path (SP) estimates published in the statistical report of Peru’s Truth and Reconciliation Commission (TRC).  According to the TRC’s own stated standard for overfitting, the models in no fewer than 47 out of the TRC’s 58 geographical strata are inadmissible.  And many of the TRC’s models are not just somewhat out of bounds; they’re egregiously overfit.

What about underfitting?

That is, does the TRC extrapolate any SP estimates from models with inadmissibly bad fits, even according to the TRC’s own stated fitting standards? Yes they do.  The impact of this violation is small, however, since the TRC extrapolates from a really badly fitting model only in stratum 32.  Yet stratum 32 is quite an interesting case.  So I devote the rest of this post to it.

Recall that Silvio Rendon and the TRC use different methods to estimate SP-caused deaths.  The details need not concern us here.  The salient point for this post is that Daniel Manrique-Vallier and Patrick Ball (MVB), two authors of the original TRC report, assert that Rendon’s method is biased toward underestimation.

MVB do head-to-head comparisons with Rendon’s SP estimates in 9 strata and MVB’s published numbers place Rendon’s estimates below the TRC’s in 8 out of these 9 strata.  These results are broadly consistent with the idea that Rendon’s method is biased downward although they are equally consistent with the idea that the TRC’s method is biased upward.

It turns out, however, that Rendon’s SP estimates are higher than the TRC’s in stratum 32.  Here are the numbers:

TRC – 328

Rendon’s preliminary estimate – 751  (before multiple imputation)

Rendon’s main estimate – 877  (after multiple imputation)

Unfortunately, MVB don’t say that stratum 32 goes against the grain of their argument.  To the contrary, Figure 4 in their supplementary materials wrongly claims a tie, placing the estimates of both Rendon and the TRC just below 600.

Figure 4 also asserts that any estimate below around 330 (eye balling the graph) is impossible.  Yet their own (dubious) methodology for defining “impossibility regions” would place this boundary at 170.

I’ve argued before that all of MVB’s figures are misleading and should be corrected.  This is because they airbrush statistical uncertainty away and assume perfect accuracy, both in the underlying data and in the matching of deaths across sources.  But figure 4 reaches a new level of wrong.

My first series argued that MVB’s defense of the TRC work is weak.  The discovery of mistakes can only diminish further its persuasiveness.

Finally, please keep your eye on the ball which is the TRC report itself.  Here are the main take-home points so far in this series.

  1. The overfitting problem – this is sufficient to dismiss the whole SP portion of the TRC report..
  2. The underfitting problem – this makes the TRC’s problems a little bit worse

Of course, MVB should correct the mistakes in their stratum 32 figure.  Beyond that I wonder whether there are other mistakes out there waiting to be discovered.

 

The Statistical Estimates of Peru’s Truth and Reconciliation Commission are Really Bad: Part 1

Silvio Rendon’s powerful rejoinder on the statistical work of Peru’s Truth and Reconciliation Commission (TRC) is finally out.

This is a big development so I’m starting a brand new series to cover it.  Of course, this new series is related to the earlier one.   But I’ll strive to make the present series self-contained so you can start here if you want to.  

The sequence of events leading up to the present moment goes as follows.

I.  The TRC of Peru publishes a statistical report that makes two surprising claims.

  1.  Nearly 3 times as many Peruvians were killed in the war, 1980 – 2000, than the combined efforts of human rights NGO’s, Peru’s Ombudsman office and the TRC were able to document on a case-by-case basis.
  2.  The Shining Path (SP) guerrillas killed more people than the State did – reversing the pattern of the combined list of documented deaths, which formed the basis for the TRC’s statistical work.

II.  Silvio Rendon publishes a critique of the TRC’s statistical work.  He also proposes new estimates which, compared to the TRC’s estimates, increase the number of deaths attributed to the State, decrease the numbers of deaths attributed to the SP and “Other” groups and decrease the total number of deaths for the war as a whole.  His estimates are inconsistent with the TRC’s surprising conclusions.

III.  Daniel Manrique-Vallier and Patrick Ball (MVB), two authors of the TRC’s original statistical report, reply with a critique of Rendon’s estimates, indirectly defending their own SP estimates.

My earlier series covers the above three developments.

IV.  Now we have Rendon’s rejoinder which mostly attacks the original work of the TRC but also defends his own estimates from MVB’s critique.

I make one caveat before proceeding.  Rendon’s work is replicable but I have not tried to replicate it.   I’ll just assume here that Rendon’s claims are correct.  This is a reasonable thing to do since no one has discovered a substantive error from Rendon in the debate so far.

Rendon’s rejoinder finds grave and terminal deficiencies in the statistical report of Peru’s TRC.  There are several issues in play but today I’ll cover just the issue of overfitting, which turns out to be quite a big problem.

There is a footnote in the TRC statistical report itself that provides a good explanation for the dangers of overfitting.  The quoted text can supplement or replace the above overfitting link or, if you prefer, you can try this very short alternative explanation:

Overfitting occurs when the model fits the data too closely and therefore models only these data and none other. As the number of parameters used to fit the model approaches the number of cells in the table, all of the available information has been used for the model fitting, and none remains for the estimation. The goal is to find a model that fits reasonably well, but not so well that the same model would fail to fit different data describing the same phenomenon.

The TRC statistical report also proposes a policy to avoid overfitting, although they ignore it for their SP estimates: reject models with goodness-of-fit p values that exceed 0.5.  (Possible p values range from 0, i.e., no fit whatsoever, up to 1, i.e., a perfect fit.)

Given this policy it’s shocking to discover that the TRC bases its SP estimates on perfectly fitting models in 14 out of its 58 geographical strata.

In fact, I learned through correspondence with Silvio Rendon that these 14 cases of egregious overfitting form just the tip of an overfitting iceberg.  In a further 12 strata the TRC models are so close to perfect fits that their goodness-of-fit p values round up to 1.  Moreover, p values are between 0.7 and 1 for an additional 13 strata and between 0.5 and 0.7 for a further 8 strata.  In other words, according to their own stated standards the TRC estimated SP-caused deaths off of overfit models in 47 out of their 58 strata.  Most of these models are badly overfit.

To summarize, based on overfitting issues alone we should bin most of the TRC’s SP estimates.

download

And overfitting is just one of the problems that plagues these SP estimates.  Stay tuned for more in part 2 of this series.

PS – You may want to check out a parallel series on the blog about the perils of matching deaths and events across lists.  It focuses on Iraq data but co-author Josh Dougherty and I discuss connections that are relevant for the Peru discussion.

 

 

 

The Perils and Pitfalls of Matching War Deaths Across Lists: Part 2

This is my second post with Josh Dougherty of Iraq Body Count (IBC).  We asserted in the first one that Carpenter, Fuller and Roberts (CFR) did a terrible job of matching violent events in Iraq, 2004-2009, between the IBC dataset and the SIGACTs dataset of the US military and its coalition partners. In particular, CFR found only 1 solid match in their Karbala sample whereas 2/3 of the records and 95% of the deaths actually match.  We now present case-by-case details to explain how CFR’s matching went so badly wrong.

Here is the Karbala sample of data with 50 SIGACT records together with our codings.  Columns A-S reproduce the content of the SIGACT records themselves.  The column headings are mostly self-explanatory but we add some clarifications throughout this post.  We use Column T, which numbers the records from 1 to 50, to reference the records we discuss in this and the following post.  Columns V-Y contain our own matching conclusions (SIGACTs versus IBC).  Column AB shows our interpretation of what CFR’s reported algorithmic criteria should imply for the matching.

Both our matching and CFR’s compare the SIGACTs dataset to the IBC dataset as it existed prior to the publication of the SIGACTs in 2010. IBC has integrated a lot of the SIGACTs material into its own dataset since that time.  Thus, most of the Karbala cases we characterize in the spreadsheet as “not in IBC” (column Y) are actually in IBC now (Column Z).  However, these newer entries are based, in whole or in part, on the SIGACTs themselves. Of course, it is interesting to compare pre-2010 IBC coverage to another parallel or separately compiled dataset; this is the point of CFR’s exercise and of ours here as well.

Readers can check codings for themselves and are welcome to raise coding issues in the comments section of the blog.  You can look up IBC records at the database page here: https://www.iraqbodycount.org/database/incidents/page1. To view a specific record, simply replace “page1” at the end of the url with the incident code of the record you wish to view, such as: https://www.iraqbodycount.org/database/incidents/k7338 for record k7338.  The whole SIGACTs dataset is here.

Recall from part 1 of this series that CFR’s stated matching algorithm applies the following 3 criteria:

  1. Date +/- one calendar day
  2. Event size +/- 30%
  3. Weapon type

As noted in the first post, we cannot be precise about the algorithm’s matching implications because of some ambiguities that arise when applying the reported criteria, particularly in criteria 2 and 3.  It appears, however, that a reasonable application of the above three criteria matches 11 out of the 50 SIGACTs Karbala records.  We are, therefore, unable to explain why CFR report that they could only match 1 record on all 3 of their criteria. Recall that we do not know CFR’s actual case-by-case conclusions because they refuse to show their work.

Each criterion causes some matching ambiguities, but we focus here on criterion 2 because it causes the most relevant problems for the Karbala sample.  The main problem is that CFR do not specify whether SIGACTs or IBC records should be the basis from which to calculate percent deviations.  Consider, for example, record 4 for which SIGACTs lists 7 civilian deaths.  IBC event k685 matches record 4 on criteria 1 and 3 (reasonably construed) but lists 10 civilian deaths.  If 7 is the base for percentage calculation then 10 deviates from it by 43% which is, of course, greater than 30% and a mismatch.  But if 10 is the calculation base then 7 deviates by exactly 30% and we have a match.

Further ambiguity stems from CFR’s failure to specify whether their 30% rule is applied in just one direction or if matching within 30% in both directions is required.  Records 30 and 36, in addition to record 4 (just discussed above), either match or do not match depending on how this ambiguity is resolved. These ambiguous cases are classified as “Borderline” in Column AB in the spreadsheet.

A third problem with criterion 2 is that IBC often uses ranges rather than single numbers and CFR do not specify how to handle ranges or even acknowledge their existence. When there is a range, does the +/- 30% rule apply to IBC’s minimum, maximum, both, or to an average of these two numbers?  We don’t know.  We have to add SIGACT records 5, 15, 34 and 42 to the list of unresolved cases when we combine this range ambiguity with the base-number ambiguity.

Criterion 3 is, potentially, the most ambiguous of all because, strictly speaking, neither SIGACTs nor IBC have a “weapon type” field.  The two projects code types in different ways and with different wordings.  Nevertheless, both datasets have some event types, such as “IED Explosion” or “gunfire,” that can be viewed as weapon types.  Sensible event- or weapon-type matches can be made subjectively from careful readings of each record, but mechanical weapons matching based just on coded fields will not work.  For example, SIGACTs has a “Criminal Event – Murder” category (in Columns N-O of our spreadsheet) whereas IBC has no such category. However, many IBC event types, such as “gunfire”, “executed”, “beheaded”, “stabbed” and “tortured, executed”, among many others, can be consistent with “Criminal Event – Murder”. Thus, rule 3 seems to consist of subjective human judgments about all these varying descriptions, even though CFR claim that “the algorithm-based matching involved little to no human judgment.”  Any attempt to replicate CFR’s judgments on this rule would be pure guesswork since it is hard to imagine an algorithm to implement rule 3, and it does not seem like there was one.  Therefore, we just ignore rule 3 and proceed as if CFR made appropriate judgment calls on weapons types in all cases, even though this assumption may give too much credit.

With the above ambiguities aside, we’ll now move on to the substantive matching errors that arise when attempting to match these two datasets algorithmically. We distinguish between 8 error types that affect CFR’s application of their algorithm to the two datasets.  The rest of this post covers the first 4 error types and the next post will cover the remaining 4.

The first 4 error types concern basic errors in documentation or coding within the SIGACTs dataset itself.  We give a broad treatment of these SIGACT errors, in part to prepare the ground for future posts.  The SIGACT errors usually translate into matching errors under the CFR algorithm, but do not always do so, and we identify cases for which the algorithm should reach correct conclusions despite SIGACT errors.  Modifications of matching criteria tend to change the ways in which SIGACT errors translate into matching errors. Thus, matching procedures that we will consider later in this series sometimes fall prey to a different range of errors than those that affect CFR’s matching.  So it is useful to provide full information on the underlying SIGACT errors at the outset.

SIGACT ERRORS

  1. Location errors – Errors in recorded locations affect at least 9 records (25, 31, 33, 37, 39, 40, 45, 48 and 50) and probably 2 more (32 and 38), affecting at least 17 deaths, i.e., at least 18% of the records and 3% of the deaths in the sample.

Many SIGACT records contain incorrectly coded locations.  Usually, but not always, these errors take the form of a wrong letter or number in a code for the Military Grid Reference System (MGRS, Column B). For example, in Record 33 both the Title (Column E) and MGRS coordinates identify Karbala as the location for a “criminal murder” that killed 6 civilians. However, the Reporting Unit (Column S) for this record is “KIRKUK NJOC” which suggests that this event occurred in Kirkuk, not Karbala: two entirely different places that are far apart.  Moreover, the Summary (Column D) of the event, a full text description, describes an attack on electricity workers from the Mullah Abdullah power plant in the “AL HWAIJAH AREA” which is southwest of the city of Kirkuk in the province of Tameem (also sometimes called Kirkuk province). IBC record k5908 is of an attack “near Hawija” that matches the characteristics of the SIGACTs event, including describing the victims as workers for the same power plant.  Taken together, all these factors confirm that Record 33 is a Kirkuk event that was mis-coded as a Karbala event.

The location error appears to stem from a flaw in the MGRS coordinates, “38SMB119152”, which, if true, would place the event in Karbala.  It seems that the letter “B” should have been an “E”.  This single small change shifts the event 250 miles north to an area near Hawija, southwest of Kirkuk, where the Summary suggests it should be.  It appears, further, that the Title of “IVO KARBALA,” i.e., in the vicinity of Karbala, was likely based on the MGRS coordinate error.  The Title coder might not have read the Summary or may not have been familiar with the locations of Karbala, Hwaijah or Kirkuk and, therefore, not realized that these were discrepant.

The crucial point for list matching is that this basic location error renders the record un-matchable under CFR’s algorithm and, indeed, under any reasonable algorithmic method that does not consider the possibility of documentation errors.  A careful reading of the detailed record by a knowledgeable human can ferret out the error. A sophisticated text reading computer should also be able to spot the error, but only if the system is programmed to question the information coded in casualty fields.

Record 33’s location error can be detected from the fact that the reporting unit was based in Kirkuk.  But, importantly, the field for “Reporting Unit” (column S) is also omitted from the Guardian version of the dataset used by CFR, along with the detailed event descriptions (Column D). Indeed, records 31, 37, 39 and 40 also list reporting units that appear to be inconsistent with a Karbala event – a clear red flag to an attentive researcher.  But this flag would be invisible to anyone, such as the CFR team, without access to the Reporting Unit field.

Records 48 and 50 also have location errors, but these mistakes take a different, almost opposite, form. These are not events outside Karbala that were erroneously shifted into Karbala.  Rather, they are true Karbala events, as we can see from their summaries, but with MGRS coordinates that incorrectly place them outside Karbala.  This kind of MGRS error should not affect CFR’s Karbala matching because they assume that location matching is already satisfied through their sample selection method of filtering for “Karbala” in the Title field. Subsequently, CFR only try to match on date, size and type, not location.  Thus, the particular coordinates issue in these two records should not ruin CFR’s matching.  Nevertheless, such MGRS coordinate errors would undermine CFR’s main (nationwide) matching exercise because that exercise uses locations, based on MGRS coordinates, as a matching criterion.

  1. Duplication errors, which affect 3 records (35, 43, 46) and 205 deaths, i.e., 6% of records and 35% of deaths are duplicated.

The CFR paper never mentions the topic of duplication or de-duplication even though this is a focus of many list matching/merging efforts.  It seems fairly clear that CFR did not attempt to de-duplicate the SIGACTs samples they used in their paper.  Yet, the three duplicates in this Karbala sample account for no fewer than 205 out of 558 deaths.  In fact, correcting for duplicates leaves just 353 unique deaths in the sample, not 558.

Records 42 and 43 report the same event from two different sources in slightly differing versions. These match IBC record k7338.  The duplicate records report 46 and 48 deaths respectively. However, if one ignores the possibility of duplication then k7338 can match only one of the two SIGACTs records.  De-duplication failure here creates a large phantom event that cannot be matched against IBC since the actual match has already been used up.

Records 34 and 35 are also duplicates of a large event but this time with the added twist that deaths and injuries are interchanged in record 35.  Thus, 36 deaths and 158 injuries from an IED explosion in record 34 becomes 158 deaths and 34 injuries in a supposedly different IED explosion on the same day in record 35.  This improbable death to injury ratio in an explosion should have been enough to raise a red flag for CFR, even though they cut themselves off from the Summary Column (D) that confirms the interchange.  This time the phantom event creates 158 un-matchable deaths.

Records 46 and 47 also duplicate the same matching event although they only account for a single death.

These duplication problems in the small Karbala sample should be sufficient to establish that duplication is going to be a significant problem to overcome in any attempt to match the SIGACTs with another list. It’s also difficult to see how duplicates could be reliably identified without exploring the details in the Summary field (Column D) omitted from CFR’s version of the dataset.  Failing to eliminate duplicates across the SIGACTs winds up leaving an array of fictional events mixed in with real events and will naturally lead to many spurious conclusions about coverage failures in a comparison list.

  1. Classification errors, taking the form here of a reversal of death and injury numbers for 1 record (35) with, supposedly, 158 deaths, i.e., affecting 2% of records and 28% of deaths.

Another form of error that occurs in some SIGACTs is that casualty numbers or types are sometimes shifted into incorrect fields, i.e., one type of casualty is misclassified as another type. Record 35, which we already mentioned above in the section on duplication, interchanges deaths and injuries and is the only such classification error in the Karbala sample.  However, similar problems such as this and misclassifying victim categories (Friendly, Host Nation, Civilian, Enemy), occur in other parts of the SIGACTS data.

Record 35 also shows how some records contain multiple error types simultaneously.

  1. Doubling the number of deaths, an error that affects 2 records (11 and 12) for a total of 8 deaths, i.e., 4% of the records and 0.9% of the deaths in the sample.

All casualty fields for record 12 are exactly doubled compared both to what is written in the Summary field (Column D) and the Title field (Column E). Thus, the correct number of 3 for the “CIV KIA” number is doubled to 6.  IBC record k1878 has this event with 3 deaths and the same date and type.  Thus, without the doubling error this record would match under the CFR algorithm.  With the error, record 12 violates the +/-30% rule, regardless of whether we use 3 or 6 for the base, and becomes a mismatch.

Note that CFR could potentially have caught this error by reading the Title, which they did have at their disposal.  Comparison of these two fields shows, at a minimum, that one of them is wrong and matching must proceed with caution.  The detailed event description (Summary) confirms that the figure of 3 civilian deaths in the Title is correct and, except for the error, this should have matched under CFR’s criteria.

Record 11 makes the same error, converting 1 death into 2.  This error also contradicts both the Title and Summary.

 

Now is a good time to take stock.

It should be abundantly clear that errors in any dataset that is part of a matching project are potentially lethal to the project.  Any casualty dataset of even moderate complexity will probably contain some errors. Conflict casualty datasets tend to be collected and compiled under much less than ideal circumstances, so an error mitigation strategy must be a central feature of matching work.  Of course, one can get lucky and wind up working with very high quality data.  CFR were not lucky.  The SIGACTs data is highly valuable, but it is also rather messy, containing many important errors crucial to case matching.

CFR and others who have relied on their findings seem not to have considered the possibility of data errors in the SIGACTs.  Rather, it appears that CFR just assumed that they had two pristine datasets with the unique weakness of incompleteness.  This misguided assumption leads them to misinterpret the many data errors as revealing incomplete coverage.  These misinterpretations are not minor.  In effect, CFR padded their discoveries of true coverage problems with a host of other issues that are unrelated to coverage, substantially exaggerating the coverage issue in the process.

And yet we have only told half the story so far.  The next post will cover an additional 4 error types.

The Perils and Pitfalls of Matching War Deaths Across Lists: Part 1

I argued in an earlier post that matching deaths across lists is a nontrivial exercise that involves a lot of judgement and that, therefore, needs to be done transparently.  Here is the promised follow up post which I do jointly with Josh Dougherty of Iraq Body Count.  In fact, we’ll make this into another multi-part series as there are many different sources and issues to explore. This is a large subject of growing importance to the conflict field, so we may also eventually convert some of this material into a journal article. Throughout this series we’ll draw heavily on Josh’s extensive experience matching violent deaths across sources for the Iraq conflict.

Today we’ll set the table with some preliminaries and offer basic findings, with more detailed exploration of the data to follow in future posts.

First, list matching for Iraq has involved a combination of event matching and victim matching.  Events are usually considered to be discrete violent incidents, such as suicide bombings, air attacks or targeted assassinations, and are typically defined by their location, date, size, type and other features.

The event matching aspect of the Iraq work means that it won’t always be directly relevant to pure victim-based matching efforts such as those underpinning the statistical work of Peru’s Truth and Reconciliation Commission (TRC), or the various efforts involving casualty lists covering the war in Syria.  We’ll talk more about pure victim-based matching in a future post.  However, matching events is ultimately still about matching deaths/victims, so the issues that arise are very similar and most of what we write here will be relevant to victim-based matching.

Second, we analyse a matching exercise from this paper by Carpenter, Fuller and Roberts (CFR) that attempts to match events from the Iraq war across two sources.  This CFR paper has been cited in some major journal articles.  In fact, Megan Price and Patrick Ball, the main author of the statistical report of the Peruvian TRC, relied heavily on CFR’s matching in some of their own papers. Yet CFR’s matching turns out to be very bad.

Third, we won’t address here the main matching exercise of 2,500 records carried out (again badly) in the CFR paper.  We cover, rather, a robustness check matching smaller samples that CFR present towards the end of their paper, and which should be more easily digestible for readers.  A proper analysis of CFR’s main matching exercise is beyond the scope of this series, but we can say here that the kind of problems affecting the robustness check generally carry over into the main matching exercise. Note, however, that CFR’s main matching is done by hand with human researchers, whereas the robustness check that we cover below is described as “computer-driven” and “non-subjective”. Still, both the human and computer approaches use essentially an algorithmic matching approach with very similar pre-determined parameters. The major difference is that in one case an algorithm is applied by hand, with more room for human judgment, while in the other it is apparently applied more strictly with the help of a machine. Indeed, CFR report that the two approaches “resulted in the same conclusions,” so they suggest that their robustness check has succeeded and that we should feel more confident in their findings.

In this exercise, CFR match samples from two sources covering events that occurred in Karbala, Iraq, between 2004 and 2009.  The sources are Iraq Body Count (IBC) and the Iraq War Logs published by WikiLeaks in 2010, also known as the official SIGACTs database of the US military.  Here are the methods of IBC.

Unfortunately, we know of no formal statement of a data collection methodology for SIGACTs, however we do know that it is compiled by the US Department of Defence from the field reports of US and Coalition soldiers, Iraqi security forces and other Iraqi sources.  We can also learn about SIGACTs by inspecting the entries.  This one, for example, describes a “search and attack” operation in which Coalition Forces killed seven “Enemy” fighters in the Diyala governorate.  The entry displays SIGACTs’ standard data-entry fields which include the date, time, GPS coordinates, event type, reporting unit and numbers killed and wounded. The casualty numbers are further divided into “Enemy”, “Friendly”, “Civilian” and “Host Nation” categories. Each record begins with a short headline and also contains a longer text description of the events. These descriptions tend to be rather jargon-filled but can be read fluently after some practice.

We will show in the next post of this series that careful reading of the detailed text descriptions is essential for matching SIGACTs-recorded deaths against other sources correctly. The CFR work runs aground already at this data inspection stage because they worked only with a summary version of the data, published by The Guardian, which omits the detailed text descriptions. Note also that the above-cited Price and Ball paper, which closely follows the CFR lead, shares CFR’s cavalier approach to the SIGACTs data, writing incorrectly of its methodology:

SIGACTSs based on daily “Significant Activity Reports” which include “…known attacks on Coalition forces, Iraqi Security Forces, the civilian population, and infrastructure. It does not include criminal activity, nor does it include attacks initiated by Coalition or Iraqi Security Forces”

This is not true of the full SIGACTs database released in 2010, and instead comes third hand from a globalsecurity.org description of some statistics on “Enemy-initiated attacks” that appeared in a 2008 US DoD report. Those data were derived from only selected portions of the SIGACTs database and their description does not apply to the full dataset. A cursory glance at the full SIGACTs dataset would have quickly revealed that it includes criminal activity and attacks initiated by Coalition or Iraqi Security Forces.

Further background on the SIGACTs (Iraq War Logs) data can be found here and here.

CFR derives their Karbala sample, plus a separate Irbil one to which we will return later, by:

filtering the entire WL data set in the event description for the appearance of the words ‘‘Irbil’’ and ‘‘Karbala.’’

You should interpret “the entire WL data set” to mean the entire Civilian category, with at least 1 death, of the Guardian version of the SIGACTs dataset, i.e., the version that omits the detailed text descriptions of each record.  In this context, the above phrase “event description” can only refer to the headline of each record, as there is nothing else in the Guardian version of the dataset that could both approximate an “event description” and contain the word “Karbala”.

The above filtering yields a sample of 50 records containing 558 deaths.  However, strangely, CFR report only 39 records in their results table.  It would seem that CFR had an additional, unreported, filtering stage that eliminated 11 records.  Or perhaps CFR simply made a mistake.  There is no way to know at present how or why this happened because CFR do not list their 39 Karbala records or their matching interpretations for each in their paper, and have ignored or refused past data requests.  Consequently, we will simply follow CFR’s reported sampling methodology, as it appears in their paper, and proceed with matching the 50 records it produces.

CFR’s reported matching algorithm applied to this sample contains three matching requirements:

  1. Event dates must be within one calendar day of each other.
  2. The number killed cannot be more than + or – 30% apart.
  3. Weapon types must match.

CFR report one main finding on Karbala alone (again, we will return to Irbil later):

the majority of events in WL [SIGACTs] are not in IBC and vice-versa.

Indeed, CFR’s results table claims that only 1 of their 39 SIGACT records match IBC on all three of their criteria. [Note that the first version of this post said that there were 2 matches rather than the correct number which is 1] They report only event, not death, statistics, but there is an obvious implication that IBC missed a high percentage of the deaths in the Karbala sample.

The problem is that their results are very wrong. When we compare each of the records in detail, the majority of records and the vast majority of deaths in the Karbala sample match with IBC. Specifically, 95% of deaths (533 out of 558) and 66% of records (33 out of 50) match with the IBC database.

However, when we apply CFR’s matching algorithm to those same records, only 24% of deaths (132 of 558) and 22% of records (11 of 50) match on all three criteria. We should note here that applying CFR’s algorithm is not as simple or straightforward as it might seem. Their three requirements all raise some ambiguities that need to be resolved by subjective judgement in practice, and the outcomes of these choices can move the final numbers around a bit.  We will discuss these issues in our next post, but any resolution of these ambiguities will still leave an enormous distance between CFR results and the truth.

It should be stressed that the CFR approach apparently seemed reasonable and reliable to the authors, journal referees and editors, and to other researchers, like Price and Ball, who build on CFR’s work. Yet their approach ultimately gets the data all wrong, and for reasons that become pretty clear when one examines the data in detail. Indeed, we find that CFR’s conclusions reflect defects in their methodology far more than they reflect holes in IBC’s coverage of conflict deaths in Karbala.

With this in mind, let’s circle back to the Peru debate which inspired the present series on matching. In the Peru discussion Daniel Marique Vallier and Patrick Ball (MVB) argue that some of Silvio Rendon’s point estimates for numbers of people killed in the Peruvian war are “impossible” because these point estimates are below numbers obtained by merging and deduplicating deaths appearing on multiple lists. But the results we report here should shock anyone who previously thought that counts emerging from such list mergers can simply be taken at face value and treated uncritically as absolute minima. MVB’s matching is unlikely to be anywhere near as bad as CFR’s, but we still need to see the matching details before we can begin to talk seriously about minima.

Our next post will share the Karbala sample along with our case-by-case matching interpretations and dig into the details of how and why the CFR approach got things so wrong.

Important New Violent Death Estimates for the War in Peru with Implications Beyond just Peru: Part 6

This is the latest installment in a series that considers the statistical report done for the Peruvian Truth and Reconciliation Commission (TRC), Silvio Rendon’s critique of this statistical report and a reply to Rendon from Daniel Manrique Vallier and Patrick Ball (MVB) who worked on the TRC statistical report.  The present post continues to discuss the MVB reply.

(Note that I may not resume this series until Silvio Rendon’s rejoinder is published.  Meanwhile, I’m also working with Josh Dougherty of Iraq Body Count on an offshoot post that will cover the practice and pitfalls of matching deaths across multiple lists.)

Today I’ll comment on nine figures from the MVB reply: figure 1 in the main body of the paper and figures 2-9 in the appendix.

I won’t produce any of the figures here because they are misleading and a picture is worth a thousand words.  The main features I object to are that the figures  substitute lower (preliminary) stratum-level estimates for Rendon’s main estimates and suppress the uncertainly surrounding these estimates.  Moreover, MVB portray some of these these lowered point estimates as falling within an “impossibility region,” a characterization which further assumes that MVB’s matching of deaths across sources was perfectly executed on fully accurate data.

Nevertheless, the figures do convey some interesting simulation-based information that addresses the question of when a direct estimation approach outperforms MVB’s indirect one and vice versa.  Each of the nine figures uses data from a stratum for which one can directly estimate Shining Path (SP) deaths.  (There are nine such strata before multiple imputation and two more, not covered by the figures, after multiple imputation.)

The X axis in each picture represents all the possible true values for the number of SP-caused deaths (with the true values indexed by N).  MVB perform simulations that estimate the number of SP-caused deaths many times for each stratum and for each N using both direct capture-recapture and MVB’s indirect capture-recapture methodology.  MVB then calculate the deviation of each estimate from the underlying true value, square these deviations (so that negative deviations do not cancel out positive ones) and take the mean of these squared deviations across all simulation runs for each value of N.  Finally, they graph these “mean-squared errors” for each method and each N in all nine strata.

For eight out of the nine strata the direct method outperforms (i.e., has lower mean-squared errors) the indirect method for values of N below some critical value and the the indirect method outperforms the direct one above this same critical value.  (For one stratum the reverse is true but there is never a big difference between the two methods in this stratum so this doesn’t seem to matter much.)  For three strata the critical value for which the best performing method switches from direct to indirect is inside of MVB’s “impossibility region”.

In eight out of the nine strata the indirect method outperforms the direct method when the true number of people killed by the SP is set equal to the estimate that the TRC actually made for that stratum (using the indirect method).  Essentially, this rather unsurprising result says that the indirect method performs well in simulations of cases for which the TRC’s indirect estimate  delivered a correct result.  And the indirect method also performs well when the TRC’s estimate is not spot on but still reasonably close to being correct.

The direct method tends to outperform the indirect one in simulations that start from the assumption that the direct estimate is correct.  Nevertheless,  in three out of the nine strata the indirect method actually wins this contest.

Overall, these simulation results tend to favor the indirect method over the direct one, especially when the true numbers are assumed to be rather high.

That said,  the direct method in the simulations does not match Rendon’s main method because, again, MVB omit the multiple imputation step of Rendon’s procedures.  Incorporating multiple imputation should shift the balance back towards Rendon.  And, again, I would like to see a similar exercise performed on Rendon’s alternative approach that covers the whole country with ten strata.

Here’s one last point before I sign off.  As of now, the MVB reply is still just a working paper, not yet published in Research and Politics.    The main advantage of posting a working paper before publication is that you can respond to feedback.  Thus, it would be great and appropriate for MVB to take advantage of the remaining time window by purging the misleading material about impossible point estimates without uncertainty intervals from the published version of their paper.  (See post 4 and post 5 of this series in addition to the present one for further details.)  This move would help lead us toward more fruitful future discussions.

Important New Violent Death Estimates for the War in Peru with Implications Beyond just Peru: Part 5

I’ll start this post by reacting to some interesting comments to part 4 of this series which was, you may be surprised to learn,  preceded by part 1, part 2 and part 3.  I’ll assume that readers have some familiarity with these posts but I’ll also try to go slowly and remind readers of things we’ve discussed before.

Recall that there is a statistical report done for the Peruvian Truth and Reconciliation Commission (TRC), Silvio Rendon’s critique of this statistical report and a reply to Rendon from Daniel Manrique Vallier and Patrick Ball (MVB) who worked on the TRC statistical report.

Let’s focus first on the data.

The TRC statistical report and Rendon’s critique are both based on what I’ll  call “the original data,” which consists of 3 (after some consolidation) lists containing a total of 25,000 unique (it is claimed) deaths, many appearing on multiple lists.

There are several issues concerning the original data:

First, only summary information from the original data is in the public domain.  The following table from the TRC statistical report shows the form of the publicly available original data:

We can see, for example, that there are 627 people recorded nationwide as killed by the State (“EST”) who appear on the list of the CVR (the TRC itself) and the DP (Defensoria del Pueblo) but not on the ODH (NGO’s) list.

There are 59 such tables, one for each geographical stratum.  Each one looks  like the above table but, of course, with smaller numbers.  Both the TRC and Rendon base their statistical work on these tables.

The problem is that these 59 tables alone do not allow us to examine the underlying matching of deaths across lists that they summarize.  Matching is a non-trivial step in the research that involves a lot of judgment.  I will examine the matching issue in an upcoming post.  Suffice to say here that until this step is opened up we are not doing open science.

To be fair, it appears possible for at least some researchers to obtain the detailed data from the Peruvian government and perform their own matching.  According to Patrick Ball:

“People with access to the detailed TRC data” is not an intrinsic category of people: it’s just people who have asked nicely and persisted (sometimes, like us, over several years), until they got access to the data. It seems to me that with sensitive data, obtaining the relevant information is incumbent upon the researcher: Rendon could have inquired of the Peruvian Ombudsman office to get the TRC data. It’s not secret, it just requires a bit of work to obtain, and he chose not to do so.

I don’t like this.  The data should simply be available for use.  Patrick may be right that, effectively, it’s open to all nice and persistent people.  But the data should also be available to mean and non-persistent people as well.

Let’s move on to a few observations on the MIMDES data, the detailed version of which is in the public domain.  (Apparently it’s not online right now but has been in the past and hard copies can be obtained.).

First, the public availability of the MIMDES data undermines excuses for forcing researchers to jump through hoops for the TRC data.  They are both detailed lists of people killed in the war.  These list are both held by the Peruvian government.  Why is it OK to circulate one list while requiring researchers to be nice and persistent for the other?

Second, I know nothing about the data collection methodology for the MIMDES data.  OK, perhaps I should obtain and study the MIMDES reports.  But the MVB reply paper introduces the MIMDES data into this whole discussion so they should describe the MIMDES data collection methodology in their paper.  (They also should have described the data collection methodologies for the lists used in the TRC’s statistical report.)  But the MIMDES methodology seems particularly important since Patrick Ball, in his comments on this blog, urges us to treat the MIMDES sample as more or less representative of all deaths in the war.  I would need to know something about how MIMDES performed its work before entertaining such a notion.

MVB have matched the MIMDES deaths against the TRC’s deaths and the resulting figures are central to their reply to Rendon’s critique.  For three reasons, however, I recommend that we take these merged TRC-MIMDES figures with a grain of salt, at least for now.  First, MVB don’t explain how they do the matching.  Second, they say their work is unfinished.  Third, it is difficult at present for anyone to match independently since the TRC data are not really open.  (Remember that you need the detailed TRC data in order to match it against the MIMDES data.)

That said, for the rest of the post I’ll take the numbers from MVB’s TRC-MIMDES merge as given just as I’ve done in my earlier posts.

Patrick and Daniel especially emphasize one point in their separate comments on post number 4 (in which they focus exclusively on Rendon’s Shining Path (SP) estimates).  Recall that Rendon’s main estimate starts with estimates from just the geographical strata that allow for direct estimation (after multiple imputation) and then uses spatial extrapolation (kriging) to extend these estimates to the whole of Peru.  But, MVD argue, the estimates in the selected strata are biased downwards because the fact that there’s rich enough data to do direct estimation already suggests that there are relatively few undocumented deaths left to discover in these strata.  Conversely, MVB  suggest, the strata where data is too sparse for direct estimation probably contain relatively many undocumented deaths.

This is a creative idea with some potential but I think that, if it exists, its effect is probably small. One of Rendon’s alternative estimates cuts Peru up into just 10 regional strata which cover the entire country rather than the 59 more localized strata in MVB’s stratification scheme.  This 10-stratum estimate is not subject to MVB’s selection bias argument.  The SP estimate in this case is around 1,000 deaths more than Rendon’s main estimate (which  requires strata selection and spatial extrapolation).  So perhaps MVB have identified a real bias although, if so, it seems to be a small one.  There are, of course, multiple changes when we move from Rendon’s main estimate to the 10-stratum one.  But MVB need their suggested bias to be huge in order to produce the 10,000 plus additional deaths required to make their TRC estimate look accurate.  The above comparison doesn’t suggest a bias effect of this order of magnitude.

The frailty of MVB’s statum-bias critique is exposed by the games they play to portray Rendon’s direct stratum estimates for the SP as systematically lower than the SP numbers in their merged TRC-MIMDES dataset.

They begin by deleting the multiple imputation step of Rendon’s procedures.  Daniel Marique Vallier explains:

Thus, showing that the application of capture-recapture to those strata leads to contradictions, automatically invalidates all the rest of the analysis. This is because those more complex estimates depend on (and amplify) whatever happens with the original 9; you can’t extrapolate from strata that are themselves inadequate. That’s what we have shown. Specifically, we have shown that the application of capture-recapture to those 9 strata (Rendon’s necessary condition) results in estimates smaller than observed counts (contradiction). This means that the basic premise, that you can use those strata as the basis for a full blown estimation, is faulty (modus tollens). Anything that depends on this, i.e. all the rest of the conclusions, is thus similarly faulty (contradiction principle).

This makes no sense.  Applying multiple imputation before capture recapture increases the estimate in every stratum.  These higher estimates then feed through the spatial extrapolation to increase the national estimates.  Deleting the multiple imputation step decreases the estimates at both the stratum and national levels.  Manrique Vallier argues, in effect, that doing something to increase Rendon’s estimates can only compound the problem of his estimates being too low.  This is like saying that drinking a lot of alcohol makes it dangerous to drive so (modus tollens) sobering up can only make it more dangerous to drive.

Next MVB try to dismiss all of Rendon’s work based on their claim that some of Rendon’s point estimates (which they have lowered) are below their merged TRC-MIMDES numbers.  Simultaneously, MVB apply a far more lenient standard to evaluate Patrick Ball’s Kosovo work (see MVB’s comments on post 4).  For Kosovo, they argue, it’s not a problem for most estimates to be below documented numbers as long as the tops of Ball’s uncertainty intervals exceed these numbers.  Moreover, it’s even OK for this criterion to fail sometimes as long as Ball’s broad patterns are correct.  I actually agree with these standards but consistency requires applying them to Rendon’s Peru work as well.

That said, Patrick Ball’s last comment makes a good point about the serious challenges to data collection in Peru.  By contrast, it’s easier to collect war-death data in Kosovo and lots of resources were devoted to doing just this..  So the true numbers in Peru might be substantially larger than MVB’s TRC-MIMDES ones.  I agree that this is possible but I would want to know a lot more about the various data collection projects in Peru before taking a strong stand on this point.

Finally, I return to Rendon’s ten strata estimate which appears immune to all the criticism contained in MVB’s reply.  The central estimate is about 2,000 deaths above MVB’s TRC-MIMDES national count for SP deaths, leaving considerable room to accommodate the discovery of more deaths, especially in light of the uncertainty interval.  That said, it would be interesting to see stratum by stratum comparisons with TRC-MIMDES to see whether there are any substantial discrepancies.

To summarize, perhaps Rendon’s SP estimates are somewhat low.  But MVB’s reply does little to undermine Rendon’s critique.beyond this minor observation.