The Statistical Estimates of Peru’s Truth and Reconciliation Commission are Really Bad: Part 1

Silvio Rendon’s powerful rejoinder on the statistical work of Peru’s Truth and Reconciliation Commission (TRC) is finally out.

This is a big development so I’m starting a brand new series to cover it.  Of course, this new series is related to the earlier one.   But I’ll strive to make the present series self-contained so you can start here if you want to.  

The sequence of events leading up to the present moment goes as follows.

I.  The TRC of Peru publishes a statistical report that makes two surprising claims.

  1.  Nearly 3 times as many Peruvians were killed in the war, 1980 – 2000, than the combined efforts of human rights NGO’s, Peru’s Ombudsman office and the TRC were able to document on a case-by-case basis.
  2.  The Shining Path (SP) guerrillas killed more people than the State did – reversing the pattern of the combined list of documented deaths, which formed the basis for the TRC’s statistical work.

II.  Silvio Rendon publishes a critique of the TRC’s statistical work.  He also proposes new estimates which, compared to the TRC’s estimates, increase the number of deaths attributed to the State, decrease the numbers of deaths attributed to the SP and “Other” groups and decrease the total number of deaths for the war as a whole.  His estimates are inconsistent with the TRC’s surprising conclusions.

III.  Daniel Manrique-Vallier and Patrick Ball (MVB), two authors of the TRC’s original statistical report, reply with a critique of Rendon’s estimates, indirectly defending their own SP estimates.

My earlier series covers the above three developments.

IV.  Now we have Rendon’s rejoinder which mostly attacks the original work of the TRC but also defends his own estimates from MVB’s critique.

I make one caveat before proceeding.  Rendon’s work is replicable but I have not tried to replicate it.   I’ll just assume here that Rendon’s claims are correct.  This is a reasonable thing to do since no one has discovered a substantive error from Rendon in the debate so far.

Rendon’s rejoinder finds grave and terminal deficiencies in the statistical report of Peru’s TRC.  There are several issues in play but today I’ll cover just the issue of overfitting, which turns out to be quite a big problem.

There is a footnote in the TRC statistical report itself that provides a good explanation for the dangers of overfitting.  The quoted text can supplement or replace the above overfitting link or, if you prefer, you can try this very short alternative explanation:

Overfitting occurs when the model fits the data too closely and therefore models only these data and none other. As the number of parameters used to fit the model approaches the number of cells in the table, all of the available information has been used for the model fitting, and none remains for the estimation. The goal is to find a model that fits reasonably well, but not so well that the same model would fail to fit different data describing the same phenomenon.

The TRC statistical report also proposes a policy to avoid overfitting, although they ignore it for their SP estimates: reject models with goodness-of-fit p values that exceed 0.5.  (Possible p values range from 0, i.e., no fit whatsoever, up to 1, i.e., a perfect fit.)

Given this policy it’s shocking to discover that the TRC bases its SP estimates on perfectly fitting models in 14 out of its 58 geographical strata.

In fact, I learned through correspondence with Silvio Rendon that these 14 cases of egregious overfitting form just the tip of an overfitting iceberg.  In a further 12 strata the TRC models are so close to perfect fits that their goodness-of-fit p values round up to 1.  Moreover, p values are between 0.7 and 1 for an additional 13 strata and between 0.5 and 0.7 for a further 8 strata.  In other words, according to their own stated standards the TRC estimated SP-caused deaths off of overfit models in 47 out of their 58 strata.  Most of these models are badly overfit.

To summarize, based on overfitting issues alone we should bin most of the TRC’s SP estimates.

download

And overfitting is just one of the problems that plagues these SP estimates.  Stay tuned for more in part 2 of this series.

PS – You may want to check out a parallel series on the blog about the perils of matching deaths and events across lists.  It focuses on Iraq data but co-author Josh Dougherty and I discuss connections that are relevant for the Peru discussion.

 

 

 

Advertisements

The Perils and Pitfalls of Matching War Deaths Across Lists: Part 2

This is my second post with Josh Dougherty of Iraq Body Count (IBC).  We asserted in the first one that Carpenter, Fuller and Roberts (CFR) did a terrible job of matching violent events in Iraq, 2004-2009, between the IBC dataset and the SIGACTs dataset of the US military and its coalition partners. In particular, CFR found only 1 solid match in their Karbala sample whereas 2/3 of the records and 95% of the deaths actually match.  We now present case-by-case details to explain how CFR’s matching went so badly wrong.

Here is the Karbala sample of data with 50 SIGACT records together with our codings.  Columns A-S reproduce the content of the SIGACT records themselves.  The column headings are mostly self-explanatory but we add some clarifications throughout this post.  We use Column T, which numbers the records from 1 to 50, to reference the records we discuss in this and the following post.  Columns V-Y contain our own matching conclusions (SIGACTs versus IBC).  Column AB shows our interpretation of what CFR’s reported algorithmic criteria should imply for the matching.

Both our matching and CFR’s compare the SIGACTs dataset to the IBC dataset as it existed prior to the publication of the SIGACTs in 2010. IBC has integrated a lot of the SIGACTs material into its own dataset since that time.  Thus, most of the Karbala cases we characterize in the spreadsheet as “not in IBC” (column Y) are actually in IBC now (Column Z).  However, these newer entries are based, in whole or in part, on the SIGACTs themselves. Of course, it is interesting to compare pre-2010 IBC coverage to another parallel or separately compiled dataset; this is the point of CFR’s exercise and of ours here as well.

Readers can check codings for themselves and are welcome to raise coding issues in the comments section of the blog.  You can look up IBC records at the database page here: https://www.iraqbodycount.org/database/incidents/page1. To view a specific record, simply replace “page1” at the end of the url with the incident code of the record you wish to view, such as: https://www.iraqbodycount.org/database/incidents/k7338 for record k7338.  The whole SIGACTs dataset is here.

Recall from part 1 of this series that CFR’s stated matching algorithm applies the following 3 criteria:

  1. Date +/- one calendar day
  2. Event size +/- 30%
  3. Weapon type

As noted in the first post, we cannot be precise about the algorithm’s matching implications because of some ambiguities that arise when applying the reported criteria, particularly in criteria 2 and 3.  It appears, however, that a reasonable application of the above three criteria matches 11 out of the 50 SIGACTs Karbala records.  We are, therefore, unable to explain why CFR report that they could only match 1 record on all 3 of their criteria. Recall that we do not know CFR’s actual case-by-case conclusions because they refuse to show their work.

Each criterion causes some matching ambiguities, but we focus here on criterion 2 because it causes the most relevant problems for the Karbala sample.  The main problem is that CFR do not specify whether SIGACTs or IBC records should be the basis from which to calculate percent deviations.  Consider, for example, record 4 for which SIGACTs lists 7 civilian deaths.  IBC event k685 matches record 4 on criteria 1 and 3 (reasonably construed) but lists 10 civilian deaths.  If 7 is the base for percentage calculation then 10 deviates from it by 43% which is, of course, greater than 30% and a mismatch.  But if 10 is the calculation base then 7 deviates by exactly 30% and we have a match.

Further ambiguity stems from CFR’s failure to specify whether their 30% rule is applied in just one direction or if matching within 30% in both directions is required.  Records 30 and 36, in addition to record 4 (just discussed above), either match or do not match depending on how this ambiguity is resolved. These ambiguous cases are classified as “Borderline” in Column AB in the spreadsheet.

A third problem with criterion 2 is that IBC often uses ranges rather than single numbers and CFR do not specify how to handle ranges or even acknowledge their existence. When there is a range, does the +/- 30% rule apply to IBC’s minimum, maximum, both, or to an average of these two numbers?  We don’t know.  We have to add SIGACT records 5, 15, 34 and 42 to the list of unresolved cases when we combine this range ambiguity with the base-number ambiguity.

Criterion 3 is, potentially, the most ambiguous of all because, strictly speaking, neither SIGACTs nor IBC have a “weapon type” field.  The two projects code types in different ways and with different wordings.  Nevertheless, both datasets have some event types, such as “IED Explosion” or “gunfire,” that can be viewed as weapon types.  Sensible event- or weapon-type matches can be made subjectively from careful readings of each record, but mechanical weapons matching based just on coded fields will not work.  For example, SIGACTs has a “Criminal Event – Murder” category (in Columns N-O of our spreadsheet) whereas IBC has no such category. However, many IBC event types, such as “gunfire”, “executed”, “beheaded”, “stabbed” and “tortured, executed”, among many others, can be consistent with “Criminal Event – Murder”. Thus, rule 3 seems to consist of subjective human judgments about all these varying descriptions, even though CFR claim that “the algorithm-based matching involved little to no human judgment.”  Any attempt to replicate CFR’s judgments on this rule would be pure guesswork since it is hard to imagine an algorithm to implement rule 3, and it does not seem like there was one.  Therefore, we just ignore rule 3 and proceed as if CFR made appropriate judgment calls on weapons types in all cases, even though this assumption may give too much credit.

With the above ambiguities aside, we’ll now move on to the substantive matching errors that arise when attempting to match these two datasets algorithmically. We distinguish between 8 error types that affect CFR’s application of their algorithm to the two datasets.  The rest of this post covers the first 4 error types and the next post will cover the remaining 4.

The first 4 error types concern basic errors in documentation or coding within the SIGACTs dataset itself.  We give a broad treatment of these SIGACT errors, in part to prepare the ground for future posts.  The SIGACT errors usually translate into matching errors under the CFR algorithm, but do not always do so, and we identify cases for which the algorithm should reach correct conclusions despite SIGACT errors.  Modifications of matching criteria tend to change the ways in which SIGACT errors translate into matching errors. Thus, matching procedures that we will consider later in this series sometimes fall prey to a different range of errors than those that affect CFR’s matching.  So it is useful to provide full information on the underlying SIGACT errors at the outset.

SIGACT ERRORS

  1. Location errors – Errors in recorded locations affect at least 9 records (25, 31, 33, 37, 39, 40, 45, 48 and 50) and probably 2 more (32 and 38), affecting at least 17 deaths, i.e., at least 18% of the records and 3% of the deaths in the sample.

Many SIGACT records contain incorrectly coded locations.  Usually, but not always, these errors take the form of a wrong letter or number in a code for the Military Grid Reference System (MGRS, Column B). For example, in Record 33 both the Title (Column E) and MGRS coordinates identify Karbala as the location for a “criminal murder” that killed 6 civilians. However, the Reporting Unit (Column S) for this record is “KIRKUK NJOC” which suggests that this event occurred in Kirkuk, not Karbala: two entirely different places that are far apart.  Moreover, the Summary (Column D) of the event, a full text description, describes an attack on electricity workers from the Mullah Abdullah power plant in the “AL HWAIJAH AREA” which is southwest of the city of Kirkuk in the province of Tameem (also sometimes called Kirkuk province). IBC record k5908 is of an attack “near Hawija” that matches the characteristics of the SIGACTs event, including describing the victims as workers for the same power plant.  Taken together, all these factors confirm that Record 33 is a Kirkuk event that was mis-coded as a Karbala event.

The location error appears to stem from a flaw in the MGRS coordinates, “38SMB119152”, which, if true, would place the event in Karbala.  It seems that the letter “B” should have been an “E”.  This single small change shifts the event 250 miles north to an area near Hawija, southwest of Kirkuk, where the Summary suggests it should be.  It appears, further, that the Title of “IVO KARBALA,” i.e., in the vicinity of Karbala, was likely based on the MGRS coordinate error.  The Title coder might not have read the Summary or may not have been familiar with the locations of Karbala, Hwaijah or Kirkuk and, therefore, not realized that these were discrepant.

The crucial point for list matching is that this basic location error renders the record un-matchable under CFR’s algorithm and, indeed, under any reasonable algorithmic method that does not consider the possibility of documentation errors.  A careful reading of the detailed record by a knowledgeable human can ferret out the error. A sophisticated text reading computer should also be able to spot the error, but only if the system is programmed to question the information coded in casualty fields.

Record 33’s location error can be detected from the fact that the reporting unit was based in Kirkuk.  But, importantly, the field for “Reporting Unit” (column S) is also omitted from the Guardian version of the dataset used by CFR, along with the detailed event descriptions (Column D). Indeed, records 31, 37, 39 and 40 also list reporting units that appear to be inconsistent with a Karbala event – a clear red flag to an attentive researcher.  But this flag would be invisible to anyone, such as the CFR team, without access to the Reporting Unit field.

Records 48 and 50 also have location errors, but these mistakes take a different, almost opposite, form. These are not events outside Karbala that were erroneously shifted into Karbala.  Rather, they are true Karbala events, as we can see from their summaries, but with MGRS coordinates that incorrectly place them outside Karbala.  This kind of MGRS error should not affect CFR’s Karbala matching because they assume that location matching is already satisfied through their sample selection method of filtering for “Karbala” in the Title field. Subsequently, CFR only try to match on date, size and type, not location.  Thus, the particular coordinates issue in these two records should not ruin CFR’s matching.  Nevertheless, such MGRS coordinate errors would undermine CFR’s main (nationwide) matching exercise because that exercise uses locations, based on MGRS coordinates, as a matching criterion.

  1. Duplication errors, which affect 3 records (35, 43, 46) and 205 deaths, i.e., 6% of records and 35% of deaths are duplicated.

The CFR paper never mentions the topic of duplication or de-duplication even though this is a focus of many list matching/merging efforts.  It seems fairly clear that CFR did not attempt to de-duplicate the SIGACTs samples they used in their paper.  Yet, the three duplicates in this Karbala sample account for no fewer than 205 out of 558 deaths.  In fact, correcting for duplicates leaves just 353 unique deaths in the sample, not 558.

Records 42 and 43 report the same event from two different sources in slightly differing versions. These match IBC record k7338.  The duplicate records report 46 and 48 deaths respectively. However, if one ignores the possibility of duplication then k7338 can match only one of the two SIGACTs records.  De-duplication failure here creates a large phantom event that cannot be matched against IBC since the actual match has already been used up.

Records 34 and 35 are also duplicates of a large event but this time with the added twist that deaths and injuries are interchanged in record 35.  Thus, 36 deaths and 158 injuries from an IED explosion in record 34 becomes 158 deaths and 34 injuries in a supposedly different IED explosion on the same day in record 35.  This improbable death to injury ratio in an explosion should have been enough to raise a red flag for CFR, even though they cut themselves off from the Summary Column (D) that confirms the interchange.  This time the phantom event creates 158 un-matchable deaths.

Records 46 and 47 also duplicate the same matching event although they only account for a single death.

These duplication problems in the small Karbala sample should be sufficient to establish that duplication is going to be a significant problem to overcome in any attempt to match the SIGACTs with another list. It’s also difficult to see how duplicates could be reliably identified without exploring the details in the Summary field (Column D) omitted from CFR’s version of the dataset.  Failing to eliminate duplicates across the SIGACTs winds up leaving an array of fictional events mixed in with real events and will naturally lead to many spurious conclusions about coverage failures in a comparison list.

  1. Classification errors, taking the form here of a reversal of death and injury numbers for 1 record (35) with, supposedly, 158 deaths, i.e., affecting 2% of records and 28% of deaths.

Another form of error that occurs in some SIGACTs is that casualty numbers or types are sometimes shifted into incorrect fields, i.e., one type of casualty is misclassified as another type. Record 35, which we already mentioned above in the section on duplication, interchanges deaths and injuries and is the only such classification error in the Karbala sample.  However, similar problems such as this and misclassifying victim categories (Friendly, Host Nation, Civilian, Enemy), occur in other parts of the SIGACTS data.

Record 35 also shows how some records contain multiple error types simultaneously.

  1. Doubling the number of deaths, an error that affects 2 records (11 and 12) for a total of 8 deaths, i.e., 4% of the records and 0.9% of the deaths in the sample.

All casualty fields for record 12 are exactly doubled compared both to what is written in the Summary field (Column D) and the Title field (Column E). Thus, the correct number of 3 for the “CIV KIA” number is doubled to 6.  IBC record k1878 has this event with 3 deaths and the same date and type.  Thus, without the doubling error this record would match under the CFR algorithm.  With the error, record 12 violates the +/-30% rule, regardless of whether we use 3 or 6 for the base, and becomes a mismatch.

Note that CFR could potentially have caught this error by reading the Title, which they did have at their disposal.  Comparison of these two fields shows, at a minimum, that one of them is wrong and matching must proceed with caution.  The detailed event description (Summary) confirms that the figure of 3 civilian deaths in the Title is correct and, except for the error, this should have matched under CFR’s criteria.

Record 11 makes the same error, converting 1 death into 2.  This error also contradicts both the Title and Summary.

 

Now is a good time to take stock.

It should be abundantly clear that errors in any dataset that is part of a matching project are potentially lethal to the project.  Any casualty dataset of even moderate complexity will probably contain some errors. Conflict casualty datasets tend to be collected and compiled under much less than ideal circumstances, so an error mitigation strategy must be a central feature of matching work.  Of course, one can get lucky and wind up working with very high quality data.  CFR were not lucky.  The SIGACTs data is highly valuable, but it is also rather messy, containing many important errors crucial to case matching.

CFR and others who have relied on their findings seem not to have considered the possibility of data errors in the SIGACTs.  Rather, it appears that CFR just assumed that they had two pristine datasets with the unique weakness of incompleteness.  This misguided assumption leads them to misinterpret the many data errors as revealing incomplete coverage.  These misinterpretations are not minor.  In effect, CFR padded their discoveries of true coverage problems with a host of other issues that are unrelated to coverage, substantially exaggerating the coverage issue in the process.

And yet we have only told half the story so far.  The next post will cover an additional 4 error types.

New Report on the Relationship between Drone Strikes and Suicide Attacks in Pakistan

I’ve just published a report with my PhD student Luqman Saeed and Iain Overton, the executive director of Action on Armed Violence, one of the NGO’s that I serve on the Board of.

The above link goes to a short summary of the report which, in turn, links to the full report.  So here I’ll just give an even shorter summary that I provided for the new impact page of the Royal Holloway Department of Economics:

Drone strikes are followed by strongly elevated rates of suicide attacks. On average, roughly one additional suicide attack occurred during a 30-day window following a drone strike in Pakistan. This suicide attack caused, on average, 20 deaths and 48 injuries. The trend of strike followed by suicide attack was elevated under Bush compared to Obama.

We don’t imagine that we’ve definitively settled this question and we cite some research that draws different conclusions (and explain why we don’t think they have settled the issue either).  We will definitely return to this subject.