Correction. The first part of this post argues that an anomaly in a published graph is an error that has some substantive implications. However, an alert reader, Ben Prytherch, proposed a benign explanation for the anomaly. I checked with the authors of the graph and it turned out that Ben is right. So this is a formal correction. I annotate this part of the post below and will write a follow up post about this as well. [October 2, 2016]
Many of you know that for some months I’ve been involved in a discussion with Pasquale Cirillo and Nicholas Nassim Taleb. Steven Pinker joined me in a recent exchange of letters with Cirillo and Taleb.
For preparation I had another look at the Cirillo-Taleb paper and was taken aback by their figure 14a:
The accompanying text says:
If …events…follow a homogeneous Poisson process…their inter-arrival times need to be exponentially distributed….Figure 14 shows that ….these characteristics are satisfactorily observable in our data set. This is clearly visible in the QQ-plot of Subfigure 14a, where most inter-arrival times tend to cluster on the diagonal.
Please clear your head for the moment of the details (Poisson, QQ, etc.). The key is that the points should line up along the diagonal, which they seem to do. Great!
The diagonal for this picture should be the 45 degree line whereas the line in the above picture is more like a 35 degree line. Notice how the X axis goes out to 11 whereas the Y axis only goes up to 7.
[This is where I start to go wrong. It turns out that the Y axis is scaled differently from the X axis. If the scaling were the same then the points would line up on the 45 degree line. Personally, I think the exposition would be better if the scaling were the same on both axes but the way that Cirillo and Taleb have done this is not an error as I originally asserted. October 2, 2016]
Here is the kind of plot we should see if the data really do follow an exponential distribution as Cirillo and Taleb claim their data do [and the pictures were done with the same scaling on both axes as I would have preferred, October 2, 2016]:
For this proper [replace “proper” with “clearer”, October 2, 2016] QQ graph both axes go to 5 and the diagonal is the 45 degree line. (Ignore the fact that Cirillo and Taleb’s points are stacked above and below each other. This is only because their data points are rounded to the nearest year)
Thus, Cirillo and Taleb’s figure 14a shows the opposite of what they claim; their data do not fit an exponential distribution. When properly interpreted the Cirillo-Taleb graph suggests that the data do follow an exponential distribution. October 2, 2016.
I have to say that I looked at figure 14a many times without noticing this problem. Presumably they just made a mistake. [My mistake, actually, October 2, 2016]
But what a slick manoeuvre this would be in the tradition exposed so well by Darrell Huff if done on purpose. Your data need to be on a particular line. You draw a line that goes through your data. You declare success. Busy people don’t notice you haven’t drawn the right line. [Of course, Cirillo and Taleb did not engage in such trickery. October 2, 2016]
By the way, Cirillo and Taleb’s figure 14b
also [October 2, 2016] strikes me as out of tune with their accompanying text:
Moreover, no time dependence should be identified among inter-arrival times, for example when plotting an autocorrelogram (ACF). Figure 14 shows that both of these characteristics [exponential distribution and no time dependence] are satisfactorily observable in our data set.
Again, without getting into the details, they are saying that the little bars are all near 0 (ignore the huge first bar). I agree that the bars are, indeed, lowish. But what about the ones near 0.2? (These are correlations so they have to be between -1.0 and +1.0) These larger bars do seem to be pretty much below the (unexplained) dotted blue line. Maybe this is a statistical significance line? If so then I’d agree to a formulation along the following lines:
We were unable to reject a hypothesis of 0 time dependence at the ??? level. However, we only had a few hundred observations and with more data we might well reject such a hypothesis, at least for some time lags. Still, it seems that any time dependence in the data is fairly weak.
I don’t see this as a massive smoking gun. I believe that Cirillo and Taleb are in the right ball park with their interpretation of these correlations although they have overstated their case. I do suspect, however, that if Nassim Taleb were standing in my shoes right now he would be shouting that I adamantly deny the overwhelming evidence of massive correlations. [Well, maybe he wouldn’t. He was pretty reasonable in our exchange about me correcting my error. October 2, 2016]
In any case, despite what Cirillo and Taleb seem to think, neither of these pictures directly addresses the main issue that interests them: whether or not there is a trend toward fewer wars per unit of time
PS -I should mention that one of my colleagues, Alessio Sancetta, helped me think through this post. Of course, all errors are mine and, as always, I’d love to hear from readers and will gladly fix any mistake I may have made.
I assumed a fair amount of knowledge above so here are a few more details for anyone out there who craves them.
The data underpinning the pictures is for large wars since 1500. I don’t have it. I believe that Cirillo and Taleb have not yet released their data yet but are planning to do so.
Figure 14a is about the distribution of time gaps between wars. Specifically, how often does the next war happen right away (0 time gap), how often do we wait 1 year, 2 years, etc.?
To do an exponential QQ plot you first fit an exponential distribution to the data. This fitted distribution then makes predictions on gaps between wars. The predictions will be, for example, that 75% of the gaps will exceed 2 years or that 50% of the gaps will exceed 4 years, etc.. You then graphically compare the predictions with the actual gap distribution. If all the predictions turn out to be exactly correct then the points will line up smack on the 45 degree line. [In my opinion the above is how one should do a QQ plot that is easy to understand. October 2, 2016]
How do we interpret the fact that, when done correctly, the points on the right in figure 14a lie well below the 45 degree line? [What follows in the next paragraph would be true if the QQ plot had been drawn the way I describe but isn’t true of the actual Cirillo Taleb graph.]
This means that the actual gaps at the high end tend to be longer than would be predicted by the fitted exponential curve. Loosely speaking, when the exponential is predicting a gap of 7 the actual gap turns out to be more like 10, etc.. In other words, the right-hand tail of the distribution of gaps between wars is stretched to the right compared to the exponential fit.
Figure 14b is checking for correlations between gaps at different time lags. For example, the bar that reaches a height near 0.2 at a lag of 5 says that a longer gap 5 wars ago tends to be associated with a longer gap until the next war. More generally, this shows that knowledge of past gaps appears to be (weakly) useful in predicting future gaps.