Secret Data Sunday – Nassim Nicholas Taleb Edition

When data are central to scientific discussions, as is typically the case, then the relevant data should be open to all.

OK, we don’t have to be totally rigid about this.  People may sink a lot of effort into building a data set so it’s reasonable for data builders to milk their data monopoly for some grace period.  In my opinion, you get one publication.  Then you put your data into the public domain.

And public domain means public domain.  It’s not OK to hide your data from people you don’t like, from people you think are incompetent, from people you suspect of having engaged in acts of moral turpitude, etc..  You post your data so everyone can have them.

If you put your data into the public domain and someone does something stupid with it then it’s fine to say that.  It’s a virtue to be nice but being nice isn’t a requirement.  But as far as I’m concerned you share your data or you’re not doing science.

Readers of the blog should be well aware that there has been a dispute about the decline of war (or not), primarily between Steven Pinker and Nassim Nicholas Taleb.  You can track my participation in this debate from a bunch of my blog entries and the links they contain.  I’m in the middle of preparing a conference talk on this subject, and I’ll post the slides later this week….so more is coming.

I planned a little data work to support the talk so I emailed Taleb asking him for the data he used to launch his attack on Pinker’s work.  Here is his reply.

1) It is not professional to publish a “flaw” without first contacting the authors. You did it twice.

2) Your 2 nitpicking “flaws” betrayed total ignorance of the subject.

So I will ask you to fuck off.

He is referring to this post (which did contain an error that I corrected after a reader pointed it out.)

What can I say?

The main thing is that if he wants to do science then it’s not OK to just declare someone to be ignorant and withhold data.

Beyond that I’d say that if he still objects to something in my post he should be specific, either in the comments or to me directly.  As always, I’ll issue a correction or clarification if I get something wrong.

Third, it isn’t really standard to clear in advance criticisms of someone’s work with the person being criticized.  Doing this could be a reasonable strategy in some cases.  And it’s reasonable to send criticism to the person being criticized.  Correcting errors, as I do, is essential.

Anyway, I take away from this episode that Taleb isn’t doing science and also that he probably doesn’t have great confidence in his work on this subject or else he wouldn’t hide his data.


  1. OK, this is certainly true and well worth pointing out.

    With, for example, the survey data I wrote about here we can never have anything unless the original data set is released.

    But we could have something like Taleb’s data if we put in enough time and effort.

    However, this observation does not let him off the hook for releasing his data. It is useful in this context to have a look at the website for the PRIO battle deaths dataset, one the the primary pieces of evidence the the decline-of-war thesis.

    This dataset is constructed entirely on open sources. Yet it is posted for all to see and use.

    And look at the supporting documentation. If you even pop these documents open for a casual look it will be obvious that there is a lot of supported judgment that went into the coding. If someone were to start this project from scratch and ignore PRIO’s supporting documents then he/she would would come up with a rather different data set. The only way to reconcile the two would be through close reading of supporting documentation.

    Similarly, if I were to try to recreate Taleb’s data set I would surely come up with something rather different from what he has. Even if I tried to follow his approach I’d fail in various ways because he has produced little supporting documentation. Of course, this would be an interesting exercise but it would take a lot of effort and, ultimately, it would be inconclusive without a lot more information about what he did precisely than is currently in the public domain.

    I’m saying that there is now substitute for his data, even though it is possible to assemble an interesting new and similar dataset from scratch.

    But, also, why put up such a high barrier preventing people from replicating and validating your work? PRIO is eager for scrutiny. Why isn’t Taleb?


  2. He gave a similar reply to Roodman almost two years ago:

    This is why their paper shouldn’t be taken seriously, and shouldn’t be viewed as any notable contribution to the debate about long-term trends in the occurrence of major wars. Going by what little data they DO present, I’m quite skeptical – why start with Boudicca in 61AD and not the battle of Teutoburg Forest in 9AD – which easily surpasses the 3000 body count threshold. Hell, Rome was waging near-constant campaigns under Augustus until 14AD. And at least a few of the smaller battles in the Claudian invasion of Britain met the 3000 threshold. Going by this alone, I would wager that there’s a sizeable underestimate of warfare before 1000AD in their set. But there’s no way to really evaluate this as long as their data remains withheld.


