Where do your Data Come From?


This picture is a wonderful opener to an undergraduate seminar.  The story is that the great statistician Abraham Wald collected data on the placements of bullet holes on planes returning from World War II missions.  He made a little dot on a picture representation of a plane for each bullet hole observed on an actual plane.  As the dots accumulated the starting picture on the left was transformed into something resembling the picture on the right.  (I’ve never seen the actual Wald picture.  Surely it’s not as perfect as the one above but leave that aside.)

In the classroom you explain that engineers then used the picture to figure out where to reinforce their airplanes.  What was the right thing for them to do?  (You need to explain to students that it was not an option to just reinforce everywhere.  Doing this would be too expensive and  weigh the plane down too much.

Some people seem to think that only a genius like Wald can solve this puzzle but I find that about a quarter of the time some student just blurts out the right answer.  That said, students have the advantage of knowing that there must be a little trick or I wouldn’t ask the question in the first place.  And only a fairly small minority of students do get it straight away.

Some students think, wrongly, that anti-aircraft guns were highly accurate.  They then cast around for reasons why the enemy decided not to shoot at some parts of the allied airplanes.  To have any chance of getting the right answer you need to realize that bullets hit the airplanes pretty much randomly all over the place.

Once you clear the hurdle of realizing that the key issue is not targeting a big trap still looms over you.  The dark areas still seem to jump out at most people as the right places to reinforce, although the only argument that could support this view is that these were where the bullets were going….but we just said that the bullets went everywhere so this can’t be right.

In fact, the black areas are just where we have records of bullets hitting the planes.   Why do we have records of bullets hitting some places but not others?  I can only think of one explanation that makes sense.  Planes that are hit in the white areas don’t return.

Main Conclusion – We should reinforce the white areas.

Falling for the conceptual trap and reinforcing the black areas would be a colossal  error.  You would waste a lot of money reinforcing most of your planes, making them  unnecessarily heavy and leaving their most vulnerable parts still exposed.

The power of the example resides in the combination of a highly tempting error that is simiultaneously very costly to make.

Please now take a second look at the title of this post.  Abraham Wald carried out a highly successful bit of data collection and analysis.  The whole key to making it work was the way he thought about where his data were coming from.

The Wald airplane analysis is cool enough in its own right to justify a post.  But I bring it up now to highlight some misguided points some people are making in other contexts.  I’ll draw these issues out in future posts but for now I’ll make them just in the context of the present example.  These are obviously unfair criticisms but, hopefull, people will get what I’m doing here when they see these points made in other contexts.

  1. Wald created a mess because he collected biased data.  Wrong.  Wald did collect biased data but he thought carefully about where his data were coming from and managed to draw highly useful and appropriate conclusions from his data.
  2. I have just discovered that Wald’s data are biased.  It would be interesting to discuss whether he has committed statistical malpractice.  Wrong.  Wald actually discovered himself that his data were biased, pointed this out and integrated this fact into his thinking.
  3. I’m not saying that Wald caused Hitler to invade the Soviet Union but he did gather biased conflict data and then Hitler did escalate the war so you can draw your own conclusion on this one.  OK, this one is much dumber than 1 and 2 but you might be surprised about some arguments currently making the rounds about biased data collection causing the rise of ISIS.

Please stay tuned.


