More faked data

You may have seen this article in the Washington Post this weekend. It’s a study of the province-by-province results in the disputed Iranian presidential election. I first heard of it through this tweet by Mark Spieglan (@mjspieglan). Apart from its inherent interest, the article caught my eye because its topic is similar to that of my Stochasticity post from last Friday: How can one determine whether a data set is real or faked?

The authors of the article, Bernd Beber and Alexandra Scacco, look at two things in the provincial vote counts, the distribution of the last digits1 and pairwise sequences of digits, and conclude that the vote counts were likely faked. The final digits are too unevenly distributed, and the two-digit sequences are too often successive—e.g., too many sequences like 23, and not enough like 64—to be due to chance alone.

At first blush, this may seem like just the opposite of Prof. Deborah Nolan’s coin flipping demonstration. Recall that (quoting myself):

One of her classroom exercises is an attempt to teach her students the true nature of random data. She selects two groups of students. One is to flip a coin a hundred times and record that sequence. The other is to dream up a hundred coin flips and record that sequence. Prof. Nolan leaves the room while this is going on and returns when the two sequences have been written out on the blackboard. She looks at the two long strings of Hs and Ts and judges which on is the real record of coin flips and which is made up. Her assessment, which is almost always correct, is based on how smooth the sequence looks Real data streams are almost always clumpy, with relatively long runs of heads or tails. People who are faking data tend to alternate regularly between heads and tails, with only short runs of consecutive Hs or Ts.

It appears that whereas Nolan is looking for clumping, Beber and Scacco are looking for smoothness. Not quite. Despite the clumpiness, Nolan is still expecting to see roughly equal numbers of heads and tails. If she looked at a series of purported coin flips and saw, say 60 heads, she’d be a bit suspicious. And if it were 65 heads, she’d be really suspicious. (You can use Wolfram Alpha to see why.) The point of both Beber and Scacco’s work and Nolan’s classroom exercise is to examine what random data really look like and to calculate how likely it is that pure chance led to the data given.

Beber and Scacco didn’t just cobble together some numbers for a Washington Post article; this sort of examination of voting data is part of their normal research. Here’s a summary of their work on the 2003 Nigerian election (that’s a PDF link to a presentation at a poster session).

At the very bottom of the Post article is a funny example of something that seems, at first glance, to be highly unlikely:

Bernd Beber and Alexandra Scacco, Ph.D. candidates in political science at Columbia University, will be assistant professors in New York University’s Wilf Family Department of Politics this fall.

What, you may ask, is the likelihood that two Ph.D. candidates, currently studying at one university, will move on in the same year to essentially identical faculty positions at another university? The odds against that must be astronomical, right? Yes, unless the two are married, in which case it’s pretty commonplace. I’ll bet that’s how Beber and Scacco would interpret it.


  1. Why look at only last digits and not first digits? Benford’s Law