No, I’ve not discovered any particular new fraud in the temperature record.
What I’m doing is the first step: I’m stating a potential METHOD to detect fudging of data. (Or systematic skew of the data via unintentional transformative processes).
The idea depends on Benford’s Law which states that the frequency of the first digit of many data sets will have a log distribution so that more will be a 1 and decreasing down to very few being a 9.
There are limits to Benford’s Law. One being that the data are more likely to obey it when they span a few orders of magnitude. (If you are measuring a property with a narrow range, like the average melting point of tin samples, you would not expect a wide distribution of digits…)
Benford’s law can only be applied to data that are distributed across multiple orders of magnitude. For instance, one might expect that Benford’s law would apply to a list of numbers representing the populations of UK villages beginning with ‘A’, or representing the values of small insurance claims. But if a “village” is a settlement with population between 300 and 999, or a “small insurance claim” is a claim between $50 and $100, then Benford’s law will not apply.
This means that using F, where temperatures span 3 orders of magnitude, is just marginally usable while C with only 2 orders of magnitude would “have issues”.
As the records are stored in 1/10ths, the overall distribution is actually over 4 orders of magnitude for F and 3 for C, but frankly I’d rather have the extra digit of margin. I suppose one could also say that negative values add some effective increase in the effective range too… but that’s a bit speculative at this point.
Now, this raises a bit of a paradox, as there is the small matter of Benford’s Law being scale invariant: and here I am picking a scale based on that having a variation in the impact…
The law can alternatively be explained by the fact that, if it is indeed true that the first digits have a particular distribution, it must be independent of the measuring units used (otherwise the law would be an effect of the units, not the data). This means that if one converts from feet to yards (multiplication by a constant), for example, the distribution must be unchanged — it is scale invariant, and the only continuous distribution that fits this is one whose logarithm is uniformly distributed.
For example, the first (non-zero) digit of the lengths or distances of objects should have the same distribution whether the unit of measurement is feet, yards, or anything else. But there are three feet in a yard, so the probability that the first digit of a length in yards is 1 must be the same as the probability that the first digit of a length in feet is 3, 4, or 5. Applying this to all possible measurement scales gives a logarithmic distribution, and combined with the fact that log10(1) = 0 and log10(10) = 1 gives Benford’s law. That is, if there is a distribution of first digits, it must apply to a set of data regardless of what measuring units are used, and the only distribution of first digits that fits that is the Benford Law.
I’m comfortable with setting aside that conundrum for now, and simply accepting that the leading digit in C may not obey Benford’s Law as well due to us already knowing their are not a lot of places on the planet with a 50, 60, 70, 80, or 90 C range. That is, the physical upper bound on temperatures will bias the sample. However, it would be worth doing the test in C as well, as Benford’s Law might still apply (given the likely increase in values in the 1x.x and 1.x ranges) In essence, if the C values also obey Benford’s Law it would be great confirmation, but if they don’t, that is not likely to be confirmation of failure so much as a flag to more closely study just what is going on with the data distribution and does that low number of orders of magnitude have an impact.
It ought to be pretty easy to do the test. Just take the data and count the number of instances of the first digit of each data item being a 1, 2, 3, etc. and plot. It ought to be a reasonable approximation of the Benford’s Law distribution – unless the data are cooked (or I’ve managed to screw up at the get go by failure to observe some limitation in the data distributions of temperature data items that makes them unsuited to a Benford’s Law test).
Basically, it ought to give a chart like:
Unfortunately, this test is unlikely to be definitive in either direction. Failure can simply mean that temperature data have a natural distribution that does not fit the law (though with 4 digits of F I’m having trouble thinking of how…) while a successful pass of the test may just mean that the Data Diddler was very clever.
Forensics are often like that. You get “indications of reason to suspect” more often than you get “Finger Of God’s Own Truth”. One finger print can just mean the person was there at an unknown time in the past (and it was OK); you need to correlate that with more data (when room was cleaned, were they authorized to be in the space at all, ever?) before it becomes more than just a flag of suspicion.
So what good is it then?
Pretty simple. If the data conform, it lends credence that there was no ham handed deliberate Data Diddling (and you can focus on more abstract and difficult searches, probably computer driven diddling that can keep the statistics right).
If the test fails, it says you need another bit of exploration to show probable fraud and it tells you what that bit of exploration would be.
Basically, show that unbiased temperature data do obey the law; then you have a smoking gun. That can be simply done with some old raw data of known quality. IFF a known unprocessed broad sample obeys Benford’s Law, but the post processed GHCN v.1 or GHCN v.2 or GHCN v.3 do not (or even more deliciously, if V.1 does and V.2 DOES NOT ;-) well, then it’s hanging time in the court of Data Diddling…
Is this a brand new idea, or is there some reason to think it’s an OK use of Benford’s Law?
Well, it’s pretty well accepted as a forensic tool, and it’s even accepted in court. I think that means it has some validity (though it would need a better statistician them I am to testify).
In 1972, Hal Varian suggested that the law could be used to detect possible fraud in lists of socio-economic data submitted in support of public planning decisions. Based on the plausible assumption that people who make up figures tend to distribute their digits fairly uniformly, a simple comparison of first-digit frequency distribution from the data with the expected distribution according to Benford’s law ought to show up any anomalous results. Following this idea, Mark Nigrini showed that Benford’s law could be used in forensic accounting and auditing as an indicator of accounting and expenses fraud. In the United States, evidence based on Benford’s law is legally admissible in criminal cases at the federal, state, and local levels.
This use, though, depends on the human desire for an even distribution of made up numbers. A consistent algorithmic variation (such as increasing by 0.5 C across the board) is less likely to cause a broken distribution. A bias such as “lift some 1.x to 2.x” temps would be shown, even if done algorithmically, as it would shift the leading digit count toward a 2 and away from a 1; so in that kind of case the more ‘subtle’ adjustments can yield the most indication of biases.
Basically, it’s a valid method, but it doesn’t catch everything.
Also, I’m not the first person to think of this. A quick web search showed at least one other person has thought of it, but I’ve not found any evidence of it having been done (yet…)
June 28, 2009 at 5:50 pm
PaulH wrote: “All of this is looking more and more Enron-esque with each passing day. How much longer can they cook the books before it all comes crashing down?”
Benford’s Law on the distribution of leading digits is sometimes used to catch those who “cook” financial records. However with data such as temperature that has a restricted range, can Benford’s Law be adjusted to take this into account?
See “Applications and Limitations” section in:
There is also a distribution on the second leading digits which may not be as sensitive to data with restricted ranges.
(See “Generalization to digits beyond the first” section in the above.)
I’ll likely put another 10 minutes or so into more creative web searches for “prior art” prior to doing the actual data test myself, but that is another area where folks could “Dig Here!” as I’m up for tea and breakfast before I do more on this line.
Someone with decent spread sheet skills could do this fairly easily just using the common spread sheet applications. I’ve got the data in UNIX files, so would likely use a more long hand FORTRAN approach (as I have the code to read the files already set up). However, I’m trying to catch up a lot of other things right now, so I’m also unlikely to get around to it before Christmas.
This puts us all in a bit of a Race Condition…
A race I would be happy to lose…
So if anyone else wants to run a Benford’s Law test on the data and report back here, feel free and by all means take the credit. I’m happy to just have been a ‘useful irritant’ by presenting the idea.
If not, well, I’ll get around to it eventually…
I am also pretty sure that a similar test for “proper” distribution of final digits could be done (though not Benford’s Law). I’ve seen some indications of non-random distribution artifacts (like that temperature series in the Paducah posting where the last digit kept hopping back and forth from a .4 to a .9 repeatedly). I believe the trailing digits ought to have an even distribution with no nodal points. So the whole data set, if looked at with both tools, could be tested for leading digit, second digit, and trailing digit non-normal distributions.
Yeah, kind of dull work… but not all that hard to do and the results can sometimes be rather interesting…