For the last couple of days I’ve been fussing with the problems caused by the inconsistency of GHCN from version to version. In v3, Russia has 2 country codes. One for the European half, one for the Asia half. In v4 it is only one abbreviation “RS” for all of it.
That showed up in the Russian anomaly comparison graphs of the prior GHCN v3.3 vs v4 Asia set in that I’m trying to compare the two versions of “one country” when it is three country definitions. So, OK, I put a footnote (sort of, really an inline comment) that this was an issue and ignored the inconsistency.
Trying to find a way to “fix that” I thought: “Well, heck, just use WMO number. Each instrument has a unique WMO number. Associate the country with the WMO for each in a distinct table. Make WMO the key field.”
Which resulted in me doing a spot check on the WMO consistency. The first block is v3 inventory file where the first three digits are “country number” then there are 5 digits of WMO# and 3 of flags for instruments near that WMO site or changed instrument at that site. For v4 there are 2 letters of country abbreviation, then three letters of various status information, then the WMO ought to be the next block. So think you can match on those WMO Numbers?:
chiefio@PiM3Devuan2:~/SQL/v3$ grep FRANCIS inventory.in 10468054000 -21.2200 27.5000 1000.0 FRANCISTOWN 991S 22FLxxno-9x-9SUCCULENT THORNSA 40778460003 19.3000 -70.3000 110.0 SAN FRANCISCO DE MACORIS D 210U 65HIxxno-9x-9WARM CROPS B 41476423001 24.4000 -104.3200 1960.0 FRANCISCO I. MADERO, DURANGO 2014R -9MVDEno-9x-9WARM GRASS/SHRUBC 42500147093 39.7675 -101.8097 1024.7 SAINT FRANCIS 1030R -9FLxxno-9x-9COOL GRASS/SHRUBC 42572494000 37.6200 -122.3800 5.0 SAN FRANCISCO 102U 6253FLxxCO15A 1COASTAL EDGES C 42574506002 37.7700 -122.4300 22.0 SAN FRANCISCO/MISSION DOLORES 70U 6253HIxxCO 1x-9COASTAL EDGES C 50998437000 13.3700 122.5200 45.0 SAN FRANCISCO 125R -9HIxxCO 1x-9WATER A chiefio@PiM3Devuan2:~/SQL/v3$ grep FRANCIS ../v4/inventory.in BC008948490 -21.2170 27.5000 1001.0 FRANCISTOWN CA004012720 50.1167 -103.9167 603.0 FRANCIS DR092205945 19.2800 -70.2500 110.0 SAN_FRANCISCO_DE_MACORIS MXM00076843 16.7700 -93.3410 1051.9 FRANCISCO_SARABIA MXXLT082709 24.9100 -104.4600 1700.0 FRANCISCO_PRIMO_VERD RPXLT752551 13.3700 122.5200 45.0 SAN_FRANCISCO SF000175820 -34.2000 24.8330 7.0 CAPE_ST_FRANCIS USC00047767 37.7281 -122.5053 2.4 SAN_FRANCISCO_OCEANSD USC00147093 39.7675 -101.8067 1024.7 SAINT_FRANCIS USC00168136 30.7775 -91.3769 35.1 ST_FRANCISVILLE USC00363018 41.1183 -75.7278 459.9 FRANCIS_E_WALTER_DAM USW00023234 37.6197 -122.3647 2.4 SAN_FRANCISCO_INTL_AP USW00023272 37.7706 -122.4269 45.7 SAN_FRANCISCO_DWTN VEM00080416 10.4850 -66.8440 856.0 GENERALISIMO_FRANCISCO_DE_MIR chiefio@PiM3Devuan2:~/SQL/v3$
For those who don’t know “San Francisco International Airport” is located in South San Francisco (a different city). I’m pretty sure that the S.O. San Francisco of the first set (72494) is the same as San Francisco INTL AP (023234) of the second one, though perhaps the thermometer moved to a slightly different LAT / LONG at the airport (or they fixed their slight location error). It looks like it has the “A” Airstation flag set for the So. SFO station.
Clearly “Francistown” is the same one. 68054 vs 948490. LAT/LONG match within rounding band.
Then Saint_Francis almost gets a match, but it looks like one of them is offset. “00147” plus the 093 modification flags vs 147093.
But just how on God’s earth to match things up? The names change. The LAT / LONG change small bits. The Country changes (both format and what country places are in as countries come and go). And it looks like the WMO# is a “variable constant” over time.
There are only 2 countries afflicted by this split personality problem, 1/2 Asia 1/2 Europe: Russia & Kazakhstan. I supposed I could just “crowbar” them both into 100% Asia for v3 and ignore the rest of the volatility of definitions. But really, I should not need to do such things.
There’s a great deal of “Dick With Factor” in these datasets at all levels. Even to the assignment of WMO Number to a given station in a given place. It’s almost like they are trying to hide things…
Over 8 years ago I met Tom Peterson in Asheville to get his take on the sharp reduction in stations listed in the GHCN. At that time all I had was the v2 files but Tom gave me a copy of the v3 files.
It was my intention to compare v2 with v3 but failed miserably owing to lack of experience manipulating databases. It did not bother me much at the time since Tony Heller soon did it better than I could ever hope to do.
Now you have v4, it will be interesting to find out what has changed. Back in 2010 I asked Tom Peterson about the “Adjustments” that appeared to cool the past but got nothing out of him that made any sense.
https://diggingintheclay.wordpress.com/2010/12/28/dorothy-behind-the-curtain-part-1/
My retail corporation set up seven digit, “meaningful facility code numbers that had similar “dick with” features. It wasn’t malicious but it was extremely careless. After a half century of work around measures we grafted on a whole new set of key facility numbers (of EIGHT digits, so senior management could easily understand why the new numbers were better than the old ones) that were “meaningless”. The meaningful numbers were retained as attributes of the new key codes.
Except to make legacy programs work the meaningful digits of the seven digit string required pre-processes to extract relevant digits from the string, check against a list of exception tables, then pass stuff back and down to the legacy system, do a chore, and pass the result up and forward to the new SQL based enterprise system …
*sigh*
Pingback: GHCN v3.3 vs v4 – Top Level Entry Point | Musings from the Chiefio