I’m starting the process of examining the GHCN Global Historical Climate Network Version 4 data set. This is just a “first peek” at it to see how much looks different and get an idea what kind of work is ahead. First blush it looks like a lot has changed and it is designed to make comparisons between the “versions” difficult. I hope to fix that…
Where To Get It
I downloaded this set “long hand” from:
https://www1.ncdc.noaa.gov/pub/data/ghcn/v4/
Then unpacked the “un-adjusted” version (that DOES have adjustments in it – they note that the data may come pre-adjusted from the various country BOMs) along with the descriptions and text files. There is also a directory that looks to be individual station graphs for every station. I’m pretty sure they are all “adjusted” graphs so will consistently show warming (as their “adjustments” plot exactly to the CO2 predictions… strange that, all the warming is in the adjustments and they just happen to exactly match the CO2 theory predictions…)
Eventually, after I’ve sized things up, I want to create a “GHCN v4 Analysis Station From Scratch” posting and whenever (if ever?) that happens I’ll put in an rsync command / script that does the download.
The Downloaded Stuff
Here’s what you get:
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4$ ls -l
total 124484
-rw-r–r– 1 chiefio chiefio 3955 Mar 18 17:53 ghcnm-countries.txt
-rw-r–r– 1 chiefio chiefio 6221 Mar 18 17:53 ghcnm-flags.txt
-rw-r–r– 1 chiefio chiefio 40872107 Mar 18 17:55 ghcnm.tavg.latest.qcf.tar.gz
-rw-r–r– 1 chiefio chiefio 42708888 Mar 18 17:50 ghcnm.tavg.latest.qcu.tar.gz
-rw-r–r– 1 chiefio chiefio 43854929 Mar 18 17:55 ghcnm.tavg.latest.qfe.tar.gz
-rw-r–r– 1 chiefio chiefio 11472 Mar 18 17:51 readme.txt
-rw-r–r– 1 chiefio chiefio 4416 Mar 18 17:51 status.txt
drwxr-xr-x 3 chiefio chiefio 68 Mar 18 o17:57 unpacking
I created the “unpacking” directory where I could lookover the .qcu file (that one is the “unadjusted”).
Here’s that record:
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4$ mkdir unpacking
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4$ cp ghcnm.tavg.latest.qcu.tar.gz unpacking/chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4$ cd unpacking/
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4/unpacking$ ls
ghcnm.tavg.latest.qcu.tar.gz
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4/unpacking$ gunzip ghcnm.tavg.latest.qcu.tar.gz
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4/unpacking$ ls
ghcnm.tavg.latest.qcu.tar
You can see that I only work on a copy of the gziped file, and I uncompress it with the gunzip command in Linux. That gives a “tar” or tape archive format file. That gets “untarred”. First I did it with a “tv” that says “table of contents verbose” to see what I was going to get, then used “xvf” that is “eXtract, Verbosly, from the File” where the file is that .tar file named in the command:
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4/unpacking$ tar tvf ghcnm.tavg.latest.qcu.tar
-rw-rw-r– jared.rennie/jared.rennie 162513448 2019-03-18 08:36 ./ghcnm.v4.0.0.20190317/ghcnm.tavg.v4.0.0.20190317.qcu.dat
-rw-rw-r– jared.rennie/jared.rennie 1888285 2019-03-18 08:36 ./ghcnm.v4.0.0.20190317/ghcnm.tavg.v4.0.0.20190317.qcu.inv
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4/unpacking$ tar xvf ghcnm.tavg.latest.qcu.tar
./ghcnm.v4.0.0.20190317/ghcnm.tavg.v4.0.0.20190317.qcu.dat
./ghcnm.v4.0.0.20190317/ghcnm.tavg.v4.0.0.20190317.qcu.inv
That unpacks into a directory as two files, so I’ll go into that directory (“cd” or Change Directory)
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4/unpacking$ cd ghcnm.v4.0.0.20190317/
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4/unpacking/ghcnm.v4.0.0.20190317$ ls
ghcnm.tavg.v4.0.0.20190317.qcu.dat ghcnm.tavg.v4.0.0.20190317.qcu.inv
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4/unpacking/ghcnm.v4.0.0.20190317$ ls -l
total 160556
-rw-r–r– 1 chiefio chiefio 162513448 Mar 18 08:36 ghcnm.tavg.v4.0.0.20190317.qcu.dat
-rw-r–r– 1 chiefio chiefio 1888285 Mar 18 08:36 ghcnm.tavg.v4.0.0.20190317.qcu.inv
You may notice that “owner” and “group” changed to me, since I unpacked it as me, not as the superuser “root”.
The .inv file is the “inventory” and tells you about the stations. That .dat file is the temperatures.
At this point, it is an unpacked version of GHCN v4 “unadjusted”. About 162 MB of data. So what’s it look like?
The Files
Here’s a bit of the inventory file:
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4/unpacking/ghcnm.v4.0.0.20190317$ head *.inv ACW00011604 57.7667 11.8667 18.0 SAVE AE000041196 25.3330 55.5170 34.0 SHARJAH_INTER_AIRP AEM00041184 25.6170 55.9330 31.0 RAS_AL_KHAIMAH_INTE AEM00041194 25.2550 55.3640 10.4 DUBAI_INTL AEM00041216 24.4300 54.4700 3.0 ABU_DHABI_BATEEN_AIR AEM00041217 24.4330 54.6510 26.8 ABU_DHABI_INTL AEM00041218 24.2620 55.6090 264.9 AL_AIN_INTL AF000040930 35.3170 69.0170 3366.0 NORTH_SALANG AFM00040911 36.7000 67.2000 378.0 MAZAR_I_SHARIF AFM00040938 34.2100 62.2280 977.2 HERAT
I notice first that it starts with 3 letters instead of the Region / Country digits. Then 8 of what looks like the old WMO# but with the 000 in front leads me to think they swapped the “modification” or “near” number to the front and put the 5 digit WMO at the end. (Hopefully they didn’t make the whole thing backwards or ‘all new’ like they did in the v1 to v2 transition).
I’d guess the next two are LAT and LONG then an altitude and station name. Gone are all those “annoying” bits of data like nearness to water, airport flag, urban /rural flag, vegetation type and other things that would show they have largely gone to thermometers at airports in urban areas near or over pavement…
So right out the gate there will be “some assembly required” to match these up with the prior data. I’m pretty sure I can work that out. One table with country flag number and country flag text to match them up (about 160 lines – probably manual to create it) and perhaps another with WMO# to Station Name by version to match all of them up. So 7280 lines in the v2 / v3 case, more for this one as they have a bunch of USHCN and Germany added (rest of world not so much…)
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4/unpacking/ghcnm.v4.0.0.20190317$ wc -l *.inv 27361 ghcnm.tavg.v4.0.0.20190317.qcu.inv
So 27,361 instrument records now? OK… It will be interesting looking at a global map of them. From other sources I’ve heard they were mostly added in the USA and Germany. I may need to move off of the Raspberry Pi with that data size. It is already 10 minutes for some long reports. Make that 4 x as long we’re talking 40 minutes? So I’ll plan on a validation of times on the Pi M3 but expect I’ll need one of the USB 3.0 boards with faster processors to make this really fast.
I figure that’s about 2 weeks of full time work to get ‘er done, so probably a few months elapsed time (unless someone can scare up a “Grant” to pay me a salary… otherwise I have to put a roof on the house this spring… yeah, me, on the roof, with a hammer…)
So, moving on to the “data” file:
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4/unpacking/ghcnm.v4.0.0.20190317$ head *.dat ACW000116041961TAVG -142 k 183 k 419 k 720 k 1075 k 1546 k 1517 k 1428 k 1360 k 1121 k 457 k -92 k ACW000116041962TAVG 60 k 32 k -207 k 582 k 855 k 1328 k 1457 k 1340 k 1110 k 941 k 270 k -179 k ACW000116041963TAVG -766 k -606 k -152 k 488 k 1171 k 1574 k 1567 k 1543 k 1279 k 887 k 513 k -161 k ACW000116041964TAVG 9 k -138 k 2 k 685 k 1166 k 1389 k 1453 k 1504 k 1168 k 735 k 493 k 59 k ACW000116041965TAVG -9 k -158 k -15 k 537 k 934 k 1447 k 1434 k 1424 k 1324 k 921 k -22 k -231 k ACW000116041966TAVG -490 k -614 k 108 k 246 k 1082 k 1642 k 1620 k 1471 k 1195 k 803 k 329 k 2 k ACW000116041967TAVG -270 k 36 k 397 k 481 k 1052 k 1373 k 1655 k 1598 k 1318 k 997 k 559 k -96 k ACW000116041968TAVG -306 k -183 k 220 k 714 k 935 k 1635 k 1572 k 1718 k 1331 k 781 k 180 k -56 k ACW000116041969TAVG -134 k -494 k -185 k 497 k 962 k 1634 k 1687 k 1773 k 1379 k 932 k 321 k -275 k ACW000116041970TAVG -483 k -704 k -75 k 261 k 1093 k 1724 k 1470 k 1609 k 1163 k 836 k 300 k 73 k
So 12 sets of “k” (or some flag for who knows what) and what looks like a temperature as 1/100 C. Cheeky little bastards to think they have clue to 1/100 C. That takes exceptional calibration and a lab; not some bit of gear stuck out by the runway at Dallas Ft. Worth and ignored for a year+. But, OK, I can use my basic data prep-loading FORTRAN program to get that loaded. Now, for the front of the record…
We’ve got 3 Letters, then what looks like 8 digits of WMO# + “near” flag (again with the minor number ‘near’ part in front instead of in the back as they did in the past), then what looks like a 4 digit year and a flag to tell you type (in this case Temperature Average).
I think I can work with this lot too. Mostly it will be making a map of the 3 letter flag to the 3 digit Region / Country of the past. A manual task, but not too many countries nor continents to deal with.
The Descriptor Files
It comes with files that describes all this stuff. I generally like to just “eyeball” things first, then check it against the documentation. I can find more errors in the documentation that way (if any) and at the same time check my quality of instinct.
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4$ wc -l *.txt 238 ghcnm-countries.txt 125 ghcnm-flags.txt 274 readme.txt 103 status.txt 740 total chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4$
None of these are so big as to be horrible. 238 Countries? Didn’t know there were that many… Maybe some are sub-sets like overseas departments of France and such? The “status” file is a list of things that changed as they made the set. Announcement of ‘beta’ availability and such. I’ll leave it for last. Flags ought to be interesting. The Readme can be helpful too. So I’ll start with them.
Flags File
Here’s the flags:
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4$ cat *flags* GHCN-Monthly, Version 4, Flag File This file describes all of the data source flags used in GHCN-Monthly version 4. They are formatted as follows: Column1: Alpha-Numeric Flag Column2: Dataset Origin (Note that some sources exist in more than one dataset) GHCND: Global Historical Climatology Network: Daily Dataset ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ GHCNM: Global Historical Climatology Network: Monthly Dataset ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v3/ ISTI: International Surface Temperature Initiative ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/ Column3: Description of Flag ******************************************************************************** 0 GHCND U.S. Cooperative Summary of the Day (NCDC DSI-3200) 6 GHCND CDMP Cooperative Summary of the Day (NCDC DSI-3206) 7 GHCND U.S. Cooperative Summary of the Day -- Transmitted via WxCoder3 (NCDC DSI-3207) A GHCND U.S. Automated Surface Observing System (ASOS) real-time data (since January 1, 2006) a GHCND Australian data from the Australian Bureau of Meteorology B GHCND U.S. ASOS data for October 2000-December 2005 (NCDC DSI-3211) b GHCND Belarus update C GHCND Environment Canada E GHCND European Climate Assessment and Dataset (Klein Tank et al., 2002) G GHCND Official Global Climate Observing System (GCOS) or other government-supplied data H GHCND High Plains Regional Climate Center real-time data I GHCND International collection (non U.S. data received through personal contacts) K GHCND U.S. Cooperative Summary of the Day data digitized frompaper observer forms (from 2011 to present) M GHCND Monthly METAR Extract (additional ASOS data) N GHCND Community Collaborative Rain, Hail,and Snow (CoCoRaHS) Q GHCND Data from several African countries that had been "quarantined", that is, withheld from public releaseuntil permission was granted from the respective meteorological services R GHCND NCDC Reference Network Database (Climate Reference Network and Historical Climatology Network-Modernized) r GHCND All-Russian Research Institute of Hydrometeorological Information World Data Center S GHCND Global Summary of the Day (NCDC DSI-9618) s GHCND China Meteorological Administration/National Meteorological Information Center/Climatic Data Center (http://cdc.cma.gov.cn) T GHCND SNOwpack TELemtry (SNOTEL) data obtained from the Western Regional Climate Center U GHCND Remote Automatic Weather Station (RAWS) data obtainedfrom the Western Regional Climate Center u GHCND Ukraine update W GHCND WBAN/ASOS Summary of the Day from NCDC's Integrated Surface Data (ISD). X GHCND U.S. First-Order Summary of the Day (NCDC DSI-3210) z GHCND Uzbekistan update Z GHCND/GHCNM Datzilla m ISTI mexico V ISTI vietnam c ISTI channel-islands e ISTI ecuador p ISTI pitcairnisland j ISTI giessen i ISTI brazil-inmet 1 ISTI brazil 2 ISTI argentina 4 ISTI india 5 ISTI gsn-sweden 8 ISTI canada-raw 9 ISTI east-africa ! ISTI uganda # ISTI antarctica-aws $ ISTI antarctica-palmer % ISTI antarctica-southpole & ISTI ispd-swiss ( ISTI ispd-ipy ) ISTI ispd-sydney * ISTI antarctica-scar-reader + ISTI spain - ISTI uruguay-inia . ISTI uruguay d ISTI swiss-digihom / ISTI ispd-tunisia-morocco : ISTI sacad_non-blended ; ISTI japan ISTI russsource-australia_de ^ ISTI russsource-australia_wwr _ ISTI russsource-ghcn ' ISTI russsource-climat ` ISTI russsource-conus_climat ~ ISTI russsource-ak_hi_climat ? ISTI germany w ISTI wmssc @ ISTI central-asia [ ISTI arctic h ISTI histalp ] ISTI hadisd F ISTI/GHCND usforts D ISTI/GHCNM mcdw-unpublished " ISTI/GHCNM ghcnsource J ISTI/GHCNM colonialera | ISTI/GHCNM climat-uk } ISTI/GHCNM mcdw k ISTI/GHCNM knmi P ISTI/GHCNM climat-ncdc { ISTI/GHCNM wwr , ISTI/GHCNM climat-bufr 3 ISTI/GHCNM climat-prelim
So, was that that “K”? Well, I’ll sort that out later. What is clear is that there’s a lot of sources and they used funny letters to code them. Got it. Other than that, not so useful.
README File
How about the README? I’ve bolded some bits of interest. Looks like I’m generally right on my guesses.
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4$ cat readme.txt GLOBAL HISTORICAL CLIMATOLOGY NETWORK - MONTHLY (GHCN-M) Version 4 (Last Updated: 10/26/2018) ******************************************************************************** The underlying data, its software and all content are provided on an 'as is' and 'as available' basis. We do not give any warranties, whether express or implied, as to the suitability or usability of the product, its software or any of its content. Feedback is welcome and should be sent via email to: ncdc.ghcnm@noaa.gov Should you have any questions or concerns regarding the product, please let us know immediately so we can rectify these accordingly. Your help in this regard is greatly appreciated. ******************************************************************************** 1. INTRODUCTION 1.1 OVERVIEW GHCN-M version 4 currently contains monthly mean temperature for over 25,000 stations across the globe. In large part GHCN-M version 4 uses the same quality control and bias correction algorithms as version 3. The greatest difference from previous version is a greatly expanded set of stations based on the large data holdings in GHCN-Daily as well as data collected as part of the International Surface Temperatue Initiative databank (ISTI; Rennie et al. 2013). 1.2 INTERNET ACCESS The GHCNM v4 product can be found here: FTP: ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v4 1.3 DOWNLOADING AND INSTALLING WINDOWS (example): GHCNM files are compressed into gzip format. For more information on gzip, please see the following web site: www.gzip.org and for potential software that can compress/decompress gzip files, please see: www.gzip.org/#faq4 LINUX (example): wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v4/ghcnm.latest.qcu.tar.gz tar -zxvf ghcnm.latest.qcu.tar.gz (alternatively, if "tar" does not support decompression, a user can try: wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v4/ghcnm.latest.qcu.tar.gz gzip -d ghcnm.latest.qcu.tar.gz tar -xvf ghcnm.latest.qcu.tar.gz) Note: the data are placed in their own separate directory, that is named according to the following specification: ghcnm.v4.x.y.YYYYMMDD where x = integer to be incremented with major data additions y = integer to be incremented with minor data additions YYYY = year specific dataset was processed and produced MM = month specific dataset was processed and produced DD = day specific dataset was processed and produced Two files (per element) should be present in the directory a 1) metadata and 2) data file. Note: there will be no increments to "x" and "y" above during the phase. 2. DATA 2.1 METADATA The metadata has been carried over from GHCN-Monthly v3. This would include basic geographical station information such as latitude, longitude, elevation, station name, etc. The extended metadata information, such as surrounding vegetation, etc. was not carried over, however still exists in version 3 2.1.1 METADATA FORMAT (.inv file) Variable Columns Type -------- ------- ---- ID 1-11 Integer LATITUDE 13-20 Real LONGITUDE 22-30 Real STNELEV 32-37 Real NAME 39-68 Character Variable Definitions: ID: Station identification code. First two characters are FIPS country code LATITUDE: latitude of station in decimal degrees LONGITUDE: longitude of station in decimal degrees STELEV: is the station elevation in meters. -999.0 = missing. NAME: station name 2.2 DATA (.dat file) The data within GHCNM v4 for the time being consist of monthly average temperature. 2.2.1 DATA FORMAT Variable Columns Type -------- ------- ---- ID 1-11 Integer YEAR 12-15 Integer ELEMENT 16-19 Character VALUE1 20-24 Integer DMFLAG1 25-25 Character QCFLAG1 26-26 Character DSFLAG1 27-27 Character . . . . . . . . . VALUE12 108-112 Integer DMFLAG12 113-113 Character QCFLAG12 114-114 Character DSFLAG12 115-115 Character Variable Definitions: ID: Station identification code. First two characters are FIPS country code YEAR: 4 digit year of the station record. ELEMENT: element type, monthly mean temperature="TAVG" VALUE: monthly value (MISSING=-9999). Temperature values are in hundredths of a degree Celsius, but are expressed as whole integers (e.g. divide by 100.0 to get whole degrees Celsius). DMFLAG: data measurement flag, nine possible values for QCU/QCF and one for QFE: Quality Controlled Unadj/Adj (QCU/QCF) Flags: blank = no measurement information applicable a-i = number of days missing in calculation of monthly mean temperature (currently only applies to the 1218 USHCN V2 stations included within GHCNM) Quality Controlled Adj and estimated (QFE) Flags: E = Data has been estimated from neighbors within the PHA QCFLAG: quality control flag, eleven possibilities within quality controlled unadjusted (qcu) dataset, and three possibilities within the quality controlled adjusted (qcf) dataset. Quality Controlled Unadjusted (QCU) QC Flags: BLANK = no failure of quality control check or could not be evaluated. Note: the following QC checks are listed in order of execution. When a value is flagged by a QC check it is not subjected to testing in subsequent checks. E = Identify different stations that have the "same" 12 monthly values for a given year (e.g. when all 12 months are duplicated and there are at least 3 or more non- missing data months, and furthermore values are considered the "same" when the absolute difference between the two values is less than or equal to 0.015 deg C) D = monthly value is part of an annual series of values that are exactly the same (e.g. duplicated) within another calendar year in the station's record. R = Flag values greater than or less than known world TAVG extremes. K = Identifies and flags runs of the same value (non-missing) in five or more consecutive months W = monthly value is duplicated from the previous month, based upon regional and spatial criteria (Note: test is only applied from the year 2000 to the present, and in general only for near real time produced data sources). I = checks for internal consistency between TMAX and TMIN. Flag is set when TMIN > TMAX for a given month. L = monthly value is isolated in time within the station record and flagged, when: 1) a non-missing value has at least 18 missing values before AND after in time., or 2) a non-missing value belongs to a "cluster" of 2 adjacent (in time) non-missing values, and the cluster of values has at least 18 missing values before AND after the cluster, or 3) a non-missing value belongs to a "cluster" of 3 adjacent (in time) non-missing values, and the cluster of values has at least 18 missing values before AND after the cluster. O = monthly value that is >= 5 bi-weight standard deviations from the bi-weight mean. Bi-weight statistics are calculated from a series of all non-missing values in the station's record for that particular month. S = Flags value when the station z-score satisfies any of the following algorithm conditions. Definitions: neighbor = any station within 500 km of target station. zscore = (bi-weight standard deviation / bi-weight mean) S(Z) = station's zscore N(Z) = the set of the "5" closest non-missing neighbor zscores. (Note: this set may contain less than 5 neighbors, but must have at least one neighbor zscore for algorithm execution) Algorithm: S(Z) >= 4.0 and < 5.0 and "all" N(Z) = 3.0 and < 4.0 and "all" N(Z) = 2.75 and < 3.0 and "all" N(Z) = 2.5 and < 2.75 and "all" N(Z) < 1.6 S(Z) -5.0 and "all" N(Z) > -1.9 S(Z) -4.0 and "all" N(Z) > -1.8 S(Z) -3.0 and "all" N(Z) > -1.7 S(Z) -2.75 and "all" N(Z) > -1.6 T = Identifies and flags when the temperature z-score compared to the inverse distance weighted z-score of all neighbors within 500 km (at least 2 or more neighbors are required) is greater than or equal to 3.0. M = Manually flagged as erroneous. Quality Controlled Adjusted (QCF) QC Flags: A = alternative method of adjustment used. M = values with a non-blank quality control flag in the "qcu" dataset are set to missing the adjusted dataset and given an "M" quality control flag. X = pairwise algorithm removed the value because of too many inhomogeneities. DSFLAG: data source flag for monthly value For more information on data source flags, please refer to ghcnm-flags.txt 3. CONTACT 3.1 QUESTIONS AND FEEDBACK NCDC.GHCNM@noaa.gov
So looks like 3 flags in a row around that “k” and I’ll need to get a finer grain reading on just what row that is to know what flag it is. Got it.
I also note they say it is an 11 “digit” value starting in column 1, but also that the first two characters are the FIPS Country Code. Looks like a documentation error to me. They have 3 CHAR and then 8 INT so “some interpretation required”… I’ll work that out when I do a detailed read of the description and match the 3 CHAR to “whatever”…
Taking a peek at the bottom of the data file is not encouraging:
chiefio@PiM3Devuan2:/WDTB/xfs/GHCNv4/unpacking/ghcnm.v4.0.0.20190317$ tail *.dat ZIXLT6221161961TAVG 1965 J 1945 J 2004 J 1850 J 1690 J 1310 J 1240 J 1320 J 1785 J 2065 J 1950 J 2000 J ZIXLT6221161962TAVG 2035 J 2035 J 1900 J 1835 J 1315 J 1395 J-9999 -9999 -9999 -9999 -9999 -9999 ZIXLT6221161963TAVG-9999 -9999 -9999 -9999 -9999 1265 J-9999 -9999 -9999 -9999 -9999 -9999 ZIXLT6221161964TAVG-9999 -9999 -9999 -9999 -9999 1120 J-9999 -9999 -9999 -9999 -9999 -9999 ZIXLT6221161965TAVG-9999 -9999 -9999 -9999 -9999 1150 J 1200 " 1420 " 1660 " 1840 " 2040 " 2220 " ZIXLT6221161966TAVG 2180 " 2040 " 1840 " 1690 " 1430 " 1280 " 1210 " 1460 " 1770 " 1980 " 2090 " 2110 " ZIXLT6221161967TAVG 2110 " 1990 " 1890 " 1920 " 1510 " 1350 " 1100 " 1380 " 1660 " 2080 " 1990 " 1910 " ZIXLT6221161968TAVG 2180 " 2000 " 1930 " 1820 " 1560 " 1080 " 1370 " 1630 " 1760 " 2180 " 1840 " 2070 " ZIXLT6221161969TAVG 2090 " 2150 " 1950 " 1830 " 1410 " 1310 " 1160 " 1460 " 1780 " 2100 " 2040 " 1910 " ZIXLT6221161970TAVG 2070 " 1990 " 1930 " 1720 " 1560 " 1330 " 1330 " 1540 " 2040 " 2030 " 2130 " 2150 "
It has 5 leading CHAR and then 6 INT of what looks like WMO#. My guess would be the leading 5 CHAR are a complicated code and not just a “minor station number near a WMO” as in the past. OK, harder to work with and match up, but not impossible. IF as described, the first two are Country, and then the last 5 are WMO# (a fella can hope can’t he?) that just leaves 4 of “mystery meat” to work out inside the “ID” of 11 “Integer” that is really including characters.
But at least now you see why I don’t just blindly trust the documentation right out the gate…
So again “some assembly required”…
I’ll need to ponder a bit how to unscramble the “Region / Country” business. The basic problem being that countries change over time (is Crimea Ukraine or Russia right now?…) and really it ought not to be part of the data structure. Oh Well. Maybe just a table of WMO and the FIPS country and the 3 digit v2 / v3 country and then the v1 country? IIRC v1 WMO was a different length, so might need it in the table too to show that mapping.
IF the WMO Number can be used as the Primary Key, that would be ideal, then make all the rest of the metadata and country code(s) dependent values. That ought to straighten out the inventory data.
Having the bulk of all stations devoid of things like Airstation Flag and distance to water will make it impossible to do that kind of report, but I can still do altitude and lat / long. Thin gruel that…
From here on down I’ve included, without comment, the “Status” and “Countries” files, just as simple documentation. I think I have a handle on what I need to do to suck this into a MariaDB database and compare it with GHCN v3. So I’ll start working on that now. The rest of this can “catch up” ;-)
Status File
GHCN-Monthly, Version 4, Status File (users can use this file to determine overall current status, including information related to previous changes and errata). DISCLAIMER: The status file does not record every possible change that may have occurred from one version to another. This would be impractical for a variety of reasons. Therefore if a user is interested in discovering every possible difference between two different file versions, they will need to construct their own program to determine all possible changes between the file versions. ******************************************************************************** 10/26/2018 GHCN-M version 4.0.0 is now operational. Issues may still arise during the first few days of operations. For any questions, please email ncdc.ghcnm@noaa.gov ******************************************************************************** 09/24/2018 With the early online release of the GHCN-M v4 paper, this product is being transitioned from a public beta (v4.b.2) to operations (v4.0.0). During this transition, issues may arise with data output. For questions, please email ncdc.ghcnm@noaa.gov ******************************************************************************** 02/16/2018 Beginning with the 20180215 run, a third version of v4 beta will be produced. In addition to QCU and QCF, a QFE inventory and data file will be generated. The QFE includes estimated data for stations for the period of 1961-1990. Data are estimated based upon neighboring station info in the Pairwise Homogeneity Algorithm (PHA). Data is only provided between 1961 and 1990, and will be used to assess climate trends. ******************************************************************************** 11/22/2017 An update to the public BETA of GHCN-M version 4 has been made available, known as v4.b.2. Changes include: - Incorporating the update from the ISTI Databank, now at version 1.1.1. - Incorporating changes and updates to the quality control procedure. For more information about the flags, please refer to the readme file. - Other minor bug fixes, as pointed out from the user community. Please remember this is still a beta version of GHCN-M version 4, which is undergoing final testing before its official release. The underlying data, its software and all content are provided on an 'as is' and 'as available' basis. We do not give any warranties, whether express or implied, as to the suitability or usability of the product, its software or any of its content. Feedback is welcome and should be sent via email to: ncdc.ghcnm@noaa.gov Should you have any questions or concerns regarding the product, please let us know immediately so we can rectify these accordingly. Your help in this regard is greatly appreciated. ******************************************************************************** 07/20/2017 Beginning this week, v4 beta will begin ingesting new data and adjusting it through the Pairwise Homogeneity Algorithm. This update essentially reverts the change that was made on March 20th, 2017. ******************************************************************************** 03/20/2017 Starting with the 20170310 run, v4 Beta will stop adjusting new data ingested. Instead, quality controlled unadjusted (QCU) data will be appended to both QCU and QCF files of v4 beta. During the first month of the new year, the Pairwise Homogenity Adjustment (PHA) will be run again to adjust the datasets entire period of record. ******************************************************************************** 10/15/2015 A public BETA of GHCN-M version 4 has been made available, effective 10/15/2015. This is the first beta distribution, and its version is known as v4.b.1 Please note that this is a beta version of GHCN-M version 4, which is still undergoing final testing before its official release. The underlying data, its software and all content are provided on an 'as is' and 'as available' basis. We do not give any warranties, whether express or implied, as to the suitability or usability of the product, its software or any of its content. Feedback is welcome and should be sent via email to: ncdc.ghcnm@noaa.gov Should you have any questions or concerns regarding the product, please let us know immediately so we can rectify these accordingly. Your help in this regard is greatly appreciated.
Countries File
AC Antigua and Barbuda AE United Arab Emirates AF Afghanistan AG Algeria AJ Azerbaijan AL Albania AM Armenia AO Angola AQ American Samoa [United States] AR Argentina AS Australia AU Austria AY Antarctica BA Bahrain BB Barbados BC Botswana BD Bermuda [United Kingdom] BE Belgium BF Bahamas, The BG Bangladesh BH Belize BK Bosnia and Herzegovina BL Bolivia BM Burma BN Benin BO Belarus BP Solomon Islands BR Brazil BU Bulgaria BX Brunei BY Burundi CA Canada CB Cambodia CD Chad CE Sri Lanka CF Congo (Brazzaville) CG Congo (Kinshasa) CH China CI Chile CJ Cayman Islands [United Kingdom] CK Cocos (Keeling) Islands [Australia] CM Cameroon CN Comoros CO Colombia CQ Northern Mariana Islands [United States] CS Costa Rica CT Central African Republic CU Cuba CV Cape Verde CW Cook Islands [New Zealand] CY Cyprus DA Denmark DJ Dijibouti DO Dominica DR Dominican Republic EC Ecuador EG Egypt EI Ireland EK Equatorial Guinea EN Estonia ER Eritrea ES El Salvador ET Ethiopia EU Europa Island [France] EZ Czech Republic FG French Guiana [France] FI Finland FJ Fiji FK Falkland Islands (Islas Malvinas) [United Kingdom] FM Federated States of Micronesia FP French Polynesia FR France FS French Southern and Antarctic Lands [France] GA Gambia, The GB Gabon GG Georgia GH Ghana GI Gibraltar [United Kingdom] GJ Grenada GK Guernsey GL Greenland [Denmark] GM Germany GP Guadeloupe [France] GQ Guam [United States] GR Greece GT Guatemala GV Guinea GY Guyana HA Haiti HK Hong Kong HO Honduras HR Croatia HU Hungary IC Iceland ID Indonesia IM Isle of Man IN India IO British Indian Ocean Territory [United Kingdom] IR Iran IS Israel IT Italy IV Cote D'Ivoire IZ Iraq JA Japan JE Jersey JM Jamaica JN Jan Mayen [Norway] JO Jordan JQ Johnston Atoll [United States] JU Juan De Nova Island [France] KE Kenya KG Kyrgyzstan KN Korea, North KR Kiribati KS Korea, South KT Christmas Island [Australia] KU Kuwait KZ Kazakhstan LA Laos LE Lebanon LG Latvia LH Lithuania LI Liberia LO Slovakia LQ Palmyra Atoll [United States] LS Liechtenstein LT Lesotho LU Luxembourg LY Libya MA Madagascar MB Martinique [France] MC Macau S.A.R MD Moldova MF Mayotte [France] MG Mongolia MH Montserrat MI Malawi MJ Montenegro MK Macedonia ML Mali MO Morocco MP Mauritius MQ Midway Islands [United States} MR Mauritania MT Malta MU Oman MV Maldives MX Mexico MY Malaysia MZ Mozambique NC New Caledonia [France] NE Niue [New Zealand] NF Norfolk Island [Australia] NG Niger NH Vanuatu NI Nigeria NL Netherlands NO Norway NP Nepal NR Nauru NS Suriname NT Netherlands Antilles [Netherlands] NU Nicaragua NZ New Zealand PA Paraguay PC Pitcairn Islands [United Kingdom] PE Peru PK Pakistan PL Poland PM Panama PO Portugal PP Papua New Guinea PS Palau PU Guinea-Bissau QA Qatar RE Reunion [France] RI Serbia RM Marshall Islands RO Romania RP Philippines RQ Puerto Rico [United States] RS Russia RW Rwanda SA Saudi Arabia SB Saint Pierre and Miquelon [France] SC Saint Kitts and Nevis SE Seychelles SF South Africa SG Senegal SH Saint Helena [United Kingdom] SI Slovenia SL Sierra Leone SN Singapore SO Somalia SP Spain ST Saint Lucia SU Sudan SV Svalbard [Norway] SW Sweden SX South Georgia and the South Sandwich Islands [United Kingdom] SY Syria SZ Switzerland TD Trinidad and Tobago TE Tromelin Island [France] TH Thailand TI Tajikistan TL Tokelau [New Zealand] TN Tonga TO Togo TP Sao Tome and Principe TS Tunisia TT Timor-Leste TU Turkey TV Tuvalu TW Taiwan TX Turkmenistan TZ Tanzania UG Uganda UK United Kingdom UP Ukraine US United States UV Burkina Faso UY Uruguay UZ Uzbekistan VC Saint Vincent and the Grenadines VE Venezuela VM Vietnam VQ Virgin Islands [United States] WA Namibia WF Wallis and Futuna [France] WI Western Sahara WQ Wake Island [United States] WS Samoa WZ Swaziland XX Unknown YM Yemen ZA Zambia ZI Zimbabwe
E M I am watching you do this with admiration and no small amount of envy.
I also understand that you are DOCUMENTING your process so that others can redo it and test the process you are doing, when & if they want. Though it won’t be me as there are too many demands on my time here. :- (
So you are running a properly conducted scientific experiment !
Thanks for your commitment to this !
I expect to see you sign off, at the end with a QED ! ( Quod Erat Demonstrandum ! )
It seems those crooks at NOAA/GHCN did not enjoy Tony Heller’s mockery. As usual with gumment conspiracies the answer is less transparency so they have tried to make it difficult to do a v2 vs. v3 vs. v4 comparison.
Having a corrupt DoJ puts our nation at risk so we need to fix it. Having a corrupt worldwide “Science Establishment” puts global prosperity at risk. We definitely need to fix that.
Thank you Chiefio for casting light into dark places. Let us watch the cockroaches scurrying. Too bad than none of these crooks will ever serve a day in jail.
EM, providing that you can match up the Station number to V3 you can then use V3 supplementary data.
I would attempt to split out the Station numbers in to their own field for as you say for Key use.
It will be fascinating to compare the individual values in V4 with those in V3 to see what they have corrupted.
Other people have different versions of V4 which show Adjustments every time it was run, so we know it has happened.
I wouldn’t mind a copy of your finalised SQL tables, but I would use Access SQL and VBA to work with the data.
Would you be interested in other versions of V4 if I could locate who has them?
I am pretty sure Tony Heller has some, but I am not sure if he will share.
@A.C.Osborn:
I am generally a data Pack Rat so I’m always interested in “versions”. That said, I think I need to get this one loaded up and make some mileage toward what it says before I run after too many more copies.
Since it was “Beta” until about 5 months ago, most “differences” can be blown off with a claim of “in development and not production” anyway.. Interesting in what it can say about their process, but not going to have much effect beyond that.
The references I’m using for SQL syntax frequently point out the differences between Oracle, Microsoft, and MySQL versions (and Postgress? I think…) Mostly these are things IN the other versions that are missing in MySQL. So I expect that most of what I do, and publish, will run with little or no change in other systems.
Visual Basic? I once spent a year+ writing production code in HP’s “Business Basic” using their Image and Query database product. A Cost Accounting system. Anything I’ve done with FORTRAN can be done in BASIC. The translation would not be hard as the nature of both languages is similar (procedural, stuff things in variables, similar commands, similar control structures). I’ve not used any of the fancy math features available in FORTRAN. It would not be my first choice, but nothing makes it a bad choice, IMHO.
FWIW, HP’s BASIC was written by a frustrated Pascal programmer IMHO. Had “BEGIN END” blocks, subroutines, functions, long variable names, DO WHILE, and more. Yeah, limited formal BASIC would work in it, but you could also write mostly Pascal and it would be “right” ;-) Both compiled and interpreted. What I wrote in it looked a lot like Pascal with a line number on every line ;-)
I had one maintenance program that was a big recursive decent (as the language was fine with recursive and re-entrant functions). Did great on the test case. Then we ran it at the first production site and it “blew up”. A Dentist where people would visit regularly for decades… The decent of the history used all memory and swap and just ran out of space before it reached the end of the record. In one made week I got to rewrite the thing to NOT be recursive… and use a lot less memory (even if it was also a lot less “cool” ;-)
Since then I’ve tended to avoid “interesting” and “cool” and “trick” code and write “the most direct and simple code” possible….
Well this is interesting… Looks like 9999 in the altitude column may have some special meaning:
So ^MG is “Mongolia. Then who knows what XLT5 is (or XLTx) then the WMO Number (I think…) Or maybe there’s some 6 digit version of the WMO#? But that altitude ought to be known unless they are flagging something else.
The third line has XLT but also has an altitude, so disjoint meanings in the two fields (i.e. not both coding for ‘missing’ or whatever).
I think I need to make a table if WMO# / Name and start from there… Picking one, it does have data in it
but only a few years (1965 to 1978). OK, so can’t just ignore those with 9999 altitude.
Looks like 6 digit WMO per this interesting and useful list:
Click to access StnList_s.pdf
A nice list of global sites and WMO numbers, lat, long, elev.
So, OK, that just leaves “XLT” to explain…
Sidebar:
Sky News is showing pictures of snow falling in Greece right now… (I presume now, it is a news show). So spring snow in Greece is “warming”? …
Comparing FUKUSHIMA (that ought to exist in both)
v3.3:
Then v4:
Hmmm… Putting just the records of interest next to each other, first line v3, the rest v4:
So four records with 3 different LAT LONG and Altitude (one of which is the 9999 flag and has XLT in the mystery fields). The last one matches v3 on LAT, LONG & Altitude, but not WMO. The 2nd one looks to match on 5 digits of WMO, but not on LAT, LONG, or Altitude. And is a different name at the airport.
How to unscramble that?…
Heck, how to just detect how much of it there is in the data set… Match the core 5 digits of WMO and then compare for differences in LAT, LONG, ALT? Grumble…
A useful 49 page PDF document that describes all the stuff they do in creating the data set:
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v4/documentation/CDRP-ATBD-0859%20Rev%201%20GHCN-M%20Mean%20Temperature-v4.pdf
Trudging through it now…
Page 8 of that doc has a nice map of stations color coded by age. Then on page 30 there’s this:
So 20,000 stations were “corrected”… but most of them are under 2 C…
One wonders how much this “homogenizing” the data accounts for the loss of low excursions and the “warming” seen when the tops don’t rise but low going excursions “go away”…
21047595001 37.7500 140.4700 67.0 FUKUSHIMA
JA000047595 37.2330 140.4330 375.0 FUKUSHIMA_ARPT
JAXLT020742 37.4550 140.2820 9999.0 FUKUSHIMA
JAXLT358607 37.7500 140.4700 67.0 FUKUSHIMA
On google maps 37.7500 140.4700 turns out to be in the middle of the Abukuma river in Fukushima.
The third entry 37.4550 140.2820 pins in the middle of an agricultural field 20 miles south south west of Fukushima. and about 43 miles west north west of the power plant.
The power plant is tagged at 37.421370, 141.028079 on google maps
37.2330 140.4330 turns out to be just off the runway of the Fukushima airport, and is about 31 ft from a white object which might be a Stevenson screen
Well this is going to be fun… NOT!
Looks like they have gone to a 6 digit WMO number but with MOST of the old station having their old 5 digit number that can duplicate a newer station for 5 of 6 digits…
I made the FORTRAN program to take in the inventory file, break it up into Country, WMO, LAT, LONG, ALT, NAME along with assigning the unknown 4 digits to Flag Fields, compute the COS(LAT) and spit it all out with TABS ready for MariaDB to load. On a whim, looking at the tail of the file. I decided to check if there were duplicates of the 5 digit WMO. There are…
So… what to do… I think I’ll need a 6 digit WMO, but then strip the leading zero off the ones that match to the old V2 / V3 type…
I’m thinking “load both a 5 and 6 digit WMO” field so I can do both. Easy to create the data at this point and just stuff a field.
Still no idea what the XLT flags are for / mean. Still a bit bothered by the 9999.0 elevation flag…
In about an hour I ought to have the inventory data in a loadable format and into a database table. Then I can start trying to compare it to the v3 inventory of stations. I think I’m going to take a coffee break first though ;-)
@Larry L:
Thanks for looking those up! So they have the same metadata quality they always had, eh? Stations in the middle of rivers…
Nice to know the one at the airport is near the runway (‘density altitude’ determines if you can fly or crash and that depends on knowing the actual temperature at the runway… Great for aviation, not so great for “climate science’).
So we have a choice of the runway, in the river, or lost? Sounds about right…
Yeah, that looks like something I can work with. So I can load the wmo as both 5 and 6 and then figure out which is the keeper later, or does that leading digit (Flag 4) sometimes have a 6th WMO digit and sometimes have something else?
After a coffee & snack break, I’ll make the database table and load script, then “kick it around” a while to decide what’s best. Then a final round from FORTRAN to database once settled.
You will need to scroll this to the right to see it, but I have both F4 / WMO and WML (L for long ;) in the tab separated input file to the database:
And it’s loaded up…
So, in theory, 27,361 “stations”.
Here’s the structure I put it in for now (to kick it around and see where this pig squeals..)
And as I printed the fields in the same order as the table, the load command is fairly trivial:
Once I’ve figured out the end state, I’ll make a posting with the FORTRAN and the final table layout.
For now I’m just going to use this one to explore the data.
Well, looks like I’ll be busy with this unscrambling problem for a little while:
So lets just match the old 5 digit WMO with the 5 digit part of the new WMO and then inspect the records where they match… Oh, that would be 3350 of them….
OK… So there’s some interesting bits in there
So the old “minor number” or modification flag series have now been combined into one set of data, hiding the equipment differences that happened behind some homogenizing I presume. Then it looks like a “5” is prepended while the old way was 6 for continent and 51 for country. Now US codes for country in this European record. Wonder why it is a US station and in Europe? Maybe later… ;-) One hopes that they didn’t just reuse the station number from Europe in a US station somewhere….
That still leaves C00 flags as unexplained. I hope that the added digit (5 in this case) maps to some simple thing and isn’t a randomly arriving digit… i.e. systematic you can code, random not so much and will likely take a manual matching.
Here’s a Canada station with the same thing going on. Matching somewhere in Europe on 5 digits. But also matching on more….
That kind of gives the lie to the idea that they might have just assumed a 0 in the leading digit for the 6th digit of the v3 set. No such luck. Here we’ve got a 6 digit match, and the first record is in Europe while the second says Canada.
At this point it looks like the WMO number is no longer a usable key for matching stations. To be sure, more ‘dig here’ required to show if that’s just a small error sample or a trend, or maybe the USA and Canada have a small part of Europe ;-) (Maybe embassies? But I doubt it…)
OK, I’m going down this rabbit hole for a while. Don’t hold your breath… Needing to back-figure how someone scrambled their data and unscramble it is not my favorite puzzle… but I’ve done it before. Those years as a DBA … Data Base Administrator. Forensics can be like this too…
I’m off to ponder with a coffee cup…
Looks like another example of a splice – a “3 way” in Antarctica
That last digit of 1, 2, 3 in the first number – the old form Station ID, compared to the new station ID that’s all the same single number…
But at least for some stations the WMO matches…
Took 7 seconds shy of 3 minutes to make this report (same as above but wiht names added):
Some stay the same. Some change.
I think I need to translate “country” to a constant thing and then select on both WMO and country…
I’ve made a “country” table and loaded V2 and V3 into it (v4 “soon” I was doing 2 & 3 to test on the easy one first…) There are some differences (version is on the far right V2 vs V3.3):
So we gained a “Territory” of France in Antarctic lands and a British overseas territory along with Slovenia in v3.3 (so wonder if Slovenia was in the data before but with the old country name?…)
Now I’m going to try loading up v4 and seeing if I can figure out how to match these guys up…
Still a bit of work to do, but getting there… I didn’t put a V4 in version on that set, so it is showing NULL, but I’ve got all three versions loaded and can now get down to matching who’s a what and can I get a map of the country number to the country abreviation that works for all countries in all versions.
I also need to settle on a caps choice. V4 has gone to lowercase…
Holy Fu*k E M !
I have just read all this with awe and a good dose of perplexity. All power to your arm and I hope the coffee is good & strong !
by the way Antartica is interesting. I think 9-10 countries made territorial claims there before 1950’s. The it was all put under an international treaty which ‘froze’ all territorial claims.. And put them to one side in the interests of scientific research. ( Seen as a problem to sort in the future ! )
So there should be a lot meteorological records for Antarctica from Australia, NZ, Chile, Argentina, UK, France, Russia, USA, China, Japan, Netherlands, etc etc…dating a long way back in time… ( With the weather so freezing almost all the time they had lots of time to do weather station records ! )https://en.wikipedia.org/wiki/Research_stations_in_Antarctica
A unified list of all of them might be useful in a climate sense – but extra work for you ! But I wonder whether they are listed as Antarctica or under the individual countries research bases.
Almost got it. Here’s the result of hand typing the country numbers into the file with the country as 2 letters. There’s a few countries without a number, so it looks like the abbreviation will be the key field, then two countries had two numbers (Russia and Kazakhstan IIRC) so that must be handled somehow.
I need to do a QA run on it and see if some place just changed the name a little. I caught several like that already. 240 rows in 0.01 seconds to report it ;-)
Here’s how it looks so far:
Yes the list of Antarctic stations is long
https://en.wikipedia.org/wiki/Research_stations_in_Antarctica
Antarctic data are reported under Antarctica as a 700 continent / country. See the 6th item listed in my just prior posting ;-)
Since V3 is a superset of V2 for countries (add 2 IIRC) I’m just going to load it for comparison. Here’s a sample of the report I’m going through now.
Basically, for anything with a similar name, I’m looking to see if there is a chance one with no number and one with no abreviation are really the same one… Then for any others (like Amsterdam Island) I’ll be hitting the Wiki to see if it has some other name that just happens to be listed in v4 instead…
Basically skipping all the matched sets of 2, and just picking out the leftovers to research. Then there will be a “final load” with a key field and I’ll have my Country Mapping pretty much done between the versions. There is an XX abbreviation defined as “unknown” so at the end of this I’ll likely use it for any V3 leftovers without an assignment. If any.
E.M.Smith says: 19 March 2019 at 5:33 pm
“One wonders how much this “homogenizing” the data accounts for the loss of low excursions and the “warming” seen when the tops don’t rise but low going excursions “go away”… ”
Nail Hammer Head on hit.
They have literally removed all Weather from the Climate, all those well known weather events like Lake Cooling, Foehns etc need no longer apply to be included in this super dataset.
And guess what they hardly ever add to the Temps, they practically always reduce the past temps.
20,000 stations adjusted, just wow!
They really have made it difficult to compare V1, V2, V3 and especially V4.
Could they be using the site 9999 as a code for do not include in Global Calculations?
@A C Osborn:
Most anything can be used as a “omit” flag, but it would be pretty obvious. IMHO there’s been much more “dumb stuff” than “malicious and deceptive” stuff. So I’d expect that for some source of records there was no elevation available and that’s what got stuck in. I mean, after all, using Google Maps is Soooooo…. hard 8-{ /sarc;
Just as an FYI to others:
I really really hate it when folks have their own “special” definition of fundamental words. Like “all”. What part of “all” was unclear? Apparently some part was…
There’s this nice feature in SQL that lets you dump things to a file. I wanted to dump my loaded country set so it would be easy to use an editor to pick out the lines that needed some R&D for a final ccode or whatever…
But it gives:
After a too long a time digging around, knowing I’d done a ‘GRANT ALL’ on permissions, I ran into this factoid:
from here:
https://stackoverflow.com/questions/6091427/mysql-into-outfile-access-denied-but-my-user-has-all-access-and-the-fold
OK….
So “GRANT ALL” does not grant all, it is really “GRANT UNKNOWN SUBSET THEN GO FISH”.
Well GRANT FILE doesn’t seem to be letting me do it either, so I think I’m stuck in Security Hell and need to drop out to the security level and come back with full on root privs and oh, never mind… it will be easier to just write them down with pen and paper….
Security Paranoia can be a good thing in a full on industrial setting, but not for the Home Gamer… on a Raspberry Pi… with only one user… inside 2 layers of firewall… with public data…
I’ll revisit the whole issue of how to actually write output to a file some other day… though, IMHO, having a database system where BY DEFAULT you can’t write output reports to a file, is “broken by design” at the point of shipment. And certainly so when “GRANT ALL” permissions doesn’t grant the ability to write out reports…
Because I can paste it here (but for some reason vi doesn’t want to take the paste) I’m going to be publishing this to the world instead of putting it in a private temp file… all because MySQL/MariaDB won’t let me write a private file without pain. Gee, that “feature” sure is keeping the data from leaking out…
I’m cutting out those records that are done, only leaving those where they need a cnum but are new so there is none; and those with a number but I’ve not matched them to a current abbreviation.
One I can see right off the top is that I’ve left the Koreas in as they have a name change to sort out. Some others I’m not sure about ( I thought I’d caught “Bahamas, The” vs “The Bahamas” so need to check that one. Others are just WT? country is that and what was it then?
Then the additional “countries” (many dependent islands): It looks like the “old” number system started with a continent number then just sequentially numbered around. I’m thinking I’ll start at 99 and count down so they don’t collide, but it will have a number even if no data exists for it. So, for example, some island off in the Pacific that wasn’t in the set before would be 599 and the next one 598, etc.
That just leaves problems like Russia that has two numbers but only one abbreviation. Keep that? Squash it? Hmmm…. and places like CEUTA (SPAIN) with no abbreviation – is it a name change or what? Or maybe just assign it to Spain? So some geography and history digging needed.
Anyone with ideas, toss ’em out there. There will be NO clean solution to this, only “good enough” and I’m open to suggestions and especially things like “Dog’s Breakfast Africa is now named King’s Paradise Democratic Socialist Republic Of Climate Change”…
EM I just talked to my head dba and he thinks the issue is here:
INTO OUTFILE ‘Country.list’
He suspects it is trying to dump that file in the home directory of mysql so you need to do something like this with a fully qualified path name where your user has permission to write out files.
INTO OUTFILE ‘/myhomedirectory/Country.list’
OK, I’ve put in an ascension date for G4, and I’ve found where North Korea was the D.R. of Korea and I’ve gone and assigned numbers in the x99-x90 range for FIPS countries without a number (as they will not be assigning one anyway and that is an unused range that’s easy to spot or filter if desired – so for each one I looked up what continent they were on, or near for all those islands, and assigned one.
I assigned the 900 block to the XX code as it is unused in the Gv1-3 sets. As ships are also unused in v4 (but 800) I’ll be giving it XX as the FIPS code.
As I run into information (like if a given island WMO number and LAT / LONG match it to a different cnum in the past) I may change it if appropriate.
That just leaves 17 “question” records From V3 not in V4 or where the name has changed and I need to do some more matching. They are the ones with a blank FIPS code abbreviation. (I’ve also made cnum a key now, since all of v4 has one in my modified load file). These are stations where I need to find an FIPS code, assign them XX, or realize they were some other country then and have a FIPS abbreviation now and I just didn’t make the match.
So here’s the size of the space now (I’ll leave in a couple of x99ish so you can see what they look like). The warnings on the load of v3 countries is just telling me I already claimed that cnum value with my v4 table load, which is by design.
I’ve left in “British Indian Ocean Territory” and BRITISH OVERSEAS TERRITORIES until I can figure out what is in each and are they maybe the same? It may also be that the batch of islands scattered about was in the “Overseas” bit and have gotten un-sorted…
There’s also some Zaire / Congo & All The Little Guineas sorting and validation needed… You would think that with only a few hundred countries we could given them all unique names, but Nooo…. everyone has to reuse parts…
So basically just need to figure out the FIPS abbreviation for those 17 with a blank, or decide to stick an XX in them and be done.
So “nice try”, but no…
(I’d done something like that earlier when I then had to run off and find that “GRANT ALL” does not grant ALL… As I know I did a GRANT ALL and have not done a GRANT FILE I’m fairly certain that’s the issue… but having worked around it with a cut/paste here I’m not going to go down the permission black hole today…)
One Down – from the wiki:
So it is the “Congo (Kinshasa)” now.
Looks like several of these are formally part of a major country, so likely now counted in it and under that FIPS code. Mellia, Madiera, Faroe Isl., Coral Sea Isl., I’ve checked so far, so I need to give them that country FIPS number.
Then “Chagos” starts to answer that British Indian territories thing. Again from the Wiki::
So looks like that BIOT may have been divided into a couple of parts, though that wiki just talks about Chagos / DIego Garcia too:
https://en.wikipedia.org/wiki/British_Indian_Ocean_Territory
So for now I’m just going to assume those two are an identity and match them.
E M I have never heard any reference to “Coral Sea Islands” before here in Oz. There are hundreds of islands in the Gra Barrier Reef but in the Coral sea ? Nah.”I wonder also how Lord Howe Island and Norfolk Island stand in this set up. Tey are both small islands in the Tasman Sea between New Zealand & Oz but territorially part of Australia.
Looks like British Virgin Islands are technically part of Britain, so the same FIPS (and British Citizens and so also EU citizens… at least for another week ;-)
BELAU is now PALAU, so that x90s code can be replaced with the 536 code…
We’re getting there, one gratuitous change at a time…
https://en.wikipedia.org/wiki/Ascension_Island
So that looks to match the British Overseas Territory line… Except that the two have different cnum values and no FIPS codes.
The Argentine Antarctic will be in the antarctic group somewhere so I’ll need to match them, then we have, from the wiki:
Which is entirely unclear. Looks sort of like their own country, but has the French anthem and the French were found of just making bits of the globe ‘France’ and moving on…So that one needs a bit more.
@Bill in Oz:
Well, there is reality and then there is what NOAA / WMO do with thermometer records. My interest is in sorting the thermometer records into reasonably “like” buckets for comparison, so if I find some tropical islands now coded with Britain, but separate in the past, I’ve got to find some way to deal with that. This is complicated by there being lots of things in one set and just gone in the other…
So my goal is to get them sorted “close enough” for comparisons, but not worry too much about the details of actual ownership.
Realize that v1,2 &3 were geographic oriented sets (first digit continent) while v4 is political entity oriented (that FIPS Nation Code). As nations are volatile and mutable things, I think that a mistake…
(Were I doing it, each station would get a LAT/LONG digit assigned and that wold be that… but since their LAT / LONG records have been dodgy at times and they didn’t keep the past record of moves so the LAT LONG now may not be what it was 40 years ago nor what it will be tomorrow, that’s got issues too).
So what I’m doing right now is just unscrambling what can be unscrambled, and marking the rest as “a bit odd” with the x9x numbers and the XX FIPS abbreviations. That ought to leave me with at most a half dozen thermometers that could be a bit wrong in their assignment into some average (geographic region or country /political entity) and that’s not going to change things enough to matter for gross comparisons.
IF then, that lets me unscramble the WMO changes, then I can get to One Pure Number per station and then make another pass through the data structure with that for the final preening. Or not if it doesn’t… and just use it as “close enough”…
Clearly the folks managing the thermometers and the temperature record “have issues” with not understanding databases, data structure, archiving, etc. etc. Way too much gratuitous change going on and it smells greatly of “How can I justify next years
pork allotmentbudget increase if I don’t change things?”OK, in theory I’m ready to run with this one. The Region (or continent) field is still blank for most of the entries (but I’m pretty sure I can fix that pretty quick and it isn’t needed yet).
It looks like about 14 lines left with some kind of XX flag record override or similar issue to think about, but it ought to be good enough to use to get the basic shakedown of the rest of the Gv4 load done and first fire evaluated.
So with that, I’m done on the country stuff for a while and next posting ought to be about temperatures and thermometers by LAT LONG and such.
Look it over. If you see anything daft or wrong in it, holler. I “moved fast” on this one and may have made assumptions that are not valid ;-)
The first three things are just showing a new build of the country table and loading it first with the v4 data (that has had the cnum manually typed in for those records into the load file) and then the load of the v3.3 data (where it can not override the v4 data, only add to it the records that are not in v4 / where they don’t match). Those ought to be the ones with XX abbreviations.
This looks pretty comprehensive EM.
And as fortune has it, there have been no major changes of territorial boundaries recently ~ 1945.
Going through it I noticed one oddity though.
The Channel islands off the coast of Normandy are not listed. ( A Crown territory which is technically not part of the UK- no MP’s in parliament for example. A weird British thing !! ! ) But Guernsey Island which is a part of the Channel Islands is listed. But that means that Jersey which is bigger & has far more people is omitted. Small but puzzling.
A PS : re Panama… How is data re the Panama Canal territory formerly controlled by the USA, treated ? Now it is all part of Panama..But prior to 1990’s was weather data from this area treated as part of the USA set ?
That is odd… I know Jersey is in the initial data load from GHCNv4…
So it ought to have survived the install of GHCN v3.3….
As there is nothing there to over write it… (and v4 ought to dominate anyway)…
Ah, there’s my mistake. Assigned the same number to two places:
So Isle of Man beat Jersey (and it wasn’t even football season ;-)
Thanks for the catch! I think I’ll do a report of how many of what exist…
I’ll bet it’s the “Skipped 2” part of the messages I ignored and that I’ve got another one:
Well I changed 698 to 697 and tried again, got the same warning count for the same reason ;-) (AND I was certain I checked that 697 wasn’t in use but maybe the find needed to wrap back to the top first…)
Yeah, I know, I need to add region to all the v4 stuff. That is what I was going to do when I sat down…
Then looks like I have two entries with duplicate primary keys. 113? Hmmm…
Well that one wasn’t one I assigned. Wonder if I typo’d it… I’d not be surprised at one mis-key out of a couple of hundred…
OK, so I moved Jerset right on top of Jan Mayen…. Got it.
Note To Self: Do “show warnings” after running an SQL program….
Once again, thanks for the eyeballs / QA catch!
UPDATE:
Looks like Niger is supposed to be 133 and I doubled the wrong digit…
So I’ve got two fixes and a re-run, then back to QA land…
And that warming is for
Where the input file has “Kingdom”… so 4 more letters. I’m pretty sure the docs said 56 CHAR (it isn’t the kind of number I would make up…) but I can bump it to 60 and not lose any sleep…
I’d not be at all surprised to find out either they didn’t follow their own docs or that the field “just grew”..
Well, to convert the v4 data to be full of tabs and ready to load:
7 1/2 minutes… Not bad for the Little Pi That Could ;-)
Since the way I went at this is a bit on the simple-but-daft side, it duplicates the basic station and reading data for each year 12 times making the month line of data to load. That blows up the size of the input to about 1.2 GB from about 162 MB…..
Then again, I’m doing this on a 500 GB disk that’s the smallest one I’ve got and it has 18% used… so it isn’t like I care about using the bits… Then, when the load is done, the database has very little redundancy in it. It is on a 1 TB disk so really doesn’t care ;-)
Being on 2 different disks means the “loading” process can just linear read one and write into the other without a lot of head seeks. Still, at USB 2.0 speed and on a Pi M3 I’m likely pushing it. We’ll see when the load finishes. Then I’ll do a report or two and we’ll find out if this equipment is “enough machine” or do I need to go upscale to a larger SBC.
OTOH, it’s kind of fun to tweak the nose of all the “Must Have New Supercomputer!!!” guys by doing things on minimal hardware ;-)
When I first ported GIStemp, I deliberately used an old “White Box PC” (with a Pentium class CPU IIRC) that had started life as a 486 and still has the 5 inch floppy drive. Later I found out some folks were snickering about the crapy hardware… completely missing the joke on them of “if it runs on an old WB PC why do you need a $50 Million Super Computer, eh?” Oh Well, personal jokes are for your own enjoyment, not that of others…
But the truth is that I’ve always been a minimalist. As a Database Consultant my specialty was efficiency reviews precisely because I like running in the minimal resources possible. So there’s a certain Haiku kind of charm / challenge in doing this on a system anyone in the world could buy. Who knows, maybe I’ll inspire some kid in Congo to get into this Climate Skeptic Kitchen Science thing ;-) I’d like that…
Aaaannnd… I knew if I typed long enough it would finish!
So 22 minutes to load the temperature data table…
Sporadic issues with the “source” column. OK, I’ll have to check on that. (May mean another data load… at just a few minutes that wasn’t an issue. At max 3 / hour, that’s a bother.) As “source” is the very last column, that could indicate a length issue on some records, or just an alignment issue. “We’ll see” with some reports & checks.
Looks like most of it is fine. I did a spot check on “Albania” (figuring they might not have too much data…) and got over 3000 rows:
Station ID looks right. AL country abbreviation is right. WMO picked out the 5 used digits. Then year is clearly properly aligned. So are the type of data, the version, month, and the floating point temperature as C (even the -99.99 missing data flags have their digits right).
OK, I can live with that. I’m going to proceed with using this load to test / devo / debug some of the same reports I did on v3.3 while I look to stamp out those warnings. Since I don’t use the source flag for anything, if it is missing a beat I won’t care. It does repeat for each month, so hopefully I’ve not got any alignment issue. (Comparing some known temperatures will tell).
For now, I’m happy. I’ll likely add an index on that Abbreviation field as it will be used in searches and selections. Still at a bit under 2 minutes to work the whole data set, not too bad.
So “first fire” on a Version 4 temperature data report looks good and done!
Whoot!!
(And yeah, I kinda like Albania… in a Potsylvania kind of way ;-)
It would seem that if you suck a 1.2 GB file into VI to look at the data and see where it’s not aligned right, on a Pi with 1 GB of memory, you get 600+ MB of swap used… a few minutes later when your keyboard and mouse respond once again ;-)
Most I’ve ever seen the Pi use…
I have a habit, from many years standing, of putting a swap partition on all my disks equal to memory size. As I have both 1 and 2 GB SBCs, some are 1, some are 2. In this case, I have both a 1 and a 2 on the same disk. That’s my system disk AND my home directory. The two swaps are on opposite ends of the disk. I change which is high vs low priority depending on when disk activity will be nearest the system area or my home directory area… Minimize long head seeks during swap…. The other disk is the one with the database on it. So right now priority is set to mostly use the swap partition near where I do all that input data munging, and not go to the other end of the disk unless desperate. Avoiding the DB disk, but using it if I run over 2 GB of swap (that ought never to happen)….
But yeah, a R. Pi using 1/2 GB+ of swap ;-) Now that’s fun ;-)
I am Delighted that this is being done on the Ras Pi 3 with Devuan2 by an old guy at his kitchen table! 8-) HURRAH ! quite a feat Maestro. Brains and experience trumps money and official connections. Maybe some the day real science advancement will again be the purview of Amateurs…pg…
@P.G.:
Glad I can be the bringer of delight! While it is a minor purpose (meaning I’d toss a pile of cash at this if that was what it required as I think it is that important) one of my guiding lights is that whole “minimalist” thing. I just love using small hardware to do big jobs.
I think it is the Amish roots. When I was 8 my Dad had use tear down and rebuild the “garage”. It had been a horse & buggy barn at the start of life, and used rough sawn REAL 2 x 4 s and square nails. We took it apart, poured a cement floor, and then put it back together. One of my jobs was to straighten the square nails to reuse them…. All the lumber was stacked and then bad bits replaced, but mostly going back where it came from.
I learned a lot about minimal resources…
Something about the Great Depression and all got soaked into my bones…
So, like everyone, I covet the latest, greatest, fastest Whizzz BANG!… but what really make me satisfied is doing the most with as close to nothing as possible ;-) Like cooking a meal for 4 at 50 ¢ / person using a camp stove and 2 pots…. Just heaven ;-) Yeah, I’m that kind of strange… I’m happiest cooking a full dinner in a dutch oven over charcoal on the patio with vegetables from the garden. What was bought? A chunk of meat. Fuel being tree trimmings…
Dinner tonight was what is left when the roast chicken can give no more sandwiches…. So pan drippings, bones, and maybe 1/2 pound of meat bits. Add a big onion, a few carrots, about 5 inches of celery (leaf end that many folks toss), salt, pepper, handful of lentils, sprinkle of French Seasoning mix and garlic… then a couple of bouillon cubes and in the last 10 minutes some “soup stars”. All up I doubt I’ve got $2 in it. We’ll have about 6 meals of a fine chicken and vegetable stew… I like it more than dinner out.
Saw an interesting YouTube last night. Minimal “Small Home” made in a 6 x 8 “cargo” trailer. Found myself wanting one… Camp stove, miniature fridge, built in bed, TV on the wall. All up the guy had about $4000 in it built from new. It looked “just enough” and no more. Light enough I think the Subaru could tow it. Now if only I went anywhere that justified it ;-)
There’s something about “Just enough and no more” that is curiously attractive… An optimization point.
FWIW, looking at the lines that gave a warning, they look to have a \ in the “source” position perhaps as an extra value.
I’m pretty sure this is a “Dirty Data” issue. I have no use for a backslash in a field of “source” so I don’t mind it got truncated. I may “fix it” later, but not on any kind of priority…
Pingback: GHCN v4 – First Graphs | Musings from the Chiefio
Just saw this thread. I did some occasional voluntary work (2 years) on v4 data sending them corrections to erroneous data. Stopped when they went to beta 2 and suddenly added 1338 stations and removed 638 others which didn’t look bad to me. Got disgusted.
Here is a really ugly one. It is co-located with three other IDs with entirely good data in the 1973 to 1993. This is because they merged in data from a couple 1000 km north.
CA006092920 45.8833 -82.5667 189 GORE_BAY_CLIMATE
Told them about it and a couple other similar stations and they told me they have confidence their adjustments would take care of it. Of course, due to the extreme latitude difference the shape of the annual temperature cycle is much different for about the first 20 years from the rest of the record even after adjustment, so it screws up the monthly anomalies.
There is still a lot of erroneous data in the dbase e.g. Very first ID in inventory is incorrect because the data comes from Sweden and not Antigua and Barbuda. I told them 1-2 years ago, but it is still there. I bitched about them using only exact matches for qc ‘E’ flag because they couldn’t even copy the data from the original data files into v4 without having occasional differences of 1-2 100ths. They then expanded their duplicate range to +-0.015, but it still doesn’t nearly account for all cross-station duplicates. Here are a couple stations which should have been flagged in many years. Different math usage gave them different results.
ASN00004028 -20.1078 119.5811 9 PARDOO_STATION
ASN00004032 -20.3725 118.6317 6.4 PORT_HEDLAND_AIRPORT
elevation 9999.0 is missing
3rd character of ID is a particular network within the country if not zero.
If ID char. 3-5 are XLT it is a record assembled from found data. Don’t know whether the rest of the number is newly minted or if they have an actual reference. A lot of Japan is XLT with many duplicates.
I’ll try to get back later today or tomorrow. If you have any questions I might be able to answer them.
From the above post by Bob Koss “so it screws up the monthly anomalies.” It would be interesting to know in which direction. I hope screws up was just a figure of speech, though my cynical side says in reality if probably isn’t.
Pingback: GHCN v3.3 vs v4 – Top Level Entry Point | Musings from the Chiefio