I’ve recently made some changes to the GHCN Analysis database for Version Anomaly comparisons.
This involves some changes to the data formats in some of the database tables, along with some changes in the data load scripts. I have not changed the original FORTRAN program that puts tabs into the data for initial loading. Eventually that Fortran will replace the “\” in some of the data (marker for “source”) with some other character. For now, that change was made via the editor (notes in the prior posting).
So this is just going to be a set of program / schema listings, and a very tiny bit of commentary / documentation.
The Database Creation Steps
Starting from a pristine install of MariaDB (MySQL), with a ‘temps’ database defined (but no tables loaded), it is good practice to do a “drop table” first just to be sure there are no “left overs” from prior runs. This is also done on a “fresh load”. I have a script that does that for me.
These scripts are pulled into MariaDB via the “source” command that says “go read this external file and do what it says”. There is also the ability to store such programs inside the database itself, but I’m not doing that (yet). Why? First off, this works and I didn’t feel it was important enough to spend time “improving” it, so have not bothered to learn how to do “stored programs” inside the database. 2nd, as I’m in “development mode” I may need to just “dump the whole database and start over”. That’s easier if your programs are not stored inside it.
This means I have “hard coded path names” in these scripts. Those would need changing for any other system. As there are only a few scripts, it isn’t that big a workload at present. Eventually I’ll get around to doing stored programs and path names as passed parameters, but that’s usually during a “polish and package” step at the end of “development”.
Drop The Tables
In my home directory is a directory named “SQL” where I have all my database ‘stuff’, divided into table related things (in a sub-directory named ‘tables’) and programs that do something (in a sub-directory named ‘bin’). The Unix / Linux command “cat” is to “concatenate and print” but with just one file name, just prints out the file. So “cat dropall” is printing out what is in the file “dropall” in my SQL/tables directory.
chiefio@PiM3Devuan2:~/SQL/tables$ cat dropall DROP TABLE continent; DROP TABLE country; DROP TABLE invent3; DROP TABLE invent4; DROP TABLE temps3; DROP TABLE temps4; DROP TABLE mstats3; DROP TABLE mstats4; DROP TABLE anom3; DROP TABLE anom4; DROP TABLE yrcastats;
Load The Table Descriptions
Then you get to load all the table schema:
chiefio@PiM3Devuan2:~/SQL/tables$ cat tables source /SG500/xfs/chiefio/SQL/tables/continent source /SG500/xfs/chiefio/SQL/tables/country source /SG500/xfs/chiefio/SQL/tables/invent3 source /SG500/xfs/chiefio/SQL/tables/invent4 source /SG500/xfs/chiefio/SQL/tables/temps3 source /SG500/xfs/chiefio/SQL/tables/temps4 source /SG500/xfs/chiefio/SQL/tables/mstats3 source /SG500/xfs/chiefio/SQL/tables/mstats4 source /SG500/xfs/chiefio/SQL/tables/anom3 source /SG500/xfs/chiefio/SQL/tables/anom4 source /SG500/xfs/chiefio/SQL/tables/yrcastats
Of course, in any version on your machine you would change “/SG500/xfs/chiefio/SQL/tables” to whatever path name you have used to store the schema files. You could also just type all these commands by hand into MariaDB, but I’m lazy about typing so do things to make that as little as possible and automate it ;-)
This pulls in a “schema” that describes each table, and builds the empty table. (We load it further down). This just defines what stuff will be stored in the table, in what format, and is it indexed or not.
A couple of words about each table:
This is in no way optimized nor normalized. It is a “just growed” system where each design choice was made on top of the prior choices, best or not. This is normal. Once a system works and does what you desire, IF it is worth it, you go back and re-design it with all your lesson learned included from the start.
I’m NOT normalizing the tables as I have specific goals I’m trying to achieve. One of them is easy reporting Python programs with embedded SQL (so long complicated SQL doing lots of joins is an issue). I also fully expect to just load the data and ‘flow’ it all the way through. NOT doing a lot of “updates” to any given step. Normalized data is best when the average use isn’t known (so you can’t optimize for it) and when there will be updates done (so redundant data means multiple update points and potential errors / discrepancies). As neither of those apply, I’m doing more “for this purpose” design and “reload from scratch” if it ever changes.
Loading The Tables
I have a script that does the load and making the indexes for all these tables in one command. That’s useful on major changes, or when moving to a new platform. Here it is:
chiefio@PiM3Devuan2:~/SQL/bin$ cat LoadTables source /SG500/xfs/chiefio/SQL/bin/LOAD/Lcontinent source /SG500/xfs/chiefio/SQL/bin/LOAD/Lcountry source /SG500/xfs/chiefio/SQL/tables/mkindexcn source /SG500/xfs/chiefio/SQL/bin/LOAD/Linv3 source /SG500/xfs/chiefio/SQL/bin/LOAD/Linv4 source /SG500/xfs/chiefio/SQL/bin/LOAD/Ltemps3 source /SG500/xfs/chiefio/SQL/tables/mkindex3 source /SG500/xfs/chiefio/SQL/bin/LOAD/Ltemps4 source /SG500/xfs/chiefio/SQL/tables/mkindex4 source /SG500/xfs/chiefio/SQL/bin/LOAD/Lmstats3 source /SG500/xfs/chiefio/SQL/bin/LOAD/Lmstats4 source /SG500/xfs/chiefio/SQL/bin/LOAD/Lanom3 source /SG500/xfs/chiefio/SQL/tables/mkanomindex3 source /SG500/xfs/chiefio/SQL/bin/LOAD/Lanom4 source /SG500/xfs/chiefio/SQL/tables/mkanomindex4
Note it does not load the yrcastats table as that takes 4 hours on the Pi and I want to know all this worked right before I launch it. Also note that the making of indexes is via tings in files that start with ‘mk’ while the actual loading of data has names that start with ‘L’. I treat the making of an index as a table issue, so the mk bits are stored with the table schema in SQL/tables.
This is just the list of what “region” number has what continent name.
chiefio@PiM3Devuan2:~/SQL/tables$ cat continent CREATE TABLE continent ( version CHAR(5) NOT NULL, ascension CHAR (10), region CHAR(1) NOT NULL, region_name CHAR(25) ) ;
Version and ascension are so that I can document if it changes from version to version. Likely not needed for this table, but we’ll see it in many others too. (I might change this in the future if it doesn’t work as well in practice as desired; I’ve already got some tables by version instead of just using the version field).
Version is v3.3 or v4 or eventually v2 and v1 of GHCN.
Ascension is just what date is this particular copy, for the eventual comparison of different copies of, say v2 saved at different times.
I’m not yet doing much with either of those two fields and they are likely irrelevant to this table.
Region is a single number digit. That’s what GHCN prior to v4 uses. I associate it with a name (the next field).
Region_name is just the name of the continent or region. Mostly this is continents, but GHCN makes a tepid attempt to have some ocean data in it, so “Ships” is one of the “regions”. As the vast bulk of all the data are from continents, I use the two words interchangeably. But do note that “Ocean Ships” and “All those Pacific Islands” are not exactly continents and a purist ought to only say “regions”.
Here’s the load script. Note it is just a brute force “stuff in a value”. All my “load tables” scripts / programs are in a directory named “LOAD” inside the ‘bin’ directory.
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lcontinent INSERT INTO continent VALUES ('Gv3','7Sept2015','1','Africa'); show warnings; INSERT INTO continent VALUES ('Gv3','7Sept2015','2','Asia'); show warnings; INSERT INTO continent VALUES ('Gv3','7Sept2015','3','South America'); show warnings; INSERT INTO continent VALUES ('Gv3','7Sept2015','4','North America'); show warnings; INSERT INTO continent VALUES ('Gv3','7Sept2015','5','Australia Pacific Islands'); show warnings; INSERT INTO continent VALUES ('Gv3','7Sept2015','6','Europe'); show warnings; INSERT INTO continent VALUES ('Gv3','7Sept2015','7','Antarctica'); show warnings; INSERT INTO continent VALUES ('Gv3','7Sept2015','8','Ship Stations Ocean'); show warnings;
My original data load was from a v3 version so I manually coded the version and date of that copy. As these have not changed over the years, it is likely irrelevant and ought to be removed. The only important bit is the association of the region number (1-8) with the continent name or “ships”.
OTOH, polishing this particular stone doesn’t buy much, so it is as it is.
In order to match the country number from v3.3 to the country abbreviation used in v4, I had to make a tab separated file of country numbers, abbreviations, and names. Yes, abut 240 lines of it. This table holds that data in the database loaded from that file:
chiefio@PiM3Devuan2:~/SQL/tables$ cat country CREATE TABLE country ( cnum CHAR(3) NOT NULL, abrev CHAR(2), region CHAR(1) NOT NULL, cname CHAR(60), version CHAR(5), ascension CHAR(10), PRIMARY KEY(cnum) ) ;
And yes, *Nix folks make endless small jokes about things like “this is cat country” or naming files like “kibble’ so you can have “cat kibble”. Things bored programmers do while writing accounting software ;-) Remember to be wary of files with names like “balls” and “brains” as not thinking and saying to your assistant “I think brain is too small” or “We need to remove brains” can result in your stepping in the joke goo… Watch too for “poo” as “cat poo” is, um, an “issue”.
Note at the bottom of the list of fields is “PRIMARY KEY(cnum)”. That says that the unique field in this table is a 3 digit value (cnum CHAR(3) up the list). That number is used by v3.3 and earlier as part of the composite station identity number. Version 4 uses a 2 letter abbreviation. You can key off either of them, but cnum will be static now that it is historical. It could also be an integer instead of character as it is a 3 digit int but since it isn’t used in any math, I’m just making it characters.
Region (or continent) is a single digit (stored as a character). It is also identical to the first digit of cnum so is redundant, but useful for doing a continent lookup to get the name.
Then cname is just the name of the country. You might think 60 characters is way too many for a country name, but not to GHCN…
Again we have the version and ascension fields. Countries do change over time, and this allows for it, but I’ve really not implemented that function yet; and my never. In fact, I had to resort to a hand edit of the v4 country file to get this to work at all, so it may make sense to just dump version and ascension some day.
Here we load the country data. Notice it is ONLY loading the ‘v4/country-tabs’ file. I have a v3.3/country-tabs file and had intended to “do the matching’ in the database, but that “had issues”, so just plugged the v3 (and prior) cnums into the v4 file and did the matching by hand, then loaded it into this table.
Yes, yet another bit of historical stone that ought to be polished more…
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lcountry LOAD DATA LOCAL INFILE '/SG500/xfs/chiefio/SQL/v4/country-tabs' INTO TABLE country (region,cnum,abrev,cname) SET version="Gv4",ascension="17MAR2019"; show warnings;
You can see that I just hard code the version and ascension values to the v4 ones. Like I said, that likely all needs to just “go away”.
I add an index field (in addition to the Primary Key that is an index by default) so that various searches and table JOIN operations will go much faster:
chiefio@PiM3Devuan2:~/SQL/tables$ cat mkcnindex CREATE INDEX abrevx ON country (abrev);
Just in case anyone wants to run a copy of this themselves, here’s the 240 lines of the country-tabs file. Others can just scroll over it for the rest of the tables and load programs. OTOH, it’s a interesting list of all the countries…
chiefio@PiM3Devuan2:~/SQL/v4$ cat country-tabs 4 426 AC Antigua and Barbuda 2 230 AE United Arab Emirates 2 201 AF Afghanistan 1 101 AG Algeria 6 604 AJ Azerbaijan 6 601 AL Albania 6 602 AM Armenia 1 102 AO Angola 5 521 AQ American Samoa [United States] 3 301 AR Argentina 5 501 AS Australia 6 603 AU Austria 7 700 AY Antarctica 2 202 BA Bahrain 4 401 BB Barbados 1 104 BC Botswana 4 427 BD Bermuda [United Kingdom] 6 606 BE Belgium 4 423 BF Bahamas, The 2 203 BG Bangladesh 4 402 BH Belize 6 607 BK Bosnia and Herzegovina 3 302 BL Bolivia 2 216 BM Burma 1 103 BN Benin 6 605 BO Belarus 5 512 BP Solomon Islands 3 303 BR Brazil 6 608 BU Bulgaria 5 522 BX Brunei 1 106 BY Burundi 4 403 CA Canada 2 204 CB Cambodia 1 110 CD Chad 2 224 CE Sri Lanka 1 112 CF Congo (Brazzaville) 1 154 CG Congo (Kinshasa) 2 205 CH China 3 304 CI Chile 4 429 CJ Cayman Islands [United Kingdom] 5 524 CK Cocos (Keeling) Islands [Australia] 1 107 CM Cameroon 1 111 CN Comoros 3 305 CO Colombia 5 535 CQ Northern Mariana Islands [United States] 4 405 CS Costa Rica 1 109 CT Central African Republic 3 406 CU Cuba 1 108 CV Cape Verde 5 525 CW Cook Islands [New Zealand] 6 610 CY Cyprus 6 612 DA Denmark 1 114 DJ Dijibouti 4 430 DO Dominica 4 407 DR Dominican Republic 3 306 EC Ecuador 1 115 EG Egypt 6 621 EI Ireland 1 199 EK Equatorial Guinea 6 613 EN Estonia 1 116 ER Eritrea 4 408 ES El Salvador 1 117 ET Ethiopia 1 198 EU Europa Island [France] 6 611 EZ Czech Republic 3 315 FG French Guiana [France] 6 614 FI Finland 5 502 FJ Fiji 3 316 FK Falkland Islands (Islas Malvinas) [United Kingdom] 5 527 FM Federated States of Micronesia 5 528 FP French Polynesia 6 615 FR France 1 143 FS French Southern and Antarctic Lands [France] 1 150 GA Gambia, The 1 118 GB Gabon 6 616 GG Georgia 1 119 GH Ghana 6 653 GI Gibraltar [United Kingdom] 4 409 GJ Grenada 6 699 GK Guernsey 4 431 GL Greenland [Denmark] 6 617 GM Germany 4 432 GP Guadeloupe [France] 5 529 GQ Guam [United States] 6 618 GR Greece 4 410 GT Guatemala 1 120 GV Guinea 3 307 GY Guyana 4 411 HA Haiti 2 299 HK Hong Kong 4 412 HO Honduras 6 609 HR Croatia 6 619 HU Hungary 6 620 IC Iceland 5 503 ID Indonesia 6 698 IM Isle of Man 2 207 IN India 1 161 IO British Indian Ocean Territory [United Kingdom] 2 208 IR Iran 6 622 IS Israel 6 623 IT Italy 1 113 IV Cote D'Ivoire 2 209 IZ Iraq 2 210 JA Japan 6 693 JE Jersey 4 413 JM Jamaica 6 697 JN Jan Mayen [Norway] 6 624 JO Jordan 5 530 JQ Johnston Atoll [United States] 1 197 JU Juan De Nova Island [France] 1 122 KE Kenya 2 213 KG Kyrgyzstan 2 206 KN Korea, North 5 504 KR Kiribati 2 221 KS Korea, South 5 523 KT Christmas Island [Australia] 2 212 KU Kuwait 6 625 K! Kazakhstan 2 211 KZ Kazakhstan 2 214 LA Laos 6 627 LE Lebanon 6 626 LG Latvia 6 628 LH Lithuania 1 123 LI Liberia 6 641 LO Slovakia 5 599 LQ Palmyra Atoll [United States] 6 696 LS Liechtenstein 1 162 LT Lesotho 6 629 LU Luxembourg 1 124 LY Libya 1 125 MA Madagascar 4 433 MB Martinique [France] 2 234 MC Macau S.A.R 6 631 MD Moldova 1 163 MF Mayotte [France] 2 215 MG Mongolia 6 695 MH Montserrat 1 126 MI Malawi 6 632 MJ Montenegro 6 648 MK Macedonia 1 127 ML Mali 1 130 MO Morocco 1 129 MP Mauritius 5 598 MQ Midway Islands [United States} 1 128 MR Mauritania 6 630 MT Malta 2 218 MU Oman 2 235 MV Maldives 4 414 MX Mexico 5 505 MY Malaysia 1 131 MZ Mozambique 5 532 NC New Caledonia [France] 5 533 NE Niue [New Zealand] 5 534 NF Norfolk Island [Australia] 1 133 NG Niger 5 520 NH Vanuatu 1 134 NI Nigeria 6 633 NL Netherlands 6 634 NO Norway 2 217 NP Nepal 5 506 NR Nauru 3 312 NS Suriname 4 434 NT Netherlands Antilles [Netherlands] 4 415 NU Nicaragua 5 507 NZ New Zealand 3 308 PA Paraguay 5 537 PC Pitcairn Islands [United Kingdom] 3 309 PE Peru 2 219 PK Pakistan 6 635 PL Poland 4 416 PM Panama 6 636 PO Portugal 5 508 PP Papua New Guinea 5 536 PS Palau 1 121 PU Guinea-Bissau 2 220 QA Qatar 1 165 RE Reunion [France] 6 639 RI Serbia 5 531 RM Marshall Islands 6 637 RO Romania 5 509 RP Philippines 4 435 RQ Puerto Rico [United States] 2 222 RS Russia 6 638 R! Russia 1 166 RW Rwanda 2 223 SA Saudi Arabia 4 438 SB Saint Pierre and Miquelon [France] 4 417 SC Saint Kitts and Nevis 1 138 SE Seychelles 1 141 SF South Africa 1 137 SG Senegal 1 196 SH Saint Helena [United Kingdom] 6 642 SI Slovenia 1 139 SL Sierra Leone 5 511 SN Singapore 1 140 SO Somalia 6 643 SP Spain 4 436 ST Saint Lucia 1 148 SU Sudan 6 694 SV Svalbard [Norway] 6 645 SW Sweden 3 317 SX South Georgia & the South Sandwich Islands [United Kingdom] 6 647 SY Syria 6 646 SZ Switzerland 4 424 TD Trinidad and Tobago 1 168 TE Tromelin Island [France] 2 228 TH Thailand 2 227 TI Tajikistan 5 538 TL Tokelau [New Zealand] 5 517 TN Tonga 1 151 TO Togo 1 136 TP Sao Tome and Principe 1 152 TS Tunisia 5 597 TT Timor-Leste 6 649 TU Turkey 5 518 TV Tuvalu 2 236 TW Taiwan 2 229 TX Turkmenistan 1 149 TZ Tanzania 1 153 UG Uganda 6 651 UK United Kingdom 6 650 UP Ukraine 4 425 US United States 1 105 UV Burkina Faso 3 313 UY Uruguay 2 231 UZ Uzbekistan 4 437 VC Saint Vincent and the Grenadines 3 314 VE Venezuela 2 232 VM Vietnam 4 440 VQ Virgin Islands [United States] 1 132 WA Namibia 5 540 WF Wallis and Futuna [France] 1 169 WI Western Sahara 5 539 WQ Wake Island [United States] 5 541 WS Samoa 1 167 WZ Swaziland 9 900 XX Unknown 2 233 YM Yemen 1 155 ZA Zambia 1 156 ZI Zimbabwe chiefio@PiM3Devuan2:~/SQL/v4$
The Instrument Inventory
Each version has a different set of instruments. How on God’s Earth they think they can do global Calorimetry with a constantly shifting foundation of instruments is beyond me. One of the things MOST drummed into students in my Chemistry classes was that IF you fool around with the thermometers your calorimmetery WILL BE WRONG PERIOD. “Don’t even touch the thermometers!”. Yet the Climate Folks are constantly touching, moving, replacing, and molesting their thermometers. Maybe they never did calorimetry in chem class. Maybe they never did chemistry?
So these tables track what instruments are in which version. Notice that at this point I’ve divided the versions into two tables, giving up on the idea of one table and using a version field in processing them. “Version” is now just a documentation field… Partly (mostly) this is due to the actual data offered in each version about the instruments being so entirely different.
chiefio@PiM3Devuan2:~/SQL/tables$ cat invent3 CREATE TABLE invent3 ( stnID CHAR(11) NOT NULL, version CHAR(5) NOT NULL, ascension CHAR(10), type CHAR(4), region CHAR(1) NOT NULL, cnum CHAR(3) NOT NULL, wmo CHAR(5) NOT NULL, wno_near CHAR(3) NOT NULL, latitude CHAR(8), longitude CHAR(9), stn_elev CHAR(6), name CHAR(30) NOT NULL, grid_elev CHAR(4), pop_class CHAR(1), pop_size CHAR(5), topo_type CHAR(2), veg CHAR(2), prox_water CHAR(2), dist_water CHAR(2), airport CHAR(1), dist_A_to_U CHAR(2), veg_grid CHAR(16), pop_class_nitelites CHAR(1), coslong CHAR(9), PRIMARY KEY (stnID) ) ;
Loading it is nearly trivial. Since I laid out the table fields in the same order as the file, you just say load this into that.:
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Linv3 LOAD DATA LOCAL INFILE '/SG500/xfs/chiefio/SQL/v3/3invent.out' INTO TABLE invent3; show warnings;
The “magic sauce’ here is in the creation of the inventory file to be loaded. That FORTRAN tab inserting program has not changed from the first posting.
Since those fields are documented in the GHCN description of the inventory, I’m not going to repeat it here. Just notice how different is the data available in v4, how reduced:
chiefio@PiM3Devuan2:~/SQL/tables$ cat invent4 CREATE TABLE invent4 ( stnID CHAR(11) NOT NULL, abrev CHAR(2) NOT NULL, f1 CHAR(1), f2 CHAR(1), f3 CHAR(1), f4 CHAR(1), wmo CHAR(5) NOT NULL, wnl CHAR(6) NOT NULL, latitude CHAR(8), longitude CHAR(9), stn_elev CHAR(6), name CHAR(30) NOT NULL, version CHAR(5) NOT NULL, ascension CHAR(10), type CHAR(4), coslong CHAR(9), PRIMARY KEY (stnID) ) ;
I’ve added the computed field “coslong” to enable easy plotting of ‘area correct’ global maps of temperatures and anomalies. It is done in a FORTRAN program documented elsewhere.
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Linv4 LOAD DATA LOCAL INFILE '/SG500/xfs/chiefio/SQL/v4/invent_tabs' INTO TABLE invent4; show warnings;
Again, loading it fairly trivial and the FORTRAN program to make the tab separated file has not changed.
This is the actual temperature data from the GHCN files. Note that from here on forward each version is getting it’s own tables. Pretty much given up on the idea of a version field as a key in one table.
Notice I’ve changed the temperature fields from this point forward to DECIMAL(6,2) from 5,2. Just a bit of paranoia after running into those “crazy high” 100+ C temperatures. Adding some head room so crazy temps can load and not be truncated into something that passes for more normal, yet wrong; nor give errors on loading. This means the field can hold a value like 1234.56 which is clearly overkill. I’m just not sure if MySQL / MariaDB counts the negative sign as a width value, so if -123.45 would need 6 wide, or not, and it is easier to burn some extra disk space than to look it up or test it ;-) So two bits of data size paranoia. A TB of disk running about $25, I don’t care…
chiefio@PiM3Devuan2:~/SQL/tables$ cat temps3 CREATE TABLE temps3 ( stnID CHAR(11) NOT NULL, version CHAR(5) NOT NULL, ascension CHAR(10), region CHAR(1) NOT NULL, cnum CHAR(3) NOT NULL, wmo CHAR(5) NOT NULL, near_wmo CHAR(3), year CHAR(4), month CHAR(4), type CHAR(4) NOT NULL, deg_c DECIMAL(6,2), missing CHAR(1), qc CHAR(1), source CHAR(1), PRIMARY KEY (stnID,year,month,type) ) ;
The Station ID is different between the v3 and prior versions and v4, so you can’t match on it. Thus might as well keep them separate. I store each temperature value in a unique record by Station, Month, Year, and type. The idea being to have MIN MAX and AVG in the same table. Not got there yet. Just AVG so far.
As v3 and prior have a cnum and not an abbreviation, that’s a difference too. I’d like to have a distinct table for mapping a stnID to a cnum, region, etc. but this was expedient. Again note that “region’ and ‘wmo’, ‘near wmo’ and ‘cnum’ are redundant parts of this Station ID but not the v4 Station ID.
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Ltemps3 LOAD DATA LOCAL INFILE '/SG500/xfs/chiefio/SQL/v3/3temps.out' INTO TABLE temps3; show warnings;
The load script once again just sucking in a pre-formatted file.
I make a couple of indexes on temps3:
chiefio@PiM3Devuan2:~/SQL/tables$ cat mkindex3 CREATE INDEX regionx ON temps3 (region); CREATE INDEX regmonx ON temps3 (region, month);
Moving on to temps4, we once again see a load of “gratuitous change’ and incompatible / non-comparable fields. It’s almost like they are trying to make comparisons between versions impossible. Not to worry, it’s only impossible for some people ;-)
Again I made the deg_C field one digit longer.
CREATE TABLE temps4 ( stnID CHAR(11) NOT NULL, abrev CHAR(2) NOT NULL, F1 CHAR(1), F2 CHAR(1), F3 CHAR(1), F4 CHAR(1), wmo CHAR(5), year CHAR(4) NOT NULL, type CHAR(4) NOT NULL, version CHAR(5) NOT NULL, ascension CHAR(10), month CHAR(4), deg_c DECIMAL(6,2), missing CHAR(1), qc CHAR(1), source CHAR(1), PRIMARY KEY (stnID,year,month,type) ) ;
Basic idea stays the same. Temperature stored by Station, Year, Month and type. Station ID is an incompatible format with the prior versions, the abbreviation for country means I needed to make that countries map table. Then some various flags (see the GHCN docs) and my best guess as to what is likely the WMO number inside that Station ID. At this point, version and ascension are basically vestigial documentation fields.
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Ltemps4 LOAD DATA LOCAL INFILE '/SG500/xfs/chiefio/SQL/v4/temps.out' INTO TABLE temps4; show warnings;
Then the v4 table indexes:
chiefio@PiM3Devuan2:~/SQL/tables$ cat mkindex4 CREATE INDEX abrevx ON temps4 (abrev);
There isn’t a lot of planning in the indexes. IF some report or graph or load / JOIN takes too long, I try tossing an index at it. If that works, the index stays in the scripting…
Computed Statistics By Month
Here the basic processing starts to move from FORTRAN file formatting / pre-processing into inside the database system computing. As I do anomalies ONLY by comparing a single instrument to itself, inside a single month, across the years, I need to compute some statistics to make that happen. Like “What is the average January temperature for this instrument?”. Then you can subtract that from any given January for that instrument to get the “anomaly’ for that instrument in that year in that month. No “baseline” needed. No homogenizing needed. No infilling needed. No hypothetical grid box needed. Just real data and statistics about it. We make those statistics here.
chiefio@PiM3Devuan2:~/SQL/tables$ cat mstats3 CREATE TABLE mstats3 ( stnID CHAR(11) NOT NULL, month CHAR(4) NOT NULL, mean DECIMAL(6,2) NOT NULL, big DECIMAL(6,2) NOT NULL, small DECIMAL(6,2) NOT NULL, num INTEGER NOT NULL, trang DECIMAL(6,2) NOT NULL, stdev DECIMAL(6,2) NOT NULL, pctd FLOAT, pctm FLOAT, PRIMARY KEY (stnID,month) ) ;
Note the key is Station ID and Month. For a given station, in a given month, what are the average (mean), the maximum (big) the minimum (small), how many values are there (num), what is the range from maximum to minimum (trange) and what is the standard deviation in them (stdev).
Some of these were float fields before. I’ve changed them to all be DECIMAL(6.2). This means I needed to add a “ROUND” function to some other steps, and for some fields (like Standard Deviation) it is clearly way too long (nobody gets out to 9000.00 standard deviations do they?…) but I wanted to stamp out all the warnings AND not have any risk of data truncation, so went for over the top. These could easily be tuned down / optimized later.
I’ve left percent missing (pctm) and the other percent valid data (pctd) as floats since I’m not using them. They are for some future idea and could just be left out. Notice that the loading program doesn’t even use them. They are presently NULLs in the database.
This data loading program is more interesting. First off, it is “table to table’. We take data out of the temps3 temperature table, compute some statistics, and stuff it into the monthly statistics table mstats3. Very similar processing is done on the version 4 data.
So why use field names like big, small, trang? Well, min, max, range, and stddev are key words to SQL, so the field name must be something else to avoid confusing the database SQL engine. You will see those functions being used in the load program below to find the AVG average, or mean, the MAX/MIN sizes, to COUNT up how many, then to take the difference between the MAX aand MIN values to get the total range, and then find the standard deviation of the temperatures. Notice I’ve put a rounding function ROUND on the computation of the mean and standard deviation values so as to avoid truncation notices. This trims them to 2 decimal places matching the new field size.
There are opportunities to tune this more on field sizes and widths.
Note that I’ve made the “exclude -99.9 missing data flag” screen into a more general “toss anything under -90 C or over 60 C in computing statistics as they are bogus values”. Something similar is also added to the computing of anomilies further down.
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lmstats3 INSERT INTO mstats3 (stnID,month,mean,big,small,num,trang,stdev) SELECT stnID,month, ROUND(AVG(deg_C),2),MAX(deg_C),MIN(deg_C),COUNT(deg_C), MAX(deg_C)-MIN(deg_C), ROUND(STDDEV(deg_C),2) FROM temps3 WHERE deg_C>-90.0 AND deg_C -90.0 AND deg_C< 60.0 GROUP BY stnID,month; show warnings;
The Version 4 monthly statistics table is almost identical to the v3.3 one, but by now I’d been burned enough with trying to fit them both into one table that I’ve given up ;-)
Besides, this way, when comparing one to the other, I don’t have to traverse any v3 data computing v4 graphs, nor traverse any v4 data when making v3. I can always “join” them if I need it mushed together.
chiefio@PiM3Devuan2:~/SQL/tables$ cat mstats4 CREATE TABLE mstats4 ( stnID CHAR(11) NOT NULL, month CHAR(4) NOT NULL, mean DECIMAL(6,2) NOT NULL, big DECIMAL(6,2) NOT NULL, small DECIMAL(6,2) NOT NULL, num INTEGER NOT NULL, trang DECIMAL(6,2) NOT NULL, stdev DECIMAL(6,2) NOT NULL, pctd FLOAT, pctm FLOAT, PRIMARY KEY (stnID,month) ) ;
Note the same changes of fields to DECIMAL(6.2) and the same addition of ROUND to the loading program. Note also the same “WHERE deg_C>-90.0 AND deg_C< 60.0” quality and missing data screen.
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lmstats4 INSERT INTO mstats4 (stnID,month,mean,big,small,num,trang,stdev) SELECT stnID,month, ROUND(AVG(deg_C),2),MAX(deg_C),MIN(deg_C),COUNT(deg_C), MAX(deg_C)-MIN(deg_C), ROUND(STDDEV(deg_C),2) FROM temps4 WHERE deg_C>-90.0 AND deg_C< 60.0 GROUP BY stnID,month; show warnings;
There are a LOT of opportunities to explore in these statistics. So far I’m only using them to compute the anomaly tables below.
Computed Anomalies Of Temperatures
Here we finally come to the meat of it. Now that we have the temperatures and the statistics, we can easily compute the anomalies of temperatures for each instrument for each monthly temperature value.
This is very similar in structure to the temperature table, and it might be a big space saver to just store the anomaly value in the temperature table itself. OTOH, doing UPDATE is horridly slow in MySQL/MariaDB, and I really don’t need to be dredging through the whole block of all of it when I’m just processing the anomaly part for graphs and such. Then there is that newly discovered by me (known to others forever ;-) limit that you can’t update a field in a table you are using in a “WHERE” screen, so that might be a problem too. Easy to avoid it all with some small bit of disk space.
chiefio@PiM3Devuan2:~/SQL/tables$ cat anom3 CREATE TABLE anom3 ( stnID CHAR(11) NOT NULL, region CHAR(1) NOT NULL, abrev CHAR(2), cnum CHAR(3) NOT NULL, wmo CHAR(5) NOT NULL, near_wmo CHAR(3), year CHAR(4) NOT NULL, month CHAR(4) NOT NULL, deg_C DECIMAL(6,2) NOT NULL, PRIMARY KEY (stnID,year,month) ) ;
Again the big change being a small one of DECIMAL(6.2) instead of 5.2 size.
Also note that the Station ID is a broken down into region, cnum, wmo, near_wmo with redundancy. This is for easier processing later, but is redundant data.
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lanom3 INSERT INTO anom3 (stnID,region,abrev,cnum,wmo,near_wmo,year,month,deg_C) SELECT T.stnID,T.region,C.abrev,T.cnum,T.wmo,T.near_wmo,T.year,T.month,ROUND(T.deg_C-ST.mean,2) FROM temps3 AS T INNER JOIN country as C ON T.cnum=C.cnum INNER JOIN mstats3 AS ST ON ST.stnID=T.stnID AND ST.month=T.month WHERE T.deg_C > -90.0 AND T.deg_C -90.0 AND T.deg_C < 60.0 ; show warnings;
Changes in this loading program are just the addition of ROUND functions and the addition of the quality range screening in the WHERE statement.
Note that I do a JOIN to country to get the abbreviation used in v4 matched to this v3 data country number.
Next the anomaly table for v4, anom4, is almost identical. Just missing near_wmo, wmo and with a bit of other order of things. I likely ought to make the order of the fields match better (region, abrev, cnum in different orders) but it doesn’t really matter. The v4 Station ID on inspection has a very confused wmo number field compared to v3, so I’m not sure what’s going on there. Have they just dropped the WMO meaning, or mixing other stuff in, or just renumbered everything? Whatever. Station ID entire is the key here anyway.
chiefio@PiM3Devuan2:~/SQL/tables$ cat anom4 CREATE TABLE anom4 ( stnID CHAR(11) NOT NULL, abrev CHAR(2) NOT NULL, region CHAR(1) NOT NULL, cnum CHAR(3) NOT NULL, year CHAR(4) NOT NULL, month CHAR(4) NOT NULL, deg_C DECIMAL(6,2) NOT NULL, PRIMARY KEY (stnID,year,month) ) ;
Again the change to 6.2 and in the loading program adding the WHERE screen on insane values along with ROUND on the anomaly computation.
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lanom4 INSERT INTO anom4 (stnID,abrev,region,cnum,year,month,deg_C) SELECT T.stnID,T.abrev,C.region,C.cnum,T.year,T.month,ROUND(T.deg_C-ST.mean,2) FROM temps4 AS T INNER JOIN country AS C ON T.abrev=C.abrev INNER JOIN mstats4 AS ST ON ST.stnID=T.stnID AND ST.month=T.month WHERE T.deg_C > -90.0 AND T.deg_C < 60.0 ; show warnings;
Yearly By Country Anomaly Statistics
Here I tried to “bite the bullet” and combine the v3.3 and v4 data into one table. That means I get to load it with one and then UPDATE with the other. It has been nothing but a bother. UPDATE has been crazy slow. I’m been working on something faster and I think I have it now.
I frist wrote two programs (and attendant “views”) that let me update the Russia and Kazakhstan data with the combined Europe & Asia statistics. So it is entirely functional now with the older Lyrcastats loading step. It will also work with the new method.
The graphs so far, that include Russia and Kazakhstan in Asia, only have v3.3 data for Asia in the comparison while the v4 data is all of Russia and Kazakhstan (they divided them at about the Urals into 2 “countries” each in v3.3 and prior but combine them as one country each in v4).
I’ve divided the Lyrcastats process into 4 steps.
L4yrcastats L3yrcastats LKZ LRS
The L4yrcastats program just loads the version 4 data. The L3yrcastats program updates the version 3.3 data EXCEPT for Russia and Kazakhstan. The LKZ program updates the Kazakhstan data using both the "KZ" and "K!" country abbreviations. The LRS program updates the Kazakhstan data using both the "RS" and "R!" country abbreviations. I've checked that all steps run without error and with data in the right places (and NULL where it belongs before that). I've not proven that the last two use all the Asian and European data, but can do that fairly quickly. So let's look at how this works. Fist, the table schema (layout). Note the use of DECIMAL(6,2) field sizes as the change.
chiefio@PiM3Devuan2:~/SQL/tables$ cat yrcastats CREATE TABLE yrcastats ( year CHAR(4) NOT NULL, abrev CHAR(2) NOT NULL, mean3 DECIMAL(6.2), mean4 DECIMAL(6.2) NOT NULL, big3 DECIMAL(6,2), big4 DECIMAL(6,2) NOT NULL, small3 DECIMAL(6,2), small4 DECIMAL(6,2) NOT NULL, num3 INTEGER, num4 INTEGER NOT NULL, trang3 DECIMAL(6,2), trang4 DECIMAL(6,2) NOT NULL, stdev3 FLOAT, stdev4 FLOAT NOT NULL, PRIMARY KEY (year,abrev) ) ;
Note that the KEY fields are the country abbreviation abrev, and the year. This table is for holding / computing anomaly statistics for each country for each year.
I run the L4yrcastats program first as it gets the table filled with the needed abrev values for the rest to work.
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat L4yrcastats INSERT INTO yrcastats (year,abrev,mean4,big4,small4,num4,trang4,stdev4) SELECT year,abrev, ROUND(AVG(deg_C),2),MAX(deg_C),MIN(deg_C),COUNT(deg_C), MAX(deg_C)-MIN(deg_C), ROUND(STDDEV(deg_C),2) FROM anom4 GROUP BY year,abrev ; show warnings;
It is just the top bit of the old Lyrcastats program with “ROUND” functions added on the mean and standard deviation so that warnings of truncation went away.
Next I use a “VIEW” to load the GHCN v3.3 data WITHOUT loading Russia or Kazakhstan data. I’ve tested that, in fact, this loads the v3.3 data and both Russia and Kazakhstan have empty or NULL values
chiefio@PiM3Devuan2:~/SQL/tables$ cat vanom3 CREATE VIEW anom3Y AS SELECT ROUND(AVG(A.deg_C),2) AS 'meanA' , MAX(A.deg_C) AS 'maxA' , MIN(A.deg_C) AS 'minA' , COUNT(A.deg_C) AS 'numA' , MAX(A.deg_C)-MIN(deg_C) AS 'trangA' , STDDEV(A.deg_C) AS 'stdevA' , Y.year AS 'yearA' , Y.abrev AS 'abrevA' FROM anom3 AS A INNER JOIN yrcastats AS Y ON Y.year = A.year AND Y.abrev=A.abrev WHERE A.abrev 'RS' AND A.abrev'R!' AND A.abrev 'KZ' AND A.abrev'K!' GROUP BY Y.year,Y.abrev
A couple of minor notes. First, I did the ROUND in the loading procedure (and it is still there) but got truncation warnings on “meanA” anyway. Adding the ROUND here stopped those. I’ve not checked to see if the one in the load program is now redundant.
Again I have a difference of the file name for the VIEW definition (vanom3) and the VIEW name (anom3Y). I’m thinking of changing that to be more consistent.
I put the computing of the various statistics in the VIEW along with the exclusion rules that leave out the KZ, K!, RS, and R! abbreviations / data for Kazak Asia, Kazak Europe, Russian Asia, and Russian Europe respectively.
I then run this program which after a quick check looks to be doing the right thing and is MUCH faster. About 10 minutes instead of 4 hours!
Note that the “ROUND” in the first mean3= line may no longer be needed as I added it to the VIEW. But it is working now so I ain’t touchin’ it! ;-0
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat L3yrcastats UPDATE yrcastats AS Y SET mean3 = (SELECT ROUND(A.meanA,2) FROM anom3Y AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), big3 = (SELECT A.maxA FROM anom3Y AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), small3 = (SELECT A.minA FROM anom3Y AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), num3 = (SELECT A.numA FROM anom3Y AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), trang3 = (SELECT A.trangA FROM anom3Y AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), stdev3 = (SELECT A.stdevA FROM anom3Y AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ) ; show warnings;
Note that it now basically just picks up data from the “VIEW” and stuffs it into the table.
MariaDB [temps]> source bin/LOAD/L3yrcastats Query OK, 23611 rows affected (10 min 0.27 sec) Rows matched: 28423 Changed: 23611 Warnings: 0
I do need to do another “soup to nuts” complete rebuild just to verify that no step was left undomumented and it all plays well together, but is sure looks clean. I did do a ‘report’ from the table at this point and it looks like it ought to look. Russia has not data for v3.3 while others do.
MariaDB [temps]> SELECT meanA from anom3Y WHERE abrevA='RS'; Empty set (1 min 23.75 sec) MariaDB [temps]> SELECT meanA FROM anom3Y WHERE abrevA="WA"; +-----------+ | meanA | +-----------+ | -0.419167 | | 0.405833 | | -0.669167 | | -0.569167 | [...] | 2.904412 | | 0.854839 | | 1.495000 | +-----------+ 83 rows in set (1 min 24.40 sec)
As I’d previously validated the following “update Russia and Kazak” steps, I think that pretty much validates all of it (other than proving both RS and R! are used in the final update).
The Final Steps:
Due to MySQL / MariaDB having an unexpected limit (not in other SQLs) where it locks a table such that you can’t use it in a WHERE clause and update it at the same time, I had to make a “VIEW” to get these bits of update code to work. This is essentially a stored “SELECT” statement that presents what looks like a table made from some other tables. (I”m using the same technique for all the updates and it is faster).
So first the “view” then the load script. Note that this VIEW also has a ROUND in it:
chiefio@PiM3Devuan2:~/SQL/tables$ cat vanom3R CREATE VIEW anom3R AS SELECT ROUND(AVG(A.deg_C),2) AS 'meanA' , MAX(A.deg_C) AS 'maxA' , MIN(A.deg_C) AS 'minA' , COUNT(A.deg_C) AS 'numA' , MAX(A.deg_C)-MIN(deg_C) AS 'trangA' , STDDEV(A.deg_C) AS 'stdevA' , Y.year AS 'yearA' , Y.abrev AS 'abrevA' FROM anom3 AS A INNER JOIN yrcastats AS Y ON Y.year = A.year AND Y.abrev='RS' WHERE A.abrev = 'RS' or A.abrev= 'R!' GROUP BY Y.year
This is largely just a bypass of the MySQL limit so that I can use Y.abrev in the WHERE clause of the update by making it into “abrevA” in this “VIEW”. I also folded the computing of the Average, min, max etc. into the view, as, well, why not?
So this VIEW computes the statistics (AVE, MAX, MIN, COUNT, Total Range, Standard Deviation) and gives them field names like meanA, maxA etc. It also puts the yrcastats key field ‘year’ as Y.year into ‘yearA’ and it puts the yrcastats Y.abrev value into ‘abrevA’ so I can use them in the following update SQL program.
Notice the use of matching “RS” for the yrcastats file but using either “RS” or “R!” (the European 1/2 of Russia) for the anomaly data? Nice, eh? I just hope it is working right (no I’ve not yet done the ‘by hand run a sample all the way through the calculations and see if it matches’ QA step. Dreading it a bit, actually.
So I’m “pretty sure” this is working as expected, and it gave no error messages, but “Some QA Required” at this point. (Hey, this is only day 2 for that code, OK?) ;-)
It does this via a JOIN of the anomalies table, anom3, with the Yearly Country Anomaly Statistics table yrcastats, that at this point has had the v4 data loaded, and the v3 data loaded but with the RS “country’ only having the Asia portion of Russia in it for v3.3 data. This fixes that.
Here’s the UPDATE SQL. Yes, it is ugly. That seems to be a limitation of UPDATE. So for each field we must go through this shadow dance of SET the field to a value from an embedded SELECT statement. At least via using the VIEW we can just grab the finished computed value and move on.
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat LRS UPDATE yrcastats AS Y SET mean3 = (SELECT ROUND(A.meanA,2) FROM anom3R AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), big3 = (SELECT A.maxA FROM anom3R AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), small3 = (SELECT A.minA FROM anom3R AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), num3 = (SELECT A.numA FROM anom3R AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), trang3 = (SELECT A.trangA FROM anom3R AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), stdev3 = (SELECT A.stdevA FROM anom3R AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ) WHERE abrev='RS' ; show warnings;
Then the same thing but this one limiting updates to KZ Kazakhstan and using KZ and K! anomaly data. Again note the ROUND in the view makes the ROUND in the load load program likely a redundant action.
chiefio@PiM3Devuan2:~/SQL/tables$ cat vanom3K CREATE VIEW anom3K AS SELECT ROUND(AVG(A.deg_C),2) AS 'meanA' , MAX(A.deg_C) AS 'maxA' , MIN(A.deg_C) AS 'minA' , COUNT(A.deg_C) AS 'numA' , MAX(A.deg_C)-MIN(deg_C) AS 'trangA' , STDDEV(A.deg_C) AS 'stdevA' , Y.year AS 'yearA' , Y.abrev AS 'abrevA' FROM anom3 AS A INNER JOIN yrcastats AS Y ON Y.year = A.year AND Y.abrev='KZ' WHERE A.abrev = 'KZ' or A.abrev= 'K!' GROUP BY Y.year
Essentially identical but for swapping the country abbreviations.
Oh, and note I’m naming “VIEW”s with a leading v in the file name so I know it isn’t creating a real table. I likely ought to have carried that forward into the VIEW name itself, but decided to keep it the same as the table it was replacing (anom3) with just a letter to tell me which Country it was for.
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat LKZ UPDATE yrcastats AS Y SET mean3 = (SELECT ROUND(A.meanA,2) FROM anom3K AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), big3 = (SELECT A.maxA FROM anom3K AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), small3 = (SELECT A.minA FROM anom3K AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), num3 = (SELECT A.numA FROM anom3K AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), trang3 = (SELECT A.trangA FROM anom3K AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ), stdev3 = (SELECT A.stdevA FROM anom3K AS A WHERE Y.year=A.yearA AND Y.abrev=A.abrevA ) WHERE abrev='KZ' ; show warnings;
Basically the same thing as for RS but with “anom3K” as the table/ VIEW name.
So there you have it. All the stuff needed to build the entire database with the exception of the unchanged FORTRAN. I’d put the FORTRAN in here too but then it would be way too long, and it is already posted. As time permits I’ll add some links back to where it is posted already.
Could this be done better? Certainly. I’m using the database like a series of procedural steps with intermediary files. Not the highest and best use. OTOH, it is fairly fast to develop this way, and then one knows all the issues and pitfalls and can go back for a clean re-implementation later. This is the “Rapid Prototype” phase.
I’ve found (many times – as has several companies) that doing a Rapid Prototype / Do-over is faster, cleaner, and better than doing a whole lot of analysis up front that then has to be re-done anyway when reality breaks your battle plan.
So don’t think of this as a “Done Deal”. It is only the “make some product you can ship now while you find out how it ought to be done” rapid phase. Feel free to do it over however you like for your own needs / uses.
FWIW I ran this all the way to the Lyrcastats step (that is 4 hours on the Pi so I skipped it this time) and had no errors and no warnings. I put that run in a comment here:
So as of now I’m done fooling around with these stages for a while. I’m going to proceed to making the Africa and Europe graphs.
So that’s what’s on deck next. Though perhaps after a bit of “garden time”. I’m reaching the “too many squash” phase and picked 1/2 dozen “8 Ball” squash yesterday. Can’t let that thing go a couple of days or you have too much squash and green beans ;-)