This is just an intermediate layout for these tables. They are Version 4 of GHCN specific. Eventually I would like to make replacements that will hold the various versions (v1, v2, v3…) but that will require a bit more “integration” of the difference between the versions.
So for now, to shake out the bugs and make sure I’ve got the right ideas, I’ve made these versions of the tables; and scripts to load them.
First off, the statistics. I’ve added a standard deviation field by Station for each month.
Here’s the table layout:
CREATE TABLE mstats4 ( stnID CHAR(11) NOT NULL, month CHAR(4) NOT NULL, mean FLOAT NOT NULL, big DECIMAL(7,2) NOT NULL, small DECIMAL(7,2) NOT NULL, num INTEGER NOT NULL, trang FLOAT NOT NULL, stdev FLOAT NOT NULL, pctd FLOAT, pctm FLOAT, PRIMARY KEY (stnID,month) ) ;
I’ve still not done the percent missing and percent data statistics so those two fields at the bottom are still empty.
Here’s the script that loads the table:
chiefio@PiM3Devuan2:~/SQL/bin$ cat lmstats4.sql INSERT INTO mstats4 (stnID,month,mean,big,small,num,trang,stdev) SELECT stnID,month, AVG(deg_C),MAX(deg_C),MIN(deg_C),COUNT(deg_C), MAX(deg_C)-MIN(deg_C), STDDEV(deg_C) FROM temps4 WHERE deg_C>-90 GROUP BY stnID,month;
When run, this took a fairly long time to complete. It has one core pegged all the time, so clearly MariaDB doesn’t do as much multi-core as it might. It is not disk limited on the Pi M3. This strongly implies a system with a single faster core would be a lot faster. This implies boards like the Odroid XU4 or the RockPro64 with very fast cores would make this a lot faster.
Here’s the run result:
MariaDB [temps]> source tables/mstats4 Query OK, 0 rows affected (0.57 sec) [...] MariaDB [temps]> source tables/anom4 Query OK, 0 rows affected (0.31 sec) [...] MariaDB [temps]> source bin/lmstats4.sql Query OK, 328327 rows affected (29 min 32.63 sec) Records: 328327 Duplicates: 0 Warnings: 0 MariaDB [temps]>
So just about 1/2 an hour to create the statistics and load them… As this only needs to be done once, it can be done on the Pi M3, but at the 1/2 hour point I’m starting to look longingly at my fast SBCs ;-)
Next, the anomaly table:
This one was (and still is) a challenge. It, too, pegs one core at 100% use. After running for about 24 hours, it crapped out with an error message. I think this was taking so long as it has a JOIN on a non-index field, so about 300,000 records to match against 240 with most likely repeated linear searches and lots of “is it equal?” tests… While it was running I figured that was likely the issue but just left it to run to find out what would come of it.
First the table schema:
CREATE TABLE anom4 ( stnID CHAR(11) NOT NULL, abrev CHAR(2) NOT NULL, region CHAR(1) NOT NULL, cnum CHAR(3) NOT NULL, year CHAR(4) NOT NULL, month CHAR(4) NOT NULL, deg_C DECIMAL(5,2) NOT NULL, PRIMARY KEY (stnID,year,month) ) ;
Pretty basic as there’s a lot of fields in v3 that are dropped in v4. I add the region (continent) number and the country number (cnum) from v3 with the intent to enable comparisons of like things with v3. That has an issue with Russia and Kazakhstan… which I’ll show down below.
Then the code:
INSERT INTO anom4 (stnID,abrev,region,cnum,year,month,deg_C) SELECT T.stnID,C.abrev,C.region,C.cnum,T.year,T.month,T.deg_C-ST.mean FROM temps4 AS T INNER JOIN country AS C ON T.abrev=C.abrev INNER JOIN mstats4 AS ST ON ST.stnID=T.stnID AND ST.month=T.month WHERE T.deg_C > -90 ;
The problem is that join to the “country” table so I can pick up things like the country number used in version 3 along with the continent number.
MariaDB [temps]> source bin/lanom4.sql ERROR 1062 (23000) at line 1 in file: 'bin/lanom4.sql': Duplicate entry 'KZ000028676-1891- APR' for key 'PRIMARY'
So I checked where was this duplicate data? (Not remembering yet that I had created it…)
MariaDB [temps]> SELECT stnID, year, month, type FROM temps4 -> WHERE year=1891 AND month=' APR' and stnID='KZ000028676'; +-------------+------+-------+------+ | stnID | year | month | type | +-------------+------+-------+------+ | KZ000028676 | 1891 | APR | TAVG | +-------------+------+-------+------+ 1 row in set (0.84 sec) MariaDB [temps]> SELECT cnum,abrev,region,version,cname FROM country -> WHERE abrev='KZ'; +------+-------+--------+---------+------------+ | cnum | abrev | region | version | cname | +------+-------+--------+---------+------------+ | 211 | KZ | 2 | Gv4 | Kazakhstan | | 625 | KZ | 6 | Gv4 | Kazakhstan | +------+-------+--------+---------+------------+ 2 rows in set (0.00 sec)
Oh, Yeah… Russia and Kazakhstan were split between Europe and Asia in v3, but only one country in V4, so there are two country numbers assigned to the same abbreviation… Which also means my attempt to make an index on it will fail.
So I need to choose to either just assign all of those two to one continent or the other (and one cnum) or change the abbreviation to something else for 1/2 of the country (and have it not match V4).
Decisions decisions… unscrambling eggs can be so much bother…
I’d intended to just pick up the v3 bits and move on, not needing to get them again, but that match clearly “has issues” due to those two countries.
I think I’ll have a cup-a tea while mulling this over. Most likely I’ll just assign all of both of them to Asia and “move on” with v4 as is for now. Sorting out the individual stations into European / Asian later with different names and abbreviations (like ARussia and EKazakhstan … or something…) Alternatively I can move the attachment of a name and number elsewhere in the process. But somewhere some how I need to be able to match two things that don’t exactly match in order to do direct comparisons.
It isn’t needed for the direct anomaly inspection of just the v4 data, so simply skipping that step would be ‘OK’ for now. Decisions, decidsions…