Updated GHCN Load Scripts & Reduced Warnings

This is a technical posting about some changes I’ve made to the GHCN Versions Anomaly Database.

The most important change was to some of the input data. There were records that ended with a “\” where that was interpreted by SQL as an “escape” character, which removed the special meaning of the end of line character, which caused the following line to be discarded as excess length of the first line. As a result about 2188 lines of Antarctic Data (in 50 blocks) was not loaded for GHCN v4. I’ll need to redo the Antarctic graphs and see if it was significant or not. The other 50 records were single data points scattered through the data of 16 Million data points, so ought not to have any significant effect. These were the lines just following each block of the other data.

I also changed the one Saint Helena record that was marked as Asia (caught before it was used) and moved it to Africa where it belonged.

The one country name that was 1 character too long, a closing “]”, had the word “and” changed to “&” and is no longer too long.

The actual database table descriptions, and a few of the data loading programs, had the field named “country” changed to match the usage from further on in the development when I thought using a different field for the v3.3 marker of ‘cnum’ for country number, and ‘abrev’ for the v4 two character abbreviation for the country. Here’s a list of the table schema that changed:

chiefio@PiM3Devuan2:~/SQL/tables$ cat CHANGED
in temps3 changed 'country' to 'cnum'
in anom3 changed 'country' to 'cnum'
in anom4 changed 'country' to 'abrev'
in mkanomindex3 changed 'country' to 'cnum'
in mkindexes changed 'country' to 'cnum'
in wmo (that isn't done yet) changed 'country' to 'cnum' and added 'abrev'

Not very dramatic ;-) Then the associated “Load table foo” scripts, like Ltemps3, also had their field names changed to match. So Ltemp3, Lanom3, Lanom4. As “wmo” is in development it has no load script. Then the mkindex* are standalone processes.

Other changes including a change in how Russia and Kazakhstan are handled. These have their data split between Europa and Asia in v3.3 (and earlier) but all just assigned to their abbreviation in v4. My approach has the creation of a new ‘abrev’ for each of “R!” and “K!” to assign to the European data in v3.3. That means the data were not in the anomaly comparisons for “RS” Russia as they were a different country code in keeping with v3.3 method. I’ve changed the loading of the ‘yrcastats’ table to assign all the Russia data into the same “abrev’ code. This needs some testing / QA to assure it works as expected, but it ought to give more comparable results. This is still a work in progress as it needs QA and validation, plus warning removal.

Then I had gotten tired of typing long program names when pulling them in to MariaDB for execution. To make that less, and to clean things up a bit, I regularized some of the program names and removed the .sql suffix (as it isn’t needed).

So, let’s go through what it looks like now, as any references going forward will be using these names and functions.

First thing I did was to drop all the existing tables:

MariaDB [temps]> source tables/dropall
Query OK, 0 rows affected (1.43 sec)

Query OK, 0 rows affected (0.16 sec)

Query OK, 0 rows affected (0.21 sec)

Query OK, 0 rows affected (0.39 sec)

Query OK, 0 rows affected (0.52 sec)

Query OK, 0 rows affected (0.47 sec)

Query OK, 0 rows affected (0.15 sec)

Query OK, 0 rows affected (0.07 sec)

Query OK, 0 rows affected (0.58 sec)

Query OK, 0 rows affected (0.44 sec)

Query OK, 0 rows affected (0.22 sec)

MariaDB [temps]> show tables;
Empty set (0.14 sec)

Here’s the SQL script that does that:

chiefio@PiM3Devuan2:~/SQL/tables$ cat dropall 
DROP TABLE continent;
DROP TABLE country;
DROP TABLE invent3;
DROP TABLE invent4;
DROP TABLE temps3;
DROP TABLE temps4;
DROP TABLE mstats3;
DROP TABLE mstats4;
DROP TABLE anom3;
DROP TABLE anom4;
DROP TABLE yrcastats;

After you drop all the tables, then you must recreate them:

MariaDB [temps]> show tables
    -> ;
Empty set (0.00 sec)

MariaDB [temps]> source tables/table
Query OK, 0 rows affected (0.23 sec)

Query OK, 0 rows affected (0.13 sec)

Query OK, 0 rows affected (0.05 sec)

Query OK, 0 rows affected (0.03 sec)

Query OK, 0 rows affected (0.06 sec)

Query OK, 0 rows affected (0.06 sec)

Query OK, 0 rows affected (0.05 sec)

Query OK, 0 rows affected (0.06 sec)

Query OK, 0 rows affected (0.08 sec)

Query OK, 0 rows affected (0.11 sec)

Query OK, 0 rows affected (0.08 sec)

MariaDB [temps]> show tables
    -> ;
+-----------------+
| Tables_in_temps |
+-----------------+
| anom3           |
| anom4           |
| continent       |
| country         |
| invent3         |
| invent4         |
| mstats3         |
| mstats4         |
| temps3          |
| temps4          |
| yrcastats       |
+-----------------+
11 rows in set (0.00 sec)

This is what that script looks like:

chiefio@PiM3Devuan2:~/SQL/tables$ cat tables
source /SG500/xfs/chiefio/SQL/tables/continent
source /SG500/xfs/chiefio/SQL/tables/country
source /SG500/xfs/chiefio/SQL/tables/invent3
source /SG500/xfs/chiefio/SQL/tables/invent4
source /SG500/xfs/chiefio/SQL/tables/temps3
source /SG500/xfs/chiefio/SQL/tables/temps4
source /SG500/xfs/chiefio/SQL/tables/mstats3
source /SG500/xfs/chiefio/SQL/tables/mstats4
source /SG500/xfs/chiefio/SQL/tables/anom3
source /SG500/xfs/chiefio/SQL/tables/anom4
source /SG500/xfs/chiefio/SQL/tables/yrcastats

Each of those being a table schema. I’ll post the canonical set of them once the final loading step is proven for yrcastats.

Then the next thing to do is load up all the tables. Here’s that script. It does not run the Lyrcastats script as that one takes over 4 hours on the Raspberry Pi M3 so I prefer to see that this is all done and right before I commit to running it:

chiefio@PiM3Devuan2:~/SQL/bin$ cat LoadTables
source /SG500/xfs/chiefio/SQL/bin/LOAD/Lcontinent
source /SG500/xfs/chiefio/SQL/bin/LOAD/Lcountry
source /SG500/xfs/chiefio/SQL/tables/mkindexcn
source /SG500/xfs/chiefio/SQL/bin/LOAD/Linv3
source /SG500/xfs/chiefio/SQL/bin/LOAD/Linv4
source /SG500/xfs/chiefio/SQL/bin/LOAD/Ltemps3
source /SG500/xfs/chiefio/SQL/tables/mkindex3
source /SG500/xfs/chiefio/SQL/bin/LOAD/Ltemps4
source /SG500/xfs/chiefio/SQL/tables/mkindex4
source /SG500/xfs/chiefio/SQL/bin/LOAD/Lmstats3
source /SG500/xfs/chiefio/SQL/bin/LOAD/Lmstats4
source /SG500/xfs/chiefio/SQL/bin/LOAD/Lanom3
source /SG500/xfs/chiefio/SQL/tables/mkanomindex3
source /SG500/xfs/chiefio/SQL/bin/LOAD/Lanom4
source /SG500/xfs/chiefio/SQL/tables/mkanomindex4

If distributed the making of the indexes on the files throughout the actual file loading steps. Note that each “load” program now starts with an “L” and no longer has a .sql suffix.

Here’s the result of running it, divided into sections for easier commenting. Remember that the continent loading is done as separate lines for each “region” just stuffing in a name, so first up, continents:

MariaDB [temps]> source bin/LoadTables
Query OK, 1 row affected (0.11 sec)

Empty set (0.01 sec)

Query OK, 1 row affected (0.08 sec)

Empty set (0.00 sec)

Query OK, 1 row affected (0.01 sec)

Empty set (0.00 sec)

Query OK, 1 row affected (0.10 sec)

Empty set (0.00 sec)

Query OK, 1 row affected (0.01 sec)

Empty set (0.00 sec)

Query OK, 1 row affected (0.00 sec)

Empty set (0.00 sec)

Query OK, 1 row affected (0.01 sec)

Empty set (0.00 sec)

Query OK, 1 row affected (0.00 sec)

Empty set (0.00 sec)

As before, no warnings and no errors. Next is countries and the index on it:

Query OK, 240 rows affected (0.13 sec)               
Records: 240  Deleted: 0  Skipped: 0  Warnings: 0

Empty set (0.00 sec)

Query OK, 0 rows affected (0.24 sec)
Records: 0  Duplicates: 0  Warnings: 0

All nice now, no truncating that one too long country name ;-) After countries, we load up the “inventory”. This is the list of thermometers for each of v3.3 and v4 of GHCN.

Query OK, 7280 rows affected (1.51 sec)              
Records: 7280  Deleted: 0  Skipped: 0  Warnings: 0

Empty set (0.00 sec)

Query OK, 27361 rows affected (2.58 sec)             
Records: 27361  Deleted: 0  Skipped: 0  Warnings: 0

Empty set (0.00 sec)

All done fast and no warnings. Then we start loading the v3.3 and v4 actual temperature data. This part takes a while, but not too bad. Here is where we had those lost records from GHCN v4 using a backslash as a data marker and SQL taking it as an “ignore next character” symbol. For now I’ve just hand edited the input file to use “^” instead. This attributes the source to an Australian / Russian set instead of an Antarctic / Russian set, but as I’m not using “source flag” anyway (yet) it doesn’t matter. Eventually I’ll change the TAB inserting FORTRAN to swap in some unique character here, but for now, they used all the ones on my keyboard!

Note we also make a few indexes on each of these data tables so following steps go faster:

Query OK, 5554200 rows affected (6 min 29.43 sec)    
Records: 5554200  Deleted: 0  Skipped: 0  Warnings: 0

Empty set (0.01 sec)

Query OK, 0 rows affected (4 min 37.75 sec)         
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (5 min 11.87 sec)         
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 16811736 rows affected (20 min 57.80 sec)  
Records: 16811736  Deleted: 0  Skipped: 0  Warnings: 0

Empty set (0.02 sec)

Query OK, 0 rows affected (12 min 52.63 sec)        
Records: 0  Duplicates: 0  Warnings: 0

After the temperatures are loaded, I compute some monthly statistics. This is fairly math intensive and things slow down even more.

Query OK, 87360 rows affected (1 min 48.17 sec)
Records: 87360  Duplicates: 0  Warnings: 0

Empty set (0.01 sec)

Query OK, 328327 rows affected (29 min 35.20 sec)
Records: 328327  Duplicates: 0  Warnings: 0

Empty set (0.01 sec)

Here we do the Lanom3 and Lanom4 loading / computing of the anomaly tables using the statistics tables. First up is Lanom3, the anomalies calculations for GHCN V3.3, plus the 4 indexes that get built on that table.

MariaDB [temps]> source bin/LOAD/Lanom3
Query OK, 5278201 rows affected (12 min 51.75 sec)
Records: 5278201  Duplicates: 0  Warnings: 0

Empty set (0.05 sec)

Query OK, 0 rows affected (3 min 20.01 sec)         
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (3 min 39.06 sec)         
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (3 min 37.43 sec)         
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (3 min 44.97 sec)         
Records: 0  Duplicates: 0  Warnings: 0

No longer do we get the “Data truncated for column ‘deg_C’ at row” warnings. How was this done? By choosing how to take the result of the Decimal-float calculation, that will be a float, and trimming the excess digits ourselves. I chose to do a ROUND but if you like you can do a TRUNC instead and get the original values the old version produced by the default truncation. Rounding ought to be a bit closer to reality, but as this is rounding into the 1/100 C place and that’s all fictional precision anyway (original data in whole degrees F for the USA) having an occasional 1/100 C change ought not to do much.

Here’s the new version of Lanom3 with the change bolded:

chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lanom3
INSERT INTO  anom3 (stnID,region,abrev,cnum,wmo,near_wmo,year,month,deg_C)
SELECT T.stnID,T.region,C.abrev,T.cnum,T.wmo,T.near_wmo,
T.year,T.month,ROUND(T.deg_C-ST.mean,2)
FROM temps3 AS T
INNER JOIN country as C ON T.cnum=C.cnum
INNER JOIN mstats3 AS ST 
ON
        ST.stnID=T.stnID
	AND ST.month=T.month 
WHERE T.deg_C > -90.0 
;

show warnings;

Lanom4 gets a similar change.

Query OK, 15450574 rows affected (19 min 7.34 sec)
Records: 15450574  Duplicates: 0  Warnings: 0

Empty set (0.09 sec)

Query OK, 0 rows affected (11 min 40.06 sec)        
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (12 min 14.11 sec)        
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (12 min 6.50 sec)         
Records: 0  Duplicates: 0  Warnings: 0

And again, we have no error messages!

As a bit of review, here’s the listing of what tables we are creating / have created:

MariaDB [temps]> show tables;
+-----------------+
| Tables_in_temps |
+-----------------+
| anom3           |
| anom4           |
| continent       |
| country         |
| invent3         |
| invent4         |
| mstats3         |
| mstats4         |
| temps3          |
| temps4          |
| yrcastats       |
+-----------------+
11 rows in set (0.46 sec)

Looking at that list, the only one left to do is the yrcastats table. Yearly Country Anomaly Statistics. We’ve loaded the continent names and the countries, loaded the inventory of thermometers for both GHCN versions, and the actual temperature data for both sets, computed the monthly statistics for both sets and then from that, created the Anomaly table values for both.

Only at the point where we are computing the yrcastats table does the data get combined from both sets, by Country Abbreviation and that causes a problem with Russia and Kazakhstan as there is only one ‘abrev’ value for GHCN v4, the ‘RS’ flag, but 2 country numbers in GHCN 3.3 data (one for Europe and one for Asia, with the cutoff about the Ural Mountains).

The prior processing did not cope with this, so ‘RS’ was assigned to Asian GHCN v3.3 and “R!’ was assigned to the European part. This preserves the v3.3 data layout as it is matched to a country abbreviation, but then during that match, only the Asian portion of v3.3 data get matched to all of ‘RS’ in the v4 anomaly table to create the yrcastats table.

What I needed to do was find a whay to process ALL abrev in v3.3 and ALL abrev in v4, matched on abrev, except for ‘RS’, ‘R!’, ‘KZ’, and ‘K!’. I used the ‘!’ second character for the European country numbers as the abbreviation was two letters but not using any special characters, so it would be guaranteed to be unique and not accidentally match any existing country code, plus it would stand out as “different’ if seen anywhere.

Loading this table is very slow. 4 hours slow on the Raspberry Pi Model 3, so not something you get to try a lot of times in one day. What I’ve settled on for the present load process is unlikely to be the most efficient. At some future data I’m going to explore alternatives and look for something more efficient.

I’ve changed the v3.3 step to exclude loading the Russia and Kazakhstan data, then I use a “view” to combine their two Country Codes (abbreviations) into one record in the yrcastats table. Yeah, kludgy. There’s got to be a better way. To try speeding up the process, at some future time I’m also going to experiment with using a “view” for the first major v3.3 load step after the v4 load.

Here’s the modified Lyrcastats SQL program that loads the data excluding the two problem cases. This is done just after the v4 data is loaded in total and the only change is all those NOT (A.abrev=’RS’ OR …) lines. The first part, the INSERT for v4, is quite fast at 4 minutes. It is the UPDATE to add v3.3 that’s slow at 4 hours. I’ve bolded one of the lines that excludes the two problem nations:

chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lyrcastats 
INSERT INTO  yrcastats (year,abrev,mean4,big4,small4,num4,trang4,stdev4)
SELECT year,abrev,
AVG(deg_C),MAX(deg_C),MIN(deg_C),COUNT(deg_C), MAX(deg_C)-MIN(deg_C), STDDEV(deg_C)
FROM anom4 
GROUP BY year,abrev
;
show warnings;

UPDATE  yrcastats AS Y
SET 
mean3 = (SELECT AVG(A.deg_C) 
FROM anom3 AS A
WHERE Y.year=A.year AND Y.abrev=A.abrev
AND NOT (A.abrev='RS' OR A.abrev='R!' OR A.abrev='KZ' OR A.abrev='K!')
),

big3 = (SELECT MAX(A.deg_C) 
FROM anom3 AS A
WHERE Y.year=A.year AND Y.abrev=A.abrev
AND NOT (A.abrev='RS' OR A.abrev='R!' OR A.abrev='KZ' OR A.abrev='K!')
),

small3 = (SELECT MIN(A.deg_C) 
FROM anom3 AS A
WHERE Y.year=A.year AND Y.abrev=A.abrev
AND NOT (A.abrev='RS' OR A.abrev='R!' OR A.abrev='KZ' OR A.abrev='K!')
),

num3 = (SELECT COUNT(A.deg_C) 
FROM anom3 AS A
WHERE Y.year=A.year AND Y.abrev=A.abrev
AND NOT (A.abrev='RS' OR A.abrev='R!' OR A.abrev='KZ' OR A.abrev='K!')
),

trang3 = (SELECT MAX(A.deg_C)-MIN(A.deg_C) 
FROM anom3 AS A
WHERE Y.year=A.year AND Y.abrev=A.abrev
AND NOT (A.abrev='RS' OR A.abrev='R!' OR A.abrev='KZ' OR A.abrev='K!')
),

stdev3 = (SELECT STDDEV(A.deg_C) 
FROM anom3 AS A
WHERE Y.year=A.year AND Y.abrev=A.abrev
AND NOT (A.abrev='RS' OR A.abrev='R!' OR A.abrev='KZ' OR A.abrev='K!')
)
;
show warnings;
chiefio@PiM3Devuan2:~/SQL/bin/LOAD$

Desipte the extra comparison logic / tests, the run time is about the same as before:

MariaDB [temps]> source bin/LOAD/Lyrcastats
Query OK, 28423 rows affected (4 min 12.69 sec)
Records: 28423  Duplicates: 0  Warnings: 0

Empty set (0.01 sec)

Query OK, 28423 rows affected (4 hours 4 min 56.37 sec)
Rows matched: 28423  Changed: 28423  Warnings: 0

Empty set (0.05 sec)

MariaDB [temps]>

So yeah, it is ugly and slow. “Needs work” comes to mind. But for now it is what I have.

Next we need to combine the European and Asian parts of the two countries in GHCN v3.3 and load that combined set of statistics into the yrcastats table. To do this, I’m using a “view” on the anom3 table to combine the data for both Europe and Asia for the country abbreviations, and then computing and updating the fields for just the ‘RS’ or ‘KZ’ abbreviation used by the v4 data in yrcastats. In this way we end up with “apples to apples” comparisons.

First, the ‘view’ for Russia, then the loading of one country, then the other, and a bit of testing that it all worked as expected (plus an ‘Easter Egg’ found in the testing):

chiefio@PiM3Devuan2:~/SQL/tables$ cat vanom3R 
CREATE VIEW anom3R AS
SELECT AVG(A.deg_C)   AS 'meanA' ,
       MAX(A.deg_C)   AS 'maxA'  ,
       MIN(A.deg_C)   AS 'minA'  ,
       COUNT(A.deg_C) AS 'numA'  ,
       MAX(A.deg_C)-MIN(deg_C) AS 'trangA'  ,
       STDDEV(A.deg_C) AS 'stdevA' ,  
       Y.year AS 'yearA'
FROM anom3 AS A
INNER JOIN yrcastats AS Y
ON Y.year = A.year AND Y.abrev='RS'
WHERE A.abrev = 'RS' or A.abrev= 'R!'
GROUP BY Y.year

This view is created using both halves of Russia in the v3.3 data and only the one abrev for all of Russia in the v4 loaded yrcastats table. Then we can create reports from it, or “extract” the computed field from it, for Russia, which are then loaded back into the appropriate fields in yrcastats so we don’t need to compute them again for each use.

HOWEVER….

In testing I ran a simple report that said Russia had a range of 69 C in the anomalies. WT?

MariaDB [temps]> SELECT minA, maxA, trangA FROM anom3R;
+--------+-------+--------+
| minA   | maxA  | trangA |
+--------+-------+--------+
|  -4.03 | -0.03 |   4.00 |
|  -2.33 |  2.77 |   5.10 |
|  -2.19 | -1.13 |   1.06 |
| -48.23 | 21.07 |  69.30 |
[...]

IIRC the world record ever temperature is in the low 50 C somewhere. That Russia had a low anomaly of -48 the same year as a high anomaly of 21 above normal seemed a bit off to me. So “I don’t think so, Tim…” So lets do a quick report of the highs in the data and see if any of it is country 222 or 638, Russia:

MariaDB [temps]> SELECT cnum, cname, abrev FROM country
    -> WHERE abrev='RS' or abrev='R!';
+------+--------+-------+
| cnum | cname  | abrev |
+------+--------+-------+
| 638  | Russia | R!    |
| 222  | Russia | RS    |
+------+--------+-------+
2 rows in set (1.09 sec)


MariaDB [temps]> SELECT deg_C, cnum, year, month, stnID FROM temps3
    -> WHERE deg_C > 55;
+--------+------+------+-------+-------------+
| deg_C  | cnum | year | month | stnID       |
+--------+------+------+-------+-------------+
|  64.00 | 101  | 1999 | JULY  | 10160535000 |
|  60.40 | 112  | 2008 |  MAY  | 11264405000 |
|  71.00 | 112  | 1997 |  MAR  | 11264459000 |
|  82.00 | 113  | 2001 |  FEB  | 11365536000 |
|  76.00 | 124  | 2000 |  JAN  | 12462008000 |
|  67.30 | 125  | 2013 |  NOV  | 12567009000 |
|  66.40 | 127  | 1999 | JUNE  | 12761226000 |
|  73.00 | 127  | 2001 |  NOV  | 12761296000 |
|  66.70 | 127  | 1999 |  NOV  | 12761297000 |
|  86.90 | 128  | 2003 |  FEB  | 12861499000 |
|  87.20 | 131  | 2004 |  MAR  | 13167297000 |
|  90.00 | 133  | 2003 | JUNE  | 13361052000 |
|  61.00 | 133  | 2000 |  AUG  | 13361099000 |
|  74.70 | 222  | 2000 | JULY  | 22232411000 |
|  86.00 | 302  | 2010 |  APR  | 30285201000 |
|  83.00 | 302  | 2011 |  APR  | 30285201000 |
|  90.00 | 302  | 2011 | SEPT  | 30285242000 |
|  99.90 | 304  | 2000 |  DEC  | 30485629000 |
|  86.20 | 407  | 1981 |  APR  | 40778486000 |
| 154.40 | 425  | 1996 | JULY  | 42572259002 |
| 138.10 | 425  | 1996 | JULY  | 42572352002 |
| 102.30 | 425  | 1996 | JULY  | 42572417010 |
| 121.10 | 425  | 1996 | JULY  | 42572471001 |
|  56.10 | 425  | 2003 |  AUG  | 42574392001 |
| 144.20 | 425  | 1996 | JULY  | 42574750001 |
|  87.80 | 425  | 1999 |  APR  | 42591178002 |
|  70.00 | 502  | 2000 |  APR  | 50291652000 |
|  64.90 | 503  | 1992 |  AUG  | 50396237000 |
|  70.00 | 647  | 2000 |  JAN  | 64740030000 |
|  89.90 | 649  | 2003 |  DEC  | 64917285000 |
|  56.80 | 650  | 1813 | JULY  | 65033345000 |
+--------+------+------+-------+-------------+
31 rows in set (37.36 sec)

Well right there’s your “Global Warming”! A few “over 100 C” boiling hots being homogenized around could really warm up the place a lot…

So looks like I ought to put in a QA Screen up front for “Crazy High” temperatures and leave them out of the data loading process.

But otherwise the “View” looks to be working correctly.

BUT, is v4 any “better”? Nope. After a few crazy values scattered around the globe, WA Namibia, runs off to crazy land:

MariaDB [temps]> SELECT cname, cnum, abrev 
    -> FROM country WHERE abrev="WA";
+---------+------+-------+
| cname   | cnum | abrev |
+---------+------+-------+
| Namibia | 132  | WA    |
+---------+------+-------+
1 row in set (0.83 sec)

MariaDB [temps]> 

Since I’ve not done the Africa graphs yet, Namibia hasn’t screwed up them. I’ll fix this prior to doing Africa and re-doing Russia.

So with Namiba fingered as “WA” here’s the v4 temperatures above 55 C:

MariaDB [temps]> SELECT deg_C, cnum, year, month, stnID FROM temps4 
    -> WHERE deg_C > 55;
ERROR 1054 (42S22): Unknown column 'cnum' in 'field list'
MariaDB [temps]> SELECT deg_C, abrev, year, month, stnID 
    -> FROM temps4 WHERE deg_C > 55;
+--------+-------+------+-------+-------------+
| deg_C  | abrev | year | month | stnID       |
+--------+-------+------+-------+-------------+
|  57.40 | AR    | 1962 |  NOV  | ARM00087532 |
|  86.00 | BL    | 2010 |  APR  | BLXLT857862 |
|  76.00 | BP    | 1997 |  JAN  | BPXLT744568 |
| 126.40 | BR    | 1925 |  FEB  | BR027794310 |
|  77.20 | BR    | 1987 |  DEC  | BRM00082191 |
| 119.90 | BR    | 1932 |  OCT  | BRM00082994 |
|  79.10 | CI    | 1919 |  AUG  | CIXLT611794 |
|  71.50 | CO    | 2017 |  NOV  | COM00080110 |
|  81.50 | CO    | 1978 |  APR  | COXLT745561 |
|  83.40 | CO    | 1978 | JULY  | COXLT745561 |
|  83.40 | CO    | 1978 | JUNE  | COXLT745561 |
|  86.20 | DR    | 1981 |  APR  | DRM00078486 |
|  60.50 | GH    | 2011 |  MAY  | GHM00065418 |
|  74.00 | KG    | 1998 |  NOV  | KG000038473 |
|  59.05 | MR    | 2017 |  NOV  | MR000061415 |
|  60.90 | MR    | 2005 |  MAY  | MR000061421 |
|  61.10 | MX    | 2007 |  MAR  | MXM00076840 |
|  63.91 | NG    | 2006 |  APR  | NG000001036 |
|  56.40 | NG    | 2009 |  DEC  | NG000061017 |
|  60.55 | NG    | 2008 |  OCT  | NG000061043 |
|  66.35 | NG    | 2007 | SEPT  | NG000061096 |
|  61.35 | NG    | 2015 |  OCT  | NG000061096 |
|  59.45 | SG    | 2015 | JULY  | SGM00061698 |
|  66.40 | SU    | 2018 | SEPT  | SU000062640 |
|  61.85 | TH    | 2004 |  OCT  | TH000048500 |
|  61.55 | TU    | 2011 |  OCT  | TUM00017260 |
|  61.55 | TU    | 2011 |  OCT  | TUXLT060121 |
|  59.25 | TX    | 2016 |  AUG  | TX000038895 |
|  56.80 | UP    | 1813 | JULY  | UPM00033345 |
|  60.60 | WA    | 1941 |  APR  | WAXLT711578 |
|  56.00 | WA    | 1941 |  AUG  | WAXLT711578 |
|  63.60 | WA    | 1941 |  DEC  | WAXLT711578 |
|  66.15 | WA    | 1941 |  FEB  | WAXLT711578 |
|  66.89 | WA    | 1941 |  JAN  | WAXLT711578 |
|  62.10 | WA    | 1941 |  MAR  | WAXLT711578 |
|  61.30 | WA    | 1941 |  MAY  | WAXLT711578 |
|  60.90 | WA    | 1941 |  NOV  | WAXLT711578 |
|  60.55 | WA    | 1941 |  OCT  | WAXLT711578 |
|  57.60 | WA    | 1941 | JUNE  | WAXLT711578 |
|  58.55 | WA    | 1941 | SEPT  | WAXLT711578 |
|  66.05 | WA    | 1942 |  APR  | WAXLT711578 |
|  62.90 | WA    | 1942 |  DEC  | WAXLT711578 |
|  66.90 | WA    | 1942 |  FEB  | WAXLT711578 |
|  65.94 | WA    | 1942 |  JAN  | WAXLT711578 |
|  63.90 | WA    | 1942 |  MAR  | WAXLT711578 |
|  64.80 | WA    | 1942 |  MAY  | WAXLT711578 |
|  63.19 | WA    | 1942 |  NOV  | WAXLT711578 |
|  60.00 | WA    | 1942 |  OCT  | WAXLT711578 |
|  62.65 | WA    | 1942 | JULY  | WAXLT711578 |
|  58.20 | WA    | 1942 | JUNE  | WAXLT711578 |
|  56.45 | WA    | 1942 | SEPT  | WAXLT711578 |
|  60.90 | WA    | 1943 |  APR  | WAXLT711578 |
|  55.55 | WA    | 1943 |  AUG  | WAXLT711578 |
|  64.80 | WA    | 1943 |  DEC  | WAXLT711578 |
|  63.10 | WA    | 1943 |  FEB  | WAXLT711578 |
|  63.00 | WA    | 1943 |  JAN  | WAXLT711578 |
|  63.60 | WA    | 1943 |  MAR  | WAXLT711578 |
|  56.15 | WA    | 1943 |  MAY  | WAXLT711578 |
|  61.45 | WA    | 1943 |  NOV  | WAXLT711578 |
|  58.45 | WA    | 1943 |  OCT  | WAXLT711578 |
|  57.90 | WA    | 1943 | JULY  | WAXLT711578 |
|  59.05 | WA    | 1943 | JUNE  | WAXLT711578 |
|  57.20 | WA    | 1943 | SEPT  | WAXLT711578 |
|  68.65 | WA    | 1944 |  APR  | WAXLT711578 |
|  56.00 | WA    | 1944 |  AUG  | WAXLT711578 |
|  68.25 | WA    | 1944 |  FEB  | WAXLT711578 |
|  68.85 | WA    | 1944 |  JAN  | WAXLT711578 |
|  67.35 | WA    | 1944 |  MAR  | WAXLT711578 |
|  68.39 | WA    | 1944 |  MAY  | WAXLT711578 |
|  59.80 | WA    | 1944 |  OCT  | WAXLT711578 |
|  59.30 | WA    | 1944 | JULY  | WAXLT711578 |
|  60.65 | WA    | 1944 | JUNE  | WAXLT711578 |
|  56.15 | WA    | 1944 | SEPT  | WAXLT711578 |
|  60.44 | WA    | 1945 |  APR  | WAXLT711578 |
|  55.75 | WA    | 1945 |  AUG  | WAXLT711578 |
|  65.44 | WA    | 1945 |  DEC  | WAXLT711578 |
|  68.10 | WA    | 1945 |  FEB  | WAXLT711578 |
|  64.00 | WA    | 1945 |  JAN  | WAXLT711578 |
|  68.00 | WA    | 1945 |  MAR  | WAXLT711578 |
|  62.80 | WA    | 1945 |  MAY  | WAXLT711578 |
|  60.15 | WA    | 1945 |  NOV  | WAXLT711578 |
|  55.65 | WA    | 1945 |  OCT  | WAXLT711578 |
|  60.35 | WA    | 1945 | JUNE  | WAXLT711578 |
|  56.30 | WA    | 1945 | SEPT  | WAXLT711578 |
|  58.75 | WA    | 1946 |  APR  | WAXLT711578 |
|  61.70 | WA    | 1946 |  DEC  | WAXLT711578 |
|  63.00 | WA    | 1946 |  FEB  | WAXLT711578 |
|  62.00 | WA    | 1946 |  JAN  | WAXLT711578 |
|  60.80 | WA    | 1946 |  MAR  | WAXLT711578 |
|  56.35 | WA    | 1946 |  MAY  | WAXLT711578 |
|  58.80 | WA    | 1946 |  NOV  | WAXLT711578 |
|  56.65 | WA    | 1946 |  OCT  | WAXLT711578 |
|  60.95 | WA    | 1946 | JUNE  | WAXLT711578 |
|  55.25 | WA    | 1946 | SEPT  | WAXLT711578 |
|  59.80 | WA    | 1947 |  APR  | WAXLT711578 |
|  62.80 | WA    | 1947 |  DEC  | WAXLT711578 |
|  66.75 | WA    | 1947 |  FEB  | WAXLT711578 |
|  62.40 | WA    | 1947 |  JAN  | WAXLT711578 |
|  69.10 | WA    | 1947 |  MAR  | WAXLT711578 |
|  61.80 | WA    | 1947 |  MAY  | WAXLT711578 |
|  59.40 | WA    | 1947 |  NOV  | WAXLT711578 |
|  56.85 | WA    | 1947 |  OCT  | WAXLT711578 |
|  55.80 | WA    | 1947 | JULY  | WAXLT711578 |
|  55.10 | WA    | 1947 | SEPT  | WAXLT711578 |
|  57.40 | WA    | 1948 |  APR  | WAXLT711578 |
|  55.55 | WA    | 1948 |  AUG  | WAXLT711578 |
|  60.65 | WA    | 1948 |  DEC  | WAXLT711578 |
|  63.60 | WA    | 1948 |  FEB  | WAXLT711578 |
|  64.39 | WA    | 1948 |  JAN  | WAXLT711578 |
|  63.05 | WA    | 1948 |  MAR  | WAXLT711578 |
|  55.95 | WA    | 1948 |  MAY  | WAXLT711578 |
|  58.95 | WA    | 1948 |  NOV  | WAXLT711578 |
|  55.80 | WA    | 1948 |  OCT  | WAXLT711578 |
|  56.95 | WA    | 1948 | JUNE  | WAXLT711578 |
|  55.30 | WA    | 1948 | SEPT  | WAXLT711578 |
|  63.80 | WA    | 1949 |  APR  | WAXLT711578 |
|  62.25 | WA    | 1949 |  DEC  | WAXLT711578 |
|  66.60 | WA    | 1949 |  FEB  | WAXLT711578 |
|  63.60 | WA    | 1949 |  JAN  | WAXLT711578 |
|  65.50 | WA    | 1949 |  MAR  | WAXLT711578 |
|  61.95 | WA    | 1949 |  MAY  | WAXLT711578 |
|  58.25 | WA    | 1949 |  NOV  | WAXLT711578 |
|  56.40 | WA    | 1949 |  OCT  | WAXLT711578 |
|  61.60 | WA    | 1949 | JULY  | WAXLT711578 |
|  64.50 | WA    | 1949 | JUNE  | WAXLT711578 |
|  56.05 | WA    | 1949 | SEPT  | WAXLT711578 |
|  60.75 | WA    | 1950 |  APR  | WAXLT711578 |
|  61.20 | WA    | 1950 |  DEC  | WAXLT711578 |
|  65.69 | WA    | 1950 |  FEB  | WAXLT711578 |
|  66.55 | WA    | 1950 |  JAN  | WAXLT711578 |
|  64.50 | WA    | 1950 |  MAR  | WAXLT711578 |
|  62.05 | WA    | 1950 |  MAY  | WAXLT711578 |
|  60.05 | WA    | 1950 |  NOV  | WAXLT711578 |
|  58.35 | WA    | 1950 |  OCT  | WAXLT711578 |
|  60.25 | WA    | 1950 | JUNE  | WAXLT711578 |
+--------+-------+------+-------+-------------+
135 rows in set (1 min 54.50 sec)

A couple of those might be valid as The Wiki states the record ever was 59 C, that’s just one, once.

According to the World Meteorological Organization’s (WMO), the highest temperature ever recorded was 56.7 °C (134.1 °F) on 10 July 1913 in Furnace Creek (Greenland Ranch), California, United States.[8] The validity of this record is challenged as possible problems with the reading have since been discovered. Christopher C. Burt, a weather historian writing for Weather Underground, believes that the 1913 Death Valley reading is “a myth”, and is at least 2.2 or 2.8 °C (4 or 5 °F) too high.[9] Burt proposes that the highest reliably recorded temperature on Earth could be at Death Valley, but is instead 54.0 °C (129.2 °F) recorded on 30 June 2013.[10] This alleged record has since been tied twice by readings in Kuwait and Iran. The WMO has stated they stand by the 1913 record pending any future investigations.

Since the USA does not show up in that v4 list, I’m presuming that Death Valley data is not in the GHCN, so using a 55 C “max” is acceptable. What about low limits? I’m going to report on those elsewhere, but the low cutoff ought to be at least the record value:

Satellite measurements of the surface temperature of Antarctica, taken between 1982 and 2013, found a coldest temperature of −93.2 °C (−135.8 °F) on 10 August 2010, at 81.8°S 59.3°E. Although this is not comparable to an air temperature, it is believed that the air temperature at this location would have been lower than the official record lowest air temperature of −89.2 °C (−128.6 °F).

The only odd one I found in a quick look was this one in Russia:

MariaDB [temps]> SELECT deg_C, cnum, year, month, stnID FROM temps3 
    -> WHERE deg_C  -99;
+--------+------+------+-------+-------------+
| deg_C  | cnum | year | month | stnID       |
+--------+------+------+-------+-------------+
| -87.00 | 222  | 1997 |  MAR  | 22224738000 |
+--------+------+------+-------+-------------+
1 row in set (39.30 sec)

But IIRC that one was in the news about then…

So it looks to me like anything lower than -90 C is likely bogus as well.

With that out of the way, back to that “loading the Russian yearly statistics” thing…

Other than some bogus data entries (to be filtered out in a reload…) the report from anom3R looked generally right. Here’s the tail of the report:

+--------+-------+--------+
| minA   | maxA  | trangA |
+--------+-------+--------+
[...]
|  -4.83 | 12.92 |  17.75 |
| -10.88 | 11.24 |  22.12 |
|  -9.96 | 12.40 |  22.36 |
|  -8.88 | 13.56 |  22.44 |
| -10.29 | 15.14 |  25.43 |
|  -8.21 | 16.43 |  24.64 |
|  -9.44 | 11.05 |  20.49 |
|  -5.93 |  9.44 |  15.37 |
+--------+-------+--------+
259 rows in set (7.71 sec)

So 259 years of data, and typical “low anomalies” about 10 C down from the norm, typical “high anomalies” about 10 to 15 C over, and the range being about 20 +/- 5 on the range of the anomalies. That seems reasonable for Russia. So with that I’m willing to proceed with the data update of yrcastats using this “view”, and a similar one for Kazakhstan using “KZ” and “K!” where this one as “RS” and “R!”.

Loading yrcastats code below does not work

UPDATE: From here on down the modified code to load the yrcastats file just leaves it NULLS. Something about adding the additional “WHERE” filters causes it to totally fail. I’m debugging, but for now dropping back to the old Lyrcastats that just ignores the European part of Russia and Kazakhstan. At least it puts something into the table…

Looks like it is a limitation of MySQL (not necessarily other versions);

https://dev.mysql.com/doc/refman/5.6/en/update.html

You cannot update a table and select from the same table in a subquery.

So likely my use of “Y.abrev=’RS'” runs afoul of that. I’m going to swap it for A.abrev, and if that’s not enough then I’ll need a new appoach.

UPDATE2: I got LRS to work via an added field in the view and removing the Y.abrev reference in the WHERE. So that’s promising. I can do the same for the LRK. Then I need a bit more of a think no the bulk load…

This is the “first cut” at the data load program for Russia, and the result (Yeah, real “original” name for it “LRS” for Load Russia):

chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat LRS 
UPDATE  yrcastats AS Y
SET 
mean3 = (SELECT A.meanA
FROM anom3R AS A
WHERE Y.year=A.yearA AND Y.abrev='RS'
),

big3 = (SELECT A.maxA
FROM anom3R AS A
WHERE Y.year=A.yearA AND Y.abrev='RS'
),

small3 = (SELECT A.minA
FROM anom3R AS A
WHERE Y.year=A.yearA AND Y.abrev='RS'
),

num3 = (SELECT A.numA
FROM anom3R AS A
WHERE Y.year=A.yearA AND Y.abrev='RS'
),

trang3 = (SELECT A.trangA
FROM anom3R AS A
WHERE Y.year=A.yearA AND Y.abrev='RS'
),

stdev3 = (SELECT A.stdevA
FROM anom3R AS A
WHERE Y.year=A.yearA AND Y.abrev='RS'
)
WHERE abrev='RS'
;
show warnings;

Then when run, it has warnings on truncating so I’ll add a ROUND to the one for KZ (and if that’s fine, retrfit it to LRS):

MariaDB [temps]> source bin/LOAD/LRS
Query OK, 28423 rows affected, 1410 warnings (49.82 sec)
Rows matched: 28423  Changed: 28423  Warnings: 1410

+-------+------+---------------------------------------------+
| Level | Code | Message                                     |
+-------+------+---------------------------------------------+
| Note  | 1265 | Data truncated for column 'meanA' at row 2  |
| Note  | 1265 | Data truncated for column 'meanA' at row 4  |
| Note  | 1265 | Data truncated for column 'meanA' at row 5  |
| Note  | 1265 | Data truncated for column 'meanA' at row 6  
[...]
| Note  | 1265 | Data truncated for column 'meanA' at row 83 |
| Note  | 1265 | Data truncated for column 'meanA' at row 84 |
| Note  | 1265 | Data truncated for column 'meanA' at row 85 |
| Note  | 1265 | Data truncated for column 'meanA' at row 86 |
| Note  | 1265 | Data truncated for column 'meanA' at row 87 |
+-------+------+---------------------------------------------+
64 rows in set (0.00 sec)

The Warnings number is roughly the same as the number of items (Max, Min, etc) x number of years, so there will likely be other “truncate” warnings once “meanA” is rounded. I’ll work on that for a while…As there are clear size differences in the implied sizes in the anom3R view and the yrcastats table, some kind of ROUND or casting will be needed to make the warnings go away.

MariaDB [temps]> describe anom3R;
+--------+--------------+------+-----+---------+-------+
| Field  | Type         | Null | Key | Default | Extra |
+--------+--------------+------+-----+---------+-------+
| meanA  | decimal(9,6) | YES  |     | NULL    |       |
| maxA   | decimal(5,2) | YES  |     | NULL    |       |
| minA   | decimal(5,2) | YES  |     | NULL    |       |
| numA   | bigint(21)   | NO   |     | 0       |       |
| trangA | decimal(6,2) | YES  |     | NULL    |       |
| stdevA | double(16,6) | YES  |     | NULL    |       |
| yearA  | char(4)      | NO   |     | NULL    |       |
+--------+--------------+------+-----+---------+-------+
7 rows in set (1.06 sec)

MariaDB [temps]> describe yrcastats;
+--------+--------------+------+-----+---------+-------+
| Field  | Type         | Null | Key | Default | Extra |
+--------+--------------+------+-----+---------+-------+
| year   | char(4)      | NO   | PRI | NULL    |       |
| abrev  | char(2)      | NO   | PRI | NULL    |       |
| mean3  | float        | YES  |     | NULL    |       |
| mean4  | float        | YES  |     | NULL    |       |
| big3   | decimal(7,2) | YES  |     | NULL    |       |
| big4   | decimal(7,2) | NO   |     | NULL    |       |
| small3 | decimal(7,2) | YES  |     | NULL    |       |
| small4 | decimal(7,2) | NO   |     | NULL    |       |
| num3   | int(11)      | YES  |     | NULL    |       |
| num4   | int(11)      | NO   |     | NULL    |       |
| trang3 | float        | YES  |     | NULL    |       |
| trang4 | float        | NO   |     | NULL    |       |
| stdev3 | float        | YES  |     | NULL    |       |
| stdev4 | float        | NO   |     | NULL    |       |
+--------+--------------+------+-----+---------+-------+
14 rows in set (0.01 sec)

However, at 4 hours per reload, doing a few iterations and the QA on it all is going to take a while, so I’m going to post this now and then post updates if anything interesting pops up.

Here’s the KZ variants of the “view” and the LRK data load program (with rounding):

chiefio@PiM3Devuan2:~/SQL/tables$ cat vanom3K
CREATE VIEW anom3K AS
SELECT AVG(A.deg_C)   AS 'meanA' ,
       MAX(A.deg_C)   AS 'maxA'  ,
       MIN(A.deg_C)   AS 'minA'  ,
       COUNT(A.deg_C) AS 'numA'  ,
       MAX(A.deg_C)-MIN(deg_C) AS 'trangA'  ,
       STDDEV(A.deg_C) AS 'stdevA' ,  
       Y.year AS 'yearA'
FROM anom3 AS A
INNER JOIN yrcastats AS Y
ON Y.year = A.year AND Y.abrev='KZ'
WHERE A.abrev = 'KZ' or A.abrev= 'K!'
GROUP BY Y.year


chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat LKZ
UPDATE  yrcastats AS Y
SET 
mean3 = (SELECT ROUND(A.meanA,2)
FROM anom3K AS A
WHERE Y.year=A.yearA AND Y.abrev='KZ'
),

big3 = (SELECT A.maxA
FROM anom3K AS A
WHERE Y.year=A.yearA AND Y.abrev='KZ'
),

small3 = (SELECT A.minA
FROM anom3K AS A
WHERE Y.year=A.yearA AND Y.abrev='KZ'
),

num3 = (SELECT A.numA
FROM anom3K AS A
WHERE Y.year=A.yearA AND Y.abrev='KZ'
),

trang3 = (SELECT A.trangA
FROM anom3K AS A
WHERE Y.year=A.yearA AND Y.abrev='KZ'
),

stdev3 = (SELECT A.stdevA
FROM anom3K AS A
WHERE Y.year=A.yearA AND Y.abrev='KZ'
)
WHERE abrev='KZ'
;
show warnings;

Then running them:

MariaDB [temps]> source tables/vanom3K
Query OK, 0 rows affected (0.39 sec)

MariaDB [temps]>

So the view was accepted straight off. Now to do the load:

MariaDB [temps]> source bin/LOAD/LKZ
Query OK, 436 rows affected, 924 warnings (26.28 sec)
Rows matched: 28423  Changed: 436  Warnings: 924

+-------+------+---------------------------------------------+
| Level | Code | Message                                     |
+-------+------+---------------------------------------------+
| Note  | 1265 | Data truncated for column 'meanA' at row 2  |
| Note  | 1265 | Data truncated for column 'meanA' at row 4  |
| Note  | 1265 | Data truncated for column 'meanA' at row 7  |
| Note  | 1265 | Data truncated for column 'meanA' at row 9  |
| Note  | 1265 | Data truncated for column 'meanA' at row 10 |
| Note  | 1265 | Data truncated for column 'meanA' at row 12 |
[...]
| Note  | 1265 | Data truncated for column 'meanA' at row 79 |
| Note  | 1265 | Data truncated for column 'meanA' at row 80 |
| Note  | 1265 | Data truncated for column 'meanA' at row 81 |
+-------+------+---------------------------------------------+
64 rows in set (0.06 sec)

MariaDB [temps]> 

With a lot less data it goes a lot faster, but I’m still getting a truncated warning. So some digging into exactly what is going on with the data will be required. The data ought not exceed 3 digits a decimal point and 2 digits for anomalies (like -10.23) but that’s 6 digits, so maybe I need to make yrcastats field size larger… but I thought SQL didn’t count the “.” so something to look up. Or maybe just remove the bogus “way wrong” data and see if that fixes it ;-)

In Conclusion

I think this way of making yrcastats ought to be correct, but after some QA time it is clearly failing, not validated. Anyone with more SQL experience feel free to offer suggestions and corrections.

I’ll be scrubbing everything, putting in a “sanity filter” on the loading of the temperature data to filter out the clearly insane high values found above, and then reloading it all, maybe a couple of times. All up, it’s about 6 hours start to end per run, so it may be a couple of days before I’m able to figure out what’s wrong with the last steps. Until then, everything up to this last step is pretty much clean and done, and I can redo the Antarctic graphs to see how they changed with the added Russian source data.

UPDATE: I did the Antarctic graphs and they are almost indistinguishable from the others. A couple of anomaly spots near the middle years get a bit closer for v3.3 vs v4, but that’s about it.

Subscribe to feed

Advertisements

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in NCDC - GHCN Issues, Tech Bits and tagged , , , , . Bookmark the permalink.

12 Responses to Updated GHCN Load Scripts & Reduced Warnings

  1. Pouncer says:

    I suggest the bad records (high or low temps beyond reasonable limits) be marked rather than deleted. You may also need to have create and mark substitute records, capped at the highest reasonable/seasonal value.

    You run something sometimes to make sure your version can reproduce the official publications, even though you know the official version is founded on wrong data. Then –having shown the official results depend on or at least include the unreasonable records– a second run that skips the “marked” unreasonable records generates corrected results. A third run includes capped substitute values rather than omitting excessive values. I would expect a few unreasonable records affect the results by only a few fractions of a percent in most cases. But by testing the results of including, capping, or omitting these records you’ll be in better position to figure out the original process.

  2. E.M.Smith says:

    @Pouncer:

    All reasonable suggestions. As I have the original data kept archived, cutting them out in any one data run isn’t problematic.

    For a “first cut” I’m just going to put a “missing data flag” of -99.9 in the field.

    This is the “unadjusted” data, so nothing is published by the Warmistas based on it. They do a lot of adjusting (often the wrong way) that they defend fiercely before they do any publishing…

    But never fear, I’ll always have “the original” if desired.

    For the kind of analysis I’m doing, a few data points don’t seem to matter one way or the other.

    SIdebar On UPDATE failure:

    I found a statement on Stack Overflow that pointed out MySQL has a limit some other SQL versions do not: You can’t select from a table with a WHERE clause AND update it. It locks the table. I suspect that’s what caused my update to return statistics like it was working, but not actually load any data. So continuing to debug but with a new “Rule” to keep in mind.

  3. Bill in Oz says:

    How curious that those temperature readings above boiling point ( 144 degrees, 114 degrees etc ) all date from 1996. I wonder if the extraordinary low temp also dates from this period. Being an employee of a totally dis-functional government, lead by an alcoholic president, with public servants not getting paid for months, does interesting things …

    Probably similar issues with other data from counties going through disfunctional periods : Eg Ukraine, Caucasus countries ( ex USSR ) etc.

    I admire your persistence EM at discovering & then chasing down these problems. Good work !
    But I’m glad I am not spending my time doing it.

  4. CoRev says:

    Good job, EM. We’re all getting a better understanding.

  5. Steven Fraser says:

    @EM: This is a very craftsman-like approach: well-described, easily-followed, thorougly documented and replicable. Bravo, sir!

    A thought about outlier anomalies…these indicate a data quality issue IMO worth a bit of investigation as to the pattern (if any) of daily values in the particular locations and date ranges, and cross-referenced with the actual values in the other dataset for those dates. Anomalous anomalies,

    It occurs to me that they are recent enough and few enough that it might be fruitful to engage in some follow-up with the BOM’s ( or whatever) to clarify the extant records content, as a way to confirm either the v3 or v4 values. That the data sets have such spurious values is a travesty, indicative of poor data quality controls in the pipeline.

  6. Larry Ledwick says:

    It would be interesting to run an audit routine that checks several issues for being within logical bounds just to filter out the data bits that need some close examination.

    A gross filter as you suggest that says anything over X is probably bad , and anything less than Y is probably bad is a very simple and necessary first cut.

    But that still leaves unexamined values that fall within those hard ceiling and floor values but are still unreasonable for the local.

    (Not that I am suggesting doing this now – just brain storming on how you would do good data hygiene testing)

    Check maximum Delta T from the low to the following day high or (daily high to subsequent low). In typical daily temperatures this number should be near or less than about 30 deg F ( 17 deg C).
    In special cases such as strong Chinook or Santa Anna wind effects or Alberta clipper like cold out breaks the number could go higher, but would only make sense in certain areas. You are not going to get down slope wind effects in New Orleans or central Florida, or arctic front outbreaks in Hawaii.

    https://en.wikipedia.org/wiki/United_States_temperature_extremes

    For the United States, the extremes are 134 °F (56.7 °C) in Death Valley, California in 1913 and −79.8 °F (−62.1 °C) recorded in Prospect Creek, Alaska in 1971.

    The largest recorded temperature change in one place over a 24-hour period occurred on January 15, 1972 in Loma, Montana, when the temperature rose from −54 to 49 °F (−47.8 to 9.4 °C).

    The most dramatic temperature changes occur in North American climates susceptible to Chinook winds. For example, the largest 2-minute temperature change of 49 °F (27.2 °C) occurred in Spearfish, South Dakota, a rise from −4 to 45 °F (−20.0 to 7.2 °C).

    Among the U.S. states, Hawaii has both the lowest state maximum of 100 °F (37.8 °C) and the highest state minimum of 12.0 °F (−11.1 °C). Tropical ocean island locations such as Hawaii often have the lowest recorded temperature ranges, sometimes with a difference of as little as 62 °F (34.4 °C).

    Might want to assign a filter by zone that accounts for things like Chinook or Santa Anna Winds or the moderating effects of the near by ocean temps.

    Obviously professional meteorologists could spend a year or two developing a proper protocol for this sort of checking (which should already be in place I might at), but the obviously have not done such checking and added appropriate * notations to some of those data values.

    Obviously a daily high value that is 20+ degrees higher than the average of the next 10 max temperatures should get some validation and audit attention. Perhaps going back to local news papers to see if some special circumstance justifies the wild value.

  7. E.M.Smith says:

    I’ve got LRS and LKZ both working. They load the respective country statistics by year for GHCN v3.3, that is under two country codes, into one country code as updates to the yearly statsics table. The load for v4 works. That just leaves my attempt to load the v3.3 data WITHOUT Russia and Kazakhstan to get fixed…. However, I ought to be able to just run the old version, that loads only Asia dats for RS and KZ then simply run the update programs and overwrite the first stuff (for RS and KZ).

    Tacky, but it would work.

    I’ve taken the opportunity to change some of the data storage formats since I’m doing all this reloading. Made most of them DECIMAL (6, 2), so 6 long not counting the decimal point, with 2 in the fractional part. Like 1234.56 for example. I’m not sure if MySQL counts the minus sign on negative numbers, so just giving extra space to be sure -123.45 will fit. Even if it is an error value. I don’t want it truncated to fit and ending up bogus but looking right…

    At yhe point where temps get turned into anomalies, I added exclusions for temperatures over 60 C or beliw -90 C. That at least keep the rest of the process more sane.

    I’m rerunning the whole load now for another final QA step on all this . Done in a couple of hours snd ought to be zero warns. I’ll pick up that last load issue tomorrow… I’m done for tonight.

    The most pleasing thing for me is finding out the few errors on warns were not enough to change any graphs visibly. I don’t have to redo them or issue corrections. I may yet post the corrected Antarctic graphs as they had the bulk of all issues, just so you can see how little they chsnged. That is partly due to my using aggregated statistics more than straight temperature items.

    One quick thing I did, checking the changed data format for Standard Deviation, was print a report sorted by it. Some values were as high as 10 in the sort list… I suspect a 10 Std. Dev. probably “has issues”. So yeah, lots of possible sanity checks can be done. Of course, the big problem is doing the QA work on the code to assure the standard dev. value calculated is correct…

    I also capture the minimum and maximum, range, and mean for each month for each instrument. Lots of stuff that can be used for…. So far I’ve done nothing like find all temperatures over 4 std. dev. from the mean…, but do generate some of the stats needed to do that kind of thing.

    Well, I can’t focus the letters on the screen well and I’m yawning every sentence, so I think it is time to sleep.

    Just one passing comment:

    Rain here typically ends middle of April. Sometimes March. We have 1 to 3 days of cool rain predicted the next few days in May… That is very unseasonably cold and wet. It could end up snowing at high elevations. Time to watch the weather reports….

  8. E.M.Smith says:

    @Bill in Oz:

    I know this is the “unadjusted” data, so before someone has gone through and “preened” it, so has “QA Issues”. Unfortunately there is no “insanity removed but not molested into a whore” version in between the “insane seasoned” data and the “Climate Silicone Whore” version.with “augmentations”… But still, you’d think the source countries would be embarrassed to send out such crap data.

    @Larry L:

    One BIG thing hanging out there, that’s on my “someday” list, is the way the electronic sensors tend to “Fail High”. Anthony Watts has found some of these. One I remember a few years back was at a big airport in Hawaii – Oahu? It was reading something like 10 C higher than anything near it (something like 40 C surrounded by 30 C) and it was months before the thing was fixed… Then they didn’t remove the bogus data…

    So there’s the question of: “If this is the normal failure mode, how many places are aged into too high, or just not recording lows right, but not yet insanely high?”

    This, then, gets compounded by their tendency to make the instrument change “splice” not have a discontinuity. So they will not accept the 10 C “jump” but instead cool the past record to make the “join” smooth at the overlap. I suspect that part of “homogenizing” is the source of some of the “cool the past” behaviour. This also shows up in the change of whitewash to latex paint on the Stevenson Screens. As latex ages, it darkens a bit and gets a tiny bit warmer. Whitewash didn’t do that, it just got dirty. Now at the change point you get a suddenly whiter screen (or a new one next to the old one)… Rather than thinking “Oh, we just put a cleaner or whiter device in place, and it is reading cooler, over the years the old shelter got darker so oldest data are fine leave in the discontinuity.” what they do is lower the ENTIRE old record by the discontinuity to assure no “jump” at the join.

    It essentially casts in stone the “rise” caused by dirt or aging latex paint.

    Finding those kinds of systematic “fail high” issues is the purpose of tossing those various statistics fields into the database. A “someday” preparation. For after I’ve done this more basic “what have we got?” sanity sweep. (Or for anyone else who wants to run a copy…that’s why I’m sharing… and if anyone is thinking “But I don’t KNOW SQL, MariaDB, or Phython!”, well, I didn’t either when I started this just a few months ago…)

    FWIW, the entire data load now runs error and warning free right up to the last yrcastats step. I’m still working on a more efficient load for the Yearly Country Anomaly Statistics data load. But, as of now, I think I’m done “fooling around” with the rest of the steps. In a little while a new posting with “just the schema and load programs” will be posted to document the changes for anyone else doing this.

    Here’s the result of the “load it all” I ran last night. As some fields changed data type (float to decimal 6,2) the times may be different. I’m not going to check them one by one, but it would be important for anyone on an efficiency kick to see what the impact was.

    This is all done on a Raspberry Pi Model 3 with 1 GB of memory running Devuan version of Debian Linux (so not a lot of hardware needed ;-) . I do have 3 hard disks on it at the moment, though that’s mostly just for my convenience. The database itself is about 8 GB fully populated (and realize each of these tables has some redundancy with the others, I’ve NOT tried to conserve space at all) so would easily fit on a 32 GB or 64 GB micro-SD card and have no hard disks. One disk has the operating system bits ( like /usr/lib /var /tmp) mostly because I’m running from an 8 GB uSD card with TWO different operating systems on it, so only a 4 GB systems partition (legacy reasons…). As MySQL/MariaDB puts the database in /var/iib that rapidly filled up for me. Then things moved “elsewhere”. I moved the database to a free partition on one hard disk and some of the systems space onto another. Finally, I’ve put swap on a third.

    Running these load steps uses a bit over 1 GB of /tmp so it went onto dedicated 2 GB disk partition. As I’m often using a browser to edit web pages while this is running, more memory would help a lot. I regularly run about 300 to 700 MB on swap space, so isolating that activity from the other disk activity is a bit of a win. An ideal workstation for this would have 2 to 4 GB of memory and a 64 GB uSD card so nothing had to roll out to “real disk” at all. Do realize that repeated 1GB write / erase cycles of /tmp on a uSD will burn it out fairly quickly. Not as fast as when 10,000 cycles was the limit (now we’re in the hundreds of thousands) but “after a few years” I have had uSD cards die on me. Another reason I put things like /var and /tmp and swap on “real disks”.

    OK, so all of that said, here’s the run results:

    MariaDB [temps]> source tables/dropall
    Query OK, 0 rows affected (0.21 sec)
    
    Query OK, 0 rows affected (0.05 sec)
    
    Query OK, 0 rows affected (0.15 sec)
    
    Query OK, 0 rows affected (0.07 sec)
    
    Query OK, 0 rows affected (0.35 sec)
    
    Query OK, 0 rows affected (0.68 sec)
    
    Query OK, 0 rows affected (0.19 sec)
    
    Query OK, 0 rows affected (0.14 sec)
    
    Query OK, 0 rows affected (0.30 sec)
    
    Query OK, 0 rows affected (0.08 sec)
    
    Query OK, 0 rows affected (0.11 sec)
    
    MariaDB [temps]> source tables/tables500
    Query OK, 0 rows affected (0.37 sec)
    
    Query OK, 0 rows affected (0.06 sec)
    
    Query OK, 0 rows affected (0.06 sec)
    
    Query OK, 0 rows affected (0.06 sec)
    
    Query OK, 0 rows affected (0.03 sec)
    
    Query OK, 0 rows affected (0.06 sec)
    
    Query OK, 0 rows affected (0.07 sec)
    
    Query OK, 0 rows affected (0.04 sec)
    
    Query OK, 0 rows affected (0.02 sec)
    
    Query OK, 0 rows affected (0.05 sec)
    
    Query OK, 0 rows affected (0.25 sec)
    
    MariaDB [temps]> source bin/LoadTables
    Query OK, 1 row affected (0.40 sec)
    
    Empty set (0.00 sec)
    
    Query OK, 1 row affected (0.02 sec)
    
    Empty set (0.00 sec)
    
    Query OK, 1 row affected (0.04 sec)
    
    Empty set (0.00 sec)
    
    Query OK, 1 row affected (0.01 sec)
    
    Empty set (0.00 sec)
    
    Query OK, 1 row affected (0.03 sec)
    
    Empty set (0.00 sec)
    
    Query OK, 1 row affected (0.02 sec)
    
    Empty set (0.00 sec)
    
    Query OK, 1 row affected (0.00 sec)
    
    Empty set (0.00 sec)
    
    Query OK, 1 row affected (0.02 sec)
    
    Empty set (0.00 sec)
    
    Query OK, 240 rows affected (0.09 sec)               
    Records: 240  Deleted: 0  Skipped: 0  Warnings: 0
    
    Empty set (0.00 sec)
    
    Query OK, 0 rows affected (0.07 sec)
    Records: 0  Duplicates: 0  Warnings: 0
    
    Query OK, 7280 rows affected (0.95 sec)              
    Records: 7280  Deleted: 0  Skipped: 0  Warnings: 0
    
    Empty set (0.00 sec)
    
    Query OK, 27361 rows affected (2.04 sec)             
    Records: 27361  Deleted: 0  Skipped: 0  Warnings: 0
    
    Empty set (0.00 sec)
    
    Query OK, 5554200 rows affected (6 min 10.75 sec)    
    Records: 5554200  Deleted: 0  Skipped: 0  Warnings: 0
    
    Empty set (0.01 sec)
    
    Query OK, 0 rows affected (4 min 23.57 sec)         
    Records: 0  Duplicates: 0  Warnings: 0
    
    Query OK, 0 rows affected (4 min 50.64 sec)         
    Records: 0  Duplicates: 0  Warnings: 0
    
    Query OK, 16811736 rows affected (19 min 56.68 sec)  
    Records: 16811736  Deleted: 0  Skipped: 0  Warnings: 0
    
    Empty set (0.03 sec)
    
    Query OK, 0 rows affected (15 min 6.24 sec)         
    Records: 0  Duplicates: 0  Warnings: 0
    
    Query OK, 87360 rows affected (1 min 53.79 sec)
    Records: 87360  Duplicates: 0  Warnings: 0
    
    Empty set (0.01 sec)
    
    Query OK, 328327 rows affected (29 min 42.17 sec)
    Records: 328327  Duplicates: 0  Warnings: 0
    
    Empty set (0.01 sec)
    
    Query OK, 5278172 rows affected (11 min 0.60 sec)
    Records: 5278172  Duplicates: 0  Warnings: 0
    
    Empty set (0.00 sec)
    
    Query OK, 0 rows affected (3 min 19.33 sec)         
    Records: 0  Duplicates: 0  Warnings: 0
    
    Query OK, 0 rows affected (3 min 45.12 sec)         
    Records: 0  Duplicates: 0  Warnings: 0
    
    Query OK, 0 rows affected (3 min 44.07 sec)         
    Records: 0  Duplicates: 0  Warnings: 0
    
    Query OK, 0 rows affected (3 min 47.60 sec)         
    Records: 0  Duplicates: 0  Warnings: 0
    
    Query OK, 15450484 rows affected (18 min 0.78 sec)
    Records: 15450484  Duplicates: 0  Warnings: 0
    
    Empty set (0.02 sec)
    
    Query OK, 0 rows affected (11 min 41.87 sec)        
    Records: 0  Duplicates: 0  Warnings: 0
    
    Query OK, 0 rows affected (12 min 38.26 sec)        
    Records: 0  Duplicates: 0  Warnings: 0
    
    Query OK, 0 rows affected (12 min 19.45 sec)        
    Records: 0  Duplicates: 0  Warnings: 0
    

    I’m now going to post all those new schema and the associated load scripts in a new posting “soon”. I’m also done with this “soup to nuts” reloading and from here forward will just be working on improving the yrcastats load process. Running the old load, then an overlay of the LRS and LKZ loads does work, but it is just tacky and slow. Once I have a “better way” I’ll add that to the next posting listing.

  9. Larry Ledwick says:

    Yes my cheap little indoor out door thermometer, fails high at 158 deg F if they lead to the outside sensor gets cut (sliding door cut it one time when I slammed it too hard). Since those sensors are basically temperature sensitive resistors, that is something to be expected. They should have an internal fixed value resistor that you can use with a manual test button to verify that the circuitry is still reading properly. Would cost about 10 – 20 cents on a mass production device to add a test or calibration feature.

    Or once a year immerse the sensor in a glass of ice water and verify it settles to 0 deg C (32 deg F) should be part of the weather service operational protocol.

  10. Pingback: GHCN Version Anomaly Database Documentation | Musings from the Chiefio

  11. E.M.Smith says:

    Just a note that I’ve added a “where” statement to the bottom of LRS and LKZ. They did in fact update those two as expected. However, they also zeroed all the others in the process. So I made all the right graphs as each was done during it’s “set” load; but then the mean3 fields were left empty except for whatever was last loaded.

    The added “WHERE” restricts LRS to loading just the RS data and LKZ to loading just the KZ. So All the fields are left full at the end.

    I did NOT update the “loading run statistics” so if you run these you will get a smaller “rows affected” statistic than quoted above.

  12. Pingback: GHCN v3.3 vs v4 – Top Level Entry Point | Musings from the Chiefio

Anything to say?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.