GHCN Database Pondering A Redo

So I’m going through the existing database for holding GHCN data, producing anomailes, and seeing how they change between versions. It presently handles v3.3 and v4. Each set of data is in its own table. These are almost identical.

So I’m thinking of combining them into one table, with “version” as a key field. Then any number of versions can be loaded without making new tables.

The problem is that parsing (or even just extracting) from One Big Table is slower than from small tables. So will it be “worth it”?

To test this I made a combined temperature data table. Here’s the new schema:

MariaDB [temps]> describe ghcn;
+-----------+--------------+------+-----+---------+-------+
| Field     | Type         | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+-------+
| stnID     | char(11)     | NO   | PRI | NULL    |       |
| type      | char(4)      | NO   | PRI | NULL    |       |
| version   | char(5)      | NO   | PRI | NULL    |       |
| year      | char(4)      | NO   | PRI | NULL    |       |
| month     | char(4)      | NO   | PRI | NULL    |       |
| ascension | char(10)     | YES  |     | NULL    |       |
| region    | char(1)      | YES  |     | NULL    |       |
| cnum      | char(3)      | YES  |     | NULL    |       |
| abrev     | char(2)      | YES  |     | NULL    |       |
| wmo       | char(5)      | YES  |     | NULL    |       |
| near_wmo  | char(3)      | YES  |     | NULL    |       |
| deg_c     | decimal(6,2) | YES  |     | NULL    |       |
| missing   | char(1)      | YES  |     | NULL    |       |
| qc        | char(1)      | YES  |     | NULL    |       |
| source    | char(1)      | YES  |     | NULL    |       |
| F1        | char(1)      | YES  |     | NULL    |       |
| F2        | char(1)      | YES  |     | NULL    |       |
| F3        | char(1)      | YES  |     | NULL    |       |
| F4        | char(1)      | YES  |     | NULL    |       |
+-----------+--------------+------+-----+---------+-------+

You can see I’ve grouped all the key fields together at the top. There’s also both the 4 Fx fields from the version 4 StationID (right after the 2 digit country Abbreviation and before the WMO bits (well, I think it’s the WMO… The v4 docs that describe their StationID layout are clearly wrong so some guessing is required.)) and the “near_wmo” and the like from the v3 set.

Loading the data went well except for a couple of warnings when loading the v4 data. The v3 had no warnings):

MariaDB [temps]> source bin/LOAD/Lghcn
Query OK, 16811736 rows affected, 4956 warnings (26 min 22.57 sec)
Records: 16811736  Deleted: 0  Skipped: 0  Warnings: 4956

+---------+------+---------------------------------------------------+
| Level   | Code | Message                                           |
+---------+------+---------------------------------------------------+
| Warning | 1265 | Data truncated for column 'source' at row 1056267 |
| Warning | 1261 | Row 1056267 doesn't contain data for all columns  |
| Warning | 1265 | Data truncated for column 'source' at row 1056268 |
| Warning | 1261 | Row 1056268 doesn't contain data for all columns  |
| Warning | 1265 | Data truncated for column 'source' at row 1056269 |
| Warning | 1261 | Row 1056269 doesn't contain data for all columns  |
| Warning | 1265 | Data truncated for column 'source' at row 1056270 |
[...]
| Warning | 1265 | Data truncated for column 'source' at row 1056296 |
| Warning | 1261 | Row 1056296 doesn't contain data for all columns  |
| Warning | 1265 | Data truncated for column 'source' at row 1056297 |
| Warning | 1261 | Row 1056297 doesn't contain data for all columns  |
| Warning | 1265 | Data truncated for column 'source' at row 1056298 |
| Warning | 1261 | Row 1056298 doesn't contain data for all columns  |
+---------+------+---------------------------------------------------+
64 rows in set (0.01 sec)

So there’s some loose-end there. Likely the line truncates without a field before the newline or some such. Basically a dirty data issue.

But all the records make it in.

So what’s the problem? Well, let’s say we just want to count up how many v4 records there are. How long does that take in the old way table, and how long in the new way table?

MariaDB [temps]> SELECT COUNT(deg_C) from temps4;
+--------------+
| COUNT(deg_C) |
+--------------+
|     16811736 |
+--------------+
1 row in set (1 min 41.79 sec)

MariaDB [temps]> SELECT COUNT(deg_C) from ghcn WHERE version="Gv4";
+--------------+
| COUNT(deg_C) |
+--------------+
|     16811736 |
+--------------+
1 row in set (3 min 49.13 sec)

And there’s the basic problem. Over 2 x as long just to count them up.

Why? It has to pick them out of the table so must look at all the entries, both v3 and v4, and then do a test on the “version” field. This kind of performance “hit” will happen on pretty much every kind of processing done against the table as a simple aggregate of 2 (or more) versions is fairly meaningless. Either you will be searching, or aggregating just one of them (so take the hit) or will be comparing the two (so will take the hit twice).

OTOH, the raw temperature data is mostly just run on into the Anomalies tables and Statistics tables. Most reporting and graphing come out of them. Similarly, while I’m running this on a Rasperry Pi M3 at the moment, I expect most folks will be using bigger iron and caring less about the time it takes.

That’s my question. I’ll be kicking it around for a few days. I’m going to make a combined anomalies table too, and populate it. Basically make the thing up to where statistics are applied (but short of the yearly country statistics – yrcastats – as its a pain and has complications caused by Kazakhstan and Russia being handled differently in the two versions. By then I ought to know if it is overall making things better, or worse.

But until then, anyone with an opinion, toss it at the wall and see if it sticks…

Subscribe to feed

Advertisements

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in NCDC - GHCN Issues, Tech Bits and tagged , , . Bookmark the permalink.

56 Responses to GHCN Database Pondering A Redo

  1. Peter says:

    Have you seen any of the recent work that Nick Stokes has been doing looking at the v3 v v4? Apparently the anomalies are calculated completely differently – i.e. v4 is not an update of v3 so much as a completely different product.
    Nick’s Blog is here (in case you hadn’t already seen it) https://moyhu.blogspot.com/

  2. E.M.Smith says:

    I’m not particularly fond of embracing the work of True Believers, and Nick was, shall we say, “dismissive” of me early on (and attacked me for comparing v1 to v3 with rather stupid objections one being that the way I did First Differences was “non-standard” … so I had to run off and find a reference showing it was OK to do it from either direction….).

    So no, I generally don’t read his stuff.

    Did have an interesting global map he did show up on one web search I’d done, and briefly thought of doing that, but then thought better of it.

    As I understand it, the thing with V4 is they went back to daily records for the basis, not the monthly data. Something like that. But frankly, I don’t really care HOW they did it, what I care about is that they “Changed things capriciously once again” and it too was “ALWAYS making the past colder and the present hotter”.

    I care much more about showing the magnitude and direction of the diddle changes as it illustrates the way they love to re-write history.

    Were they REALLY interested in truth, then the exceedingly hot 1936 when most of our actual hot records were set would show up as hot in the history in GHCN. Since it doesn’t, we know it’s a First Class Fraud. There IS an objective reality, we DO have 1st hand recordings from the past, and at every turn, the GHCN “manufacturers” shape and mold the Data Food Product such that it is out of sync with that objective history and record.. Simply put: It is unfit for purpose.

    So folks who admire how wonderful a job was done with it, and praise the folks who create it, are not high on my list of “trusted sources”, no matter their coding skill.

    In particular, I’ve lived inside a 110 mile radius in California for my entire life, other than a few years working on contracts in Florida ( about 5 years total spread over 10). I KNOW what’s normal here and what is not. Simply put: Nothing has really changed. Period.

    It’s been hotter ( I remember one day of 117 F that’s not been repeated) and ti’s been colder (several episodes of snow in the Great Valley – about ’63 and ’73 or so, then one recently). So trying to sell me on the notion that it’s Boiling Hot and We’re ALL GONNA DIE OF HOT!!!!! Just sounds like Chicken Little. Frankly, I remember one day of 105 F here about 1983? that’s NOT been repeated since. It was in August. In comparison, this August, I’m not running the AC today. (Nor did I much last year…)

    So listening to the breathless True Believers moaning over how HORRIBLE it is, well, let’s just say it’s hard to swallow. I find it more pleasant not to “go there”.

    One last point: IIRC he was in some way connected with the Berkeley effort. Sold as an attempt to make a new clean record. Then what did they do to remove the splice artifacts in the older GHCN? Chop and chip the data into even MORE shorter segments and create a great deal MORE splice artifacts. And erase in one movement all the Good Will I’d held toward Berkeley. Anyone who doesn’t know that Splice Artifacts are EVIL in calorimetry ought not be allowed near either a thermometer nor a computer…

  3. Peter says:

    All points well taken. Where I am in Australia at the moment I am noticing it as being abnormally cold for this time in July/August, which pretty much aligns with the work of Ninderthana (Ian Wilson – http://astroclimateconnection.blogspot.com/) on orbital physics. Back in 2013 he predicted the heat wave for Australia over 2018/19 summer here based on orbital physics as well, and whilst he didn’t say anything about the northern hemisphere, I can’t help wondering if the most recent heat in Europe is from the same phenomenon.

    Meanwhile, Australia’s temperature anomaly for June 2019 is in the negative range by several degrees, and some record cold temps have been recorded very recently (4.8C at Middle Point near Darwin in the tropics?)

  4. NickF says:

    Consider index on some additional fields that are frequently used to separate sheep from goats. Version number, for instance. Possibly region as well. Then the DB need only read relevant records.

    I’m not a DB expert, but fortunately I work with some top-rank practitioners and try to learn from them. My employer is a large company. Frequently encounter applications – even from big-name vendors – where the database schema or application design flounders at our scale.

    One contractor, working with our team supporting an important package, had previously worked for the package vendor. When he first looked at our production implementation, he gasped, “We never thought it would get anywhere near THAT big!” Tech equivalent of the famous Jaws line, “Looks like we’ll need a bigger boat!”

  5. Larry Ledwick says:

    It only took what – 30 -40 years?

  6. E.M.Smith says:

    @Peter:

    I expect that Australia / New Zealand will be leading the plunge into cooler times. There’s more hysteresis in the solar radiation in the S. H., the Southern Ocean is way cold, and just a minor movement of the currents and winds can change the water and weather of the islands a lot…

    Then the discrepancy between reality and the “official reports” ought to stand out strongly.

    @NickF:

    Sound advice so you know how database design works. I’d only point out that under “key” in the above table description you will find “version” has “PRI” for Primary Key.. So it already has an index on it. That means the example report is as fast as you can get it with indexes.

    @Larry L:

    40 years, you say? Let’s see, hot shot gets Ph.D. at about 25, gets Government Job… collects Rent Seeking Check until retirement at 65… or right about 40 years….

    New Guy gets hired. Can’t stay on the Ph.Dole if you don’t find something New, Exciting… and simply saying “What the last guy did was enough” gets your department shut down so…. maybe it’s time for a Paradigm Shift! /sarc;

  7. E.M.Smith says:

    I’ve made combined versions of both the mstats and anomalies tables. Then scripts to load them, and did the load of statistics. The bulk of the statistics calculations are in v4 as there are a lot more records there. On the Pi M3 it took just under 1/2 hour in the mstats4 table. What’s the time required in the combined data load?

    MariaDB [temps]> describe mstats;
    +---------+--------------+------+-----+---------+-------+
    | Field   | Type         | Null | Key | Default | Extra |
    +---------+--------------+------+-----+---------+-------+
    | stnID   | char(11)     | NO   | PRI | NULL    |       |
    | month   | char(4)      | NO   | PRI | NULL    |       |
    | version | char(5)      | NO   | PRI | NULL    |       |
    | mean    | decimal(6,2) | NO   |     | NULL    |       |
    | big     | decimal(6,2) | NO   |     | NULL    |       |
    | small   | decimal(6,2) | NO   |     | NULL    |       |
    | num     | int(11)      | NO   |     | NULL    |       |
    | trang   | decimal(6,2) | NO   |     | NULL    |       |
    | stdev   | decimal(6,2) | NO   |     | NULL    |       |
    | pctd    | float        | YES  |     | NULL    |       |
    | pctm    | float        | YES  |     | NULL    |       |
    +---------+--------------+------+-----+---------+-------+
    11 rows in set (0.00 sec)
    
    MariaDB [temps]> source bin/LOAD/Lmstats;
    Query OK, 415687 rows affected (1 hour 2 min 50.00 sec)
    Records: 415687  Duplicates: 0  Warnings: 0
    
    Empty set (0.02 sec)
    
    MariaDB [temps]> 
    

    So again about twice as long. Yes, the v3.3 data statistics are also computed, but there are much less of the total than the v4 records.

    So it’s looking like I can cut the number of tables about in half by increasing the data processing time by about double. Overall, it is more efficient of machine resources to keep things in individual tables; but conceptually it is cleaner and simpler to have it all in one table.

    Well, on to the anomalies computation. As this is going rather fast, I think I’ll do the whole implementation and then decide (or maybe just leave them both “out there” for folks to play with).

  8. E.M.Smith says:

    Loading Anomalies gets done in two passes. Why? Because at this stage the two versions still do not have a unified country marker. v3 using country number and v4 using the 2 letter abbreviation.

    I likely ought to make an updating script to fill in both in the ghcn temperatures table, THEN do the anomalies update in one pass, but not tonight…

    Here’s the run output:

    MariaDB [temps]> source bin/LOAD/Lanomalies
    Query OK, 5278172 rows affected (12 min 1.21 sec)
    Records: 5278172  Duplicates: 0  Warnings: 0
    
    Empty set (0.01 sec)
    
    Query OK, 15450484 rows affected (40 min 59.88 sec)
    Records: 15450484  Duplicates: 0  Warnings: 0
    
    Empty set (0.01 sec)
    

    Not all that bad, really. Here’s the update script / SQL:

    chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lanomalies
    INSERT INTO  anomalies (stnID,year,month,version,region,abrev,cnum,wmo,near_wmo,deg_C)
    SELECT T.stnID,T.year,T.month,T.version,T.region,C.abrev,T.cnum,T.wmo,T.near_wmo,ROUND(T.deg_C-ST.mean,2)
    FROM ghcn AS T
    INNER JOIN country as C ON T.cnum=C.cnum
    INNER JOIN mstats AS ST 
    ON
            ST.stnID=T.stnID
    	AND ST.month=T.month 
            AND ST.version=T.version
    WHERE T.deg_C > -90.0 AND T.deg_C  -90.0 AND T.deg_C < 60.0 AND T.version="Gv4"
    ;
    
    show warnings;
    

    So a close inspection will show some fields are T.xxxx in v3 and C.xxxx in v4 (being picked up from the Countries table). It was easier to do it this way than fix the missing dta in each version (make them match on what is used to ID the country) I’ll revisit this sometime tomorrow…

    So at this point I’ve unified the major tables and loaded data into them.

    Guess it’s time to try a few reports / graphs and see if they are easier, or not…

  9. rms says:

    Just for kicks, now that you have one big table (MySQL) and some Python experience, perhaps load all data into a Pandas Dataframe and compare performance. Also, try in R. Both Pandas and R has, as I understand, simple ways to import data from MySQL and then have “magic” to get some performance. It could be interesting and perhaps valuable to compare three tools for the simple benchmarks you show above.

  10. Another Ian says:

    E.M.

    “Basically make the thing up to where statistics are applied (but short of the yearly country statistics – yrcastats – as its a pain and has complications caused by Kazakhstan and Russia being handled differently in the two versions. By then I ought to know if it is overall making things better, or worse.”

    Have you looked for another round(s) of “gotchas” in V1 and V2 that might complicate this project?

  11. jim2 says:

    Here is a WUWT article that contains a different approach to comparing versions and also a few criticisms if EMs approach, although they admit they don’t know why EM got the result he did. It includes the code they used. It would be interesting to QC THEIR work :)

    https://wattsupwiththat.com/2012/06/22/comparing-ghcn-v1-and-v3/

  12. Neil Fisher says:

    EM:”Sound advice so you know how database design works. I’d only point out that under “key” in the above table description you will find “version” has “PRI” for Primary Key.. So it already has an index on it. That means the example report is as fast as you can get it with indexes.”

    Err, not quite – it means it forms part of a unique record identifier.
    Create an index anyway, I’m pretty sure you will like the improvement,

  13. p.g.sharrow says:

    @EMSmith; The question is where do you go from here?
    You began this quest to determine the how and why this AGW thing came into being based on the weather records. A computer forensic examination of the official Data Base by means available to anyone with skill and a limited budget. With a few hundred dollars and a lot of talent you have succeeded marvelously in creating a desktop super computer emulator from off the shelf parts!
    The creation of this “tool” has been of far more interest to me then the evaluation of a data base that I have known to be flawed from the start. Sometimes the path we walk is more important then our original destination.
    You began this quest, to write a paper on your evaluation of the usefulness of the weather data records for determination of “Global Climate Change”, ……. ok……… But what is next?
    I would say, classes on the creation and use of your “Emulator” hold a greater long term value to the world then any paper on the poor quality of the weather data base for projecting human caused climate changes. Just my nickles worth of opinion…pg

  14. Larry Ledwick says:

    Interesting look at an European heat wave in 1881

  15. E.M.Smith says:

    @rms:

    An interesting idea. Unfortunately I think it is likely to measure my low skill level with Python more than the relative performance. Then, for R, I have never written one line of it, so coming up from “scratch” is not a good place to be for performance comparisons….

    It is an interesting idea, to ask “What if the creation of statistics and anomalies were done in data frames instead of the database”… I’m pretty sure I’d start hitting swap issues on this hardware though. With 1 GB of memory I’ve already got to close the browser when doing things in the MySQL / MariaDB. (Right now I’ve got 308 MB of swap in use…)

    Observing these “load table” programs run, there are red D disk wait tags showing up in HTOP a lot, so I think it is disk limited much of the time. Likely why the times double on things that are not very heavy on the computes. It is ranging over a physically larger patch of disk and reading more of it (even if some of the records are not used due to being for the ‘other’ version). Those longer movements add immutable time…

    Which brings me to what I’d do for more performance: More memory and USB 3.0 disks.

    Each step is a little different, and some of the load data steps do have one core pegged at 100% some of the time (a lot of the time) while others are D disk wait and memory is at the limit (or beyond it if you do anything else…) 2 GB or more of memory helps significantly. When run on the systems with a USB 3.0 disk, the disk wait issues reduce a lot (though they come back with enough CPU speed… things are a balancing act.) Once you reach the A72 and A73 cores with big fat heat sinks on them, CPU is no longer limiting and you are back at D Disk Wait limiting.

    What does all this have to do with Python or R dataframes? Where do they ‘live’? Memory or Disk. IF in memory, it will end up swapping to disk anyway. The v4 data with tabs in it (and redundant copies of things like Station ID) takes up 1.4 GB on disk. I suspect that data frames do not store the data in a tree hierarchy with key fields, but in a square array with redundant field copies for each row; so this flat file representation of the data is likely representative. That means that at a minimum at any one time 1/3 of the data (and likely 1/2 as the OS and software take some space along with the graphics partition – my 1 GB has 867 MB actually available for programs, the other 128 MB being for the GPU) will be on swap space.

    We already know that the weak spots of the Pi are the USB I/O channel and heat management. So stuffing more workload onto the USB channel (even before taking the hard disk hits…) is a poor way to go. Then, to the extent Python or R can use more than one core effectively, we will start hitting heat limiting issues.

    So while it would be a fun way to torture this small SBC, I suspect what we’d find out is what we already know: It is too small to use on 1.5 to 2 GB of data in memory.

    So: To properly do what you suggest, I’d need to use one of the boards with lots of memory and fast USB 3.0 disk. And good I/O channel designs (and faster memory type, too ;-) That would be the Odroid N2 or the RockPro64. At that point I think it could be very interesting. There’s enough memory and I/O speed to hold that much data in core, and the load / store operations run at USB 3.0 speed; then with 6 to 8 cores and some of them A72 or A73 (the A73 has much better heat management but otherwise similar) you will not be CPU limited much. There I think it would be a very interesting question to ask “What happens if you put it all in memory in a data structure?”

    But to get there, I need to learn some R. Likely on this platform and with the data in tables…

    However: If you want to, say, code up the R and Python to take the temperature data, and calculate the statistics and anomalies from it, I’ll happily run it for you to get the results. The MariaDB table schema are above (and I can post the various Load Programs too) so anything you need to know about the data structures available are “good to go”.

    @Another Ian:

    Yea, I have. v2 is essentially identical in structure to v3 and even uses the same StationIDs. It ought to be trivial to add (mostly involving me finding the cleanest newest copy of it… and writing a load data step by copy of the v3 and changing v3 to v2 file locations). The v1 is a bit more complicated. It is much much shorter in length, and the countries and StationID are different (yet again). There WILL be complications with it, if for no other reason than you have things like the fall of the USSR changing what countries exist over much of the Eurasian space. The StationID looks like it is just shorter. IIRC it’s just the WMO number “unadorned” by the “mod flag” of v2 / v3 and without the move to alphanumeric “stuff” in v4. I think I can map that to v2 / v3 stationID but ‘we’ll see”.

    I can still do a lot with them, as the v1 (and some v2…) country changes only really show up at the point where I’m doing “by country” statistics. That yrcastats file and the reports from it. What I’m more likely to do once those two are “in the mix” is go to LAT / LONG boxes. Not the 8k or 16,000 of them smeared together in “homogenizing” that the “climate scientists” do, but larger boxes that capture, say, 1/4 of a continent or all of the Caribbean. That ought to give interesting and very valid comparisons of the data itself.

    @Jim2:

    Yeah, that’s the “food fight” where Nick and fellow travelers got pissy at me. I essentially re-did the whole thing with a couple of variations showing the result stayed the same with changes of direction of First Differences AND with using a Baseline approach.

    Do note that the methods used then are quite different from what I’m doing now. That was all FORTRAN and this is database based, for one example. Then I used First Differences primarily. Here I’m using “all data” as the “baseline” (though can easily change to any other baseline period). In all cases the results are the same or very similar anyway. The past gets cooled… and the “present” end of each data set gets comparatively warmer…

    If you look at the picture up top, you will see some “pages” under it as headings. One is V1 vs V3 and is an index into a lot of that stuff using a non-database approach:
    https://chiefio.wordpress.com/v1vsv3/

    So there’s a lot more than the one article to work through…

    But do note that v3 is already a cooked version compared to v1 so finding v4 is cooked relative to v3 is finding an ongoing pattern of frog warming…

    @Neil Fisher:

    What I read said you get an index on primary key fields. I’m willing to test that assumption by tossing an index on it; but what I’d done before had indexes slowing down initial data load steps and it was faster to just load the data, then make the index…

    Lord Knows I’m not a MariaDB Expert. ( I’d say I’m about “New Journeyman” at this point) So not going to dispute anything I’ve not tested.

    https://mariadb.com/kb/en/library/getting-started-with-indexes/


    There are four main kinds of indexes; primary keys
    (unique and not null), unique indexes (unique and can be null), plain indexes (not necessarily unique) and full-text indexes (for full-text searching).

    The terms ‘KEY’ and ‘INDEX’ are generally used interchangeably, and statements should work with either keyword.

    Now I could easily be wrong in my ASSUMPTION that there is a key on each of the components of a composite primary key, and it could be that the only index is on the composite, but seems a silly way to handle a composite key… (Then again, I’ve often had my expectations bent by reality as someone clearly decided on a shorter path to being done…) So let’s take a look:

    MariaDB [temps]> SHOW INDEX FROM mstats;
    +--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    | Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
    +--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    | mstats |          0 | PRIMARY  |            1 | stnID       | A         |      404502 |     NULL | NULL   |      | BTREE      |         |               |
    | mstats |          0 | PRIMARY  |            2 | month       | A         |      404502 |     NULL | NULL   |      | BTREE      |         |               |
    | mstats |          0 | PRIMARY  |            3 | version     | A         |      404502 |     NULL | NULL   |      | BTREE      |         |               |
    +--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    3 rows in set (0.00 sec)
    
    MariaDB [temps]> 
    

    This scrolls way off the right side, but it sure looks to me like it is saying there are three indexes made from the composite key definition.

    @P..G.:

    Well, what I’m doing Right Now is something of a “clean up” step. The only real “Magic” in the Silicon Valley approach to things (that you see in Steve Jobs & Woz or in Elon Musk today) is that they are:

    a) Willing to fail to learn something important.
    b) Do a rapid “try” to do the learning knowing it is likely to fail.
    c) Use that information to move faster than The Other Guy to get First Mover Advantage.
    d) Once you know how it ought to be done, throw out what you did and redo it the right way.

    This is usually called the Rapid Prototyping model of development. You saw a lot of this in W.W.II developments when it was call the Yankee “Can Do” attitude… Just keep pounding on it until it works and do not ever get into Analysis Paralysis… (That is the exclusive turf of Government and mega-corporations… and they don’t like the competition ;-)

    So I’ve done a, b, and c in making the first version of all this. Now I’m at the point where I’m taking the Lessons Learned and doing some final tests (things like how much does it slow stuff down to use combined tables for a simpler design; and finding a replacement for FORTRAN in the initial data load IF that is reasonable…) preparatory for step “d” – doing it over “the right way”…

    Sometimes “d” is just polish on what you did the first time ( Apple IIc vs Apple I) and sometimes it’s a complete “dump it and try again” (Space X first rockets that were blown up… vs now landing back on the launch pads reliably).

    So I figure that in about a month, maybe less, I’ll be comfortable that this is “polished enough” and I’ll have a simple “start to end” that anyone can do by being a “Script Kiddie” (of any age ;-) and without the need to install a FORTRAN compiler ;-) That’s the end state write up point.

    Of course “I could be wrong” and something will “blow up on the pad” while trying to load v2 and then you are back at “c”…

    Oh, also of note: I’m not real keen on how I handled the re-integration of Russia and Kazakhstan. Too many tables and too many steps. Yeah, it works, but it’s a bit ugly. So that’s a step that needs more polish too.

    The “good news” is that I’m over the “cringe” reaction at the idea of hitting this again. Making all those gazillion graphs was kind of a marathon effort and resulted in some “burn out” at the end. It takes a bit of time to come back from that… (So I spent too much time watching YouTube videos and posting them ;-) But I’m back in it now…

    @All:

    I am very interested in personal preference opinions on multiple tables vs longer run times.

    Do folks prefer having tables like:

    temps1, temps2, temps3, temps4
    mstats1, mstats2, mstats3, mstats4
    anom1, anom2, anom3, anom4

    and run times that are 1/2 to 1/3 as long (but sill in the “minutes’ ranges up to about 1/2 hour)

    or tables like:

    ghcn
    mstats
    anomalies

    And run times that are longer (and some over an hour, but only done once) AND with the need to include “WHERE version=V2” or similar in many queries?

    It will work either way. The second is “cleaner” but slower. The first is faster, but more stuff laying around…

  16. jim2 says:

    EM – One thing noted in the brief WUWT article is that station IDs do not necessarily refer to the same station between versions. They used lat-long, which sounds like a reasonable method to avoid bad IDs. Of course, you have to depend on the lat-long being correct, but that’s the case for all data.

  17. E.M.Smith says:

    Well this is interesting. Building an index now takes a lot more space. I am adding one to the ghcn (temperatures) table to test it vs the Primary Key based index(s). Almost ran out of room on my 2 GB /tmp partition (that was sized to be almost double the 1 GB needed prior..)

    Filesystem     1K-blocks    Used Available Use% Mounted on
    /dev/sda9        2086912 2029868     57044  98% /tmp
    

    A “close run thing”…

    So “note to self”: IF doing the combined fewer tables version, make /tmp 4 GB or larger…

    It has dropped back to 97% used at this point and hopefully that’s the high water mark as the index creation runs. I’ll post an update when (if?) it finishes.

  18. rms says:

    I’m giving it a show. Downloaded all years of daily data of GHCN (1763 to present!) with ‘curl’ called for each file, then loading into a “big” dataframe that i’m saving to reload to then play with (if possible). We’ll see what happens.

    from pathlib import Path
    import pandas as pd

    fn_data=”/volumes/archive/climate/ghcn_csv/”
    df_file=”climate_dataframe”
    print(fn_data)
    f=Path(fn_data).glob(‘*.gz’)
    col=[‘ID’,’YYYYMMDD’,’Element’,’Datavalue’,’M-Flag’,’Q-Flag’,’S-Flag’,’OBS-Time’]
    mydateparser = lambda x: pd.datetime.strptime(x, “%Y%m%d”)

    datatype={
    ‘id’: ‘object’,\
    ‘YYYYMMDD’: “object”,\
    ‘Element’:’object’, \
    ‘Datavalue’:”float64″,\
    ‘M-Flag’:”object”,\
    ‘Q-Flag’:”object”,\
    ‘S-Flag’:”object”,\
    ‘OBS-Time’:”object”}

    df = pd.read_csv(\
    fn_data+’2019.csv.gz’, \
    names=col,\
    header=None,\
    compression=’gzip’,\
    sep=’,’, \
    quotechar='”‘,\
    parse_dates=[‘YYYYMMDD’],\
    date_parser=mydateparser)
    print(df)
    print(df.describe())
    cnt=0
    for filename in f:
    if filename != fn_data+’2019.csv.gz’:
    cnt+=1
    if cnt<1000:
    print("\n",cnt,filename)
    #df1 = pd.read_csv(filename, names=col,header=None,compression='gzip', sep=',', quotechar='"')
    df1 = pd.read_csv(\
    filename, \
    names=col,\
    header=None,\
    compression='gzip',\
    sep=',', \
    quotechar='"',\
    parse_dates=['YYYYMMDD'],\
    date_parser=mydateparser)
    print(df1.describe())
    df.append(df1)
    #df.to_pickle(fn_data+df_file)
    print(df.describe())
    df.to_pickle(df_file)

  19. rms says:

    tabs for the python code didn’t work. sigh.

  20. E.M.Smith says:

    Well…. That wasn’t expected….

    I made the index and then re-ran the query from above (count of entries for v4) and the run time nearly doubled(!).

    So first, create the added index:

    MariaDB [temps]> CREATE INDEX versionx ON ghcn(version);
    Query OK, 0 rows affected (30 min 2.41 sec)         
    Records: 0  Duplicates: 0  Warnings: 0
    

    That took 1/2 an hour so right there if you only use it once it’s a big up front cost.

    Then I redid the same query:

    MariaDB [temps]> SELECT COUNT(deg_C) FROM ghcn WHERE version="Gv4";
    +--------------+
    | COUNT(deg_C) |
    +--------------+
    |     16811736 |
    +--------------+
    1 row in set (6 min 10.49 sec)
    

    Above (pre-index) it was:

    1 row in set (3 min 49.13 sec)
    

    I did notice a lot of red D disk wait notices (about 1/2 the time) in “htop”.

    My interpretation of this is that the added index was AFTER all the data in the table, so the program was doing seeks from the data to the index to the data to the index…

    But that’s just a guess. I suppose it is also possible that having the same field in the PRIMARY KEY and in a discrete index could cause some kind of decision loop in the program. A need to “pick an index” on each record? Who knows…

    FWIW, here’s the indexes post run:

    MariaDB [temps]> SHOW INDEX FROM ghcn;
    +-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
    +-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    | ghcn  |          0 | PRIMARY  |            1 | stnID       | A         |      521564 |     NULL | NULL   |      | BTREE      |         |               |
    | ghcn  |          0 | PRIMARY  |            2 | type        | A         |      521564 |     NULL | NULL   |      | BTREE      |         |               |
    | ghcn  |          0 | PRIMARY  |            3 | version     | A         |      521564 |     NULL | NULL   |      | BTREE      |         |               |
    | ghcn  |          0 | PRIMARY  |            4 | year        | A         |     4172518 |     NULL | NULL   |      | BTREE      |         |               |
    | ghcn  |          0 | PRIMARY  |            5 | month       | A         |     4172518 |     NULL | NULL   |      | BTREE      |         |               |
    | ghcn  |          1 | versionx |            1 | version     | A         |         756 |     NULL | NULL   |      | BTREE      |         |               |
    +-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    6 rows in set (0.69 sec)
    

    I’m going to now remove the versionx index and re-run the selection and see if the results return to the original values… (Why we test things…)

  21. E.M.Smith says:

    @rms:

    One of my complaints about Python… it is alignment dependent.. You can use the “pre” and “/pre”” tags to preserve the layout. So <pre> then your code then </pre> and it ought to be OK.

    Oh, and remember than any < needs to be replaced with &lt; and any > similarly &gt;.

    (And if you want to know how I got that stuff to show up, I have a posting on that…)

  22. rms says:

    Maybe could you give me a dump of your database and I’ll mess around with Python and Pandas and see what I can do. Like you, not an expert, but I know enough, and have interest.

  23. E.M.Smith says:

    Yet more unexpected:

    MariaDB [temps]> DROP INDEX versionx ON ghcn;
    Query OK, 0 rows affected (1.79 sec)
    Records: 0  Duplicates: 0  Warnings: 0
    
    MariaDB [temps]> SELECT COUNT(deg_C) FROM ghcn WHERE version="Gv4";
    +--------------+
    | COUNT(deg_C) |
    +--------------+
    |     16811736 |
    +--------------+
    1 row in set (4 min 11.08 sec)
    
    MariaDB [temps]> 
    

    I did notice that the red D disk wait on mysql was almost constantly showing. My interpretation of this is that the prior run it would “sink out” during some longer process time (exactly what is unknown but perhaps during seeks from the data to the index and back as then the USB driver isn’t the limit it’s just waiting…) but who knows.

    Now it was just constantly sucking up a stream of data, waiting for more as the CPU ran ahead of the disk…

    The surprise is that the 4 minutes+ is not 3 minutes nor 6 minutes… One wonders.

    I’ll need to re-run it after a fresh start and see if having the browser open causes some kind of arguing over swap space / memory…

    In any case, I think this shows that with a Primary Key Only you get best performance, and adding another index on a field already in the Primary Key is slower.

    I’ve run into lots and lots of things like that over the decades where what you would expect (the same time to run or better) doesn’t happen. That’s why I’ll typically say “I expect” and not “it will” and “why we test”…. What I believe is “for amusement only”, it is what I’ve tested that matters… Something Warmistas ought to remember, but don’t…

  24. rms says:
    from pathlib import Path
    import pandas as pd
    
    fn_data="/volumes/archive/climate/ghcn_csv/"
    df_file="climate_dataframe"
    print(fn_data)
    f=Path(fn_data).glob('*.gz')
    col=['ID','YYYYMMDD','Element','Datavalue','M-Flag','Q-Flag','S-Flag','OBS-Time']
    mydateparser = lambda x: pd.datetime.strptime(x, "%Y%m%d")
    
    datatype={
        'id': 'object',\
        'YYYYMMDD': "object",\
        'Element':'object', \
        'Datavalue':"float64",\
        'M-Flag':"object",\
        'Q-Flag':"object",\
        'S-Flag':"object",\
        'OBS-Time':"object"}
    
    df = pd.read_csv(\
        fn_data+'2019.csv.gz', \
        names=col,\
        header=None,\
        compression='gzip',\
        sep=',', \
        quotechar='"',\
        parse_dates=['YYYYMMDD'],\
        date_parser=mydateparser)
    print(df)
    print(df.describe())
    cnt=0
    for filename in f:
        if filename != fn_data+'2019.csv.gz':
            cnt+=1
            if cnt<1000:
                print("\n",cnt,filename)
                #df1 = pd.read_csv(filename, names=col,header=None,compression='gzip', sep=',', quotechar='"')
                df1 = pd.read_csv(\
                    filename, \
                    names=col,\
                    header=None,\
                    compression='gzip',\
                    sep=',', \
                    quotechar='"',\
                    parse_dates=['YYYYMMDD'],\
                    date_parser=mydateparser)          
                print(df1.describe())
                df.append(df1)
    #df.to_pickle(fn_data+df_file)
    print(df.describe())
    df.to_pickle(df_file)
    
  25. E.M.Smith says:

    @rms:

    What is a “dump of my database” as commands? Schema? Load scripts? Binary blob? ;-)

    I suspect just one table would be “enough” for testing purposes.

  26. E.M.Smith says:

    Here’s the code to make the tables:

    chiefio@PiM3Devuan2:~/SQL/tables$ cat ghcn
    CREATE TABLE  ghcn (
        stnID     CHAR(11) NOT NULL,
        type      CHAR(4)  NOT NULL,
        version   CHAR(5)  NOT NULL,
        year      CHAR(4)  NOT NULL,
        month     CHAR(4)  NOT NULL,
        ascension CHAR(10),
        region    CHAR(1),
        cnum      CHAR(3),
        abrev     CHAR(2),
        wmo       CHAR(5),
        near_wmo  CHAR(3),
        deg_c     DECIMAL(6,2),
        missing   CHAR(1),
        qc        CHAR(1),
        source    CHAR(1),
        F1        CHAR(1),
        F2        CHAR(1),
        F3        CHAR(1),
        F4        CHAR(1),
        PRIMARY KEY (stnID,type,version,year,month)
        ) ;
    chiefio@PiM3Devuan2:~/SQL/tables$ cat mstats
    CREATE TABLE mstats (
        stnID   CHAR(11)     NOT NULL,
        month   CHAR(4)      NOT NULL,
        version CHAR(5)      NOT NULL,
        mean    DECIMAL(6,2) NOT NULL,
        big     DECIMAL(6,2) NOT NULL,
        small   DECIMAL(6,2) NOT NULL,
        num     INTEGER      NOT NULL,
        trang   DECIMAL(6,2) NOT NULL,
        stdev   DECIMAL(6,2) NOT NULL,
        pctd    FLOAT,
        pctm    FLOAT,
        PRIMARY KEY (stnID,month,version)
        ) ;
    chiefio@PiM3Devuan2:~/SQL/tables$ cat anomalies
    CREATE TABLE anomalies (
        stnID    CHAR(11)     NOT NULL,
        year     CHAR(4)      NOT NULL,
        month    CHAR(4)      NOT NULL,
        version  CHAR(5)      NOT NULL,
        region   CHAR(1),
        abrev    CHAR(2),
        cnum     CHAR(3),
        wmo      CHAR(5),
        near_wmo CHAR(3),
        deg_C    DECIMAL(6,2) NOT NULL,
        PRIMARY KEY (stnID,year,month,version)
        ) ;
    chiefio@PiM3Devuan2:~/SQL/tables$ 
    
  27. E.M.Smith says:

    Here’s the SQL to load the tables. You would need to replace the path name to where I have the formatted input data to match your disk locations::

    chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lghcn
    LOAD DATA LOCAL INFILE '/SG500/xfs/chiefio/SQL/v3/ghcn.tabs' INTO TABLE ghcn;
    
    show warnings;
    
    LOAD DATA LOCAL INFILE '/SG500/xfs/chiefio/SQL/v4/ghcn.tabs' INTO TABLE ghcn;
    
    show warnings;
    
    
    chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lmstats 
    INSERT INTO  mstats (stnID,month,version,mean,big,small,num,trang,stdev)
    SELECT stnID,month,version,
    ROUND(AVG(deg_C),2),MAX(deg_C),MIN(deg_C),COUNT(deg_C), MAX(deg_C)-MIN(deg_C), ROUND(STDDEV(deg_C),2)
    FROM  ghcn
    WHERE deg_C>-90.0  AND deg_C < 60.0
    GROUP BY version,stnID,month;
    
    show warnings;
    
    
    chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lanomalies 
    INSERT INTO  anomalies (stnID,year,month,version,region,abrev,cnum,wmo,near_wmo,deg_C)
    SELECT T.stnID,T.year,T.month,T.version,T.region,C.abrev,T.cnum,T.wmo,T.near_wmo,ROUND(T.deg_C-ST.mean,2)
    FROM ghcn AS T
    INNER JOIN country as C ON T.cnum=C.cnum
    INNER JOIN mstats AS ST 
    ON
            ST.stnID=T.stnID
    	AND ST.month=T.month 
            AND ST.version=T.version
    WHERE T.deg_C > -90.0 AND T.deg_C < 60.0 AND T.version="Gv3"
    ;
    
    show warnings;
    
    INSERT INTO  anomalies (stnID,year,month,version,region,abrev,cnum,wmo,near_wmo,deg_C)
    SELECT T.stnID,T.year,T.month,T.version,C.region,T.abrev,C.cnum,T.wmo,T.near_wmo,ROUND(T.deg_C-ST.mean,2)
    FROM ghcn AS T
    INNER JOIN country as C ON T.abrev=C.abrev
    INNER JOIN mstats AS ST 
    ON
            ST.stnID=T.stnID
    	AND ST.month=T.month 
            AND ST.version=T.version
    WHERE T.deg_C > -90.0 AND T.deg_C < 60.0 AND T.version="Gv4"
    ;
    
    show warnings;
    chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ 
    
  28. cdquarles says:

    @rms,
    I want to be sure of my interpretation here. Your code is R, correct? If so, I should be able to use it. I can, if I have to, use Chief’s FORTRAN, maybe ported into a language I already have on my system, to load a version of his RDMS SQL into the system that I have already. Then I can just do the box-whisker plots (and thus, drop the necessity of any anomaly), by year or by date.

  29. E.M.Smith says:

    Then there’s the FORTRAN to turn the fixed format data layout into TAB delimited. I’ve not yet re-written this one in bash (or anything else…) and it is particularly ugly, what with using brute force repeats of the text instead of a loop with variables, but it is what it is. One line for each month where the text month (like “JAN” ) is hard coded and the index on the arrays read in changes (1 then 2 then…). One each for v3.3 and v4.

    This is one of the bits I really really want to clean up. Note that the word “preen” in these file names just denotes that I deleted some “diagnostic prints” and their FORMAT statements that had not been converted to the new layout of fields.

    chiefio@PiM3Devuan2:~/SQL/v3$ cat 3ghcn_preen.f 
    C FORTRAN to read the station data files v3 GHCN file and insert Tabs
    C in the output.  Also divides "country" into Continent and Country
    C
    C Variable declarations...
    C
          CHARACTER * 1 TAB
          CHARACTER * 1 CONT
          CHARACTER * 2 COUNTRY
          CHARACTER * 5 WMO
          CHARACTER * 3 MOD
          CHARACTER * 4 YEAR
          CHARACTER * 4 TYPE
          CHARACTER * 5 VERSION
          CHARACTER * 10 ASCEN
          CHARACTER * 1 M(1:12), Q(1:12), S(1:12)
          CHARACTER * 5 T(1:12)
          CHARACTER * 11 stationID
    
          REAL F(1:12)
    C
    C Set the TAB character
          TAB=CHAR(09)
    C
    C Set some constants
          VERSION="Gv3"
          ASCEN="7Sept2015"
    C
    C Read in one line of data...
    C
        9 OPEN(1, FILE='temps.in', STATUS='OLD', ACTION='READ')
       10 READ (1, 11, END=99) CONT, COUNTRY, WMO, MOD, YEAR, TYPE,         &
         &T(1),M(1),Q(1),S(1),T(2),M(2),Q(2),S(2),T(3),M(3),Q(3),S(3),      &
         &T(4),M(4),Q(4),S(4),T(5),M(5),Q(5),S(5),T(6),M(6),Q(6),S(6),      &
         &T(7),M(7),Q(7),S(7),T(8),M(8),Q(8),S(8),T(9),M(9),Q(9),S(9),      &
         &T(10),M(10),Q(10),S(10),T(11),M(11),Q(11),S(11),T(12),M(12),Q(12),&
         &S(12)
    C
       11 FORMAT (A1,A2,A5,A3,A4,A4,12(A5,A1,A1,A1))
    C
    C Make Station ID by concatenation of the parts
    C
          stationID=CONT//COUNTRY//WMO//MOD
    C
    C Convert CHAR temp to Float Temp and adjust decimal point
          DO 20, I=1,12
             READ (T(I),*,END=20) F(I)
             F(I)=F(I)/100.0
       20 CONTINUE
    C  The Do loop from above
    
    C Write out one line of data with TAB between fields
    C
          WRITE (6, 8) StationID,TAB,TYPE,TAB,VERSION,TAB,YEAR,TAB,"JAN",TAB &
         &,ASCEN,TAB,CONT,TAB,CONT,COUNTRY,TAB,"XX",TAB,WMO,TAB,MOD,TAB,     &
         &F(1),TAB,M(1),TAB, Q(1),TAB, S(1),TAB,"",TAB,"",TAB,"",TAB,""
    
          WRITE (6, 8) StationID,TAB,TYPE,TAB,VERSION,TAB,YEAR,TAB,"FEB",TAB &
         &,ASCEN,TAB,CONT,TAB,CONT,COUNTRY,TAB,"XX",TAB,WMO,TAB,MOD,TAB,     &
         &F(2),TAB,M(2),TAB, Q(2),TAB, S(2),TAB,"",TAB,"",TAB,"",TAB,""
    
          WRITE (6, 8) StationID,TAB,TYPE,TAB,VERSION,TAB,YEAR,TAB,"MAR",TAB &
         &,ASCEN,TAB,CONT,TAB,CONT,COUNTRY,TAB,"XX",TAB,WMO,TAB,MOD,TAB,     &
         &F(3),TAB,M(3),TAB, Q(3),TAB, S(3),TAB,"",TAB,"",TAB,"",TAB,""
    
          WRITE (6, 8) StationID,TAB,TYPE,TAB,VERSION,TAB,YEAR,TAB,"APR",TAB &
         &,ASCEN,TAB,CONT,TAB,CONT,COUNTRY,TAB,"XX",TAB,WMO,TAB,MOD,TAB,     &
         &F(4),TAB,M(4),TAB, Q(4),TAB, S(4),TAB,"",TAB,"",TAB,"",TAB,""
    
          WRITE (6, 8) StationID,TAB,TYPE,TAB,VERSION,TAB,YEAR,TAB,"MAY",TAB &
         &,ASCEN,TAB,CONT,TAB,CONT,COUNTRY,TAB,"XX",TAB,WMO,TAB,MOD,TAB,     &
         &F(5),TAB,M(5),TAB, Q(5),TAB, S(5),TAB,"",TAB,"",TAB,"",TAB,""
    
          WRITE (6, 8) StationID,TAB,TYPE,TAB,VERSION,TAB,YEAR,TAB,"JUNE",   &
         &TAB,ASCEN,TAB,CONT,TAB,CONT,COUNTRY,TAB,"XX",TAB,WMO,TAB,MOD,TAB,  &
         &F(6),TAB,M(6),TAB, Q(6),TAB, S(6),TAB,"",TAB,"",TAB,"",TAB,""
    
          WRITE (6, 8) StationID,TAB,TYPE,TAB,VERSION,TAB,YEAR,TAB,"JULY",  &
         &TAB,ASCEN,TAB,CONT,TAB,CONT,COUNTRY,TAB,"XX",TAB,WMO,TAB,MOD,TAB, &
         &F(7),TAB,M(7),TAB, Q(7),TAB, S(7),TAB,"",TAB,"",TAB,"",TAB,""
    
          WRITE (6, 8) StationID,TAB,TYPE,TAB,VERSION,TAB,YEAR,TAB,"AUG",TAB &
         &,ASCEN,TAB,CONT,TAB,CONT,COUNTRY,TAB,"XX",TAB,WMO,TAB,MOD,TAB,     &
         &F(8),TAB,M(8),TAB, Q(8),TAB, S(8),TAB,"",TAB,"",TAB,"",TAB,""
    
          WRITE (6, 8) StationID,TAB,TYPE,TAB,VERSION,TAB,YEAR,TAB,"SEPT",   &
         &TAB,ASCEN,TAB,CONT,TAB,CONT,COUNTRY,TAB,"XX",TAB,WMO,TAB,MOD,TAB,  &
         &F(9),TAB,M(9),TAB, Q(9),TAB, S(9),TAB,"",TAB,"",TAB,"",TAB,""
    
          WRITE (6, 8) StationID,TAB,TYPE,TAB,VERSION,TAB,YEAR,TAB,"OCT",TAB &
         &,ASCEN,TAB,CONT,TAB,CONT,COUNTRY,TAB,"XX",TAB,WMO,TAB,MOD,TAB,     &
         &F(10),TAB,M(10),TAB, Q(10),TAB, S(10),TAB,"",TAB,"",TAB,"",TAB,""
    
          WRITE (6, 8) StationID,TAB,TYPE,TAB,VERSION,TAB,YEAR,TAB,"NOV",TAB &
         &,ASCEN,TAB,CONT,TAB,CONT,COUNTRY,TAB,"XX",TAB,WMO,TAB,MOD,TAB,     &
         &F(11),TAB,M(11),TAB, Q(11),TAB, S(11),TAB,"",TAB,"",TAB,"",TAB,""
    
          WRITE (6, 8) StationID,TAB,TYPE,TAB,VERSION,TAB,YEAR,TAB,"DEC",TAB &
         &,ASCEN,TAB,CONT,TAB,CONT,COUNTRY,TAB,"XX",TAB,WMO,TAB,MOD,TAB,     &
         &F(12),TAB,M(12),TAB, Q(12),TAB, S(12),TAB,"",TAB,"",TAB,"",TAB,""
    
        8 FORMAT (A11,A1, A4,A1, A5,A1, A4,A1, A4,A1, A10,A1, A1,A1,         &
         &A1,A2,A1, A2,A1, A5,A1, A3,A1,                                     &
         &F7.2,A1, A1,A1, A1,A1, A1,A1, A1,A1, A1,A1, A1,A1, A1 )
    
    C Retrieve another line of data...
    C
          GO TO 10
    C
    C If end of file, then stop...
    C
       99 STOP
          END
    
    chiefio@PiM3Devuan2:~/SQL/v4$ cat 4ghcn_preen.f 
    C FORTRAN to read the station data files v4444CN file and insert Tabs
    C in the output.  Also divides "country" into ISTI abrev, flags, WMO
    C
    C Variable declarations...
    C
          CHARACTER * 1 TAB
          CHARACTER * 2 ABREV
          CHARACTER * 1 F1,F2,F3,F4
          CHARACTER * 5 WMO
          CHARACTER * 4 YEAR
          CHARACTER * 4 TTYPE
          CHARACTER * 5 VERSION
          CHARACTER * 10 ASCEN
          CHARACTER * 1 M(1:12), Q(1:12), S(1:12)
          CHARACTER * 5 T(1:12)
          CHARACTER * 11 stationID
    
          REAL F(1:12)
    C
    C Set the TAB character
          TAB=CHAR(09)
    C
    C Set some constants
          VERSION="Gv4"
          ASCEN="17Mar2019"
    C
    C Read in one line of data...
    C
        9 OPEN(1, FILE='temps.in', STATUS='OLD', ACTION='READ')
       10 READ (1, 11, END=99) ABREV,F1,F2,F3,F4,WMO,YEAR,TTYPE,            &
         &T(1),M(1),Q(1),S(1),T(2),M(2),Q(2),S(2),T(3),M(3),Q(3),S(3),      &
         &T(4),M(4),Q(4),S(4),T(5),M(5),Q(5),S(5),T(6),M(6),Q(6),S(6),      &
         &T(7),M(7),Q(7),S(7),T(8),M(8),Q(8),S(8),T(9),M(9),Q(9),S(9),      &
         &T(10),M(10),Q(10),S(10),T(11),M(11),Q(11),S(11),T(12),M(12),Q(12),&
         &S(12)
    C
       11 FORMAT (A2,A1,A1,A1,A1,A5,A4,A4,12(A5,A1,A1,A1))
    C
    C Make Station ID by concatenation of the parts
    C
          stationID=ABREV//F1//F2//F3//F4//WMO
    C
    C Convert CHAR temp to Float Temp and adjust decimal point
          DO 20, I=1,12
             READ (T(I),*,END=20) F(I)
             F(I)=F(I)/100.0
       20 CONTINUE
    C  The Do loop from above
    
    C Write out one line of data with TAB between fields
    C
    
    C2345*789112345678921234567893123456789412345678951234567896123456789712*
    
          WRITE (6, 8) StationID,TAB,TTYPE,TAB,VERSION,TAB,YEAR,TAB,"JAN",  &
         &TAB,ASCEN,TAB, "",TAB,"",TAB, ABREV,TAB, WMO,TAB,"",TAB,          &
         &F(1),TAB, M(1),TAB, Q(1),TAB, S(1),TAB, F1,TAB, F2,TAB, F3,TAB, F4&
    
          WRITE (6, 8) StationID,TAB,TTYPE,TAB,VERSION,TAB,YEAR,TAB,"FEB",  &
         &TAB,ASCEN,TAB, "",TAB,"",TAB, ABREV,TAB, WMO,TAB,"",TAB,          &
         &F(2),TAB, M(2),TAB, Q(2),TAB, S(2),TAB, F1,TAB, F2,TAB, F3,TAB, F4&
    
          WRITE (6, 8) StationID,TAB,TTYPE,TAB,VERSION,TAB,YEAR,TAB,"MAR",  &
         &TAB,ASCEN,TAB, "",TAB,"",TAB, ABREV,TAB, WMO,TAB,"",TAB,          &
         &F(3),TAB, M(3),TAB, Q(3),TAB, S(3),TAB, F1,TAB, F2,TAB, F3,TAB, F4&
    
          WRITE (6, 8) StationID,TAB,TTYPE,TAB,VERSION,TAB,YEAR,TAB,"APR",  &
         &TAB,ASCEN,TAB, "",TAB,"",TAB, ABREV,TAB, WMO,TAB,"",TAB,          &
         &F(4),TAB, M(4),TAB, Q(4),TAB, S(4),TAB, F1,TAB, F2,TAB, F3,TAB, F4&
    
          WRITE (6, 8) StationID,TAB,TTYPE,TAB,VERSION,TAB,YEAR,TAB,"MAY",  &
         &TAB,ASCEN,TAB, "",TAB,"",TAB, ABREV,TAB, WMO,TAB,"",TAB,          &
         &F(5),TAB, M(5),TAB, Q(5),TAB, S(5),TAB, F1,TAB, F2,TAB, F3,TAB, F4&
    
          WRITE (6, 8) StationID,TAB,TTYPE,TAB,VERSION,TAB,YEAR,TAB,"JUNE", &
         &TAB,ASCEN,TAB, "",TAB,"",TAB, ABREV,TAB, WMO,TAB,"",TAB,          &
         &F(6),TAB, M(6),TAB, Q(6),TAB, S(6),TAB, F1,TAB, F2,TAB, F3,TAB, F4&
    
          WRITE (6, 8) StationID,TAB,TTYPE,TAB,VERSION,TAB,YEAR,TAB,"JULY", &
         &TAB,ASCEN,TAB, "",TAB,"",TAB, ABREV,TAB, WMO,TAB,"",TAB,          &
         &F(7),TAB, M(7),TAB, Q(7),TAB, S(7),TAB, F1,TAB, F2,TAB, F3,TAB, F4&
    
          WRITE (6, 8) StationID,TAB,TTYPE,TAB,VERSION,TAB,YEAR,TAB,"AUG",  &
         &TAB,ASCEN,TAB, "",TAB,"",TAB, ABREV,TAB, WMO,TAB,"",TAB,          &
         &F(8),TAB, M(8),TAB, Q(8),TAB, S(8),TAB, F1,TAB, F2,TAB, F3,TAB, F4&
    
          WRITE (6, 8) StationID,TAB,TTYPE,TAB,VERSION,TAB,YEAR,TAB,"SEPT", &
         &TAB,ASCEN,TAB, "",TAB,"",TAB, ABREV,TAB, WMO,TAB,"",TAB,          &
         &F(9),TAB, M(9),TAB, Q(9),TAB, S(9),TAB, F1,TAB, F2,TAB, F3,TAB, F4&
    
          WRITE (6, 8) StationID,TAB,TTYPE,TAB,VERSION,TAB,YEAR,TAB,"OCT",  &
         &TAB,ASCEN,TAB, "",TAB,"",TAB, ABREV,TAB, WMO,TAB,"",TAB,          &
         &F(10),TAB, M(10),TAB, Q(10),TAB, S(10),TAB, F1,TAB, F2,TAB,       &
         &F3,TAB, F4
    
          WRITE (6, 8) StationID,TAB,TTYPE,TAB,VERSION,TAB,YEAR,TAB,"NOV",  &
         &TAB,ASCEN,TAB, "",TAB,"",TAB, ABREV,TAB, WMO,TAB,"",TAB,          &
         &F(11),TAB, M(11),TAB, Q(11),TAB, S(11),TAB, F1,TAB, F2,TAB,       &
         &F3,TAB, F4
    
          WRITE (6, 8) StationID,TAB,TTYPE,TAB,VERSION,TAB,YEAR,TAB,"DEC",  &
         &TAB,ASCEN,TAB, "",TAB,"",TAB, ABREV,TAB, WMO,TAB,"",TAB,          &
         &F(12),TAB, M(12),TAB, Q(12),TAB, S(12),TAB, F1,TAB, F2,TAB,       &
         &F3,TAB, F4
    
    
    C2345*789112345678921234567893123456789412345678951234567896123456789712*
    
        8 FORMAT (A11,A1, A4,A1, A5,A1, 2(A4,A1), A10,A1, A1,A1, A3,A1,     & 
         & A2,A1, A5,A1, A3,A1, F7.2,A1, 6(A1,A1),A1)
    
    C Retrieve another line of data...
    C
          GO TO 10
    C
    C If end of file, then stop...
    C
       99 STOP
          END
    

    That’s pretty much the whole bit so far (of the new version). Just need to download the data (that you already know how to do) and unpack it.

  30. E.M.Smith says:

    @CDQuarles:

    The code RMS posted looks like Python to me.

    As you can see from inspection of the FORTRAN it is fairly trivial. Just reading in fixed format from the GHCN data file off their tarball, and putting tabs in it. There is a tiny bit of changing the order of fields to match the database order (so the load step becomes simpler without field specifiers) and computing a COS of one field. There is also a bit of “break into parts” and “concatenate into a whole”, but that’s also trivial.

    I can likely re-write this in bash i about 30 minutes, or likely any of several other languages… except I don’t write R yet :-{

    So you are on what kind of computer (linux, Mac, Windoz)? And with what languages on it?

  31. E.M.Smith says:

    Oh, and I’m using the same COUNTRY and CONTINENT tables from before documented in prior postings.

    This SQL script:

    chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lcountry 
    LOAD DATA LOCAL INFILE '/SG500/xfs/chiefio/SQL/v4/country-tabs' 
    INTO TABLE country (region,cnum,abrev,cname)
    SET version="Gv4",ascension="17MAR2019";
    
    show warnings;
    

    Loads into this table:

    chiefio@PiM3Devuan2:~/SQL/tables$ cat country 
    CREATE TABLE country (
        cnum CHAR(3) NOT NULL,
        abrev   CHAR(2),
        region CHAR(1) NOT NULL,
        cname  CHAR(60),
        version CHAR(5),
        ascension CHAR(10),
        PRIMARY KEY(cnum)
        ) ;
    

    This hand made mapping of Abbreviation to Country Number:

    chiefio@PiM3Devuan2:~/SQL/v4$ cat country-tabs 
    4	426	AC	Antigua and Barbuda 
    2	230	AE	United Arab Emirates 
    2	201	AF	Afghanistan
    1	101	AG	Algeria 
    6	604	AJ	Azerbaijan 
    6	601	AL	Albania
    6	602	AM	Armenia 
    1	102	AO	Angola 
    5	521	AQ	American Samoa [United States] 
    3	301	AR	Argentina 
    5	501	AS	Australia 
    6	603	AU	Austria 
    7	700	AY	Antarctica 
    2	202	BA	Bahrain 
    4	401	BB	Barbados 
    1	104	BC	Botswana 
    4	427	BD	Bermuda [United Kingdom] 
    6	606	BE	Belgium 
    4	423	BF	Bahamas, The 
    2	203	BG	Bangladesh
    4	402	BH	Belize
    6	607	BK	Bosnia and Herzegovina 
    3	302	BL	Bolivia 
    2	216	BM	Burma
    1	103	BN	Benin 
    6	605	BO	Belarus 
    5	512	BP	Solomon Islands 
    3	303	BR	Brazil
    6	608	BU	Bulgaria
    5	522	BX	Brunei
    1	106	BY	Burundi
    4	403	CA	Canada 
    2	204	CB	Cambodia
    1	110	CD	Chad 
    2	224	CE	Sri Lanka 
    1	112	CF	Congo (Brazzaville) 
    1	154	CG	Congo (Kinshasa)
    2	205	CH	China  
    3	304	CI	Chile 
    4	429	CJ	Cayman Islands [United Kingdom] 
    5	524	CK	Cocos (Keeling) Islands [Australia] 
    1	107	CM	Cameroon 
    1	111	CN	Comoros
    3	305	CO	Colombia 
    5	535	CQ	Northern Mariana Islands [United States] 
    4	405	CS	Costa Rica 
    1	109	CT	Central African Republic 
    3	406	CU	Cuba 
    1	108	CV	Cape Verde
    5	525	CW	Cook Islands [New Zealand]
    6	610	CY	Cyprus 
    6	612	DA	Denmark 
    1	114	DJ	Dijibouti
    4	430	DO	Dominica
    4	407	DR	Dominican Republic 
    3	306	EC	Ecuador 
    1	115	EG	Egypt 
    6	621	EI	Ireland 
    1	199	EK	Equatorial Guinea
    6	613	EN	Estonia 
    1	116	ER	Eritrea 
    4	408	ES	El Salvador 
    1	117	ET	Ethiopia 
    1	198	EU	Europa Island [France]
    6	611	EZ	Czech Republic 
    3	315	FG	French Guiana [France] 
    6	614	FI	Finland 
    5	502	FJ	Fiji
    3	316	FK	Falkland Islands (Islas Malvinas) [United Kingdom] 
    5	527	FM	Federated States of Micronesia 
    5	528	FP	French Polynesia 
    6	615	FR	France 
    1	143	FS	French Southern and Antarctic Lands [France]
    1	150	GA	Gambia, The 
    1	118	GB	Gabon 
    6	616	GG	Georgia
    1	119	GH	Ghana
    6	653	GI	Gibraltar [United Kingdom]
    4	409	GJ	Grenada
    6	699	GK	Guernsey
    4	431	GL	Greenland [Denmark] 
    6	617	GM	Germany 
    4	432	GP	Guadeloupe [France] 
    5	529	GQ	Guam [United States] 
    6	618	GR	Greece 
    4	410	GT	Guatemala 
    1	120	GV	Guinea 
    3	307	GY	Guyana 
    4	411	HA	Haiti
    2	299	HK	Hong Kong
    4	412	HO	Honduras 
    6	609	HR	Croatia
    6	619	HU	Hungary 
    6	620	IC	Iceland 
    5	503	ID	Indonesia 
    6	698	IM	Isle of Man
    2	207	IN	India 
    1	161	IO	British Indian Ocean Territory [United Kingdom] 
    2	208	IR	Iran 
    6	622	IS	Israel 
    6	623	IT	Italy 
    1	113	IV	Cote D'Ivoire 
    2	209	IZ	Iraq 
    2	210	JA	Japan 
    6	693	JE	Jersey
    4	413	JM	Jamaica 
    6	697	JN	Jan Mayen [Norway] 
    6	624	JO	Jordan
    5	530	JQ	Johnston Atoll [United States] 
    1	197	JU	Juan De Nova Island [France]
    1	122	KE	Kenya 
    2	213	KG	Kyrgyzstan 
    2	206	KN	Korea, North 
    5	504	KR	Kiribati 
    2	221	KS	Korea, South 
    5	523	KT	Christmas Island [Australia] 
    2	212	KU	Kuwait 
    6	625	K!	Kazakhstan
    2	211	KZ	Kazakhstan 
    2	214	LA	Laos 
    6	627	LE	Lebanon
    6	626	LG	Latvia 
    6	628	LH	Lithuania
    1	123	LI	Liberia 
    6	641	LO	Slovakia 
    5	599	LQ	Palmyra Atoll [United States] 
    6	696	LS	Liechtenstein
    1	162	LT	Lesotho 
    6	629	LU	Luxembourg 
    1	124	LY	Libya 
    1	125	MA	Madagascar 
    4	433	MB	Martinique [France]
    2	234	MC	Macau S.A.R
    6	631	MD	Moldova 
    1	163	MF	Mayotte [France]
    2	215	MG	Mongolia 
    6	695	MH	Montserrat
    1	126	MI	Malawi
    6	632	MJ	Montenegro 
    6	648	MK	Macedonia 
    1	127	ML	Mali 
    1	130	MO	Morocco 
    1	129	MP	Mauritius 
    5	598	MQ	Midway Islands [United States} 
    1	128	MR	Mauritania 
    6	630	MT	Malta 
    2	218	MU	Oman 
    2	235	MV	Maldives 
    4	414	MX	Mexico 
    5	505	MY	Malaysia 
    1	131	MZ	Mozambique 
    5	532	NC	New Caledonia [France] 
    5	533	NE	Niue [New Zealand]
    5	534	NF	Norfolk Island [Australia]
    1	133	NG	Niger 
    5	520	NH	Vanuatu 
    1	134	NI	Nigeria
    6	633	NL	Netherlands 
    6	634	NO	Norway 
    2	217	NP	Nepal 
    5	506	NR	Nauru
    3	312	NS	Suriname
    4	434	NT	Netherlands Antilles [Netherlands]
    4	415	NU	Nicaragua 
    5	507	NZ	New Zealand 
    3	308	PA	Paraguay 
    5	537	PC	Pitcairn Islands [United Kingdom] 
    3	309	PE	Peru 
    2	219	PK	Pakistan 
    6	635	PL	Poland 
    4	416	PM	Panama 
    6	636	PO	Portugal 
    5	508	PP	Papua New Guinea 
    5	536	PS	Palau
    1	121	PU	Guinea-Bissau
    2	220	QA	Qatar
    1	165	RE	Reunion [France]
    6	639	RI	Serbia
    5	531	RM	Marshall Islands 
    6	637	RO	Romania 
    5	509	RP	Philippines 
    4	435	RQ	Puerto Rico [United States] 
    2	222	RS	Russia 
    6	638	R!	Russia
    1	166	RW	Rwanda
    2	223	SA	Saudi Arabia 
    4	438	SB	Saint Pierre and Miquelon [France]
    4	417	SC	Saint Kitts and Nevis
    1	138	SE	Seychelles 
    1	141	SF	South Africa 
    1	137	SG	Senegal 
    1	196	SH	Saint Helena [United Kingdom] 
    6	642	SI	Slovenia 
    1	139	SL	Sierra Leone 
    5	511	SN	Singapore
    1	140	SO	Somalia
    6	643	SP	Spain 
    4	436	ST	Saint Lucia 
    1	148	SU	Sudan 
    6	694	SV	Svalbard [Norway] 
    6	645	SW	Sweden 
    3	317	SX	South Georgia & the South Sandwich Islands [United Kingdom]
    6	647	SY	Syria 
    6	646	SZ	Switzerland 
    4	424	TD	Trinidad and Tobago 
    1	168	TE	Tromelin Island [France]
    2	228	TH	Thailand 
    2	227	TI	Tajikistan 
    5	538	TL	Tokelau [New Zealand] 
    5	517	TN	Tonga 
    1	151	TO	Togo 
    1	136	TP	Sao Tome and Principe
    1	152	TS	Tunisia 
    5	597	TT	Timor-Leste
    6	649	TU	Turkey 
    5	518	TV	Tuvalu 
    2	236	TW	Taiwan
    2	229	TX	Turkmenistan 
    1	149	TZ	Tanzania 
    1	153	UG	Uganda
    6	651	UK	United Kingdom 
    6	650	UP	Ukraine 
    4	425	US	United States 
    1	105	UV	Burkina Faso 
    3	313	UY	Uruguay 
    2	231	UZ	Uzbekistan 
    4	437	VC	Saint Vincent and the Grenadines
    3	314	VE	Venezuela 
    2	232	VM	Vietnam 
    4	440	VQ	Virgin Islands [United States] 
    1	132	WA	Namibia 
    5	540	WF	Wallis and Futuna [France]
    1	169	WI	Western Sahara 
    5	539	WQ	Wake Island [United States]
    5	541	WS	Samoa
    1	167	WZ	Swaziland 
    9	900	XX	Unknown
    2	233	YM	Yemen
    1	155	ZA	Zambia 
    1	156	ZI	Zimbabwe 
    chiefio@PiM3Devuan2:~/SQL/v4$ 
    

    Continent is also a hand done thing, and pretty trivial (and not that used really):

    chiefio@PiM3Devuan2:~/SQL/tables$ cat continent 
    CREATE TABLE continent (
        version CHAR(5) NOT NULL,
        ascension CHAR (10),
        region CHAR(1) NOT NULL,
        region_name CHAR(25)
        ) ;
    

    Gets loaded with:

    chiefio@PiM3Devuan2:~/SQL/bin/LOAD$ cat Lcontinent 
    INSERT INTO continent VALUES ('Gv3','7Sept2015','1','Africa');
    show warnings;
    INSERT INTO continent VALUES ('Gv3','7Sept2015','2','Asia');
    show warnings;
    INSERT INTO continent VALUES ('Gv3','7Sept2015','3','South America');
    show warnings;
    INSERT INTO continent VALUES ('Gv3','7Sept2015','4','North America');
    show warnings;
    INSERT INTO continent VALUES ('Gv3','7Sept2015','5','Australia Pacific Islands');
    show warnings;
    INSERT INTO continent VALUES ('Gv3','7Sept2015','6','Europe');
    show warnings;
    INSERT INTO continent VALUES ('Gv3','7Sept2015','7','Antarctica');
    show warnings;
    INSERT INTO continent VALUES ('Gv3','7Sept2015','8','Ship Stations Ocean');
    show warnings;
    

    I’m likely to remove the version and ascension fields for the simple reason that the definition of continents seems to not change between versions at all.

  32. rms says:

    @cdquarles. the code is Python (ver 3)

    one small error, for the two creation of dataframes (df and df1), after the date parser spec, add the dtype, which I forgot in what I pasted:

                    date_parser=mydateparser,\
                    dtype=datatype)      
    

    Pandas read_csv for a dataframe documented at
    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

    @emsmith

    A “dump” of a database puts all the tables, schemas, and the data into a so-called “dump” file. Normally used for backup and for transferring to another server.

    https://dev.mysql.com/doc/refman/5.7/en/mysqldump-sql-format.html

    Best to dump the entire database, if possible, and put it on some cloud file server somewhere and I can grab it. Do you see my email in my postings?

    Give me what and as much as you can and I’ll play around it as I have some time over the next week or so. I’ll stick with loading it into dataframe(s) and see if I can replicate what you have timed above.

    Just for kicks, of course! We’ll see where this goes.

  33. cdquarles says:

    Eh, that’s what I get for being old. Thanks Chief and rms. I am on a PC, that’s rather old now (still on the old BIOS boot system), and have both Windows and Linux on it. I’m pretty rusty on the Linux, since it is rather cranky about some stuff, where Windows just works (note, the reverse is often true, too). I have dabbled in both bash and powershell :).

  34. E.M.Smith says:

    @Jim2:

    Very UNfortunately, there is NO unique identifier for a station. Tbe whole thing is just STUFFED with splice artifacts and the risk of splice artifacts.

    LAT & LONG can be wrong, or used for several different station NEAR that point over time.

    WMO number is used for a given WMO station OR stations “near” it. Denoted with a modifier digit in v2 and v3 but not in v1 or v4.

    The instrument hardware and exact location (like Stevenson screen over grass off to the side 100 yards vs MMTS over concrete 100 ft from the terminal building near the jetway) can change with nothing indicated.

    The whole thing is a dogs breakfast of instrument change, splice artifacts, missing data, estimated crap, site changes, UHI, bad sites, and lousy or entirelg missing records of changes. That ought to cause any reasonable observer to declare it “not fit for purpise” for clinate research and planetary calorimetry upon first inspection….

    Short form: The data are crap and the metadata crappier. Useless crap.

    But yeah, figuring out which station is which is a major problem and I strongly doubt they fixed that with LAT / LONG.

  35. E.M.Smith says:

    @rms:

    Ah, full binary blob dump…

    I don’t have any “cloud server” accounts anywhere. Know of any that don’t want money OR too much identity information?

    Yes, email addresses show up in the site admin pages.

  36. E.M.Smith says:

    @CDQuarles:

    I’ll make a bash version of the above FORTRAN later today…

  37. rms says:

    @Emsmith … send me an email and I’ll reply connect you to a place where you can copy the dump to. I have a number of cloud file share accounts.

    A “dump” file is not a blob, nor is it binary. It’s a text file (unless there are binary data fields then I guess those bits will be binary. It’s a self-contained file having everything. It’s a real basic thing to know about SQL server db’s so yout might want to explore just a bit on exporting then importing.

    You may want to to use the “MySQLWorkbench” app which is freely available.

  38. E.M.Smith says:

    I think we are using words a bit differently.

    You use blob in the sence of inscrutable “binary blob” as in binary blob device drivers.
    I’m using it in the sense of “big bag of bits” that can also be structured ascii or other “text” (that kind of by definition is also binary encoded).

    At some point, numeric data types and things like image types will end up encoded into some format for the dump. Doesn’t matter much if MariaDB does it as straight binary machine words, an ASCII encoding, or even EBCDIC (Extended Binary Coded Decimal Interchange Code IIRC… old IBM coding). It still is a bog-o-bits.

    So if my calling it a binary blob mislead you, substitute bag-o-bits.

  39. E.M.Smith says:

    Oh , and I’m out running around the rest of the day. It will be a day or two before I can service email queues and dump / upload things. As the above SQL create table scripts and load scripts compkete in minutes it would be a lot faster to just use them to load the data you already downloaded… though more steps and s FORTRAN compile (though I’ve finished sketching ouf the bash version, if will be tonight before I can typs it in…)

  40. rms says:

    I’ve been messing around with the by_year data in Pandas then saving as a “pickle” file to enable experimentation without having to reimport and also saving to MySQL to see what i can do to make it work. I don’t have enough free disk space on my iMac to import all the data, nor do I think I could handle your dump file. So hold off doing that. I’m going to work out how to create a MySQL instance where the data resides on one of my spare USB hard disks… project for next week. Ping me an email anyway and when I have something which shows how I did all this in Python, Pandas, and MySQL, I’ll share with you in case of interest.

  41. E.M.Smith says:

    Well, my present database would be way too large as it has both sets of tables. Was pkanning to do a complete clean and relod just one set… but I’ll hold off now.

  42. E.M.Smith says:

    @rms:

    You can just send it to my generic blog inbox: pub numerl4 all ATSIGN aol DOT com (no spaces in it…)

    @ALL:

    Well, I finished the debugging on the script for turning v3 data into tab delimited. It’s still a bit ugly as I’m just brute forcing 12 lines (one for each month). It is also rather surprisingly slow. Surprising since I’m not getting red D “disk wait” on the process in “htop” nor is the CPU core loaded up. Mostly running about 16% to 18% or so for the process, about 25%-30% overall. So what’s slowing it down?

    I think it is the interrupt to call bc for the temperature calculation (convert string of digits to float with 2 digits of precision and divided by 100).

    Whatever. It isn’t my preferred language for things like this (bulk file handling with math and string processing) and I’m mostly doing it as a “anyone with *nix can do this” option without FORTRAN or some other language folks may not know.

    Here’s the (admittedly) ugly code:

    chiefio@PiM3Devuan2:~/SQL/v3$ cat ghcn.sh
    #!/bin/bash
    ascen="7Sept2015"
    version="Gv3.3"
    while IFS='' read -r Z; do
    degc=`echo "scale=2; ${Z:19:5} / 100" | bc -l`
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "$version" "$ascen" "${Z:0:1}" "${Z:0:3}" "${Z:3:5}" "${Z:8:3}" "${Z:11:4}" "JAN" "${Z:15:4}" "$degc" "${Z:24:1}" "${Z:25:1}" "${Z:26:1}" 
    
    degc=`echo "scale=2; ${Z:27:5} / 100" | bc -l`
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "$version" "$ascen" "${Z:0:1}" "${Z:0:3}" "${Z:3:5}" "${Z:8:3}" "${Z:11:4}" "FEB" "${Z:15:4}" "$degc" "${Z:32:1}" "${Z:33:1}" "${Z:34:1}" 
    
    degc=`echo "scale=2; ${Z:35:5} / 100" | bc -l`
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "$version" "$ascen" "${Z:0:1}" "${Z:0:3}" "${Z:3:5}" "${Z:8:3}" "${Z:11:4}" "MAR" "${Z:15:4}" "$degc" "${Z:40:1}" "${Z:41:1}" "${Z:42:1}" 
    
    degc=`echo "scale=2; ${Z:43:5} / 100" | bc -l`
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "$version" "$ascen" "${Z:0:1}" "${Z:0:3}" "${Z:3:5}" "${Z:8:3}" "${Z:11:4}" "APR" "${Z:15:4}" "$degc" "${Z:48:1}" "${Z:49:1}" "${Z:50:1}" 
    
    degc=`echo "scale=2; ${Z:51:5} / 100" | bc -l`
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "$version" "$ascen" "${Z:0:1}" "${Z:0:3}" "${Z:3:5}" "${Z:8:3}" "${Z:11:4}" "MAY" "${Z:15:4}" "$degc" "${Z:56:1}" "${Z:57:1}" "${Z:58:1}" 
    
    degc=`echo "scale=2; ${Z:59:5} / 100" | bc -l`
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "$version" "$ascen" "${Z:0:1}" "${Z:0:3}" "${Z:3:5}" "${Z:8:3}" "${Z:11:4}" "JUNE" "${Z:15:4}" "$degc" "${Z:64:1}" "${Z:65:1}" "${Z:66:1}" 
    
    degc=`echo "scale=2; ${Z:67:5} / 100" | bc -l`
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "$version" "$ascen" "${Z:0:1}" "${Z:0:3}" "${Z:3:5}" "${Z:8:3}" "${Z:11:4}" "JULY" "${Z:15:4}" "$degc" "${Z:72:1}" "${Z:73:1}" "${Z:74:1}" 
    
    degc=`echo "scale=2; ${Z:75:5} / 100" | bc -l`
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "$version" "$ascen" "${Z:0:1}" "${Z:0:3}" "${Z:3:5}" "${Z:8:3}" "${Z:11:4}" "AUG" "${Z:15:4}" "$degc" "${Z:80:1}" "${Z:81:1}" "${Z:82:1}" 
    
    degc=`echo "scale=2; ${Z:83:5} / 100" | bc -l`
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "$version" "$ascen" "${Z:0:1}" "${Z:0:3}" "${Z:3:5}" "${Z:8:3}" "${Z:11:4}" "SEPT" "${Z:15:4}" "$degc" "${Z:88:1}" "${Z:89:1}" "${Z:90:1}" 
    
    degc=`echo "scale=2; ${Z:91:5} / 100" | bc -l`
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "$version" "$ascen" "${Z:0:1}" "${Z:0:3}" "${Z:3:5}" "${Z:8:3}" "${Z:11:4}" "OCT" "${Z:15:4}" "$degc" "${Z:96:1}" "${Z:97:1}" "${Z:98:1}" 
    
    degc=`echo "scale=2; ${Z:99:5} / 100" | bc -l`
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "$version" "$ascen" "${Z:0:1}" "${Z:0:3}" "${Z:3:5}" "${Z:8:3}" "${Z:11:4}" "NOV" "${Z:15:4}" "$degc" "${Z:104:1}" "${Z:105:1}" "${Z:106:1}" 
    
    degc=`echo "scale=2; ${Z:107:5} / 100" | bc -l`
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "$version" "$ascen" "${Z:0:1}" "${Z:0:3}" "${Z:3:5}" "${Z:8:3}" "${Z:11:4}" "DEC" "${Z:15:4}" "$degc" "${Z:112:1}" "${Z:113:1}" "${Z:114:1}" 
    
    done < "$1"
    

    The one for Version 4 will be along shortly…

    (I wonder if there is some language common for programming on both *nix and MicroSoft platforms that would be suitable… What do people use in the MS world, anyway? That isn’t a MS Proprietary Thing…)

  43. E.M.Smith says:

    Here’s the GHCN v4 version:

    chiefio@PiM3Devuan2:~/SQL/v4$ cat 4ghcn.sh 
    #!/bin/bash
    ascen="17Mar2019"
    version="Gv4"
    tab=\t
    while IFS='' read -r Z; do
    
    degc=`echo "scale=2; ${Z:19:5} / 100" | bc -l`
    
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "${Z:0:2}" "${Z:2:1}" "${Z:3:1}" "${Z:4:1}" "${Z:5:1}" "${Z:6:5}" "${Z:11:4}" "${Z:15:4}" "$version" "$ascen" "JAN" "$degc" "${Z:24:1}" "${Z:25:1}" "${Z:26:1}"
    
    degc=`echo "scale=2; ${Z:27:5} / 100" | bc -l`
    
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "${Z:0:2}" "${Z:2:1}" "${Z:3:1}" "${Z:4:1}" "${Z:5:1}" "${Z:6:5}" "${Z:11:4}" "${Z:15:4}" "$version" "$ascen" "FEB" "$degc" "${Z:32:1}" "${Z:33:1}" "${Z:34:1}" 
    
    degc=`echo "scale=2; ${Z:35:5} / 100" | bc -l`
    
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "${Z:0:2}" "${Z:2:1}" "${Z:3:1}" "${Z:4:1}" "${Z:5:1}" "${Z:6:5}" "${Z:11:4}" "${Z:15:4}" "$version" "$ascen" "MAR" "$degc" "${Z:40:1}" "${Z:41:1}" "${Z:42:1}" 
    
    degc=`echo "scale=2; ${Z:43:5} / 100" | bc -l`
    
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "${Z:0:2}" "${Z:2:1}" "${Z:3:1}" "${Z:4:1}" "${Z:5:1}" "${Z:6:5}" "${Z:11:4}" "${Z:15:4}" "$version" "$ascen" "APR" "$degc" "${Z:48:1}" "${Z:49:1}" "${Z:50:1}" 
    
    degc=`echo "scale=2; ${Z:51:5} / 100" | bc -l`
    
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "${Z:0:2}" "${Z:2:1}" "${Z:3:1}" "${Z:4:1}" "${Z:5:1}" "${Z:6:5}" "${Z:11:4}" "${Z:15:4}" "$version" "$ascen" "MAY" "$degc" "${Z:56:1}" "${Z:57:1}" "${Z:58:1}" 
    
    degc=`echo "scale=2; ${Z:59:5} / 100" | bc -l`
    
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "${Z:0:2}" "${Z:2:1}" "${Z:3:1}" "${Z:4:1}" "${Z:5:1}" "${Z:6:5}" "${Z:11:4}" "${Z:15:4}" "$version" "$ascen" "JUNE" "$degc" "${Z:64:1}" "${Z:65:1}" "${Z:66:1}" 
    
    degc=`echo "scale=2; ${Z:67:5} / 100" | bc -l`
    
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "${Z:0:2}" "${Z:2:1}" "${Z:3:1}" "${Z:4:1}" "${Z:5:1}" "${Z:6:5}" "${Z:11:4}" "${Z:15:4}" "$version" "$ascen" "JULY" "$degc" "${Z:72:1}" "${Z:73:1}" "${Z:74:1}" 
    
    degc=`echo "scale=2; ${Z:75:5} / 100" | bc -l`
    
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "${Z:0:2}" "${Z:2:1}" "${Z:3:1}" "${Z:4:1}" "${Z:5:1}" "${Z:6:5}" "${Z:11:4}" "${Z:15:4}" "$version" "$ascen" "AUG" "$degc" "${Z:80:1}" "${Z:81:1}" "${Z:82:1}" 
    
    degc=`echo "scale=2; ${Z:83:5} / 100" | bc -l`
    
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "${Z:0:2}" "${Z:2:1}" "${Z:3:1}" "${Z:4:1}" "${Z:5:1}" "${Z:6:5}" "${Z:11:4}" "${Z:15:4}" "$version" "$ascen" "SEPT" "$degc" "${Z:88:1}" "${Z:89:1}" "${Z:90:1}" 
    
    degc=`echo "scale=2; ${Z:91:5} / 100" | bc -l`
    
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "${Z:0:2}" "${Z:2:1}" "${Z:3:1}" "${Z:4:1}" "${Z:5:1}" "${Z:6:5}" "${Z:11:4}" "${Z:15:4}" "$version" "$ascen" "OCT" "$degc" "${Z:96:1}" "${Z:97:1}" "${Z:98:1}" 
    
    degc=`echo "scale=2; ${Z:99:5} / 100" | bc -l`
    
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "${Z:0:2}" "${Z:2:1}" "${Z:3:1}" "${Z:4:1}" "${Z:5:1}" "${Z:6:5}" "${Z:11:4}" "${Z:15:4}" "$version" "$ascen" "NOV" "$degc" "${Z:104:1}" "${Z:105:1}" "${Z:106:1}" 
    
    degc=`echo "scale=2; ${Z:107:5} / 100" | bc -l`
    
    	printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${Z:0:11}" "${Z:0:2}" "${Z:2:1}" "${Z:3:1}" "${Z:4:1}" "${Z:5:1}" "${Z:6:5}" "${Z:11:4}" "${Z:15:4}" "$version" "$ascen" "DEC" "$degc" "${Z:112:1}" "${Z:113:1}" "${Z:114:1}" 
    
    done <"$1"
    

    There are some minor “justification” differences in the scripts vs FORTRAN (aligned right side or left) but that ought not matter once loaded in the database.

    Testing that must wait until the process is done. The GNCH v3.3 has about 1/2 an hour on it and is still just about 1/10 done… so figure maybe 5 hours? Which means the v4 will likely take a whole day… I’ll post times when I have them.

    Running both together on the Pi put it in heat limit on the CPU (with small heat sink) so to get a valid time I’ll need to run them one at a time and not use the browser… so off to a different machine…

  44. Another Ian says:

    Climate related

    “A song for the sensitive Joelle Gergis.
    (To the tune of ‘Both Sides Now’ with apologies to Joni Mitchell)

    Rows and flows of climate scare,
    And adjustments plucked out of the air,
    With stormy droughts freezing everywhere,
    We’ve looked at clouds that way.

    But now they only block the Sun,
    As models project rain on everyone.
    So much heat-held albedo it’s done,
    But those clouds got in our way.

    We’ve looked at clouds from both sides now,
    Computed parameters, and still somehow,
    No, cloud delusions we recall,
    We really do know clouds at all.

    Moons, warm June’s, and icy-caps,
    The climate chaos, and way you feel.
    Cause virtual climate is just so real.
    I’ve emoted at clouds that way.

    But now it’s just another show,
    Take the money, leave crying as you go.
    And if you care, don’t let them know.
    Don’t give yourself away.

    We’ve looked at climate from both sides now,
    From give and take, and still somehow,
    It’s CO2′s illusions I recall.
    We like the payments, that’s all.

    Fears, more fears and feeling proud,
    With hubris we shout right out loud.
    All dreams and schemes conning crowds,
    We see extinction’s rise that way.

    But now old weather’s acting strange.
    CO2 still warming but clouds have changed?
    So everyone’s lost when we’ve gained,
    We’ve a fat cheque every day.

    We’ve looked at life from both sides now,
    From win and lose and still somehow,
    It’s extinction’s illusions I recall.
    We really know bad climate, that’s all.

    We’ve looked at life from both sides now,
    From up and down and still somehow,
    It’s CO2′s illusions I recall.
    ‘Cause we really don’t know anything, at all. ”

    http://joannenova.com.au/2019/08/climate-expert-or-un-certified-seer-joelle-gergis-does-gut-wrenching-grieving-horror-volcanic-rage/#comment-2172805

  45. jim2 says:

    (I wonder if there is some language common for programming on both *nix and MicroSoft platforms that would be suitable… What do people use in the MS world, anyway? That isn’t a MS Proprietary Thing…)

    You can install BASH on Win 10.

  46. jim2 says:

    If you’re a developer working with both Linux and Windows, you can now add Bash shell on Windows 10 to run Linux commands. Here’s how to install the tool.

    https://www.windowscentral.com/how-install-bash-shell-command-line-windows-10

  47. E.M.Smith says:

    Well then, I’ll asume folks with MSWindows can just use bash or C…

  48. E.M.Smith says:

    Well, the conversion of v3.3 finally completed:

    chiefio@PiM3Devuan2:~/SQL/v3$ time ./ghcn.sh temps.in > 3gtemps.out 
    
    real	757m9.355s
    user	275m1.181s
    sys	662m59.556s
    

    757 minutes is 12.6 hours….

    The v4 file is 3 x larger… so about 36 hours to make it.

    chiefio@PiM3Devuan2:~/SQL/v4$ ls -l ../v3/*ps.out
    -rw-r--r-- 1 chiefio chiefio 389304928 Aug  6 06:08 ../v3/3gtemps.out
    -rw-r--r-- 1 chiefio chiefio 411010800 Feb  8 19:25 ../v3/3temps.out
    
    chiefio@PiM3Devuan2:~/SQL/v4$ ls -l *.out
    -rw-r--r-- 1 chiefio chiefio 1260880200 May 13 19:12 temps.out
    

    Why I like FORTRAN? How about 100 times faster?

    chiefio@PiM3Devuan2:~/SQL/v4$ time ./stationdat > junk.out 
    
    real	7m54.531s
    user	7m22.522s
    sys	0m30.879s
    chiefio@PiM3Devuan2:~/SQL/v4$ ls -l junk.out
    -rw-r--r-- 1 chiefio chiefio 1260880200 Aug  6 06:46 junk.out
    

    Note that the 3gtemps.out (the new one) is slightly smaller than the old one. I think this is due to the FORTRAN printing full field widths and the script only printing the characters between the quote marks so the FORTRAN means I must test for ” JAN” with the leading space (in various SQL and Python programs). This might be a feature, or a bug… Visual inspection of the data shows it is all there and right, but with “alignment” changes that I think are the spaces.

    They have the same number of lines:

    chiefio@PiM3Devuan2:~/SQL/v3$ wc -l *ps.out
      5554200 3gtemps.out
      5554200 3temps.out
    

    Here’s the top 10 lines of both. Scroll way off to the right and you can see the alignment differences that I’m attributing to spaces in the FORTRAN printout.

    chiefio@PiM3Devuan2:~/SQL/v3$ head *ps.out
    ==> 3gtemps.out <==
    10160355000	Gv3.3	7Sept2015	1	101	60355	000	1878	JAN	TAVG	8.90	 	 	1
    10160355000	Gv3.3	7Sept2015	1	101	60355	000	1878	FEB	TAVG	9.50	 	 	1
    10160355000	Gv3.3	7Sept2015	1	101	60355	000	1878	MAR	TAVG	11.10	 	 	1
    10160355000	Gv3.3	7Sept2015	1	101	60355	000	1878	APR	TAVG	16.10	 	 	1
    10160355000	Gv3.3	7Sept2015	1	101	60355	000	1878	MAY	TAVG	19.80	 	 	1
    10160355000	Gv3.3	7Sept2015	1	101	60355	000	1878	JUNE	TAVG	22.40	 	 	1
    10160355000	Gv3.3	7Sept2015	1	101	60355	000	1878	JULY	TAVG	24.90	 	 	1
    10160355000	Gv3.3	7Sept2015	1	101	60355	000	1878	AUG	TAVG	26.80	 	 	1
    10160355000	Gv3.3	7Sept2015	1	101	60355	000	1878	SEPT	TAVG	23.20	 	 	1
    10160355000	Gv3.3	7Sept2015	1	101	60355	000	1878	OCT	TAVG	26.80	 	S	1
    
    ==> 3temps.out <==
    10160355000	Gv3  	7Sept2015 	1	101	60355	000	1878	 JAN	TAVG	   8.90	 	 	1
    10160355000	Gv3  	7Sept2015 	1	101	60355	000	1878	 FEB	TAVG	   9.50	 	 	1
    10160355000	Gv3  	7Sept2015 	1	101	60355	000	1878	 MAR	TAVG	  11.10	 	 	1
    10160355000	Gv3  	7Sept2015 	1	101	60355	000	1878	 APR	TAVG	  16.10	 	 	1
    10160355000	Gv3  	7Sept2015 	1	101	60355	000	1878	 MAY	TAVG	  19.80	 	 	1
    10160355000	Gv3  	7Sept2015 	1	101	60355	000	1878	JUNE	TAVG	  22.40	 	 	1
    10160355000	Gv3  	7Sept2015 	1	101	60355	000	1878	JULY	TAVG	  24.90	 	 	1
    10160355000	Gv3  	7Sept2015 	1	101	60355	000	1878	 AUG	TAVG	  26.80	 	 	1
    10160355000	Gv3  	7Sept2015 	1	101	60355	000	1878	SEPT	TAVG	  23.20	 	 	1
    10160355000	Gv3  	7Sept2015 	1	101	60355	000	1878	 OCT	TAVG	  26.80	 	S	1
    
  49. rms says:

    OK. I’m making my program to load the by_year data into MySQL to “work and look good” and will post on the the cloud share I will inform you of at the email you provide. Looking at performance tweaking. As you know, there are millions and millions of records.

    Others commenting about what languages to use on Windows… I still recommend Python with Pandas which I am led to believe works well on Windows (I use Mac OSX and Ubuntu Linux). I recommend installing first Anaconda to create a Python 3.x environment. Anaconda makes so much of all this easy to get stuff installed and maintained correctly. https://www.anaconda.com/distribution/ Surely can also install R into this environment on Windows, as well as FORTRAN.

  50. E.M.Smith says:

    @rms:

    It’s just been so long for me since I’ve used a Windows box for any programming that I’m basically clueless about what is currently popular on it. My only Windows computer at this point is dual boot Debian (current) and Windows 7 ….

    As I’m already writing Python, that’s an easy choice.

    It would make it “one language and a database” end to end and cross platform…

    Guess I’ll take a look at that, and at your data load program.

    FWIW, I did find it kind of fun to make the tab insertion program run in bash… yeah, sick puppy ;-) It was made a little more delicious by being 10 x slower than FORTRAN … Obfuscated AND slothful in one go!

    C and FORTRAN are about the same efficiency for most things on *nix (no real surprise when you realize the bulk of the compiler is the same package – gnu stuff) with many other high level languages being “close”. Julia (as a new kid on the block) has managed to come in at just about the same speed. Python, when interpreted, is slower, but not enough to matter for most things (and when it does, you can use pypy or similar to compile it). IMHO they are all about a “wash” on the technical front, so popularity and personal preference drive most of the decision tree.

    Interesting article here:
    https://www.technotification.com/2018/11/julia-vs-python-programming.html

    It might be fun to do the front end step in both Python and Julia just as examples and for individual choice. I’ve wanted to play with Julia and these are very trivial programs so good for a first toe in the water.

    R, for me, now, is still something of a black box. I think I remember reading somewhere an article saying it as “OK” on speed, but not great. I have looked at the syntax and structure and it isn’t that complicated a language; but somehow a bit “odd”… More like mathematics than programming…

    Well, in any case, I’ve got the “tab insert program” now running in FORTRAN and bash (though it needs a test load to the database to assure that the blank space differences don’t change anything in a bad way…) so I guess next it will be “some other language”. Since R is having library problems on the ARM platforms, and Python looks to be “everywhere”, I’ll likely just use it.

    I know that (at least in theory per what I’ve read) you can stuff things into the database directly from Python. That could be a nice feature. I think I’ll try writing just the simple “tab inserter” though, so that it is identical in function to the other two.

    Then I’ll be ready to run through from data download to convert / upload to creating the stats tables. That’s the bulk of it. The reports / graphs are already done, I just have to not screw up how they work ;-) As they have an SQL interface, as long as I leave the field names the same it ought to be fine even if I change how fields are created or the order in the database.

    There WILL be changes for those cases where I’m combining two tables into one (in that a join will not be needed and field names will change) but that’s mostly just changing the SQL string the Python gets fed.

    Well, “we’ll see”… and I’ll be watching the mail box ;-)

  51. Larry Ledwick says:

    This is windows version but not sure it works on windows 7

    https://www.techadvisor.co.uk/how-to/windows/bash-shell-windows-3655328/

  52. jim2 says:

    Then there are some free Fortran programs for Windows.

    https://www.thefreecountry.com/compilers/fortran.shtml

  53. rms says:

    I strongly Anaconda as a first step for anyone before installing yet more programming languages on Windows (or Mac or Linux). You’ll get easy setup and it comes with all these talked-about stuff. Free. And superb quality.

  54. jim2 says:

    Looks like only Python/R is free. The other Anaconda packages cost money.

  55. E.M.Smith says:

    Well, that was interesting…

    So I decided that, rather than wait 2 days for the v4 script to run n the Pi M3 I’d just swap over to the RockPro64. It’s a whole lot faster… So I did and it ran with about the same “loading” of the machine. About 1/3 total load (process and system). Now the RockPro64 has some very fast cores… It is a 6 core board, the “BIG” cores running at 1.8 GHz and the “little’ running at 1.4 GHz (2/4 split of BIG/little) and those big cores are much faster per MHz. It has more memory too and I had 8 GB of swap space.

    Lots of relatively Big Iron…

    So I started the v4 version of the bash based tab inserter and let it run most of the day, then night came, off to bed.

    This morning I found that it had ALMOST completed:

    chiefio@rockpro64:~/SQL/v4$ ls -l *.out
    -rw-rw-r-- 1 chiefio chiefio  917070587 Aug  7 13:46 shTemps.out
    -rw-r--r-- 1 chiefio chiefio 1260880200 Mar 21 03:05 temps.out
    

    The problem is that the window where it was running was full of (and growing with more of):

    4ghcn.sh: fork: retry: Resource temporarily unavailable
    4ghcn.sh: fork: retry: Resource temporarily unavailable
    4ghcn.sh: fork: retry: Resource temporarily unavailable
    4ghcn.sh: fork: retry: Resource temporarily unavailable
    

    Why on earth would it be unable to fork a process? Nothing was limiting…

    Digging around I found this:

    https://www.linuxquestions.org/questions/linux-server-73/bash-fork-retry-resource-temporarily-unavailable-limit-issue-4175618262/

    Edit:
    Tried this java application out on multiple cloud hardware configurations.. all yielding the same result. from 1cpu – 6gb of ram to 16cpu – 60gb of ram and several hardware configs in between.

    Edit2: SOLVED
    Figured out the issue
    https://stackoverflow.com/questions/…57379#43157379

    /etc/systemd/logind.conf UserTasksMax needs to be set higher than ~12000
    Not sure how this is read in, but a reboot of the server activated this number

    I think it’s also worth noting that
    sys.kernel.threads-max
    sys.kernel.pid_max
    sys.vm.max_map_count
    Could probably be limitations if I was trying to go much higher with thread count, and these might be my next limiters if I decided to go bigger with thread count.

    Edit3:
    Instead of rebooting you can do these commands
    Code:

    sudo systemctl daemon-reload
    sudo systemctl daemon-reexec
    sudo service systemd-logind restart
    

    Then restart your shell or.. exit.. and ssh back in

    The referenced link lists several other SystemD imposed limits that can cause grief and how to change them, too

    Folks may remember that I’d not bothered to change the RockPro64 to a Devuan (no systemd) based system… but left it an Armbian Ubuntu with SystemD.

    So now we know. SystemD “has ideas” about how much of your system in your cgroup you ought to be allowed to use… Sigh. So use your equipment too much, you get stopped from using your equipment…

    FWIW, I had to not only kill that process but exit a couple of windows before I was able to issue the “HALT” command (in the ROOT window I always have open and ready for dealing with The Bad Thing…) without it, too, getting the same error. As I had done an su to root, I presume that window was also assigned to my basic resource pool along with the one running the bash script.

    I’m really really not liking the constant stream of “negative surprises” from SystemD.

    Sure, you can work around them. Sure, you can “adjust” the straight jacket… Sure you can… but why put it on in the first place?

Anything to say?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.