GHCN in MYSQL quick and easy

For about 3 years I’ve thought I ought to load GHCN data into a database instead of screwing around with FORTRAN. Don’t get me wrong, I actually like FORTRAN. It was my first computer language. I’m pretty good with it. I don’t mind a “GO TO” statement at all and I’m as good with them as with BEGIN; END; pairs (that have now crept into FORTRAN anyway…) but, well, I spent a decade or so as a DBA (Data Base Administrator) and teaching classes in how to use them and how great they are…

So I felt a bit guilty.

But the database ‘products’ (they are free, so kind of hard to think of it as a product, that usually means bought…) on Linux tend to the SQL (Structured Query Language was the original meaning) family and that’s not one I’d used before. Also, IMHO, it was a bit primitive compared to the tools I really wanted to use (in the FOCUS / RAMIS II family, I could be done in about 10 minutes all up.)

So I delayed.

They say “All things come to he who waits”. And it did.

But I didn’t wait doing nothing. While in Florida I’d gotten a couple of SQL books and learned the basic structure of the language (fairly primitive, IMHO) and got a sense of the capabilities. Frankly, that was part of what caused me to take so long.

You see, a ‘watershed’ of sorts happened about 1980 in Computer Programming. Prior to that point, flat files and fixed format data read / write were the norm. Then folks got all enamored of more structured data and the era of databases took off. About the same time, the spreadsheet driven demand for CSV (Comma Separated Values) set in also.

No problem there.

The problem is that ever more computer languages and database systems have become dead stupid about how to handle a flat file with a fixed format. IMHO for no other reason than they were no longer ‘trendy’. They are still one of THE most useful ways to store and move data. LOTS of data has a comma in it, and that makes the CSV format fundamentally broken for large swathes of data. From what I can tell, SQL doesn’t do well with fixed format flat file input. It wants CSV files. That is terribly limiting.

So you end up writing “glue programs”. Right off the bat, you get thrown back into using a language suited to fixed format flat files (like, oh, FORTRAN…) just to read the data and put it into the database. But… Once you are already in a language, and have read in the data, why not just “do what you wanted to do and be done”… so it’s an easy slide to just NOT bother with the PITA DB and keep on writing FORTRAN.

But I felt guilty.

I knew if I had it in a DB that a lot of other things become much easier. Especially if I had a DB of all three versions. V1, V2, V3 compares could be much easier. Eventually this wore me down and I decided “now is the time to load at least ONE and practice the language”.

The Internet Search Is Your Friend.

Not wanting to do work that was already done, I thought maybe someone had a script up or a program for making the database. A quick search showed a couple of them. After downloading one of the “scripts”, it was of amazing size. A bit of looking at it showed it contained all of GHCN V3.2.2 data embedded in it and was all ready to load as is. Neatly bypassing that whole conversion of data formatting problem. (I am still left with ‘adapt the process to V1 and V2 and a ‘load raw daily data’ conversion for when I get to those steps).

What this means is that I can have at least ONE instance of GHCN data up and running in a MySQL database on CentOS linux. And it took about 15 minutes. Now I can easily ‘slide’ into actually using SQL and the database and getting fast and comfortable with it. The ‘center of gravity of easy’ moves.

This version of the data, in this downloaded script, will be the post-homogenizing pasteurized processed data-food-product and not the organic vegan holistic raw data uncontaminated by adjuncts ;-) … but it is a start.

The Sources

Dmitry A. Grechka

The Global Historical Climatology Network-Monthly (GHCN-M) dataset by NCDC is particularly important data set if your research deals with climate data. It is widely accepted. Its major advantage is quality control and a variety of data sources combined together. I used it several times as reference data for validation of calculated climate surfaces. It is also great for uncertainty assessment of climate interpolation methods.

But it is distributed as text files of specific format only. And you will have to write a parser to fetch the data.

It was that “write a parser” part that had me unmotivated…

This week I decided to load the GHCNv3 into MySQL to make it flexible for fetching. I can fetch different subsets of the data into CSV files just with composing a proper select query. That made a significant speed up in experiments with interpolation techniques.

I share these SQL scripts to enable others researchers to load GHCN v3 into their own SQL servers. You can restore GHCN at your server and perform requests to it. Just download the script, execute it. And you are able to get the data you need. Fast :)

GHCNv3_ansi.sql (compatible with PostgreSQL, DB2, Oracle, MsSQL)
GHCNv3_mysql.sql (MySQL specific syntax)

The scripts do not contain CREATE DATABASE statements. Thus create an empty database by hand and then execute the proper script.

All well and good, but now it was ‘on me’ to make sure I had SQL of some sort running and to actually DO something. I chose to use the CentOS box just because the R.Pi was already fetching a load of data and likely to hit zero free space ‘soon’; and because CentOS almost certainly would have SQL of some sort already installed. It did, MySQL.

There are minor differences in the flavors of SQL, so it is kind of important to know that, and know what you are using. This had me using the second script, not the first one. The “free world” flavor not the proprietary ones.

I also learned that just having it installed is not enough. The ‘service’ has to be started and running before you can run the database… In yet another overly complicated byzantine moment, the designers split things so that you need a back end mysql server running then you can use the mysql command to get a mysql ‘shell’ that lets you use it. Not the way I would do it, but hey, I didn’t have to write it.

So in case anyone else needs to get over that hump, this page helps:

It has a LOT of stuff on setting up a couple of levels of security (so you log into the system, then log into mysql with another set of login and passwords and then…) that I just skipped. If you do not set up security, then you can just use it. In my case I ‘just used it’ as root. In a stand alone disconnected server, that isn’t too much of an issue. The security config they describe is suited for use as a networked server with important data subject to attack. Yet another layer of complications for later…

This article describes a basic installation of a MySQL database server on CentOS Linux, just enough to get you started. Remember that you might need to install other packages to let applications use MySQL, like extensions for PHP. Check your application documentation for details.
Install MySQL

Install the MySQL server through the CentOS package manager by running the following commands at a command prompt:

sudo yum install mysql-server
sudo /sbin/service mysqld start

Then, run the following command:

sudo /usr/bin/mysql_secure_installation

I skipped the secure_installation stuff for now. This was just a first cut get the feet wet install, not production.

Launch MySQL

Now that MySQL is installed, you can verify that it’s running by trying to launch it:

sudo /sbin/service mysqld start

If MySQL is already running, you will receive a message to that effect.

Launch at restart

To ensure that the MySQL server will launch when the machine is restarted, run the following command:

sudo chkconfig mysqld on

That makes sure your machine will launch the MySQL server when it reboots.

I did an “su root” and then issued the commands, so “service mysqld start”. Same effect, but less total typing as you don’t need to keep doing “sudo”; besides, I’ve not put myself in the ‘sudoers file’ on the CentOS box yet ;-)

After that is done, you can type “mysql” and launch a mysql ‘shell’ where you can enter the mysql commands to make, load, read, process, etc. a database.

They have it all set up with logins and passwords and use the full path name. Since I skipped the ‘security’ level, and have a sane command search path, ‘mysql’ was all I needed to type.

The mysql shell

There is more than one way to work with a MySQL server, but this article focuses on the most basic and compatible approach: The mysql shell. At the command prompt, run the following command to launch the mysql shell and enter it as the root user:

/usr/bin/mysql -u root -p

When you’re prompted for a password, enter the one that you set at installation or, if you haven’t set one, just press enter to submit no password. The following mysql shell prompt should appear:


They then go into a bunch about setting passwords and such and selecting a database. I just skipped down to that ‘create a database’ the prior link said was not done in the script.

Create a database

There is a difference between database server and an actual database, 
even though those terms are often used interchangeably. MySQL is a 
database server, meaning that it keeps track of databases and controls 
access to them. An actual database is where all the data goes is 
stored, and it is the database that applications are trying to access 
when they interact with MySQL.

Some applications create a database as part of their setup process, 
but others require you to create a database and tell the application 
about it. Fortunately, creating a database is simple.

To create a database, log in to the mysql shell and run the following 
command, replacing demodb with the name of the database that you want 
to create:


The database is created. You can verify its creation by running a 
query to list all databases. The following example shows the query
and example output:

| Database           |
| information_schema |
| demodb             |
| mysql              |
3 rows in set (0.00 sec)

Though I did a “create database ghcnv3;” instead. ( It isn’t case sensitive. Any case will do on commands).

So now I had mysql installed, running (if not security locked down), and with a database name created (though empty) ready for loading data and shaping; and was able to launch a ‘mysql shell’ and look around. Not bad for about 10 minutes.

There’s a lot more in that link about how to set things up for various needs. Adding users, permissions, privs, and more. “Someday” I’ll get to those things.

So what next?

Get the data into the database.

So at this point, we’re back to that downloaded “script with data”. I created a directory named “DB” and had put both copies of the compressed download in it. Then needed to unpack one of them. I’d also figured I have the database in this directory (thus the name) but that didn’t work as expected… more on that below).

[chiefio@CentosBox ~]$ cd DB

[chiefio@CentosBox DB]$ ls

GHCNv3_mssql_db2_oracle_postgre.sql.bz2  GHCNv3_mysql.sql.bz2

[chiefio@CentosBox DB]$ bzip2 -d GHCNv3_mysql.sql.bz2 

[chiefio@CentosBox DB]$ ls

GHCNv3_mssql_db2_oracle_postgre.sql.bz2  GHCNv3_mysql.sql

Here you can see that the bz2 ending got removed and it is now just a .sql ending file ready to go.

I wanted to know if the file was all ‘text’ so did:

[chiefio@CentosBox DB]$ file GHCNv3_mysql.sql 
GHCNv3_mysql.sql: data

Hmmm… file contents of ‘data’ not ‘text’… I decided to ‘risk it’ and look at a dump of a little of it ( 10 lines by default ) to my screen:

[chiefio@CentosBox DB]$ head GHCNv3_mysql.sql 
-- The script loads GHCNv3 data (Unadjusted and adjusted GHCN-Monthly
 TAVG, TMAX, TMIN and inventory data) in a SQL database.

-- Contains all data included in the raw files as on release v3.2.2.20141108
-- The data origin: J. H. Lawrimore, M. J. Menne, B. E. Gleason, C. N. 
Williams, D. B. Wuertz, R. S. Vose, and J. Rennie (2011), An overview of 
the Global Historical Climatology Network monthly mean temperature data set, 
version 3, J. Geophys. Res., 116, D19121, doi:10.1029/2011JD016187.
-- Raw data was downloaded from ftp://

-- The data is converted into SQL by Dmitry A. Grechka on 14 November 2014

Ah, contains both script text and embedded data… Got it. No wonder it unpacks so big:

[chiefio@CentosBox DB]$ ls -l
total 544104
-rw-rw-r--. 1 chiefio chiefio  35036549 Aug 30 23:10 GHCNv3_mssql_db2_oracle_postgre.sql.bz2
-rw-rw-r--. 1 chiefio chiefio 522116261 Aug 30 23:11 GHCNv3_mysql.sql

I’d thought 35 MB was a huge script, but at 1/5 GB it was way out of line. Unless it contains the data… which it does.

After a bit of looking around, I found out how to load the data via script execution:

mysql> show databases
    -> ;
| Database           |
| information_schema |
| ghcnv1             |
| ghcnv2             |
| ghcnv3             |
| mysql              |
| test               |
6 rows in set (0.12 sec)

Couple of things here. First off, notice that I’ve made “databases” (really just names at this point) for all three major versions of GHCN. Then loaded the v3 one. I think you can see where I’m heading… Then, note that allmost all MySQL commands end with a semicolon. I forgot to type that after the “show databases” but no worries, just type it on the next line and all is good.

FWIW, typing ‘q’ will not exit. You need to type exit or quit.

Here’s a bit more of a look at the results of the load (that took about 5 minutes on this box, I think. I launched it and wandered off for a while…)

Note that you must tell it which database to use first, then you can do things with the tables in it. Also notice that unlike everywhere else so far, table names ARE case sensitive.

mysql> use ghcnv3
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables
    -> ;
| Tables_in_ghcnv3 |
| Records          |
| Stations         |
2 rows in set (0.00 sec)

mysql> describe stations;
ERROR 1146 (42S02): Table 'ghcnv3.stations' doesn't exist
mysql> describe Stations;
| Field     | Type        | Null | Key | Default | Extra |
| id        | bigint(20)  | NO   | PRI | NULL    |       |
| latitude  | double      | YES  |     | NULL    |       |
| longitude | double      | YES  |     | NULL    |       |
| elevation | double      | YES  |     | NULL    |       |
| name      | char(30)    | YES  |     | NULL    |       |
| grelev    | smallint(6) | YES  |     | NULL    |       |
| popcls    | char(1)     | YES  |     | NULL    |       |
| popsiz    | smallint(6) | YES  |     | NULL    |       |
| topo      | char(2)     | YES  |     | NULL    |       |
| veg       | char(2)     | YES  |     | NULL    |       |
| loc       | char(2)     | YES  |     | NULL    |       |
| ocndis    | tinyint(4)  | YES  |     | NULL    |       |
| airstn    | bit(1)      | YES  |     | NULL    |       |
| towndis   | tinyint(4)  | YES  |     | NULL    |       |
| grveg     | char(16)    | YES  |     | NULL    |       |
| popcss    | char(1)     | YES  |     | NULL    |       |
16 rows in set (0.01 sec)


mysql> describe Records;
| Field       | Type        | Null | Key | Default | Extra |
| station_id  | bigint(20)  | NO   | PRI | NULL    |       |
| year        | smallint(6) | NO   | PRI | NULL    |       |
| month       | tinyint(4)  | NO   | PRI | NULL    |       |
| is_adjusted | bit(1)      | NO   | PRI | NULL    |       |
| element     | char(4)     | NO   |     | NULL    |       |
| value       | smallint(6) | NO   |     | NULL    |       |
| dmflag      | char(1)     | NO   |     | NULL    |       |
| qcflag      | char(1)     | NO   |     | NULL    |       |
| dsflag      | char(1)     | NO   |     | NULL    |       |
9 rows in set (0.06 sec)

Just as a bit more documentation, here’s the ‘help’ page for online help:

mysql> help

For information about MySQL products and services, visit:
For developer information, including the MySQL Reference Manual, visit:
To buy MySQL Enterprise support, training, or other products, visit:

List of all MySQL commands:
Note that all text commands must be first on line and end with ';'
?         (\?) Synonym for `help'.
clear     (\c) Clear the current input statement.
connect   (\r) Reconnect to the server. Optional arguments are db and host.
delimiter (\d) Set statement delimiter.
edit      (\e) Edit command with $EDITOR.
ego       (\G) Send command to mysql server, display result vertically.
exit      (\q) Exit mysql. Same as quit.
go        (\g) Send command to mysql server.
help      (\h) Display this help.
nopager   (\n) Disable pager, print to stdout.
notee     (\t) Don't write into outfile.
pager     (\P) Set PAGER [to_pager]. Print the query results via PAGER.
print     (\p) Print current command.
prompt    (\R) Change your mysql prompt.
quit      (\q) Quit mysql.
rehash    (\#) Rebuild completion hash.
source    (\.) Execute an SQL script file. Takes a file name as an argument.
status    (\s) Get status information from the server.
system    (\!) Execute a system shell command.
tee       (\T) Set outfile [to_outfile]. Append everything into given outfile.
use       (\u) Use another database. Takes database name as argument.
charset   (\C) Switch to another charset. Might be needed for processing binlog with multi-byte charsets.
warnings  (\W) Show warnings after every statement.
nowarning (\w) Don't show warnings after every statement.

For server side help, type 'help contents'

And note again that the world is divided into ‘server side’ and not… so you have a different command for ‘server side help’.

Here is an example query:

mysql> select * from Stations where name='SHIP I';
| id          | latitude | longitude | elevation | name   | grelev | popcls | popsiz | topo | veg  | loc  | ocndis | airstn | towndis | grveg | popcss |
| 80099906001 |       59 |       -19 |      -999 | SHIP I |      0 | r      |     -9 | FL   | NU   | CO   |      1 |        |      -9 | WATER | r      |
1 row in set (0.25 sec)


I know, it’s a silly one, but it is only going to give me one line and with an easy to type selection key of “SHIP I”. But it shows the data is in there and accessible.

And, with that, I have a GHCN V3 database up, installed, loaded, and running. Maybe it’s time for morning coffee now ;-)

I also found this fairly readable, if slow paced, tutorial page:

The Annoyance Issue

I’d mentioned I thought my DB would be created in the directory where I issued the ‘create database’ command… turns out MySQL “knows better” and stuffed it off in /var. Which was OK I guess as I had enough free space there. But really… what if I’d not? In any case, realize that.

I am using CentOS with cPanel. On my server, all MySQL databases save at /var/lib/mysql. Now /var is 100% full and MySQL has stopped working. How can I move the databases to a new directory like /home/mysql especially considering that this server is managed with cPanel?
if you want to change mysql data directory, you should do edit the mysql config file and change datadir value.

datadir = /home/user/data

Your new data directory MUST be owned by mysql and has proper security context.

chown -R mysql.mysql /home/user/data
chcon -R -t mysqld_db_t /home/user/data

restart mysql server

/etc/init.d/mysqld restart

They restart it with an init.d form of command, but CentOS uses the ‘service’ form. That’s just another gratuitous variation between Linux flavors…

[root@CentosBox mysql]# pwd
[root@CentosBox mysql]# ls -l
total 487448
drwx------. 2 mysql mysql      4096 Aug 30 23:19 ghcnv1
drwx------. 2 mysql mysql      4096 Aug 30 23:19 ghcnv2
drwx------. 2 mysql mysql      4096 Aug 31 01:48 ghcnv3
-rw-rw----. 1 mysql mysql 488636416 Aug 31 01:59 ibdata1
-rw-rw----. 1 mysql mysql   5242880 Aug 31 01:59 ib_logfile0
-rw-rw----. 1 mysql mysql   5242880 Aug 31 01:59 ib_logfile1
drwx------. 2 mysql mysql      4096 Aug 30 23:15 mysql
srwxrwxrwx. 1 mysql mysql         0 Aug 30 23:15 mysql.sock
drwx------. 2 mysql mysql      4096 Aug 30 23:15 test
[root@CentosBox mysql]# 

So there it is. 488 MB of “surprise” stuffed off in /var/lib for god only knows what reason… One of the reasons I don’t get all wound around the axle on setting up everything ‘just so’ the first time through. Now I known when I’m ready for the ‘production install’ and take on that security setup load, I’m also going to be changing the default database install location.

And, with that, I’m off for a snack and coffee while I decide what to do next. ;-)

Subscribe to feed


About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in NCDC - GHCN Issues and tagged , , , . Bookmark the permalink.

19 Responses to GHCN in MYSQL quick and easy

  1. skylimey says:

    MySQL, just like the other SQL databases, aren’t really aimed at single user applications, which is why you had to go through the steps of setting up daemons, logins etc. MySQL is the “M” in the LAMP stack (Linux, Apache, PHP/Python) so most websites out there are running on the same software you just set up. When one takes into account the recovery/transaction/clustering capabilities of these systems, it’s actually pretty amazing to me they can be up and running so easily.

    I taught SQL for Oracle in 1985 when “relational” was a new technology that no one (other than the XYZ government agencies) had any real clue how to use it. It’s come a long way since then.

  2. E.M.Smith says:

    Up for a day and no comments? So is that because nobody is interested, because the article said all that was needed – i.e. not much to say by those looking to ‘DIY’, because folks would rather focus on the financial crash WIP, because political things are more fun, because…

    I’m planning a fairly long, if slow, project of loading all the major 3 variations of GHCN and the 2 of USHCN and maybe a couple of others into an SQL (said: Esss Kew El – there’s a political reason I emphasize that… see below*) database set and then doing various ‘slices’ through them for comparisons. So things like “have the individual stations changed data values; or is it the collective that shifts?” that are easier once that data structure is in place. ( Programming consists of the art of structuring data then surrounding it by processes… so the first step is to look at how the data are structured and change it to something better if needed.)

    If folks only want to hear about the end result, when one shows up, I can use my posting time for other stuff. If folks want to ‘follow along at home’, I can spend a couple of hours every week or so making a “status of the work” and “how to DIY” posting like this one. So if anyone want’s more like this, let me know. Or if it isn’t of interest, let me know that, too.

    * Names:

    In the beginning, this SQL database was named SEQUEL for Structured English QUEry Language. IBM found out via a law suit that someone else owned that name, so changed it to SQL but continued to promote pronouncing it as “sequel”. I was at an IBM shop ‘in the day’ and had folks “correct” me from SQL to ‘sequel’… and looking into it was offended at the ruse. So went out of my way to pronounce it Ess Kew El and when “corrected” explain my distaste for supporting the crime of intellectual property theft by devious means even AFTER losing the court case. I’m very happy now to see the Wiki saying to pronounce it as the three letters…


    SQL was initially developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the early 1970s. This version, initially called SEQUEL (Structured English QUEry Language), was designed to manipulate and retrieve data stored in IBM’s original quasi-relational database management system, System R, which a group at IBM San Jose Research Laboratory had developed during the 1970s. The acronym SEQUEL was later changed to SQL because “SEQUEL” was a trademark of the UK-based Hawker Siddeley aircraft company.

    In the late 1970s, Relational Software, Inc. (now Oracle Corporation) saw the potential of the concepts described by Codd, Chamberlin, and Boyce, and developed their own SQL-based RDBMS with aspirations of selling it to the U.S. Navy, Central Intelligence Agency, and other U.S. government agencies. In June 1979, Relational Software, Inc. introduced the first commercially available implementation of SQL, Oracle V2 (Version2) for VAX computers.

    After testing SQL at customer test sites to determine the usefulness and practicality of the system, IBM began developing commercial products based on their System R prototype including System/38, SQL/DS, and DB2, which were commercially available in 1979, 1981, and 1983, respectively.

    So that’s why I think pronouncing it as three letters matters.

    BTW, that Wiki has some nice code examples in it.

  3. skylimey says:

    My comment is/was still awaiting moderation….

    [Reply: The first time a person (email / IPnumber) makes a posting, it goes to moderation. This is so that I can make sure you’all are not being spammed and things are not turning into a swear word swamp. Once I’ve looked it over, the person is approved and their subsequent postings go up immediately. IFF someone becomes a PITA with a tendency to be too rude, vile, or “gotcha games” prone – i.e. not being very polite – I’ll take the time and effort to put in place a special “hold ’em” for that person. At present, after several years, there are about 4 or 5 folks with such an entry; so it kinda takes a LOT to get onto the “moderation” list. Basically because I have better things to do than play moderator. So “no worries” and it wasn’t about you. It was just that you had your first posting and that gets held until I take a look at the moderation queue, which is usually after the news of the day, a posting, and a cruise though comments; though sometimes it happens before. Welcome, and enjoy your ‘approved’ status ;-) -E.M.Smith ]

  4. p.g.sharrow says:

    @EMSmith; The results of your investigations and your opinions are of great interest. The means, not so much. Very few, maybe none :-( would want to do the work to follow you down that rabbit hole to check your work. I trust you, Hell! I’m impressed!…pg.

  5. Larry Ledwick says:

    Interested just too busy to comment.
    We use mysql on Centos here at work. Seems to be a good stable platform. I am a newby to mysql, so mostly a watcher on this topic I just use it to run queries to see what is going on with certain jobs, nothing complicated and no actual setup involved. We have used a couple different flavors and do occasionally trip over slight differences in how they handle certain obscure things like memory management but when you get all that sorted out much better than mysql under solaris.

  6. E.M.Smith says:


    Welcome! Sorry I had the “brain fart” of complaining about lack of comments prior to checking the moderation queue. My bad. I know better, but it was first thing in the morning and coffee hadn’t soaked in yet and {list of typical excuses for doing something a bit stupid – pick one or two.}

    I was doing HP Image / Query in about ’79 ? or so. Then by 1982 was in the land of IBM Mainframes doing RAMIS II (that few folks will ever know about or remember) and the similar product FOCUS (that was written by the same guy after he left Mathematica Products Group causing rumors of a tape given the speed of bring-up…). So when SQL came along, I was already “tracked” into other products as a specialist “Senior Consultant” and teaching classes in them. Looks like Information Builders and Focus are still around:

    Maybe I ought to dust off the FOCUS manual and see if they want a slightly out of date “expert” in the product… It really is a remarkably convenient way to glue on ‘relations’ and the ‘relational calculus’ to darned near any data structure, including fixed format flat files. I could take a motley collection of old databases, new databases, and flat files and glue them together into one view, then make a variety of reports from it; all in a couple of hours. (Depending on how you did it, FOCUS and RAMIS II could suck a machine to it’s knees… One contract had me go to the State Of California who had a large IBM Mainframe at about 98% utilization. As each consultant at M.P.Group was to have a specialty, I made mine ‘efficiency’ and collected a set of 26 particulars to ‘check and tune’. When I was done, the machine was doing the same job, but at 94% idle… They really liked me… saved them a few $Million for a new machine… took about a week IIRC.)

    So never quite found the time to explore SQL. Now I am.

    If you have any pointers on where to best get info on sucking in a column oriented flat file WITHOUT CSV, I’d really appreciate it. The approach I’m most likely to take is to use FORTRAN to suck it in (as I have FORTRAN that does that already) and then figure out how to put it into the DB with calls from inside FORTRAN. I think it can do it, but haven’t checked for sure; and have no idea if that is optimal.

    Looks like RAMIS II bit the dust some time after FOCUS started rocking…

    “Mathematica Products Group was acquired for a brief time by Martin Marietta Corporation. In 1987 RAMIS was sold to Online Software International until it was acquired by its current owners, Computer Associates, Long Island, New York.”

    Looks like it is now named CA-RAMIS but given the paucity of links found on a web search, I’d say it is likely “in use” but not much… There is some info on the CA site, but darned hard to find. Here’s a particularly ugly example of gluing some datasets together with JCL included and making reports. Completely obscuring the basic simplicity and power of the approach / language.

    So you could make a definition that pointed to a flat file of salesman, region, paygrade, sales quota and another one that pointed to a database of product, region, sales, profit, and once glued together type something like:

    Table file sales
    print sales by salesman by region where
    profit > 1000000

    and get a nice report of your top producers… (I’ve been away from it long enough that I am not sure now if you can use > or if some other token is needed… it’s only been 30 years ;-)

    AND you could make fixed data format statements (called ‘masters’) to describe a flat file and then just treat it like any other database… so something like:

    Table file GHCNv1
    sum January by year over Station

    becomes near trivial.

    Oh Well. Both owned by mega-corps so not going to be available free anytime soon… Thus my using SQL. Probably best for me to learn it anyway…

    But I really really miss the ability to just put a layout description in place on a flat file and be done.

    Oh Well….

  7. Larry Ledwick says:

    Per one of my co-workers who often has to do file loads:
    If using mysql and the file is delimited (comma separated or some other character), the command he’ll want to look at is “LOAD DATA INFILE”

  8. E.M.Smith says:


    Thanks for the response, but the problem is exactly that “the file is delimited”… it isn’t. It is a flat file with fields defined by position not any delimiter. I.e. standard FORTRAN and COBOL similar era languages “flat file” fixed length records. It seems that when whatever overcame the world of computing in about 1980 happened, they just purged from their mind the idea that a file might be fixed format fixed length records position not delimiter. Causes no end of grief dealing with fixed format flat files….

    One of my major gripes about just about every computer language post 1975. Including, BTW, C that does handle fixed format flat files, but in a painful way. (Read a record and then parse it yourself…)

    So one ends up writing ‘parsers’ because the “modern” languages and databases are just too damn stupid to understand fixed format records…

  9. Larry Ledwick says:

    If it is fixed format you could use basic unix commands like “cut”, “awk” and “sed” to insert delimiters and parse it so you could load it I would think?

    I’ve never really looked at the files themselves, but yes that is right out of 80 column punch card file structure.

    Big pain for sure, I will ask around and see how we handle data files which are fixed format and convert them to files usable in a data base. If nothing else a useful learning experience.

  10. Larry Ledwick says:

    Is this the sort of file you are working with:

    STATION           STATION_NAME                                       ELEVATION  LATITUDE   LONGITUDE  DATE     TMAX     TMIN     PRCP     
    ----------------- -------------------------------------------------- ---------- ---------- ---------- -------- -------- -------- -------- 
    GHCND:USC00327027 PETERSBURG 2 N ND US                               466.3      48.0355    -98.01     20100101     -178     -311        0 
    GHCND:USC00327027 PETERSBURG 2 N ND US                               466.3      48.0355    -98.01     20100102     -244     -322        0 
    GHCND:USC00327027 PETERSBURG 2 N ND US                               466.3      48.0355    -98.01     20100103     -194     -289        0 
    GHCND:USC00327027 PETERSBURG 2 N ND US                               466.3      48.0355    -98.01     20100104     -167     -200       15 
    GHCND:USC00327027 PETERSBURG 2 N ND US                               466.3      48.0355    -98.01     20100105     -133     -167     9999 
    GHCND:USC00327027 PETERSBURG 2 N ND US                               466.3      48.0355    -98.01     20100106     -133     -172     9999 
    GHCND:USC00327027 PETERSBURG 2 N ND US                               466.3      48.0355    -98.01     20100107     -150     -278        0 
  11. Eric Barnes says:

    Hi EM,
    Sorry it’s taken me so long to get to this! I was pretty interested in this stuff (checks file timestamps) about 3 years ago. Life intervened and continues to intervene in the form of a 5 and 6 year old. My main goal was to create an audit of ghcnd to see how much the daily values changed given a baseline. The only changes I noticed interestingly added data to missing dates from 80+ years ago. I’d assume perfectly innocent stuff. It’d be interesting to see how the data has changed 3 years later (I have archived copies of the data). The code is in C along with bash and sql scripts. A short summary…

    The database creation script is in prog/test .

    The daily data is downloaded from using a script get_ghcn_daily.bsh.

    The file is unzipped and all the files are placed into a single directory. This makes doing a ‘ls’ command very hazardous as there are 10’s of thousands of files.

    Daily files downloaded from noaa are converted into csv. I chose to do this to use the mysql load command which moves data very quickly into mysql. I suppose a c procedure could have been used, but I figured this would be faster in terms of my time. The bash script that does the loading is load_ghcn_csv.bsh.

    Once this is done, you should be able to query ghcnd. :) There are a number of queries that were probably a rather poor start at some analysis in the sql_test directory.

    There is a “zipper” program that does it’s own summation of the daily files. Truncating (or moving the daily table) and reloading with daily from a different date will allow you to run the zipper program which identifies where changes have been made to historical records (which shouldn’t happen too frequently one would think :) ). As it is now, it only identifies months and years where there is a difference. You have to do your own investigating with sql to pinpoint what changed. Automating this would be nice.

    A note for all who use the code/scripts, please notify noaa in advance if you find anything interesting or otherwise. I know most here think climate change is a political issue (and it is), but I’m sure that there are many fine people at noaa who work very hard and should be given proper credit for maintaining this large data set and IMO a large dose of courtesy and respect is in order if any discrepancies or otherwise are found.

    Thanks again for your excellent blog EM! :)

    The link is here and I’ll leave it up for a while.

    PS. I’m really more interested in ISH the hourly dataset at the moment. IMO, increasing CO2 weakens convection and I think it would be interesting to do a statistical study of means/maxes/etc. of 50 years ago vs. today. Of course I have time for none of this so the point is moot. :)

  12. Eric Barnes says:

    Hi EM, Just read about you not wanting csv and I think a good route would be prepared statements. I’m pretty sure that’s how the load command does it and it should be very fast.

    Here’s the mysql reference. I’ve never done that for mysql, but I have for db2/oracle/sql server. Should be very fast.

    Converting to csv and then using load command was reasonably fast for me (I don’t think it took more than a couple of hours to process an entire dataset).

  13. E.M.Smith says:


    Yes. Here’s a sample:

    [chiefio@CentosBox GHCN.Dec09]$ ls
    EMS.Dec2009.CP  Merge.log  v2.mean  v2.temperature.inv  x.Tarball
    [chiefio@CentosBox GHCN.Dec09]$ head v2.mean
    1016035500001966-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999  133  110
    1016035500001967  100  112  129  143  181  193  239  255  225  201  166  107
    1016035500001968  109  123  130  160  181  207  247  241  223  191  162  128
    1016035500001969  117  114  134  149  182  194  223  241  220  185  161  108
    1016035500001970  129  110  122  138  165  211  232  249  233-9999  148  114
    1016035500001971  116  104  105  155  183  202  238  271  226  186  133  112
    1016035500001972  101  128  141-9999  160  204  220  228  210  199  154  119
    1016035500001973  112  101  106  136  183  216  242  246  235  187  140  121
    1016035500001974  122  118  137  134  181  216-9999  240  223  170  141-9999
    1016035500001975  115  109  119  136  167  192  240  247  235-9999  139  118
    [chiefio@CentosBox GHCN.Dec09]$ 

    Station ID in the first 12 (composed of a continent, region, then WMO number) year slammed right up against it, then 12 monthly temperatures with -9999 being ‘null’ or empty.

    So for files like this, I can parse and insert commas (though would rather avoid that) but in the station names file, it is possible for a stray comma to be there in the name… so a generic solution avoids a custom cut for each file after inspecting for the presence of commas…

    And yes, I can do a lot with things like awk, sed, etc. But that’s just more development time and frankly ‘kludge’ that a nice FORMAT statement bypasses. (Said statement already in my existing FORTRAN programs so zero work…)

    I just don’t like doing things in inefficient kludgey ways if there is a better cleaner alternative; and feel compelled to bitch about it for a few weeks before just giving in and doing it. 8-|

    @Eric Barnes:

    Ah, the 5 to 6 year age… You get to sleep now (unlike the first 6 months…) but start to realize that the total energy expended is rising exponentially to keep up with them… But “no worries” in about 10 years you will be Wide Awake at night wondering what the Hell they are doing, but not having all that energy suck chasing after them anymore… having given up ;-)

    (Son, now 27 or so, expecting first grand kid… I’m sooo looking forward to ‘helping him’ via stories about what he did as a child ;-)

    Downloaded your tarball.. will get to looking at it in a day or three. Thx!!

    IMHO, it isn’t the “raw data” that changes. I’m pretty sure folks have shown proper stewardship of it. It’s the method that is flawed, IMHO. From TOBS applied everywhere even when not appropriate, to “homogenizing” that just screams “Hidden splice artifact by proxy” and such.

    Unfortunately, while I firmly believe in “Never attribute to malice that which is adequately explained by stupidity”, I also recognize that intelligent people are way more prone to “believing their own bull shit” than the not-so-bright. I can spin a great yarn that is way wrong and must guard against believing it myself…

    But really, does it matter if it is malice or stupidity? It’s still quite wrong.

    FWIW, I’d happily take a job at NOAA / NASA / Hadley and be one of the most pleasant of folks to work with. (Of all things, I’m NASA certified that way as I’m one of the folks in the model on which all astronaut selection is based… it’s my personality profile… a long story). I could easily do ‘maintenance work’ on code I thought was bogus without interjecting my bias. It’s just “what I do”… I suspect there are a lot of programmer folks like that working on these codes.

    And then the True Believers have built such massive “Castles in the Sky” that at this point they can’t back off and examine the foundations. I think most of them are “not evil”, just firmly convince of an error. People are like that.

    Part of why I generally oppress swearing here and generally eschew name calling. Even if gently at times and not always enforcing the “rules”.

    It is very easy to be Tribal, far harder to think “maybe he is a great guy but mis-taught” and very hard to think “Perhaps it IS me…”. So effort must be applied in the opposite order.

    IFF I had to chose, right now, I’d tick the box that said: “Honest hard working folks with a bit of confusion about how things really work, but a heartfelt belief in their subtle errors.”

    Kind of like all of us…

    In short: It’s the adjustments, not the raw data, and errors of assumption, not malice. IMHO.

    But I’d like to confirm that bias on my part via inspecting the various actual data sets in minute detail… thus, the project.

    And, IF I’m wrong, I’m wrong. So be it.

    Oh, and glad you like the blog. I try. ;-)

  14. Paul Hanlon says:

    Years ago, I also thought about doing this (It was right after the New Zealand Weather Service shenanigans). I downloaded the daily data, and wrote out a PHP script to parse it. Four hours later, and only about 10% into the data, it gave up. R and Javascript were similarly challenged.

    I ended up writing a C script to parse each file into a single array from which I then converted into a monthly dataset for each station and saved this to a csv file. It managed the whole process in about two hours on a laptop. I was quite proud of that little script, I do remember parsing the -9999s to be particularly challenging (at least for me). I could look it over and send it on to you if you wish. You wouldn’t need the saving into monthly values bit, just saving to a csv file and doing a load data local infile into the DB.

    There is a way to get load data local to read fixed width columns, it’s the fourth comment down in this link (no permalinks), but it expects that there will be a corresponding field in the database for a given column, which wouldn’t be of use to you.

  15. E.M.Smith says:

    @Paul Hanlon:

    Any code examples always appreciated. Even if not directly used, I learn things…

    From the link, it looks like fixed format load is supported, but via implied width based on field sizes in the DB. Not ideal, but workable… Once loaded into some table, things can be easily converted into another…

    If the FIELDS TERMINATED BY and FIELDS ENCLOSED BY values are both empty (”), a fixed-row (nondelimited) format is used. With fixed-row format, no delimiters are used between fields (but you can still have a line terminator). Instead, column values are read and written using a field width wide enough to hold all values in the field. For TINYINT, SMALLINT, MEDIUMINT, INT, and BIGINT, the field widths are 4, 6, 8, 11, and 20, respectively, no matter what the declared display width is.

    So I just have to make a table with field sizes that match the input, load, then convert things if needed… I can work with that, but a FORMAT statement would be a lot easier…

  16. Graeme No.3 says:

    I am completely out of my depth here but I recall that in the early nineties there was a commercial UNIX program that “pulled data” out of old data bases and inserted it into your new DB. that is you entered the position where each record was, and where it had to go. It could be used to extract data from non-Unix sources. I think it was called FreeForm.
    Not being computer literate I lost interest when it was suggested by the IT people as a means of saving the old databases (non Unix) and placing it in a series of MS Access files. I was a sceptic even then.

  17. p.g.sharrow says:

    Graeme No.3 says:”I am completely out of my depth here”
    LOL ! ! !
    When I grew up a computer was someone that operated a 10row adding machine and a wordprocesser was the person that worked at a manual typewritter.
    In 85 I was told that I just had to get a computer to do my business paperwork, so I did. Jumped into the deep end, way over my head, with no lifeguard or instructor. Been flounder around every since. The field grows faster then I can catch up. My son accuses me of being a technophobe. I built him his first computer and set up a BBS for him to operate, before internet! and now my grandson can’t believe grandpa isn’t up on the latest stuff, works off an old XP machine and doesn’t have a cellphone, 8-( (no service)
    I barely understand all the stuff Mike talks about doing, but, I’m learning……………;-) pg

  18. kneel63 says:

    MySQL is certainly reasonably reliable, scalable etc – plenty of ISPs use MySQL back-ended DNS servers, because it’s easy to set up, even fairly easy to get DB replication working (so in this case, secondary DNS servers don’t use DNS zone updates, rather SQL DB replication).

  19. E.M.Smith says:


    I can easily see the utility and the industrial strength. My only complaint is about ‘style’.

    Some things have had the finish just polished to where everything is “just right”. There is a clear theme or pattern to it and it fits with a clear sense of esthetics. To me, BSD Unix is like that. Other things have a more “thrown together and cluttered” aspect. Things that are more complicated or inefficient than they ought to be. Slapdash in spots, confused in others. Garish or mixed conflicting styles… Stripes with plaid shirt sort of thing. Microsoft Windows is very much like that (and each couple of years the ‘style’ changes to something else equally bad or worse…)

    In between is Red Hat and Ubuntu where some of the Microsoft bloat and confused style has crept in. So Red Had it pushing ‘systemd’ as a big fat binary blob that writes inscrutable binary files for logs and has fingers in everything – violating the ethic and style of Unix. And Ubuntu is following them, but also suffering code boat and with conflicting styles showing up in various “windowing systems” and administrative methods.

    So I came from a DBA background with a great, simple, consistently themed tool that was designed from the start to let any English speaker just know what was likely to be right. Yeah, it had its ugly parts too, but not many and mostly just where it had to interact with IBM. And here’s SQL and it has that same IBM Ugly feel to the words chosen and how things work. Like the difference between smoothly flowing poetry and a military manual. Both are English, but the esthetics are very different…

    And that does matter.

    Languages that are a joy just settle in and take up residence. (French was like that… a very comfortable and highly useful language). Languages that are forced and not pretty are hard to absorb. The brain just says “Please Sir, can I have less?”… Some of the ‘constructed languages’ are like that. Esperanto, for example, despite the attempt to be like natural languages, has ‘flow’ issues.

    So I’m not knocking the usefulness of SQL. I’m just griping that my brain is saying “less please” when I spend time learning the syntax and grammar to build and operate a database. (Maybe I’ll make my own dictionary and have a preprocessor convert it… I’ve done that before with not-so-poetic things ;-)

Comments are closed.