For about 3 years I’ve thought I ought to load GHCN data into a database instead of screwing around with FORTRAN. Don’t get me wrong, I actually like FORTRAN. It was my first computer language. I’m pretty good with it. I don’t mind a “GO TO” statement at all and I’m as good with them as with BEGIN; END; pairs (that have now crept into FORTRAN anyway…) but, well, I spent a decade or so as a DBA (Data Base Administrator) and teaching classes in how to use them and how great they are…
So I felt a bit guilty.
But the database ‘products’ (they are free, so kind of hard to think of it as a product, that usually means bought…) on Linux tend to the SQL (Structured Query Language was the original meaning) family and that’s not one I’d used before. Also, IMHO, it was a bit primitive compared to the tools I really wanted to use (in the FOCUS / RAMIS II family, I could be done in about 10 minutes all up.)
So I delayed.
They say “All things come to he who waits”. And it did.
But I didn’t wait doing nothing. While in Florida I’d gotten a couple of SQL books and learned the basic structure of the language (fairly primitive, IMHO) and got a sense of the capabilities. Frankly, that was part of what caused me to take so long.
You see, a ‘watershed’ of sorts happened about 1980 in Computer Programming. Prior to that point, flat files and fixed format data read / write were the norm. Then folks got all enamored of more structured data and the era of databases took off. About the same time, the spreadsheet driven demand for CSV (Comma Separated Values) set in also.
No problem there.
The problem is that ever more computer languages and database systems have become dead stupid about how to handle a flat file with a fixed format. IMHO for no other reason than they were no longer ‘trendy’. They are still one of THE most useful ways to store and move data. LOTS of data has a comma in it, and that makes the CSV format fundamentally broken for large swathes of data. From what I can tell, SQL doesn’t do well with fixed format flat file input. It wants CSV files. That is terribly limiting.
So you end up writing “glue programs”. Right off the bat, you get thrown back into using a language suited to fixed format flat files (like, oh, FORTRAN…) just to read the data and put it into the database. But… Once you are already in a language, and have read in the data, why not just “do what you wanted to do and be done”… so it’s an easy slide to just NOT bother with the PITA DB and keep on writing FORTRAN.
But I felt guilty.
I knew if I had it in a DB that a lot of other things become much easier. Especially if I had a DB of all three versions. V1, V2, V3 compares could be much easier. Eventually this wore me down and I decided “now is the time to load at least ONE and practice the language”.
The Internet Search Is Your Friend.
Not wanting to do work that was already done, I thought maybe someone had a script up or a program for making the database. A quick search showed a couple of them. After downloading one of the “scripts”, it was of amazing size. A bit of looking at it showed it contained all of GHCN V3.2.2 data embedded in it and was all ready to load as is. Neatly bypassing that whole conversion of data formatting problem. (I am still left with ‘adapt the process to V1 and V2 and a ‘load raw daily data’ conversion for when I get to those steps).
What this means is that I can have at least ONE instance of GHCN data up and running in a MySQL database on CentOS linux. And it took about 15 minutes. Now I can easily ‘slide’ into actually using SQL and the database and getting fast and comfortable with it. The ‘center of gravity of easy’ moves.
This version of the data, in this downloaded script, will be the post-homogenizing pasteurized processed data-food-product and not the organic vegan holistic raw data uncontaminated by adjuncts ;-) … but it is a start.
Dmitry A. Grechka
The Global Historical Climatology Network-Monthly (GHCN-M) dataset by NCDC is particularly important data set if your research deals with climate data. It is widely accepted. Its major advantage is quality control and a variety of data sources combined together. I used it several times as reference data for validation of calculated climate surfaces. It is also great for uncertainty assessment of climate interpolation methods.
But it is distributed as text files of specific format only. And you will have to write a parser to fetch the data.
It was that “write a parser” part that had me unmotivated…
This week I decided to load the GHCNv3 into MySQL to make it flexible for fetching. I can fetch different subsets of the data into CSV files just with composing a proper select query. That made a significant speed up in experiments with interpolation techniques.
I share these SQL scripts to enable others researchers to load GHCN v3 into their own SQL servers. You can restore GHCN at your server and perform requests to it. Just download the script, execute it. And you are able to get the data you need. Fast :)
The scripts do not contain CREATE DATABASE statements. Thus create an empty database by hand and then execute the proper script.
All well and good, but now it was ‘on me’ to make sure I had SQL of some sort running and to actually DO something. I chose to use the CentOS box just because the R.Pi was already fetching a load of data and likely to hit zero free space ‘soon’; and because CentOS almost certainly would have SQL of some sort already installed. It did, MySQL.
There are minor differences in the flavors of SQL, so it is kind of important to know that, and know what you are using. This had me using the second script, not the first one. The “free world” flavor not the proprietary ones.
I also learned that just having it installed is not enough. The ‘service’ has to be started and running before you can run the database… In yet another overly complicated byzantine moment, the designers split things so that you need a back end mysql server running then you can use the mysql command to get a mysql ‘shell’ that lets you use it. Not the way I would do it, but hey, I didn’t have to write it.
So in case anyone else needs to get over that hump, this page helps:
It has a LOT of stuff on setting up a couple of levels of security (so you log into the system, then log into mysql with another set of login and passwords and then…) that I just skipped. If you do not set up security, then you can just use it. In my case I ‘just used it’ as root. In a stand alone disconnected server, that isn’t too much of an issue. The security config they describe is suited for use as a networked server with important data subject to attack. Yet another layer of complications for later…
This article describes a basic installation of a MySQL database server on CentOS Linux, just enough to get you started. Remember that you might need to install other packages to let applications use MySQL, like extensions for PHP. Check your application documentation for details.
Install the MySQL server through the CentOS package manager by running the following commands at a command prompt:
sudo yum install mysql-server
sudo /sbin/service mysqld start
Then, run the following command:
I skipped the secure_installation stuff for now. This was just a first cut get the feet wet install, not production.
Now that MySQL is installed, you can verify that it’s running by trying to launch it:
sudo /sbin/service mysqld start
If MySQL is already running, you will receive a message to that effect.
Launch at restart
To ensure that the MySQL server will launch when the machine is restarted, run the following command:
sudo chkconfig mysqld on
That makes sure your machine will launch the MySQL server when it reboots.
I did an “su root” and then issued the commands, so “service mysqld start”. Same effect, but less total typing as you don’t need to keep doing “sudo”; besides, I’ve not put myself in the ‘sudoers file’ on the CentOS box yet ;-)
After that is done, you can type “mysql” and launch a mysql ‘shell’ where you can enter the mysql commands to make, load, read, process, etc. a database.
They have it all set up with logins and passwords and use the full path name. Since I skipped the ‘security’ level, and have a sane command search path, ‘mysql’ was all I needed to type.
The mysql shell
There is more than one way to work with a MySQL server, but this article focuses on the most basic and compatible approach: The mysql shell. At the command prompt, run the following command to launch the mysql shell and enter it as the root user:
/usr/bin/mysql -u root -p
When you’re prompted for a password, enter the one that you set at installation or, if you haven’t set one, just press enter to submit no password. The following mysql shell prompt should appear:
They then go into a bunch about setting passwords and such and selecting a database. I just skipped down to that ‘create a database’ the prior link said was not done in the script.
Create a database There is a difference between database server and an actual database, even though those terms are often used interchangeably. MySQL is a database server, meaning that it keeps track of databases and controls access to them. An actual database is where all the data goes is stored, and it is the database that applications are trying to access when they interact with MySQL. Some applications create a database as part of their setup process, but others require you to create a database and tell the application about it. Fortunately, creating a database is simple. To create a database, log in to the mysql shell and run the following command, replacing demodb with the name of the database that you want to create: CREATE DATABASE demodb; The database is created. You can verify its creation by running a query to list all databases. The following example shows the query and example output: SHOW DATABASES; +--------------------+ | Database | +--------------------+ | information_schema | | demodb | | mysql | +--------------------+ 3 rows in set (0.00 sec)
Though I did a “create database ghcnv3;” instead. ( It isn’t case sensitive. Any case will do on commands).
So now I had mysql installed, running (if not security locked down), and with a database name created (though empty) ready for loading data and shaping; and was able to launch a ‘mysql shell’ and look around. Not bad for about 10 minutes.
There’s a lot more in that link about how to set things up for various needs. Adding users, permissions, privs, and more. “Someday” I’ll get to those things.
So what next?
Get the data into the database.
So at this point, we’re back to that downloaded “script with data”. I created a directory named “DB” and had put both copies of the compressed download in it. Then needed to unpack one of them. I’d also figured I have the database in this directory (thus the name) but that didn’t work as expected… more on that below).
[chiefio@CentosBox ~]$ cd DB [chiefio@CentosBox DB]$ ls GHCNv3_mssql_db2_oracle_postgre.sql.bz2 GHCNv3_mysql.sql.bz2 [chiefio@CentosBox DB]$ bzip2 -d GHCNv3_mysql.sql.bz2 [chiefio@CentosBox DB]$ ls GHCNv3_mssql_db2_oracle_postgre.sql.bz2 GHCNv3_mysql.sql
Here you can see that the bz2 ending got removed and it is now just a .sql ending file ready to go.
I wanted to know if the file was all ‘text’ so did:
[chiefio@CentosBox DB]$ file GHCNv3_mysql.sql GHCNv3_mysql.sql: data
Hmmm… file contents of ‘data’ not ‘text’… I decided to ‘risk it’ and look at a dump of a little of it ( 10 lines by default ) to my screen:
[chiefio@CentosBox DB]$ head GHCNv3_mysql.sql -- The script loads GHCNv3 data (Unadjusted and adjusted GHCN-Monthly TAVG, TMAX, TMIN and inventory data) in a SQL database. -- Contains all data included in the raw files as on release v220.127.116.1141108 -- The data origin: J. H. Lawrimore, M. J. Menne, B. E. Gleason, C. N. Williams, D. B. Wuertz, R. S. Vose, and J. Rennie (2011), An overview of the Global Historical Climatology Network monthly mean temperature data set, version 3, J. Geophys. Res., 116, D19121, doi:10.1029/2011JD016187. -- Raw data was downloaded from ftp://ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v3/ -- The data is converted into SQL by Dmitry A. Grechka on 14 November 2014
Ah, contains both script text and embedded data… Got it. No wonder it unpacks so big:
[chiefio@CentosBox DB]$ ls -l total 544104 -rw-rw-r--. 1 chiefio chiefio 35036549 Aug 30 23:10 GHCNv3_mssql_db2_oracle_postgre.sql.bz2 -rw-rw-r--. 1 chiefio chiefio 522116261 Aug 30 23:11 GHCNv3_mysql.sql
I’d thought 35 MB was a huge script, but at 1/5 GB it was way out of line. Unless it contains the data… which it does.
After a bit of looking around, I found out how to load the data via script execution:
mysql> show databases -> ; +--------------------+ | Database | +--------------------+ | information_schema | | ghcnv1 | | ghcnv2 | | ghcnv3 | | mysql | | test | +--------------------+ 6 rows in set (0.12 sec)
Couple of things here. First off, notice that I’ve made “databases” (really just names at this point) for all three major versions of GHCN. Then loaded the v3 one. I think you can see where I’m heading… Then, note that allmost all MySQL commands end with a semicolon. I forgot to type that after the “show databases” but no worries, just type it on the next line and all is good.
FWIW, typing ‘q’ will not exit. You need to type exit or quit.
Here’s a bit more of a look at the results of the load (that took about 5 minutes on this box, I think. I launched it and wandered off for a while…)
Note that you must tell it which database to use first, then you can do things with the tables in it. Also notice that unlike everywhere else so far, table names ARE case sensitive.
mysql> use ghcnv3 Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> show tables -> ; +------------------+ | Tables_in_ghcnv3 | +------------------+ | Records | | Stations | +------------------+ 2 rows in set (0.00 sec) mysql> describe stations; ERROR 1146 (42S02): Table 'ghcnv3.stations' doesn't exist mysql> describe Stations; +-----------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------+-------------+------+-----+---------+-------+ | id | bigint(20) | NO | PRI | NULL | | | latitude | double | YES | | NULL | | | longitude | double | YES | | NULL | | | elevation | double | YES | | NULL | | | name | char(30) | YES | | NULL | | | grelev | smallint(6) | YES | | NULL | | | popcls | char(1) | YES | | NULL | | | popsiz | smallint(6) | YES | | NULL | | | topo | char(2) | YES | | NULL | | | veg | char(2) | YES | | NULL | | | loc | char(2) | YES | | NULL | | | ocndis | tinyint(4) | YES | | NULL | | | airstn | bit(1) | YES | | NULL | | | towndis | tinyint(4) | YES | | NULL | | | grveg | char(16) | YES | | NULL | | | popcss | char(1) | YES | | NULL | | +-----------+-------------+------+-----+---------+-------+ 16 rows in set (0.01 sec) mysql> mysql> describe Records; +-------------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------+-------------+------+-----+---------+-------+ | station_id | bigint(20) | NO | PRI | NULL | | | year | smallint(6) | NO | PRI | NULL | | | month | tinyint(4) | NO | PRI | NULL | | | is_adjusted | bit(1) | NO | PRI | NULL | | | element | char(4) | NO | | NULL | | | value | smallint(6) | NO | | NULL | | | dmflag | char(1) | NO | | NULL | | | qcflag | char(1) | NO | | NULL | | | dsflag | char(1) | NO | | NULL | | +-------------+-------------+------+-----+---------+-------+ 9 rows in set (0.06 sec)
Just as a bit more documentation, here’s the ‘help’ page for online help:
mysql> help For information about MySQL products and services, visit: http://www.mysql.com/ For developer information, including the MySQL Reference Manual, visit: http://dev.mysql.com/ To buy MySQL Enterprise support, training, or other products, visit: https://shop.mysql.com/ List of all MySQL commands: Note that all text commands must be first on line and end with ';' ? (\?) Synonym for `help'. clear (\c) Clear the current input statement. connect (\r) Reconnect to the server. Optional arguments are db and host. delimiter (\d) Set statement delimiter. edit (\e) Edit command with $EDITOR. ego (\G) Send command to mysql server, display result vertically. exit (\q) Exit mysql. Same as quit. go (\g) Send command to mysql server. help (\h) Display this help. nopager (\n) Disable pager, print to stdout. notee (\t) Don't write into outfile. pager (\P) Set PAGER [to_pager]. Print the query results via PAGER. print (\p) Print current command. prompt (\R) Change your mysql prompt. quit (\q) Quit mysql. rehash (\#) Rebuild completion hash. source (\.) Execute an SQL script file. Takes a file name as an argument. status (\s) Get status information from the server. system (\!) Execute a system shell command. tee (\T) Set outfile [to_outfile]. Append everything into given outfile. use (\u) Use another database. Takes database name as argument. charset (\C) Switch to another charset. Might be needed for processing binlog with multi-byte charsets. warnings (\W) Show warnings after every statement. nowarning (\w) Don't show warnings after every statement. For server side help, type 'help contents'
And note again that the world is divided into ‘server side’ and not… so you have a different command for ‘server side help’.
Here is an example query:
mysql> select * from Stations where name='SHIP I'; +-------------+----------+-----------+-----------+--------+--------+--------+--------+------+------+------+--------+--------+---------+-------+--------+ | id | latitude | longitude | elevation | name | grelev | popcls | popsiz | topo | veg | loc | ocndis | airstn | towndis | grveg | popcss | +-------------+----------+-----------+-----------+--------+--------+--------+--------+------+------+------+--------+--------+---------+-------+--------+ | 80099906001 | 59 | -19 | -999 | SHIP I | 0 | r | -9 | FL | NU | CO | 1 | | -9 | WATER | r | +-------------+----------+-----------+-----------+--------+--------+--------+--------+------+------+------+--------+--------+---------+-------+--------+ 1 row in set (0.25 sec) mysql>
I know, it’s a silly one, but it is only going to give me one line and with an easy to type selection key of “SHIP I”. But it shows the data is in there and accessible.
And, with that, I have a GHCN V3 database up, installed, loaded, and running. Maybe it’s time for morning coffee now ;-)
I also found this fairly readable, if slow paced, tutorial page:
The Annoyance Issue
I’d mentioned I thought my DB would be created in the directory where I issued the ‘create database’ command… turns out MySQL “knows better” and stuffed it off in /var. Which was OK I guess as I had enough free space there. But really… what if I’d not? In any case, realize that.
I am using CentOS with cPanel. On my server, all MySQL databases save at /var/lib/mysql. Now /var is 100% full and MySQL has stopped working. How can I move the databases to a new directory like /home/mysql especially considering that this server is managed with cPanel?
if you want to change mysql data directory, you should do edit the mysql config file and change datadir value.
datadir = /home/user/data
Your new data directory MUST be owned by mysql and has proper security context.
chown -R mysql.mysql /home/user/data
chcon -R -t mysqld_db_t /home/user/data
restart mysql server
They restart it with an init.d form of command, but CentOS uses the ‘service’ form. That’s just another gratuitous variation between Linux flavors…
[root@CentosBox mysql]# pwd /var/lib/mysql [root@CentosBox mysql]# ls -l total 487448 drwx------. 2 mysql mysql 4096 Aug 30 23:19 ghcnv1 drwx------. 2 mysql mysql 4096 Aug 30 23:19 ghcnv2 drwx------. 2 mysql mysql 4096 Aug 31 01:48 ghcnv3 -rw-rw----. 1 mysql mysql 488636416 Aug 31 01:59 ibdata1 -rw-rw----. 1 mysql mysql 5242880 Aug 31 01:59 ib_logfile0 -rw-rw----. 1 mysql mysql 5242880 Aug 31 01:59 ib_logfile1 drwx------. 2 mysql mysql 4096 Aug 30 23:15 mysql srwxrwxrwx. 1 mysql mysql 0 Aug 30 23:15 mysql.sock drwx------. 2 mysql mysql 4096 Aug 30 23:15 test [root@CentosBox mysql]#
So there it is. 488 MB of “surprise” stuffed off in /var/lib for god only knows what reason… One of the reasons I don’t get all wound around the axle on setting up everything ‘just so’ the first time through. Now I known when I’m ready for the ‘production install’ and take on that security setup load, I’m also going to be changing the default database install location.
And, with that, I’m off for a snack and coffee while I decide what to do next. ;-)