All I wanted to do was make a couple of simple nearly trivial graphs…
After working out how to get the GHCN files loaded into SQL, and doing a couple of ad-hoc queries (proving it was worth the effort), I was ready to move on to “make a graph” at the leading edge of things (while still puttering around with indexes and key fields and efficiency and such on MySQL, as the low priority backend tasks).
A fair amount of searching for MySQL interface libraries for various languages showed they exist for some languages (notably Go has one) but they tended to be a bit of “assemble it yourself”. I didn’t find one for FORTRAN (it might exist, I just did a very quick web search), did find one on Sourceforge for DIY with Go, and there were hints of maybe something for R (but I really didn’t search much for it…). And, at every turn, oodles and oodles of links to stuff about Python.
OK, I can take a hint. When 80%+ of the stuff you turn up says “We use FOO!” (or in Python code “We use SPAM!” they like using Monty Python references so use “SPAM and Eggs” instead of FOO and BAR) eventually you start to think “Maybe I ought to look at FOO?”. So I did.
I don’t hide my (mild) distaste for Python. It’s a fine language and all, but does some things I just don’t like. Due to dynamic typing a variable isn’t really a variable but is a pointer to some data somewhere that DOES have a type. So: a = 1; or b = “SPAM” really just puts a pointer in a or b to those data items “somewhere”. Well, if you are used to thinking of an assignment as actually doing an assignment and pointers as being pointers it is a bit annoying to need to keep in mind that these aren’t. MOST of the time you can ignore it. Yes, it is a nice touch that lets you do things like not bother to assign a data type; but… there can be the odd case where it bites you so you must know “this is different”. Similarly, I don’t like position dependent syntax. Change the white space, change the program… Having white space as part of the reserved word space you need to track is a bother…all to enforce their ideas about good pretty-printing practice for “structured programming”. OK, apply straight jacket and shut up…
So despite Python being the “3rd most popular” language among whoever bothered to be in the survey, (the other two are Java and C) I’ve avoided using it unless there wasn’t much of an alternative. (Like doing maintenance on existing code).
But there it was: Just about every example was using Python… So OK, I’ve spent a couple of days “loading Python” into my brain again, and learning a little of how to use Pandas to graph things. I’m about to do my first “load stuff into Pandas and graph it”, but thought I would post up a couple of notes first. Any success at graphing will be added as an update.
One bit of irk was wandering for an hour or so down the list of Pythons. RPython, Cython, Jython (or some such), PyPy, and more. Why? Well, if you are going to use Python, you MUST PICK ONE, which means you must know enough to pick one, which means at least a minimal familiarity. This is one of my more general complaints about “trendy” things: They end up metastasizing into so many forms it is more work to choose one that to just use something else.
Jython is a Python that makes Java byte code for the JVM. OK, scratch that one.
Cython is a super-set that then spits out C or C++ code, mostly used for making other library like stuff. Scratch.
RPython is Restricted Python, used as a subset to make self-hosting Python. Scratch it, too.
The bottom line is that the Python you get by default on Linux is likely to be the one you want. Maybe.
Then we get 2.x vs 3.x Python. The 3.x Major Release is not entirely compatible with the 2.x one. OK, use python3 (that you may get by default, or not, depending on a lot of things, or you might need to call it out as python3).
It looks like I have both installed already:
root@odroidxu4:/SG2/ext/chiefio/SQL/Table.Schemas# apt-get install python Reading package lists... Done Building dependency tree Reading state information... Done python is already the newest version. python set to manually installed. The following packages were automatically installed and are no longer required: libjsoncpp0 libuuid-perl Use 'apt-get autoremove' to remove them. 0 upgraded, 0 newly installed, 0 to remove and 5 not upgraded. root@odroidxu4:/SG2/ext/chiefio/SQL/Table.Schemas# apt-get install python3 Reading package lists... Done Building dependency tree Reading state information... Done python3 is already the newest version. python3 set to manually installed. The following packages were automatically installed and are no longer required: libjsoncpp0 libuuid-perl Use 'apt-get autoremove' to remove them. 0 upgraded, 0 newly installed, 0 to remove and 5 not upgraded.
But I’ll need to explicitly say “python3” to get the newer dialect.
Then, while researching how big a Dataframe (i.e. what EVERYONE else calls a table…) can be, found that 32 bit Python will barf on about 1.8 GB of data (as memory peaks about 3.x GB during the load and that exceeds what a 32 bit can address) so you will want 64 bit Python then. How one gets 64 vs 32 bit left unclear… though implied it was disjoint from machine word size. I did a quick check of the size of GHCN and either one will do:
chiefio@odroidxu4:~/SQL/v3$ ls -l total 704796 [...] -rw-r--r-- 1 root root 786240 Jan 23 21:14 inventory.in -rw-r--r-- 1 root root 1048320 Jan 23 21:43 inv.out [...] -rw-r--r-- 1 chiefio chiefio 53690600 Jan 20 20:39 temps.in -rw-r--r-- 1 chiefio chiefio 344360400 Jan 20 20:40 temps.u.in
So the Inventory File even as a tab column list is only 1 MB, and the entire GHCNv3 temperature data file similarly expanded is 344 MB. They can both be sucked into a Pandas Dataframe of either word size Python. (And fit nicely in memory of my 2 GB XU4…)
At least I can now ignore what “size” Python to run…
CSV, Dataframes, and more
The examples of how to do this generally just call an “object” that is in the Pandas library to do the loading. They universally give lip service to some OTHER format load choices, then show only the CSV (Comma Separated Values) load. Why the world so loves CSV is beyond me. Text and numbers are often FULL of commas that cause grief for that mode. So I’m going to end up “discovering” what those other modes of data loading are and working it out myself. But for my first attempt, I’m going to go ahead and do CSV. On the “inventory” file as it is the smaller one.
A ‘quick search’ showed it very comma infested:
root@odroidxu4:/SG2/ext/chiefio/SQL/v3# grep "," inventory.in 10266460001 -17.1000 15.7000 1150.0 PEREIRA D,ECA ANGO 1095R -9FLxxno-9A-9SUCCULENT THORNSA 10468026000 -18.3700 21.8500 1000.0 SHAKAWE 920R -9FLDEno-9A-9MARSH, SWAMP C 10764910000 4.0000 9.7300 9.0 DOUALA OBS. 12U 458FLxxCO 5A 1MARSH, SWAMP C 12061816000 10.9300 -14.3200 69.0 BOKE 41R -9HIFOno-9x-9MARSH, SWAMP A 12161769000 11.5800 -15.4800 20.0 BOLAMA 3S 10FLxxCO 3A 3MARSH, SWAMP A [...] 20550983000 45.7700 132.9700 103.0 HULIN 76S 20FLxxno-9x-9BOGS, BOG WOODS C 20551334000 44.6200 82.9000 321.0 JINGHE 441R -9MVDEno-9x-9MARSH, SWAMP B 20551716000 39.8000 78.5700 1117.0 BACHU 1331R -9FLDEno-9x-9MARSH, SWAMP A 20551730000 40.5000 81.0500 1013.0 ALAR 1070R -9FLDEno-9x-9MARSH, SWAMP A[...] 21047678001 30.5000 140.3000 83.0 TORISHIMA,JAPAN 0R -9HIxxCO 1x-9WATER A 21544212001 49.8000 92.1000 934.0 MONGOLIAN STATION,BAYAN-OL 1203S 20MVxxno-9x-9COOL DESERT A 21544213001 49.7000 96.4000 -999.0 BAYAN UUL, DZAVHAN 1680R -9MVxxno-9x-9COOL DESERT A 21544214001 48.3000 89.5000 -999.0 ALTAY, BAYAN-OLIGY 2624R -9MVxxno-9x-9COOL GRASS/SHRUBA 21544218001 47.1000 92.8000 -999.0 MONGOLIAN STATION, HOVD 1390R -9MVDEno-9x-9WARM GRASS/SHRUBA 21544237001 48.2000 99.9000 -999.0 MONGOLIAN STATION, N. HANG 2529R -9MVxxno-9x-9COOL CONIFER A 21544239001 49.0000 104.1000 -999.0 ERDENET, BULGAN 1403R -9HIxxno-9A-9COOL GRASS/SHRUBB 21544241000 48.9000 106.1000 807.0 BAYAN-GOL, SELENGE 914R -9HIxxno-9x-9COOL GRASS/SHRUBA 21544241002 48.9000 106.9000 -999.0 MONGOLIAN STATION, SELENGE 1311R -9MVxxno-9x-9WARM CROPS A 21544241003 49.2000 105.4000 -999.0 ORBON, SELENGE 964R -9HIxxno-9x-9COOL GRASS/SHRUBA 21544241004 49.8000 106.7000 -999.0 YOROO, SELENGE 958R -9FLMAno-9x-9COOL GRASS/SHRUBA 21544241005 50.1000 106.2000 -999.0 SHAAM, SELENGE 731R -9HIxxno-9x-9COOL GRASS/SHRUBA [...]
So I get to “de-comma” it… As I learned vi back in the dark ages and it now resides in my brain stem so the magic key patterns just happen without my knowing what they are anymore… I use the vi editor (renamed “vim” in Linux but they have an alias to vi). In vi, you can use “Regular Expressions” (just as in “grep” – global regular expression print). In vi, the “globally replace” command is, at the : prompt
You may hate the excessive terseness of hard core *Nix, or love it, but for me being able to replace all “,” with “;” in 10 CHAR is a good thing. To decode that line, the “:” gets you a command prompt in vi, then “g” is “globally”, meaning “do all lines” (you can also do line ranges) the “/,/” says “search for comma”, the “s” is “substitute” “//” for that thing you found (the comma) “;/” “a semicolon”, “g” globally on each line – i.e. not just the first one on the line but anywhere and everywhere in that row. All that in 10 CHAR.
Why use a “;”? Because there were none in the file:
root@odroidxu4:/SG2/ext/chiefio/SQL/v3# grep ";" inventory.in root@odroidxu4:/SG2/ext/chiefio/SQL/v3#
So there will not be a collision with any other meaning.
Now, with it stripped of the dreaded comma, I’ve modified my “Tab inserting FORTRAN program” to insert a comma between fields. I did a similar “global substitution” of “COM” for “TAB” and set COM=”,”. Also changed the input / output file names just to keep things clear.
root@odroidxu4:/SG2/ext/chiefio/SQL/v3# head semiinven.in 10160355000 36.9300 6.9500 7.0 SKIKDA 18U 107HIxxCO 1x-9WARM DECIDUOUS C 10160360000 36.8300 7.8200 4.0 ANNABA 33U 256FLxxCO 1A 7WARM CROPS C 10160390000 36.7200 3.2500 25.0 DAR-EL-BEIDA 34U 1365FLxxCO10A 6WARM CROPS C 10160395001 36.5200 4.1800 942.0 FT. NATIONAL 805R -9MVDEno-9x-9WARM CROPS A 10160400001 36.8000 5.1000 230.0 CAP CARBON 28R -9HIxxCO 1x-9WATER A 10160402000 36.7200 5.0700 2.0 BEJAIA 121U 90HIxxCO 1A 3WATER B 10160403000 36.4700 7.4700 227.0 GUELMA 287S 47HIxxno-9x-9WARM CROPS C 10160419000 36.2800 6.6200 694.0 CONSTANTINE 563U 335MVxxno-9A 7WARM FOR./FIELD B 10160425000 36.2200 1.3300 143.0 CHLEF 242U 106HIxxno-9A 3WARM CROPS C 10160425001 36.1700 1.5000 112.0 ORLEANSVILLE 219R -9HIDEno-9x-9WARM CROPS A root@odroidxu4:/SG2/ext/chiefio/SQL/v3# ls *.f ccodesv3.f csvinvn.f inven.f stationdat.f
So that csvinvn.f is the FORTRAN to put this input file (where I’ve swapped commas to ;) in CSV condition so Pandas can load it “in the usual way”. Someday later I’ll work out how to use some non-CSV Dataframe load; but this gets me over the hump to “doing something in Pandas” with about 2 minutes of effort. Far less than it took to do this writeup about it ;-)
This input file is semiinven.in as it is full of semi-colons ;-)
root@odroidxu4:/SG2/ext/chiefio/SQL/v3# gfortran csvinvn.f root@odroidxu4:/SG2/ext/chiefio/SQL/v3# ./a.out > invent.csv [...] root@odroidxu4:/SG2/ext/chiefio/SQL/v3# head invent.csv Gv3 ,7Sept2015 ,QCU ,1,101,60355,000, 36.9300, 6.9500, 7.0,SKIKDA , 18,U, 107,HI,xx,CO, 1,x,-9,WARM DECIDUOUS ,C Gv3 ,7Sept2015 ,QCU ,1,101,60360,000, 36.8300, 7.8200, 4.0,ANNABA , 33,U, 256,FL,xx,CO, 1,A, 7,WARM CROPS ,C Gv3 ,7Sept2015 ,QCU ,1,101,60390,000, 36.7200, 3.2500, 25.0,DAR-EL-BEIDA , 34,U, 1365,FL,xx,CO,10,A, 6,WARM CROPS ,C Gv3 ,7Sept2015 ,QCU ,1,101,60395,001, 36.5200, 4.1800, 942.0,FT. NATIONAL , 805,R, -9,MV,DE,no,-9,x,-9,WARM CROPS ,A Gv3 ,7Sept2015 ,QCU ,1,101,60400,001, 36.8000, 5.1000, 230.0,CAP CARBON , 28,R, -9,HI,xx,CO, 1,x,-9,WATER ,A Gv3 ,7Sept2015 ,QCU ,1,101,60402,000, 36.7200, 5.0700, 2.0,BEJAIA , 121,U, 90,HI,xx,CO, 1,A, 3,WATER ,B Gv3 ,7Sept2015 ,QCU ,1,101,60403,000, 36.4700, 7.4700, 227.0,GUELMA , 287,S, 47,HI,xx,no,-9,x,-9,WARM CROPS ,C Gv3 ,7Sept2015 ,QCU ,1,101,60419,000, 36.2800, 6.6200, 694.0,CONSTANTINE , 563,U, 335,MV,xx,no,-9,A, 7,WARM FOR./FIELD ,B Gv3 ,7Sept2015 ,QCU ,1,101,60425,000, 36.2200, 1.3300, 143.0,CHLEF , 242,U, 106,HI,xx,no,-9,A, 3,WARM CROPS ,C Gv3 ,7Sept2015 ,QCU ,1,101,60425,001, 36.1700, 1.5000, 112.0,ORLEANSVILLE , 219,R, -9,HI,DE,no,-9,x,-9,WARM CROPS ,A root@odroidxu4:/SG2/ext/chiefio/SQL/v3#
And “Bob’s Your Uncle!” it’s done. A CSV file for input.
For completion, this is a listing of the csvinvn.f program, though it is only trivially different from the TAB one.
root@odroidxu4:/SG2/ext/chiefio/SQL/v3# cat csvinvn.f C FORTRAN to read the inventory files v3 GHCN file and insert COMMA C in the output. Also divides "country" into Continent and Country C C Variable declarations... C CHARACTER * 1 COM CHARACTER * 5 VERSION CHARACTER * 10 ASCEN CHARACTER * 4 TYPE, GRIDELEV CHARACTER * 1 CONT,POPC,AIR,POPNL CHARACTER * 2 COUNTRY,TOPOT,VEG,PROXW,DISTW,DISTAU CHARACTER * 5 WMO,PSIZE CHARACTER * 3 NEAR CHARACTER * 6 STNELEV CHARACTER * 8 LATITUDE CHARACTER * 9 LONGITUDE CHARACTER * 16 VEGGRID CHARACTER * 30 NAME C C Set the COM character COM="," C C Set some constants VERSION="Gv3" ASCEN="7Sept2015" TYPE="QCU" C C Read in one line of data... C 9 OPEN(1, FILE='semiinven.in', STATUS='OLD', ACTION='READ') 10 READ (1, 11, END=99) CONT, COUNTRY, WMO, NEAR,LATITUDE, & &LONGITUDE,STNELEV,NAME,GRIDELEV,POPC,PSIZE,TOPOT,VEG,PROXW, & &DISTW,AIR,DISTAU,VEGGRID,POPNL C 11 FORMAT (A1,A2,A5,A3,X,A8,X,A9,X,A6,X,A30,X,A4,A1,A5,A2,A2,A2, & &A2,A1,A2,A16,A1) C C Convert CHAR to Float C READ (T(I),*,END=20) F(I) C C Write out one line of data with COM between fields C C WRITE (6, 6) VERSION,ASCEN,TYPE,CONT,COUNTRY,WMO,NEAR,LATITUDE, & C &LONGITUDE,STNELEV,NAME,GRIDELEV,POPC,PSIZE,TOPOT,VEG,PROXW,DISTW, & C &AIR,DISTAU,VEGGRID,POPNL C 6 FORMAT (A5,A10,A4,A1,A2,A5,A3,A8,A9,A6,A30,A4,A1,A5,A2,A2,A2, & C &A2,A1,A2,A16,A1) WRITE (6, 7) VERSION,COM,ASCEN,COM,TYPE,COM,CONT,COM,CONT,COUNTRY,& &COM,WMO,COM,NEAR,COM,LATITUDE,COM,LONGITUDE,COM,STNELEV,COM,NAME, & &COM,GRIDELEV,COM,POPC,COM,PSIZE,COM,TOPOT,COM,VEG,COM,PROXW,COM, & &DISTW,COM,AIR,COM,DISTAU,COM,VEGGRID,COM,POPNL 7 FORMAT (A5,A1,A10,A1,A4,A1,A1,A1,A1,A2,A1,A5,A1,A3,A1,A8,A1,A9,A1,& &A6,A1,A30,A1,A4,A1,A1,A1,A5,A1,A2,A1,A2,A1,A2,A1, & &A2,A1,A1,A1,A2,A1,A16,A1,A1) C Retrieve another line of data... C GO TO 10 C C If end of file, then stop. C 99 STOP END root@odroidxu4:/SG2/ext/chiefio/SQL/v3#
I’m now going to spend some unknown number of hours attempting to suck that CSV file into a Pandas Dataframe and graph something. Perhaps number of WMO by latitude… or a scatter chart of that…
Whem I’m done, this posting will get an UPDATE here.
Though first I think I’m going to go get breakfast and fresh coffee… Yes, this was done before breakfast ;-)
After a lot of wandering in the forest (documented in comments below) I made my first plot. It is LATitude and LONgitude of stations on a graph. No idea if it is right, or not, or what. But here it is.
This will just be the latitude and longitude for stations plotted against the order of records. It has some quasi meaning as records are arranged by continent, so sort of in physical blobs. Yeah, mostly meaningless. Doesn’t matter. I’ve now done the whole thing from data format to load to graph. So from here on out it is just polishing and adding incremental skills.