GIStemp_Data_Go_Round

The GIStemp Data-Go-Round

As I go through each step, documenting where data come from and go to, this page will be updated with more details. It is as much a place where I can put my notes and keep track of things as it is a place where you can marvel at the mindless complexity of the data flow.

What really ought to happen is that GIStemp ought to have a simple database structure under it. Then most all of this complicated “Data-Go-Round” would be reduced to data pre-processing in the database load scripts as different data formats were matched to the database (that could be a single flat file relation… nothing fancy.)

In this description, I am basing program and script names on what I have named them in my “cleaned up” version. In the original, scripts and FORTRAN programs often had the same name. Confusing. For the version that I have compiled, I’ve made all FORTRAN source foo.f and all compiled executables from FORTRAN foo.exe while all scripts are named foo.sh (so when you see that “foo” does something you don’t need to wonder if it’s the foo script or the foo program…)

A side note on names: names of files in your present directory, wherever you are, start with the characters “./” (this is a Unix syntax that means “Starting right where you are – the ‘.’ – in your present directory – the ‘/’ – then look for a file named {whatever follows} ). I do this to make it absolutely clear that a file is just written or read from wherever you run the program or script. Proper coding practice would have what is called a “fully qualified file name”, something like “/GIStemp/STEP0/input_file/foo” but the code is a bit lax in that regard…

For things in a specific directory, I don’t bother to put the ./ in front. It would be more precise to do so, but makes things a bit hard to read on long file names.

Step Minus One:

It is left as an exercise for the user to gather together the data sets that GIStemp uses. There is a bit of a pointer to the sites that have the data, but a lot of it is left to the imagination. See the “STEP0” description for a bit more detail where the data comes from. Or see the GIStemp tab at the top for a higher level introduction.

I’ve written an “ftp wrapper script” that gets a bunch of the data, but I’ve not yet got it done enough to get all the Antarctic data. There is also a bit of a ‘hand edit job’ that gets done to the data anyway, to make sure it’s ready to feed into GIStemp.

I’ll come back to this section a bit later. For now, just Be Advised that the data set you see used in following steps is not exactly handed over in final form and ready to go in all cases…

Step Zero:

For purely histerical reasons, GIStemp formally starts with a step named STEP0. Yes, it has all the indicia of having been added on “after the fact” as a way to get the Antarctic data into the system and to put in a “special” version of the data for Hohenpeissenberg.

We will start there.

Each STEP has a master controlling script that is run to orchestrate the process. This script sometimes has some processing in it as well. I generally talk about these as the “do_” scripts. For this step, the name is:

do_comb_step0.sh

inputs: input_files/v2.mean (from ftp)

As we go through what it does, I say “Calls:” when it calls another executable (script or FORTRAN) or mention a specific unix command if it does something interesting to the data (as opposed to just removing some scratch file it created…)

Calls: antarc_to_v2.sh

inputs: input_files/antarc1.txt
inputs: input_files/antarc2.txt,
inputs: input_files/antarc3.txt

output: ./v2_antarct.dat

scratch: ./v2_antarct.datt

(there is a circular sed, sed, sort from .dat into .datt to .dat)

Calls: antarc_comb.exe

inputs: ./v2_antarct.dat
inputs: input_files/v2.mean

output: ./v2.meanx

Calls: dumpold.exe (parameter of 1880 as cutoff date)

inputs: ./v2.meanx

output: ./v2.meany

Calls: get_USHCN.sh

The comment says:

“replacing USHCN station data in $1 by USHCN_noFIL data
(Tobs+maxmin adj+SHAPadj+noFIL) reformat USHCN to v2.mean format”

inputs: input_files/hcn_doe_mean_data
inputs: input_files/ushcn.tbl

output: ./hcn_doe_mean_data_fil
output: ID_US_G

note that ID_US_G is just a sorted version of input_files/ushcn.tbl

This whole get_USHCN.sh script is so small, I wonder why it isn’t pulled ‘in line’ into the do_ script. All I can figure is that I’ve taken the f77 compile out of the script and it’s an artifact of the compile / run / delete behaviour that does not make sense when compiles are done with a Make file. I’ll quote the whole thing here “as is” from my “cleaned up” version:

BEGIN QUOTE:

echo “extracting FILIN data”
grep -e ” 3A” < input_files/hcn_doe_mean_data > hcn_doe_mean_data_fil

echo “getting inventory data for v2-IDs”
sort -n input_files/ushcn.tbl > ID_US_G

./bin/USHCN2v2.exe

END QUOTE.

That’s it. 3 active lines and two advisory notices.

The get_USHCN.sh script Calls: USHCN2v2.exe

inputs: ./ID_US_G
inputs: ./hcn_doe_mean_data_fil

output: ./USHCN.v2.mean_noFIL
output: ./USHCN.v2.mean_FIL
output: ./USHCN.last_year (a place to store the year with last data)

The do_ script then reads in the last_year value from ./USHCN.last_year.

grep is then used by the do_ script

inputs: ./v2.meany

output: ./ghcn_us

Calls: dumpold.exe (parameter of 1980 for cutoff date)

inputs: ./ghcn_us

output: ./ghcn_us_end

grep is then used by the do_ script

inputs: ./USHCN.v2.mean_noFIL

output: ./xxx (Honest! That’s what they chose as a name…)

Calls: dumpold.exe (parameter of 1880 for cutoff date)

inputs: ./xxx

output: ./yyy

The do_ script uses sort

inputs: ./yyy

output: ./USHCN.v2.mean_noFIL (overwrites old version…)

Calls: dif.ushcn.ghcn.exe ( parameter of last_year )

inputs: ./USHCN.v2.mean_noFIL
inputs: ./ghcn_us_end
inputs: ./USHCN.last_year (as passed parm via script read)

output: ./ushcn-ghcn_offset_noFIL

Calls: cmb2.ushcn.v2.exe

inputs: ./v2.meany
inputs: ./USHCN.v2.mean_noFIL
inputs: ./ushcn-ghcn_offset_noFIL

output: ./v2.meanz
output: ./ghcn.last_year
output: ./ushcn.log (as redirected output of run in script)

Next comes the Hohenpeissenberg special case processing.

tail +100 (which skips the first 99 records, then prints from 100 onward.)

inputs: input_files/t_hohenpeissenberg_200306_as_recieved_July17_2003

output: ./t_hohenpeissenberg

Calls: hohp_to_v2.exe

inputs: ./t_hohenpeissenberg
output: ./v2.t_hohenpeissenberg

Calls: cmb.hohenp.v2.exe

inputs: ./v2.t_hohenpeissenberg
inputs: ./v2.meanz
output: ./v2.mean

The do_ script then makes, with a forced option, and with any error messages redirected to /dev/null, the directories to_next_step and work_files. Not the way I would do it…

Notice that the output file ./v2.mean looks like the input file name of input_files/v2.mean, but is a significanly different file with much different stuff in it. This file is then moved into the directory to_next_step, and renamed:

The do_ script uses mv to move the dataset (rename it)

inputs: ./v2.mean

output: ./to_next_step/v2.mean_comb

so the dataset is then named: STEP0/to_next_step/v2.mean_comb.

The script then advises you to move this file by hand into the STEP1 directory:

STEP1/to_next_step.

It logically ought to be in STEP1/input_files, but I’m not going to change anything just yet. At some point, though, the idea that a STEP ought to look in it’s “to_next_step” directory for input is just broken.

So here the exact same data file has had 3 names inside a half dozen lines of code with nothing at all done to it in the process of the Data_Go_Round.

They then tell you to : do_comb_step1.sh v2.mean_comb

STEP One:

This is as far as I’ve gotten. More “soon”…

Advertisements

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in GISStemp Technical and Source Code and tagged , , , , , . Bookmark the permalink.

11 Responses to GIStemp_Data_Go_Round

  1. Gary says:

    Any chance of a flowchart for us visually-oriented types?

  2. E.M.Smith says:

    All in good time, Gary!

    It was hard enough just extracting all the guzintas and guzoutas in the first place! I decided to put up the basic list (since that was what I had) and take a break.

    I also need to figure out how to make flowcharts that can be published on WordPress… (I’m both “paper and pencil” and MS Project style, but have no idea how to make a simple GIF that WoodPress would like. I’m sure it’s not hard, it just takes time to figure out… and right now that time is going elsewhere…)

  3. Ellie in Belfast says:

    Could you start with cropped screen shots of an MS Project flowsheet (of you’ve done one)? MS Excel does flowsheets easily and might be more flexible.

  4. Tony Hansen says:

    Thank you E.M

  5. E.M.Smith says:

    @Tony: you are most welcome.

  6. drj11 says:

    @Gary: See slides 10 and 11 of the PDF you can find on this page:

    http://clearclimatecode.org/doc/2008-09-11/pyconuk/

  7. E.M.Smith says:

    drj11
    http://clearclimatecode.org/doc/2008-09-11/pyconuk/

    Thank You!

    BTW, if you would like to “host” a tarball of the ported FORTRAN version of GIStemp, I have one ready to go, but no place to put it.

    It could be useful to you in two ways:

    1) It lets you A/B compare your code product with fixed datasets to demonstrated a match of product (i.e. you can prove correctness and reproduction of GIStemp product with fixed data series).

    2) It lets you make GIStemp comparison products that are done with exactly the same dataset you run in your code. Without it, you have the risk that the published GIStemp graphs and charts diverge due to the data being different and folks attributing that to your port / code rather than GIStemp having a slightly different input data set.

  8. Jeff Alberts says:

    I can host whatever you need, and make it publicly available or password protected.

  9. E.M.Smith says:

    @Jeff

    Thanks! It will take about 1.7 MB all told for both tarballs. One is a “source code only you download all the data sets even the little ones” of 65KB, the other includes the misc. small data sets and only leaves out a couple of the really big ones. It’s about 1.65 MB.

    Please contact me via the email address in the about tab up top as to how I place these on your server. I don’t expect them to change much since they run now as is. And despite my hope that thousands of folks will step up and join the effort / process, the reality is that there will likely not be much download traffic…

    E.M.Smith

  10. Jeff Alberts says:

    EM, just emailed you at the pub4all address.

  11. E.M.Smith says:

    Thanks Jeff, I’ll get to it tonight or tomorrow afternoon.

Comments are closed.