GIStemp STEP0 – The Process

Where Data Becomes Breakfast

Where Data Becomes Breakfast

Full size image (It is my belief that as a non-commercial activity this is a ‘fair use’ image for purposes of political satire and education.)

 

Step0 was added after the fact (thus the “0”) and shows elements of a ‘glue data together by hand’ mindset. It looks to me like it is a “lets add in the Antarctic and a few bits” step. It also deletes data older than 1880, but no reason is given for why. Could it be because it is a “cherry pick”?

 

Sizes and Listings

 

Lets look inside STEP0 for how big things are. Files ending in “.f” are FORTRAN program source code (that when compiled into something the computer can run gets the “.exe” executable binary suffix) while those ending in “.sh” are Unix / Linux shell scripts. Oh, and “*” is the “wildcard character” that says “match anything” so *smith would match “Goldsmith” and “Tinsmith” and… So just taking a look at the line count numbers in the text files (cd STEP0, wc -l *) :

      31 input_files/sorts.f
      47 USHCN2v2.f
      87 antarc_comb.f
      10 antarc_comb.sh
      38 antarc_to_v2.sh
      40 cmb.hohenp.v2.f
      65 cmb2.ushcn.v2.f
      70 dif.ushcn.ghcn.f
      65 do_comb_step0.sh
      22 dump_old.f
       9 get_USHCN
      17 hohp_to_v2.f
      20 step0_README.txt
     521 total

 

This is typical of all the steps. For STEP0 there are about 521 lines of code in total.

STEP0 gathers together data from several sources, puts it into compatible formats, and merges it together into a single usable dataset. It also deletes old data, manipulates GHCN vs USHCN in an odd way, and deletes GHCN records for the US older than the USHCN record.

 

Scripts:

 

Note – all FORTRAN is compiled to runnable binaries in-line in the scripts, executed, then the runnable binaries deleted. You would expect the human readable source code to be in a distinct archive, compiled to runnable binaries in a distinct program library once, and only deleted when a newer version had passed a Quality Assurance test.

One comment on style. I always learned that you created and used your work files in a work file directory; this code creates and uses them in the same directory as the program source code, then when done, moves them into a work directory. Not what you would expect.

antarc_to_v2.sh

creates v2_anarct.dat by merging anarctic*.txt, sorting, swap ‘-‘ to ‘-999.9’

antarc_comb.sh

uses antarc_comb.f to merge v2_anarct.dat with v2.mean giving v2_mean.x;

get_USHCN

finds ‘3A’ data in hcn_doe_mean_data and copies them to hcn_doe_mean_dat_fil then sorts these lines numerically (ID#) to produce ID_UD_G (I think this is specifically selecting the Filnet data from the whole mass.)

do_comb_step0.sh

This is the start. It runs antarc_to_v2.sh to run antarc_comb.f (that adds the GHCN data to arctic), dump_old.f on the combined data, get_USHCN, USHCN2v2.f to replace GHCN dups with USHCN and change format to be like v2. So far this makes sense, other than the rational for GHCN vs USHCN. What makes one better?

The script then runs dump_old.f on the GHCN US-only data (to create a 1980 and newer subset for use in calculating an ‘offset’ between GHCN and USHCN), then dif.ushcn.ghcn.f (creates that odd ‘offset’ between GHCN and USHCN), cmb2.ushcn.v2.f (to ‘find the offset caused by adjustment’ in the original data and to remove it via the ‘offset’ from the prior step), then runs hohp_to_v2.f and cmb.hohenp.v2.f to add in a better version of the data for hohenpeissenberg (from input_files). It then creates the directories to_next_step and work_files, moves files into them, and finishes.

That whole offset un-offsetting process just seems strange to me. Why not just use the non-adjusted data series OR use the already adjusted series? Why blend 1/3 of one with 1/3 of the other with 1/3 being unoffsetterized? This just looks like a great chance to introduce to the series that which is not in the original data sets.

 

FORTRAN source:

 

USHCN2v2.f

Converts USHCN data into a more v2 (GHCN) like format

antarc_comb.f

Combines GHCN data with antarctic data

cmb.hohenp.v2.f

Combines the Hohenpeissenberg data into the v2 data

cmb2.ushcn.v2.f

Combines the USHCN data with the merged Antarctic and GHCN data

dif.ushcn.ghcn.f

Finds overlapping records in the USHCN and GHCN data sets, makes a ‘dif’ value. The rationale of this step needs serious scrutiny.

dump_old.f

Removes records from older than a selected date (1880 as argument from scripts)

hohp_to_v2.f

Converts the hohenpeissenberg data to be more like v2 (GHCN) format

 

A Deeper Look at the Top Level Script

 

We will take a brief look at the top level controlling script do_comb_step0.sh and after the listing of it I’ll describe what it does.

Begin listing of “do_comb_step0.sh”:

#!/bin/ksh

if [[ $# -ne 1 ]] ; then echo “Usage: $0 v2.mean” ; exit ; fi

infile=input_files/$1
if [[ ! -s $infile ]] ; then echo “$infile not found” ; exit ; fi

fortran_compile=$FC
if [[ $FC = ” ]]
then echo “set an environment variable FC to the fortran_compile_command like f90”
echo “or do all compilation first and comment the compilation lines”
exit
fi

echo “Bringing Antarctic tables closer to ${infile} format”
./antarc_to_v2.sh

echo “adding extra Antarctica station data to ${infile}”
${fortran_compile} antarc_comb.f -o antarc_comb.exe
antarc_comb.exe v2_antarct.dat ${infile}
echo “created v2.meanx from v2_antarct.dat and ${infile}”
rm -f v2_antarct.dat antarc_comb.exe

# echo “removing pre-1880 data:”
echo ; echo “GHCN data:”
${fortran_compile} dump_old.f -o dump_old.exe
dump_old.exe v2.meanx v2.meany 1880 ; echo “created v2.meany from v2.meanx”
rm -f v2.meanx

echo ” ”
echo “replacing USHCN station data in $1 by USHCN_noFIL data (Tobs+maxmin adj+SHAPadj+noFIL)”
echo ” reformat USHCN to v2.mean format”
${fortran_compile} USHCN2v2.f -o USHCN2v2.exe ; ./get_USHCN
read last_year < USHCN.last_year

echo “finding offset caused by adjustments” ; echo “extracting US data from GHCN set”
grep ^425 v2.meany > ghcn_us
dump_old.exe ghcn_us ghcn_us_end 1980
echo “getting USHCN data:”
grep -v -e -9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999 USHCN.v2.mean_noFIL > xxx
dump_old.exe xxx yyy 1880 ; rm -f xxx
sort -n yyy > USHCN.v2.mean_noFIL ; rm -f yyy
${fortran_compile} dif.ushcn.ghcn.f -o dif.ushcn.ghcn.exe ; dif.ushcn.ghcn.exe ${last_year}
echo “created ushcn-ghcn_offset_noFIL ”
${fortran_compile} cmb2.ushcn.v2.f -o cmb2.ushcn.v2.exe ; cmb2.ushcn.v2.exe > ushcn.log
echo “created v2.meanz”

echo “replacing Hohenspeissenberg data in $1 by more complete data (priv.comm.)”
echo “disregard pre-1880 data:”
tail +100 input_files/t_hohenpeissenberg_200306.txt_as_received_July17_2003 > t_hohenpeissenberg
${fortran_compile} hohp_to_v2.f -o hohp_to_v2.exe ; hohp_to_v2.exe
${fortran_compile} cmb.hohenp.v2.f -o cmb.hohenp.v2.exe
cmb.hohenp.v2.exe

# cleanup
mkdir to_next_step 2> /dev/null
mkdir work_files 2> /dev/null
mv v2.mean to_next_step/v2.mean_comb
mv *exe v2.mean* ghcn* ushcn* hcn* USHCN.v2.mean_noFIL ID_US_G *hohenpeiss* work_files/.

echo ; echo “created v2.mean_comb”
echo “move this file from to_next_step/. to ../STEP1/to_next_step/. ”
echo “and execute in the STEP1 directory the command:”
echo ” do_comb_step1.sh v2.mean_comb”

End of script.

First up, we see that it expects the input file v2.mean in the input_files directory. We know from inspection that it is not there as unpacked, but is called out in the directions as a file to be downloaded and unpacked. Good enough. It is the GHCN data.

The script next nags to assure that you have a FORTRAN compiler and it is defined in your environment variables. Why all the FORTRAN compilation is not done up in a nice ‘make’ file with the source archive separated from the binaries and execution workspace is unclear. But lets rolls with it. Maybe they want to treat FORTRAN like an interpreted language such as shell scripting and BASIC and find this a convenient way to allow ‘tinkering’ without worry about production structures like source archives and make files (and version control, and …)

So we next want to make the formats of the input files closer. (“v2.mean” like) so we run the script anarc_to_v2.sh that does things like turn the “missing data markers” in antarc*.txt into the same kind used by the v2.mean file and putting it into the output file v2_antarct.dat giving us one combined antarctic data set.

A sidebar on antarc_comb: At this point it gets just a bit messy. The script antarc_comb.sh seems to do the same thing as the next block of the script does (compile and run antarc_comb.f to create v2_meanx from v2_antarc.dat and v2.mean); but a grep shows that nothing else contains the text “antarc_comb.sh”. So I am left to assume antarc_comb.sh is a ‘hand tool’ to play with this step without doing an actual run.

I would have made it a working script, then just called it from do_comb_step0.sh but as it stands, antarc_comb.sh explicitly calls the FORTRAN compiler f77 (ignoring the environment variable setting; raising the risk of compiler variations in the behaviour of the two paths) and takes an explicit argument for the antarctic data set and the v2.mean data set. It also links them to two temp input files named fort.1 and fort.2 with the output in fort.12 that it then moves to the v2.meanx file. At any rate, the product out is v2.meanx which is a combined v2 and antarctic data set.

The program dumpold.f is then compiled and removes data from prior to 1800 from v2.meanx to yield v2.meany. Be prepared for this constant mutating of the names of files. “Data data, who’s got the data!” is the name of the game. At every stage in every STEP the date get slightly transformed and moved through a couple of new temporary file names on their way to yet another output file name.

Hold onto your hat for the next step. It is a bit convoluted, but just remember that at the end of the day the whole step is to remove the ‘3A’ type records from hcn_doe_mean_data and find the value of “the most recent year” containing data. Ready? Here we go…

Next the USHCN2v2.f program is compiled and the script get_USHCN is executed. So what does IT do? It sucks in the hcn_doe_mean_data file and makes a file hcn_doe_mean_data_fil with a copy of the type “3A” records in it. Then it sucks in the file input_files/ushcn.tbl and sorts it (sort -n) with the output in ./ID_US_g (they are both used by ISHCN2v2, when it is run, as the two input files); at the end, the script executes the program USHCN2v2 (which produces the output files of USHCN.v2.mean_noFIL and USHCN.last_year (which is then read, to load the value of the last year found into the variable last_yr in this script). Got all that?!?

The data of interest are now in the USHCN.v2.mean_noFIL file.

A “grep” text search is used to find type 425 records in v2.meany and put copies, as an extract, in ghcn_us.

Our old friend dump_old is called again, but this time with the cutoff year being 1980 and the data being the extracted US only GHCN records. The input is ghcn_us and the output is ghcn_us_end (that ought to contain only 1980 to present). Why 1980? “Why, don’t ask why, down that path lies insanity and ruin. -emsmith” I do find it curious that cutoff date matches the end of the baseline used for comparisons elsewhere. So we use GHCN / USHCN adjusted by some odd ‘offset’ based only on recent time periods outside the ‘baseline’… Pay close attention when we get to that ‘adjustment’ phase. We’ll see…

Next comes the USHCN data.  A “grep” text search removes lines full of non-data from USHCN.v2.mean_noFIL and places the output in to a file named ‘xxx’ that then has dump-old run on it to remove anything prior to 1880 and leave the residue in a file named ‘yyy’ which is handed to sort (in numerical order) and put back into USHCN.v2.mean_noFIL where it started.

So what did we just do?

US data in GHCN was excerpted from 1980 to present and a copy placed into ghcn_us_end and then we then removed dead records from the USHCN.v2.mean_noFIL file and sorted the file by station ID. There’s got to be an easier way…

 

Let The Games Begin


The next process is to compile and run dif.ushcn.ghcn.f feeding it the last_year parameter created above and creating as output the ushcn-ghcn_offset_noFIL file.

What dif.ushcn.ghcn.f does is to read in the files USHCN.v2.mean_noFIL (US with ‘3A’ removed) and ghcn_us_end (GHCN, only US, from 1980 to now) and produce the ushcn-ghcn_offset_noFIL file as output. It compares the USHCN record with the GHCN record (only back to 1980) looking for up to 10 years worth of common records.

When it finds 10 years, or finds all there are less than 10 available since 1980, it takes the difference between “the monthly average temperatures of those years in GHCN and USHCN” and averages them together. If you only have one common year, you would compute DIFFERENCE= GHCN(year, eachmonth) – USHCN(year,eachmonth) or, for several years, it would be the average of DIFFERENCE for the first (up to 10) years counting backwards from the present to 1980. DIFFERENCE= GHCN(up.to.10.years,eachmonth) – USHCN(up.to.ten.years,eachmonth) summed over years into DIFF(eachmonth) for the 12 months of a year.  This is the “offset” used later to change the actual temperature records into…, er, um, something else.

The only reason for this seems to be to bring the two curves into alignment when you glue GHCN onto USHCN. I don’t see where this makes sense. One end or the other ought to be ‘more right’ or you are taking the risk of an error from gluing together 2 disjoint series and finding a bogus “trend”. This deserves exploration by someone more familiar with the data series. When applied, it is applied to all past records. I fail to see why an equipment change in 2001 ought to change the record from 1888, but maybe that’s just me…

We are nearing the end, so hang in there. Only 2 more steps to go and one is easy.

We then compile and run cmb2.ushcn.v2.f (producing the logfile ushch.log) and v2.meanz as the output file. What does it do? The input files are v2.meany, USHCN.v2.mean_noFIL and ushcn-ghcn_offset_noFIL This guy copies all the non-425 (non-us) stations directly to the output (maintaining the sort order). It also copies to the output any USHCN stations that do not have a corresponding GHCN record for that year. The magic sauce is what happens to the records with both a USHCN and a GHCN entry.

In that case, skipping ‘the early years’ (those being any not in the USHCN record), the USHCN record is used, but it is decremented by the DIFFERENCE value found in the last step (that GHCN/USHCN difference or “offset” map we just computed: DIFF(eachmonth) above). It does not matter how far back in time that record came from, the ‘difference’ by month calculated based on at most 10 years (from at oldest 1980) to present is subtracted from that record.

I can see little rational for this. Why is the USHCN record “better” than the GHCN sometimes? Why is subtracting a delta value more ‘correct’ than just using one record or the other? The stated rational is to un-adjust the replacement record, yet the other GHCN records are left as is. This is just mixing two different kinds of records willy nilly with a smoothing step so the disconnect doesn’t show up; as near as I can tell.

The program then writes out v2.meanz as this composite modified data set and writes the maximum year found in the file GHCN.last_year for future use. It would be very interesting to compare trends in v2.meanz v.s. the GHCN and USHCN data series. This would show if it approximates one, the other, or diverges from both.

 

The Wrap-Up


That’s the worst of it. The last step just merges in a ‘complete’ record of data from Hohenspeissenberg (where for some unknown reason the original has gaps). Why that site deserves special treatment is not clear to me. So much other data is simply tossed in the trash, why is this so important? Is there something special about this one station?

At any rate, the first 100 lines are skipped if I understand tail +100 correctly. (I think this is intended to remove data older than 1880). The orginal data file is:

input_files/t_hohenpeissenberg_200306.txt_as_received_July17_2003

while the output goes into: t_hohenpeissenberg

We then compile and run the programs hohp_to_v2.f (which reads in t_hohenpeissenberg, converts ‘real’ type floating point temperature numbers to the ‘integer’ types (with the last digit being the 1/10C place; i.e. it’s an integer of tenths of degrees C), and produces the output file v2.hohenpeissenberg) and runs cmb.hohenp.v2.f (that takes in the v2.hohenpeissenberg file from the prior step and v2.meanz, merges them and produces v2.mean as it’s output file.

We now have v2.mean to send to the next step. All that is left is ‘cleanup’.

Cleanup: the two output directories are created (to_next_step and work_files) whether they exist already or not (that does work, but is considered “poor form”.  Less sloppy is to check for the existence and only execute the creation command if they are not already there). The file v2.mean is moved into to_next_step and renamed v2.mean.comb ( “Why? Don’t ask why, down that path lies insanity and ruin. -emsmith”) All the misc. dregs are moved into the directory “work_files”: that is, the binaries and all the other bits in: *exe v2.mean* ghcn* ushcn* hcn* USHCN.v2.mean_noFIL ID_US_G *hohenpeiss*

And now we are told to, by hand, move
to_next_step/v2.mean_comb into
../STEP1/to_next_step/v2.mean.comb

Notice that we just moved a file from STEP0 “to_next_step” into STEP1 “to_next_step” and not into STEP1 “input_files” (and there is no STEP1 “from_last_step”…)  Tacky.  Just tacky.  So when we get to STEP1, we will be looking for our input file in “to_next_step”.  Sheesh.  So much cruft, so little time…

So at the end of the day we’ve merged all this data, thrown out anything older than 1880, tossed any GHCN records earlier than the first USHCN for each site, kept any USHCN records not in GHCN, and for those that are in both, we’ve corrupted our history based on the recent past differences between GHCN and USHCN.

Again I ask: This is better than straight USHCN or GHCN plus Antartic how?

The STEP0 “readme” file is reproduced below:

The following files have to be downloaded and added to the input_files/. directory :
v2.mean (from GHCN)
hcn_doe_mean_data (from USHCN)
antarc1.txt, antarc3.txt (from SCAR, manual updates)
antarc2.txt (from SCAR if any changes occurred – not likely !)

To simplify comparisons to older antarc*txt files, you may execute the “do_sort”
script; this will create duplicates of these files in which the stations are ordered
alphabetically called antarc*.sort.txt – all while in the input_files/. directory .
(The original antarc2.txt and antarc3.txt were not ordered; I ordered and moved them
into the _old subdirectory)

An environment variable FC has to be set to a fortran90 compilation command, e.g.
f90 , ifort , “xlf90 -qfixed” , …

Then the command “do_comb_step0.sh v2.mean” should do the rest and put the file
v2.mean_comb into the to_next_step/. directory . All temporary and log files are
collected in the work_files/. directory .

The next step would be: do_comb_step1.sh v2.mean_comb

Advertisements

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in GISStemp Technical and Source Code and tagged , , , , , . Bookmark the permalink.

4 Responses to GIStemp STEP0 – The Process

  1. John Galt says:

    If we learn nothing else from GISSTemp, I think we can without a doubt conclude that although somebody may be expert in their own technical or scientific field, that does not make them qualified to write computer code.

  2. Martin says:

    re Hohenpeissenberg

    it is Meteorological Observatory Hohenpeissenberg
    in Bavaria – Germany

    see this link: http://tinyurl.com/bjjo3m
    Google translation from German Wiki

    it is mentioned that it is in operation since 1 January 1781 until today almost uninterrupted meteorological observations are carried out. ??

  3. E.M.Smith says:

    @John Galt: It also says that you can make code work without it being “professional quality”; but that does bring into question the quality of the product the code makes…

    @Martin: Thanks! Now I know why it’s special. Unfortunately, GIStemp throws away the 100 years of data from 1781 to 1880.

  4. Nick Barnes says:

    FWIW, our step0.py produces byte-for-byte identical output to the GISTEMP STEP0 code.

Comments are closed.