STEP1 Overview and Sizes

Children's Python

Children's Python

Full size image

Step1 has two very distinguishing features.  First, and most important, it is professionally done.  The author even signs his name to his code in a couple of places (pride of authorship) and runs a tidy mind.  Kudos to you sir!

Second, it is written in Python.  Unfortunately, I have never learned to write Python.  Fortunately it looks in many ways a lot like many other programming languages, so I think I can follow what it’s doing.


First up, what does gistemp.txt say:


And what does the gistemp.txt file have to say about STEP1?
Step 1 : Simplifications, elimination of dubious records, 2 adjustments (
The various sources at a single location are combined into one record, if
possible, using a version of the reference station method. The adjustments
are determined in this case using series of estimated annual means.

Non-overlapping records are viewed as a single record, unless this would
result introducing a discontinuity; in the documented case of St.Helena
the discontinuity is eliminated by adding 1C to the early part.

After noticing an unusual warming trend in Hawaii, closer investigation
showed its origin to be in the Lihue record; it had a discontinuity around
1950 not present in any neighboring station. Based on those data, we added
0.8C to the part before the discontinuity.

Some unphysical looking segments were eliminated after manual inspection of
unusual looking annual mean graphs and comparing them to the corresponding
graphs of all neighboring stations.


Result: Ts.txt

OK then. Not sure exactly what “unphysical looking” is as an objective criterion, but I’m not sure that an “annual mean graph” is very physical either. The net of this is that there was some hand tweaking of the data going on. Gee, what a surprise…

Here we also get another variation on “The Reference Station Method” wherein a station “way over there” knows more about what’s happening here than a station right here. We’ll get into that more down in the code itself. But first we have to make some Python slither…


The PYTHON_README.txt file:


The *.py scripts in this directory use the Python programming language.
Each of these scripts begins with a line indicating the location of
Python on your system. At GISS, this happens to be /usr/bin/python .
You may need to alter this line to specify the location of Python on
your system, say /usr/freeware/bin/python or /usr/local/bin/python.

These Python scripts also make use of two custom C extension modules.
These extensions must be compiled and then placed in the site-packages
subdirectory of your computer’s Python library. To do so, unarchive the
EXTENSIONS.tar.gz file located here. Then cd into the EXTENSIONS directory.
You will find there the following subdirectories:


Go into each of these subdirectories and make the C extension: the common
file “” has to be edited to fit your system; the script
“make_shared” does the rest; but it may be safer to do the 3 commands separately.

The resulting * files should end up in the Python site-packages
directory to look something like:
(assuming your Python library is at /usr/lib/python)
or you may simply move them to the STEP1/. directory.

“make_clean” may be used to remove the files created by “make_shared”.

Ah, the joys of a well written README.  We know exactly what to do. Unfortunately, that’s a bunch of work to set up Python, compile some special libraries, put them in special places, and most likely modify the scripts and other commands that are run to look in those ‘special places’.  OK, I’m not going to belabor that.  If you are a “Python Guy” you probably know how to do this already.  If not, you either learn it or find someone to help.


Sizes and Listings


Lets look inside STEP1 for how big things are. Files ending in “.py” are Python program source code while those ending in “.sh” are Unix / Linux shell scripts. Oh, and “*” is the “wildcard character” that says “match anything” so *smith would match “Goldsmith” and “Tinsmith” and…


Strangely, STEP1/input_files contains exactly the same copies of, mcdw.tbl, sumofday.tbl, Ts.discont.RS.alter.IN, Ts.strange.RSU.list.IN, ushcn.tbl, and v2.inv files, though one line in v2.inv was kicked out by diff, it is is visually the same so the difference is in the white space somewhere.

 diff STEP0/input_files/v2.inv STEP1/input_files/v2.inv
< 40371964000 WITHEHORSE, Y                   60.72 -135.07  703  947S   15MVxxno-9x-9TUNDRA          C   60
> 40371964000 WHITEHORSE, Y                   60.72 -135.07  703  947S   15MVxxno-9x-9TUNDRA          C   60

So just how big is STEP1? This is the result of (cd STEP1; du *):


884	input_files
0	work_files
0	to_next_step


There is only one shell script, the file which runs the python programs in order (after linking all the input-files to similar names in the STEP1 directory… At the end, it moves work files and output files from the STEP1 directory into their respective directories as well. Odd.)

The work_files and to_next_step directories are empty, but we have a new directory, EXTENSIONS and a PYTHON_README.txt that are important bits of ‘setup’ coding stuck into this step. We looked at the PYTHON_README.txt file above.

OK, so you need to do some work to make Python go and there are some C programs with ‘magic sauce’ to compile and install. What are they?

wc monthlydata/*


     307    1343   10359 monthlydata/
       2       3      41 monthlydata/
       4      20     169 monthlydata/make_clean
       8      19     122 monthlydata/make_shared
     958    2866   22304 monthlydata/monthlydatamodule.c
    1279    4251   32995 total


wc stationstring/*


307    1343   10359 stationstring/
2         3      45 stationstring/
4        20     169 stationstring/make_clean
8        19     122 stationstring/make_shared
640    1906   16655 stationstring/stationstringmodule.c
961    3291   27350 total


Here we have wc * for all the other program files at the STEP1 level:


   Lines   Words   Bytes File Name

      27     194    1346 PYTHON_README.txt
      71     227    2084
      30      85     744
     293    1125   10544
     253     944    8266
      45     232    1633
     125     372    3502
     102     449    2518
     104     295    2690
    1050    3923   33327 total


So it looks like a significant part of the processing is hidden in the ‘magic sauce’ C programs with the bulk after that being done by and the programs. From what I’ve seen so far, ‘comb’ typically means ‘combine’. Inspection of shows what appears to be a statistics library definition that makes functions such as mean, anom, sigma,


Taking a look at the top level script



if [[ $# -ne 1 ]] ; then echo “Usage: $0 v2_raw_filename” ; exit ; fi

if [[ ! -s to_next_step/$1 ]]
then echo “file to_next_step/$1 not found”
exit ; fi
ln -s to_next_step/$1 .

# the input files
for x in Ts.discont.RS.alter.IN Ts.strange.RSU.list.IN mcdw.tbl sumofday.tbl ushcn.tbl v2.inv
do if [[ ! -s $x ]] ; then ln input_files/$x . ; fi

echo “Creating $1.bdb” ; $1
if [[ ! -s $1.bdb ]] ; then echo “ failed” ; exit ; fi

echo “Combining overlapping records for the same location:” $1 > comb.log
if [[ ! -s $1.combined.bdb ]]
then echo “ failed; look at comb.log” ; exit ; fi

echo “Fixing St.Helena & Combining non-overlapping records for the same location:” $1.combined > piece.log ; # try to combine
if [[ ! -s $1.combined.pieces.bdb ]] ; # non-overlapping records
then echo “ failed” ; exit ; fi

echo “Dropping strange data – then altering Lihue,Hawaii” $1.combined.pieces
if [[ ! -s $1.combined.pieces.strange.bdb ]]
then echo “ failed” ; exit ; fi $1.combined.pieces.strange
if [[ ! -s $1.combined.pieces.strange.alter.bdb ]]
then echo “ failed” ; exit ; fi $1.combined.pieces.strange.alter.bdb

rm -f $1
mv *bdb *.log work_files/.
mv ${1}*.txt to_next_step/Ts.txt

echo ; echo “created Ts.txt”
echo “move this file from STEP1/to_next_step to STEP2/to_next_step ”
echo “and execute in the STEP2 directory the command:”
echo ” last_year_with_data”

End of Script


Other than that the script seems to look for it’s input file ($1) v2.mean.comb in the directory “to_next_step”, we see what looks like a fairly straight forward process. Why is an input file in the “to_next_step” directory? “Why, don’t ask why, down that path lies insanity and ruin. -emsmith.”

The program takes the ‘raw’ v2 file and make it into v2.mean.comb.bdb which looks to me like some kind of Python data base structure with hash keys (binary data base?). v2.mean.comb.bdb then has ‘overlapping records’ combined via (with log file comb.log) and output to the v2.mean.comb.combined.bdb file. It looks like the combining does a weighting process of some sort. It is not clear to me why one record is weighted over another. The rank order preference is: MDCW, USHCN, SUMOFDAY, UNKNOWN which seems to correspond with the files of similar names in input_files (modulo that there is no ‘unknown’ file). What this weighting system is and why it is done are not clear to me.

We then have some kind of ‘fix’ applied to St.Helena via (spitting out a ‘piece.log’ file in the process) and producing the v2.mean.comb.combined.pieces.bdb output file. The file contains only one record:

147619010000 147619010002 1976 8 1.0

which looks to me like a change of station ID number and the date when it happened (as a guess). The code seems to imply that the two sets of data are being combined into one record (but a real Python prgrammer ought to check that!)

This same code ‘combines non-overlaping records’ via what looks like a varient of the ‘reference station method’. It looks to me like the code searches for ‘nearby’ stations that have data for any place where the present station has a gap, then computes some kind of weighting factor based on anomallies and uses that to ‘fill in’ the missing data in the gap (i.e. create data where there are none based on the notion that a near by station can tell you what this station ought to have been…) Again, a real Python programmer ought to look at this part.. It has a variable (rad) that has an upper bound of BUCKET_RADIUS=10 but the units of this bucket radius are not clear. It looks to me like it is in degrees; but a real Phython programmer needs to verify / disprove that.

This is possibly the explanation for the random places in GIStemp data where a station has a data point jump up or down for no reason in the middle of the series.

“Strange Data” is then “dropped” via producing v2.mean.comb.combined.pieces.strange.bdb and altered with program with output to the v2.mean.comb.combined.pieces.strange.alter.db file, (Thank God there are not more steps making giant file names via agglutination!) that is handed off to that one presumes turns it’s not so human friendly database into readable text.
Both ‘strange’ and ‘alter’ look like straight foward cut / past jobs.

I believe the output file will be ./v2.mean.comb..combined.pieces.strange.alter.txt but I could be wrong since I’m guessing what the Python funcion in does.

Housekeeping follows with the bdb and log files moved into work_files and any v2*.txt files moved into to_next_step as Ts.txt which once again leaves me wondering why work files are created in the same place that the source code lives then when finished running moved into the work_files directory. I would create and use them in their own isolated directory sparing the source code the risk of being trampled upon. But I guess that’s just me.

Again we have the output file, Ts.txt, moved into ‘to_next_step’ of the next step. (One wonders why it isn’t put into input_files, or, if that is reserved for static input files, why there is no from_last_step directory; or even just why to_next_step isn’t named “interstep_shared_files”. But I guess that’s just me too…)

A brief inspection of the code shows it to be generally well written, and well structured. I see no particular reason to suspect that it does anything other than what it claims to do so I’m going to skip on to the next step for more detailed examination. The only places that I see potential ‘issues’ are the and with the potential for another ‘reference station method’ data fabrication happening.

So at the end of the day the entire work output of this section is the file: Ts.txt


About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in GISStemp Technical and Source Code and tagged , , , , , . Bookmark the permalink.