Inside GIStemp, an Overview

The GISStemp is not a climate model. It is a ‘gather temperature data from many sources and change it’ process. Unfortunately, that is an exact statement of function and not a political comment.

By the time GISS temperature data reaches the models it is already fatally flawed and not in touch with reality. The models don’t have a chance…

The code in the first few steps is fairly trivial. Most of it is minor format changes (change ‘data missing flag’ from “-” to “-9999”) and file concatenations / deletions. There are about 6000 lines of code in GISStemp of which I would estimate about 1000 are truly ‘functional’. It consists of 6 coded “steps” in 5 directories plus a couple that are not coded (manual data download, for example). These are numbered STEP0 STEP1 STEP2 STEP3 STEP4_5 (plus the un-numbered steps of manual data download, and subroutine compilation / installation …)

The code in STEP1 is Python (with two function libraries in “C” that Python calls). All the other STEPs are FORTRAN.

It ought to run on any system with f77 or similar compiler, Python, and “C”. Unix or Linux ought to be your best bet. So far I’ve seen nothing in the code that is tied to a particular architecture. I have seen a lot of ‘crufty practices’ such as writing scratch files in the same place where the source code is ‘archived’ and treating FORTRAN like an interpreted language (compile in line in scripts, run binary, delete binary. An example of why so many lines are ‘non-functional’.)

(Apologies to anyone not a programmer. “Cruft” is clearly understood by programmers to mean “crud, not good, junk accumulated over time and never swept up, junky style” as an approximation; but seems to be a word that is not known to standard English. I’ve used it for about 40 years professionally and to this day don’t know where I learned it… Isn’t jargon fun?)

WattsUpWithThat  has an interesting write up of part of the reason that historical computed anomalies change each time a run of GIStemp happens.  (That is in addition to the rewriting of the actual temperature data that happens prior to the anomaly steps.)  It’s worth reading.

General Style

Most sections have a top level script that runs the show. These are typically SH or KSH. There are also frequently ‘sub scripts’ that simply have a compile wrapper wrapped around a FORTRAN program to compile it, relink some input files, run it, delete it, then rename the output files to something else.

There is a great deal of renaming files and pointless recompilations that accomplish nothing but to obscure the function. A good place to start figuring out the code would be to make a data diagram and track where the data come in, get changed, and go out; then unwind where ‘the music goes round and round, ohoh OH oh Ohoh!”

The download also includes a fairly large number of data files (all the Antarctic data, for example) so the actual size of the code is far smaller than the download.

Getting, unpacking & inspecting the software package, directory structure:

To download the GISSTEMP source code go to: NASA GISS and click on the download link.

Unpack the archive, and read gistemp.txt

As downloaded, the package decompresses to about 4.5 MB, but DON”T PANIC! This is almost entirely data files scattered through the very small portion that is source code. Most of the steps are coded in FORTRAN with sh and ksh scripts. One step (STEP1) is in Python. The Python step includes 2 ‘C’ programs to be compiled and installed for use by the Python programs. In later sections, many of the programs are duplicates.

Name of step Lines of code

STEP0             521
STEP1           1050 ( Python & C )
STEP2           1319
STEP3           1560
STEP4_5       1631

total              6081

The directory structure is simple: 

STEP0, STEP1, STEP2, STEP3, STEP4_5 and the ‘readme file’: gistemp.txt

In addition, most STEPs have directories for input files, work files, and output files, though the names for these can vary. Unless stated otherwise, they are likely to be input_files, work_files, and to_next_step.

Sizes in 1k blocks as unpacked:

1916    STEP0
1052    STEP1
  100    STEP2
    84    STEP3
1352    STEP4_5
    12    gistemp.txt

So why STEP0? It was added later in life as a pre-process step…

Inside STEP0, for example

Why so large? Lets look inside STEP0 first; sizes in 1k blocks (result of ‘du -ks’):

1852 input_files
       4 step0_README.txt
       4 hohp_to_v2.f
       4 get_USHCN
       4 dump_old.f
       4 dif.ushcn.ghcn.f
       4 cmb2.ushcn.v2.f
       4 cmb.hohenp.v2.f
       4 antarc_comb.f
       4 USHCN2v2.f
       0 work_files
       0 to_next_step

The directories work_files and to_next_step are empty since the programs have not been run yet. Among other things, the antarctic data series are already in the imput dirctory: input_files. Unexpectedly, so is some executable code (31 lines) that lets you sort the antarctic data should you download a new copy and the shell script to run it (result of: cd input_files; wc *sort* ):

Lines Words   Bytes File Name

24        100     694    do_sort
31          55     813    sorts.f
55        155   1507    total

You can probably already guess that “do_sort” is the compile and run script wrapper for the sorts.f FORTRAN program.

The block sizes above are a bit misleading, since in unix / linux / mach land the “blocks” used by a file are assigned in large lumps. A single character file will take a whole block, often 1k to 4k bytes. So just taking a look at the line count numbers in the text files (cd STEP0, wc *) :

Lines   Words     Bytes File Name

  47     109       1362 USHCN2v2.f
  87     303       3017 antarc_comb.f
  10       46         287
  38     306       1867
  40     134       1520 cmb.hohenp.v2.f
  65     221       2590 cmb2.ushcn.v2.f
  70     226       2394 dif.ushcn.ghcn.f
  65     314        2501
  22       37          504 dump_old.f
    9       25          207 get_USHCN
  17       31          416 hohp_to_v2.f
  20     151        1054 step0_README.txt
490   1903       17719 total

Adding in do_sort and sorts.f we get these totals:

Lines   Words   Bytes
545     2058    19226 TOTALS

This is typical of all the steps. There are lots of imbedded or sometimes left over data files in the several steps. So for STEP0 there are about 545 lines of code in total. The script is the top level controlling script (and that naming convention tends to hold through the other steps, just change the digit). When you see a program and a script of nearly the same name, the script is typically the wrapper to turn compiled FORTRAN into something more like interpreted BASIC (which gives you an idea what coding style to expect going ‘forward’…). For example: is the wrapper for the antarc_comb.f program, so you will find called where the FORTRAN is expected to be run.

I will document each individual step in a manner similar to this in an article named with the STEPx name.

How about the files:

Lines       Words       Chars
        1             4            26 Ts.discont.RS.alter.IN
      65         487        4829 Ts.strange.RSU.list.IN
      47         282        3337 antarc1.list
 2204      26597    212171 antarc1.txt
      35         233        2485 antarc2.list
  2462     26988    214373 antarc2.txt
     66          396        4686 antarc3.list
 1567      17312    142484 antarc3.txt
        1             5             37
     24          100           694 do_sort
 1502        4506       40554 mcdw.tbl
        8           37            448 preliminary_manual_steps.txt
      31           55            813 sorts.f
    371       1113        10017 sumofday.tbl
    224       4023        18995 t_hohenpeissenberg_200306.txt_as_received_July17_2003
  1221       3663        32967 ushcn.tbl
  7364     78766      787948 v2.inv
17193   164567    1476864 total

From the sub-directory _old:

Lines   Words   Chars
2040  24557  196085 antarc1.txt
1534  16850  139004 antarc3.txt
3574  41407  335089 total

From this we can see that the GHCN data in v2.inv are the largest part, followed closely by the arctic data. We can also see that they have conveniently left some old copies laying about for us to look at too.

The file contains one line that states

147619010000 147619010002 1976 8 1.0

I think this gives 2 station IDs for Helena and a cutover date, but that needs to be verified down in the code somewhere.

while the file preliminary_manual_steps.txt contains the 8 lines:

antarc1.txt was downloaded from
antarc2.txt was downloaded from
antarc3.txt was downloaded from

some typos in antarc2.txt were manually corrected


Station information files were manually created combining information from
the above files and GHCN’s v2.temperature.inv

This being a bit less than unhelpful. Exactly what are ‘some typos’ and what is ‘manually created’ as a process? And which files are the ‘station information’ files? And while we are at it, where is the v2.temperature.inv file from which they were made? Yes, I’m sure I can ‘work it out’ but a word or two in the docs or the directory would have been helpful… As we’ve seen in an earlier posting, looking into the GHCN download directory is more helpful. That is where you find v2.temperature.inv documented.

In STEP4_5 there are a couple of more bits of data downloaded. gistemp.txt lists these as: HadISST1: 1870-present cmb/sst/oimonth_v2 Reynolds 11/1981-present

One of these (oimonth_v2) is a Sea Surface Temperature anomaly map. Some folks have asserted this means that GIStemp uses satellite data since this anomaly map is used. Yes, the map is made from a combination of surface records and satellite data, but by the time it gets here is it just a grid of 1 degree cells (lat long) with a single anomaly number per month. Not exactly what I’d call “satellite data”. More like a “Satellite derived anomaly map” product.

OK, that gives you a basic idea what you’re looking at. For the preliminary steps, and each of the STEPx blocks, I’ll be adding postings in more detail.


About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in GISStemp Technical and Source Code and tagged , , , , , . Bookmark the permalink.

2 Responses to Inside GIStemp, an Overview

  1. JLKrueger says:

    Maybe “Cruft” is a regional programmer dialect. The word I’ve heard most is “kludge” which pretty much means the same thing.

    Yeah, I love jargon…it gets even more fun when two parties use the same word, but it means completely different things to each.

  2. E.M.Smith says:

    At least out here, a “kludge” is a very badly put together piece of code that may or may not work right. “Cruft” is a milder issue, more in keeping with untidiness or sloppy and having less sense of poorly functioning. So for example one would say “That backup program is a real kludge, it always hangs on February 28th.” as compared to “That backup program is full of cruft; it looks like 2 people had a style collision, there is no indenting, and there are 2 subroutines that are never called.”

Comments are closed.