The GISStemp is not a climate model. It is a ‘gather temperature data from many sources and change it’ process. Unfortunately, that is an exact statement of function and not a political comment.
By the time GISS temperature data reaches the models it is already fatally flawed and not in touch with reality. The models don’t have a chance…
The code in the first few steps is fairly trivial. Most of it is minor format changes (change ‘data missing flag’ from “-” to “-9999”) and file concatenations / deletions. There are about 6000 lines of code in GISStemp of which I would estimate about 1000 are truly ‘functional’. It consists of 6 coded “steps” in 5 directories plus a couple that are not coded (manual data download, for example). These are numbered STEP0 STEP1 STEP2 STEP3 STEP4_5 (plus the un-numbered steps of manual data download, and subroutine compilation / installation …)
The code in STEP1 is Python (with two function libraries in “C” that Python calls). All the other STEPs are FORTRAN.
It ought to run on any system with f77 or similar compiler, Python, and “C”. Unix or Linux ought to be your best bet. So far I’ve seen nothing in the code that is tied to a particular architecture. I have seen a lot of ‘crufty practices’ such as writing scratch files in the same place where the source code is ‘archived’ and treating FORTRAN like an interpreted language (compile in line in scripts, run binary, delete binary. An example of why so many lines are ‘non-functional’.)
(Apologies to anyone not a programmer. “Cruft” is clearly understood by programmers to mean “crud, not good, junk accumulated over time and never swept up, junky style” as an approximation; but seems to be a word that is not known to standard English. I’ve used it for about 40 years professionally and to this day don’t know where I learned it… Isn’t jargon fun?)
WattsUpWithThat has an interesting write up of part of the reason that historical computed anomalies change each time a run of GIStemp happens. (That is in addition to the rewriting of the actual temperature data that happens prior to the anomaly steps.) It’s worth reading.
General Style
Most sections have a top level script that runs the show. These are typically SH or KSH. There are also frequently ‘sub scripts’ that simply have a compile wrapper wrapped around a FORTRAN program to compile it, relink some input files, run it, delete it, then rename the output files to something else.
There is a great deal of renaming files and pointless recompilations that accomplish nothing but to obscure the function. A good place to start figuring out the code would be to make a data diagram and track where the data come in, get changed, and go out; then unwind where ‘the music goes round and round, ohoh OH oh Ohoh!”
The download also includes a fairly large number of data files (all the Antarctic data, for example) so the actual size of the code is far smaller than the download.
Getting, unpacking & inspecting the software package, directory structure:
To download the GISSTEMP source code go to: NASA GISS and click on the download link.
Unpack the archive, and read gistemp.txt
As downloaded, the package decompresses to about 4.5 MB, but DON”T PANIC! This is almost entirely data files scattered through the very small portion that is source code. Most of the steps are coded in FORTRAN with sh and ksh scripts. One step (STEP1) is in Python. The Python step includes 2 ‘C’ programs to be compiled and installed for use by the Python programs. In later sections, many of the programs are duplicates.
Name of step Lines of code
STEP0 521
STEP1 1050 ( Python & C )
STEP2 1319
STEP3 1560
STEP4_5 1631
total 6081
The directory structure is simple:
STEP0, STEP1, STEP2, STEP3, STEP4_5 and the ‘readme file’: gistemp.txt
In addition, most STEPs have directories for input files, work files, and output files, though the names for these can vary. Unless stated otherwise, they are likely to be input_files, work_files, and to_next_step.
Sizes in 1k blocks as unpacked:
1916 STEP0
1052 STEP1
100 STEP2
84 STEP3
1352 STEP4_5
12 gistemp.txt
So why STEP0? It was added later in life as a pre-process step…
Inside STEP0, for example
Why so large? Lets look inside STEP0 first; sizes in 1k blocks (result of ‘du -ks’):
1852 input_files
4 step0_README.txt
4 hohp_to_v2.f
4 get_USHCN
4 dump_old.f
4 do_comb_step0.sh
4 dif.ushcn.ghcn.f
4 cmb2.ushcn.v2.f
4 cmb.hohenp.v2.f
4 antarc_to_v2.sh
4 antarc_comb.sh
4 antarc_comb.f
4 USHCN2v2.f
0 work_files
0 to_next_step
The directories work_files and to_next_step are empty since the programs have not been run yet. Among other things, the antarctic data series are already in the imput dirctory: input_files. Unexpectedly, so is some executable code (31 lines) that lets you sort the antarctic data should you download a new copy and the shell script to run it (result of: cd input_files; wc *sort* ):
Lines Words Bytes File Name
24 100 694 do_sort
31 55 813 sorts.f
55 155 1507 total
You can probably already guess that “do_sort” is the compile and run script wrapper for the sorts.f FORTRAN program.
The block sizes above are a bit misleading, since in unix / linux / mach land the “blocks” used by a file are assigned in large lumps. A single character file will take a whole block, often 1k to 4k bytes. So just taking a look at the line count numbers in the text files (cd STEP0, wc *) :
Lines Words Bytes File Name
47 109 1362 USHCN2v2.f
87 303 3017 antarc_comb.f
10 46 287 antarc_comb.sh
38 306 1867 antarc_to_v2.sh
40 134 1520 cmb.hohenp.v2.f
65 221 2590 cmb2.ushcn.v2.f
70 226 2394 dif.ushcn.ghcn.f
65 314 2501 do_comb_step0.sh
22 37 504 dump_old.f
9 25 207 get_USHCN
17 31 416 hohp_to_v2.f
20 151 1054 step0_README.txt
490 1903 17719 total
Adding in do_sort and sorts.f we get these totals:
Lines Words Bytes
545 2058 19226 TOTALS
This is typical of all the steps. There are lots of imbedded or sometimes left over data files in the several steps. So for STEP0 there are about 545 lines of code in total. The script do_comb_step0.sh is the top level controlling script (and that naming convention tends to hold through the other steps, just change the digit). When you see a program and a script of nearly the same name, the script is typically the wrapper to turn compiled FORTRAN into something more like interpreted BASIC (which gives you an idea what coding style to expect going ‘forward’…). For example: antarc_comb.sh is the wrapper for the antarc_comb.f program, so you will find antarc_comb.sh called where the FORTRAN is expected to be run.
I will document each individual step in a manner similar to this in an article named with the STEPx name.
How about the files:
Lines Words Chars
1 4 26 Ts.discont.RS.alter.IN
65 487 4829 Ts.strange.RSU.list.IN
47 282 3337 antarc1.list
2204 26597 212171 antarc1.txt
35 233 2485 antarc2.list
2462 26988 214373 antarc2.txt
66 396 4686 antarc3.list
1567 17312 142484 antarc3.txt
1 5 37 combine_pieces_helena.in
24 100 694 do_sort
1502 4506 40554 mcdw.tbl
8 37 448 preliminary_manual_steps.txt
31 55 813 sorts.f
371 1113 10017 sumofday.tbl
224 4023 18995 t_hohenpeissenberg_200306.txt_as_received_July17_2003
1221 3663 32967 ushcn.tbl
7364 78766 787948 v2.inv
17193 164567 1476864 total
From the sub-directory _old:
Lines Words Chars
2040 24557 196085 antarc1.txt
1534 16850 139004 antarc3.txt
3574 41407 335089 total
From this we can see that the GHCN data in v2.inv are the largest part, followed closely by the arctic data. We can also see that they have conveniently left some old copies laying about for us to look at too.
The file combine_pieces_helena.in contains one line that states
147619010000 147619010002 1976 8 1.0
I think this gives 2 station IDs for Helena and a cutover date, but that needs to be verified down in the code somewhere.
while the file preliminary_manual_steps.txt contains the 8 lines:
antarc1.txt was downloaded from http://www.antarctica.ac.uk/met/READER/surface/stationpt.html
antarc2.txt was downloaded from http://www.antarctica.ac.uk/met/READER/temperature.html
antarc3.txt was downloaded from http://www.antarctica.ac.uk/met/READER/aws/awspt.html
some typos in antarc2.txt were manually corrected
Station information files were manually created combining information from
the above files and GHCN’s v2.temperature.inv
This being a bit less than unhelpful. Exactly what are ‘some typos’ and what is ‘manually created’ as a process? And which files are the ‘station information’ files? And while we are at it, where is the v2.temperature.inv file from which they were made? Yes, I’m sure I can ‘work it out’ but a word or two in the docs or the directory would have been helpful… As we’ve seen in an earlier posting, looking into the GHCN download directory is more helpful. That is where you find v2.temperature.inv documented.
In STEP4_5 there are a couple of more bits of data downloaded. gistemp.txt lists these as:
http://www.hadobs.org HadISST1: 1870-present http://ftp.emc.ncep.noaa.gov cmb/sst/oimonth_v2 Reynolds 11/1981-present
One of these (oimonth_v2) is a Sea Surface Temperature anomaly map. Some folks have asserted this means that GIStemp uses satellite data since this anomaly map is used. Yes, the map is made from a combination of surface records and satellite data, but by the time it gets here is it just a grid of 1 degree cells (lat long) with a single anomaly number per month. Not exactly what I’d call “satellite data”. More like a “Satellite derived anomaly map” product.
OK, that gives you a basic idea what you’re looking at. For the preliminary steps, and each of the STEPx blocks, I’ll be adding postings in more detail.
Maybe “Cruft” is a regional programmer dialect. The word I’ve heard most is “kludge” which pretty much means the same thing.
Yeah, I love jargon…it gets even more fun when two parties use the same word, but it means completely different things to each.
At least out here, a “kludge” is a very badly put together piece of code that may or may not work right. “Cruft” is a milder issue, more in keeping with untidiness or sloppy and having less sense of poorly functioning. So for example one would say “That backup program is a real kludge, it always hangs on February 28th.” as compared to “That backup program is full of cruft; it looks like 2 people had a style collision, there is no indenting, and there are 2 subroutines that are never called.”