GIStemp, as its first act, takes several other data sets and glues them together. Here we will take a look at the most important one. The others are either a near subset (USHCN – the U.S. Historical Climate Network) or bits and pieces from odd places, like the bits from the Antarctic. In most contexts, the GHCN data are called “v2” in the code; for “version 2”.
The “HCN” part stands for “Historical Climate Network” but really means the land based weather stations. That is, not the satellites. “US” is inside the U.S.A. and “G” is the whole globe. Both GHCN and USHCN come from NOAA, but they have a slightly different ‘correction’ history. You may choose to download minimally corrected data or those with more corrections for things like “TOB” – Time of Observation Bias, or “UHI” – Urban Heat Island effect.
So, to understand GIStemp, we have to take a look at GHCN data. Here is the “README” file from the GIStemp download:
This is a very brief description of GHCN version 2 temperature data and
metadata (inventory) files, providing details, such as formats, not
available in www.ncdc.noaa.gov/ghcn/ghcn.html.
New monthly data are added to GHCN a few days after the end of
the month. Please note that sometimes these new data are later
replaced with data with different values due to, for example,
occasional corrections to the transmitted data that countries
will send over the Global Telecommunications System.
All files except this one were compressed with a standard UNIX compression.
To uncompress the files, most operating systems will respond to:
“uncompress filename.Z”, after which, the file is larger and the .Z ending is
removed. Because the compressed files are binary, the file transfer
protocol may have to be set to binary prior to downloading (in ftp, type bin).
The three raw data files are:
The versions of these data sets that have data which we adjusted
to account for various non-climatic inhomogeneities are:
Each line of the data file has:
station number which has three parts:
country code (3 digits)
nearest WMO station number (5 digits)
modifier (3 digits) (this is usually 000 if it is that WMO station)
one digit (0-9). The duplicate order is based on length of data.
Maximum and minimum temperature files have duplicate numbers but only one
time series (because there is only one way to calculate the mean monthly
maximum temperature). The duplicate numbers in max/min refer back to the
mean temperature duplicate time series created by (Max+Min)/2.
four digit year
12 monthly values each as a 5 digit integer. To convert to
degrees Celsius they must be divided by 10.
Missing monthly values are given as -9999.
If there are no data available for that station for a year, that year
is not included in the data base.
A short FORTRAN program that can read and subset GHCN v2 data has been
Station inventory and metadata:
All stations with data in max/min OR mean temperature data files are
listed in the inventory file: v2.inv. The available metadata
are too involved to describe here. To understand them, please refer
to: www.ncdc.noaa.gov/ghcn/ghcn.html and to the simple FORTRAN
program read.inv.f. The comments in this program describe the various
metadata fields. There are no flags in the inventory file to indicate
whether the available data are mean only or mean and max/min.
The file v2.country.codes lists the countries of the world and
GHCN’s numerical country code.
Data that have failed Quality Control:
We’ve run a Quality Control system on GHCN data and removed
data points that we determined are probably erroneous. However, there
are some cases where additional knowledge provides adequate justification
for classifying some of these data as valid. For example, if an isolated
station in 1880 was extremely cold in the month of March, we may have to
classify it as suspect. However, a researcher with an 1880 newspaper article
describing the first ever March snowfall in that area may use that special
information to reclassify the extremely cold data point as good. Therefore,
we are providing a file of the data points that our QC flagged as probably
bad. We do not recommend that they be used without special scrutiny. And
we ask that if you have corroborating evidence that any of the “bad” data
points should be reclassified as good, please send us that information
so we can make the appropriate changes in the GHCN data files. The
data points that failed QC are in the files v2.m*.failed.qc. Each line
in these files contains station number, duplicate number, year, month,
and the value (again the value needs to be divided by 10 to get
degrees C). A detailed description of GHCN’s Quality Control can be
found through www.ncdc.noaa.gov/ghcn/ghcn.html.
So, there you go. Some pretty good pointers to where to get bits and what they mean. But what about these “read.inv.f” and “read.data.f” programs it mentions? Well, I didn’t see them. But I did see one named “v2.read.data.f” that seems to do the same thing.
The comment block from down in the guts of that program does a nice job of telling you what the fields are:
c ic=3 digit country code; the first digit represents WMO region/continent
c iwmo=5 digit WMO station number
c imod=3 digit modifier; 000 means the station is probably the WMO
c station; 001, etc. mean the station is near that WMO station
c name=30 character station name
c rlat=latitude in degrees.hundredths of degrees, negative = South of Eq.
c rlong=longitude in degrees.hundredths of degrees, – = West
c ielevs=station elevation in meters, missing is -999
c ielevg=station elevation interpolated from TerrainBase gridded data set
c pop=1 character population assessment: R = rural (not associated
c with a town of >10,000 population), S = associated with a small
c town (10,000-50,000), U = associated with an urban area (>50,000)
c ipop=population of the small town or urban area (needs to be multiplied
c by 1,000). If rural, no analysis: -9.
c topo=general topography around the station: FL flat; HI hilly,
c MT mountain top; MV mountainous valley or at least not on the top
c of a mountain.
c stveg=general vegetation near the station based on Operational
c Navigation Charts; MA marsh; FO forested; IC ice; DE desert;
c CL clear or open;
c not all stations have this information in which case: xx.
c stloc=station location based on 3 specific criteria:
c Is the station on an island smaller than 100 km**2 or
c narrower than 10 km in width at the point of the
c station? IS;
c Is the station is within 30 km from the coast? CO;
c Is the station is next to a large (> 25 km**2) lake? LA;
c A station may be all three but only labeled with one with
c the priority IS, CO, then LA. If none of the above: no.
c iloc=if the station is CO, iloc is the distance in km to the coast.
c If station is not coastal: -9.
c airstn=A if the station is at an airport; otherwise x
c itowndis=the distance in km from the airport to its associated
c small town or urban center (not relevant for rural airports
c or non airport stations in which case: -9)
c grveg=gridded vegetation for the 0.5×0.5 degree grid point closest
c to the station from a gridded vegetation data base. 16 characters.
c A more complete description of these metadata are available in
c other documentation
Unfortunately, it does not tell you just what that ‘other documentation’ might be nor where to find it…
The station data are in a file named “v2.temperature.inv” which has things like a station ID number, a name, latitude, longitude, kind of ground cover, etc. A significant part of GIStemp STEP0 is devoted to gluing together this station data with the temperature history (stored by ID number only).
In my opinion, it would be far better to load all the temperature and station data into a simple relational database rather than jump through all the hoops that GIStemp does. That would eliminate much of the confusion and strongly simplify the code.
Update For The Future
The GHCN data have a massive die off of thermometers about 2007.
The USHCN data had a conversion of format to the USHCN.v2 format in 2007.
The net of these two was that only 136 records were used for the entire USA in GIStemp since 2007 and up until Novenber of 2009.
At that time, NASA GISS changed their code to pull the USHCN.v2 data into GIStemp. See:
Or the equivalent ftp link:
The USHCN.v2 data is far more heavily “adjusted” so there is more “warming of the historical trend” from re-writing the past to be colder, but at least now we are using more of the USA thermometers. Now all they need to do is put back the 85% or so of the thermometers in the rest of the world that were deleted from the record from 1990 or so to date (yet left in the baseline periods…)
I have heard a report that the GHCN data set will be “improved” with the same “adjustment” method put into USHCN.v2 and we will have to wait and see if this too increases the warming slope for the rest of the world…