GCM ModelE – Elephants Playing Mozart

I’ve done a first pass through the GCM (Global Circulation Model) Model E code.

If “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk” then Model E is structured so that the Elephant can play Mozart while on parade…

(History of the saying in a short posting here by Steven Goddard.)

Never The Same Code Twice

First off, it follows in the tradition of GISStemp where every time you run the program it is recompiled. This means it is never the same code twice.

Compilers are very complicated beasts. They make decisions. LOTS of decisions. Sometimes they have bugs in them. Sometimes they do things in different order. They pull in bits of code from various libraries and link it all together, so if the libraries change, the product changes.

Historically, libraries were ‘static linked’ so that your program, when run, was a constant and unchanging product. You can still set static linking if desired. In the last few decades, dynamic linking has become more popular. The program only loads the bits of the libraries when it runs. This makes them smaller in memory footprint, and it means an update to a library doesn’t require recompiling all the programs that use it (unless the change is a big one and changes calling sequence or parameters). The downside is that changes to the library can cause changes, or failures, of the programs. In the Microsoft world that’s the DLL Dynamically Linked Library failures some may have run into.

While dynamic libraries are a benefit, they do add a little bit of risk of change. Generally manageable since libraries are strongly QA tested and any bugs usually surface fast as they are also widely used. So generally folks are OK with that. For those who are not, you can static link to the old libraries if desired.

Model E goes way past that.

Not only does it give you lots of potential changes from dynamic libraries and compiler runs, it lets you tune and prune compiler flags, linker flags, and more. Turning on or off various sections of the code, adding, changing, or removing bits of it as you like it. Then your “run” is compiled and executed.

I’d originally expected there to be a “run deck” of parameters for any given run, and some relatively static input files of historical data. I’d not expected a lot of “fiddling with the model code” on every run. Essentially the whole model is parameters. When someone says “I ran Model E” and got FOO results, what they really ought to be saying is “I tuned Model E to get FOO results”.

This may well be “normal” in what passes for “Climate Science”, but it is not the way to get stable and reliable results via modeling actual natural processes as defined by observations of nature.

Model, which Model?

You can get a copy of the Model E code from NASA/GISS via this link (and the link in it) that describes the model.

https://www.giss.nasa.gov/tools/modelE/

Your first clue that this is a “Moving River” (as in “You can never cross the same river twice.”) come in that text. Bolding mine.

The current incarnation of the GISS series of coupled atmosphere-ocean models is available here. Called ModelE,
[…]
Model versions used for the CMIP5 and CMIP6 simulations are available via the nightly snapshots of the current code repository, including the frozen ‘AR5_branch’. However, users should be aware that these snapshots are presented ‘as is’ and are not necessarily suitable for publication-quality experiments.

I note in passing the use of “experiments” to describe computer model fantasies… I have nothing against models as tools. They can be very useful. But they are NOT experiments. They are a computerized fantasy that is best used to “inform our ignorance”. Compare where the model diverges from reality to see where you have it wrong. That’s where you learn something.

Sidebar: I ran a supercomputer center for 7+ years mostly doing plastic modeling for making dies. The model was right about 90% of the time saving lots of money on dies. ONE fluid at ONE temperature in ONE fixed space, 10% wrong. That from a well tested widely used program designed for exactly that use, not something hacked up on each run with code changes. At $1/4 Million per set of dies, we were VERY motivated to get 100% right.

The link to the “AR5 Branch” is here:

https://simplex.giss.nasa.gov/snapshots/modelE2_AR5_branch.2017.01.18_08.50.02.tgz

I note in passing that the one I downloaded to do this examination was named:
modelE2_AR5_branch.2016.12.28_08.50.01.tgz

so it looks like that frozen AR5 branch changes daily or weekly…

Or maybe it’s one of these…

GISS Model E Source Code Snapshots

The following versions of Model E source code are available for public download:

modelE2_AR5_branch.2017.01.18_08.50.02.tgz – the version of the code used for AR5 simulations.
modelE2_AR5_v2_branch.2017.01.18_08.50.02.tgz – updated and improved version of AR5 code (still under development).
modelE1_patches.2017.01.18_08.50.02.tgz – the version of the code used for AR4 simulations.
modelE2.2017.01.18_08.50.02.tgz – the latest code from the develpment branch. This code is always under development by definition. It may be unstable or not even compile at all. Use at your own risk.
modelE2_planet_1.0.2017.01.18_08.50.02.tgz – version 1.0 of ROCKE3D model as it is currently used in our planetary simulations. (Still under development, functionality may change at later time.)

So their “AR5 Branch” of archived code is actually 2 versions, one under development, and a bunch of patches related to AR4, maybe, and they have a “development branch” too… Decisions decisions… (And I left out their ‘spectral binning’ code from further down the page…)

GISS Model E Tools

The following tools are available for public download:

Spectral_Binning_GISSGCM.2017.01.18_08.50.02.tgz – Scripts to generate ModelE radiation input files from a raw stellar spectrum.

So you can automate the creation of the stellar parameter fudge… Wonder if it handles the recent cool down of the sun with shift of 10% or so of the UV vs IR levels??…

That’s just for starters…

Down In The Weeds

When you unpack the code, you get this directory structure:

chiefio@Headend:/Climate/modelE2_AR5_branch$ ls
Makefile  aux   config  doc   init_cond  templates
README    cmor  decks   exec  model      tests

At some future time I’m going to describe the technical side of it more, but now I’m just looking at the control knobs.

Typically you start with reading the README file, and a look at the Makefile (that builds any given bit of software in the *Nix world). It looks fairly clean at this point with only one Makefile… but when you dig into it, that changes. First, though, some sizes.

chiefio@Headend:/Climate/modelE2_AR5_branch$ du -ks *
4	Makefile
4	README
1076	aux
92	cmor
132	config
36	decks
440	doc
224	exec
272	init_cond
15996	model
1360	templates
136	tests

Most of the code is in the “model” directory. Yet “aux” and “templates” are fairly large too. 1 MB (1076 kB) for AUX and 1.3 MB for templates. Those will be doing something big. Then init_cond will be setting some initial state too.

What does the README say? It says essentially don’t trust the README… and here’s some untrustworthy clues:

NOTE: The full documentation on how to run the model is in doc/HOWTO.html.
This summary is not complete, and not necessarily up-to-date either.

PLEASE READ THE FULL DOCUMENTATION – IT REALLY WILL MAKE YOUR LIFE EASIER!

The directory tree of the modelE has the following structure:

  modelE
        |
        |-/model   (the source code for GCM model)
        |
        |-/aux     (auxiliary programs such as pre- and post-processing)
        |
        |-/exec    (various scripts needed to compile and setup the model)
        |
        |-/doc     (directory for documentation)
        |
        |-/decks   (directory for rundecks)
                |
                |-.R     (rundeck for the run )
                |
                |-/_bin  (directory for binaries for )
                |
                |-/      (link to directory where you setup
                |                     and run )
                |-.R
                ................

Configuring the model on your local system

Intended working directory is directory modelE/decks. In this
directory type “gmake config”. This will create a default ~/.modelErc
file in your home directory. This should be edited so that run output,
rundeck libraries etc. can be properly directed, and so that the
compile options (multiple processing, compiler name , NetCDF libraries
etc.) can be set appropriately.

Compiling and running the model.

All rundecks should be created inside this directory and all “make”
commands should be run from there. The following is a typical example
of how to compile and setup a run with the name “my_run”:

cd decks # go to directory decks
gmake rundeck RUN=my_run # create rundeck for “my_run”

You will need to edit the rundeck in order to choose a configuration
that is appropriate. Once that is done…

gmake gcm RUN=my_run # compile the model for “my_run”
gmake setup RUN=my_run # run setup script for “my_run”

Make sure that you create the rundeck with “gmake rundeck …” before
running any other commands for this run, otherwise the Makefile will
not understand you. You can skip “gmake gcm …” and just do “gmake setup…”
in which case gcm will be compiled automatically.
Another command you want to run (after creating the rundeck) is:

gmake aux RUN=my_run

This will compile auxiliary programs in /aux. All the binaries (both for
model and for auxiliary programs) are put into /decks/my_run.bin .

The following is a list of targets currently supported by Makefile:

config – copy the default .modelErc setup to your home directory.
rundeck – create new rundeck
depend – create dependencies for specified rundeck
gcm – compile object files and build executable for specified rundeck
aux – compile standard auxiliary programs
auxqflux- compile auxiliary programs for computing qflux
auxdeep – compile auxiliary programs for setting deep ocean
setup – do setup for specified rundeck
clean – remove listings and error files
vclean – remove object files, .mod files and dependencies
newstart- remove all files in the run directory
exe – compile gcm and put executable into RUN directory
htmldoc – create web-based documentation for this RUN

If you run “gmake” without arguments it will print a short help.

The use of ‘gmake’ to run your model is a big whack with the Clue Stick that this thing does a rebuild on each run. Toward the bottom I’ve bolded a few bits where that is confirmed. Every rundeck a new model is built. We also see that ‘exec’ is where the real making of the program is done. So what’s in exec?

chiefio@Headend:/Climate/modelE2_AR5_branch$  ls exec
CommandEntry.pm         modelErc                runpmE
CommandPool.pm          pdE                     setup_e.pl
RegressionTools.pm      pproc_dep.pl            setup_e_new.pl
clone                   protect_ccp_options.pl  sfmakedepend
comp_mkdep.pl           r_to_mk.pl              split72.pl
compare_module_file.pl  regression.py           sswE
convertToFreeFormat.py  regressionTests.pl      sumacc_nc
editRundeck.sh          regression_tests.sh     unlock
gcmdoc.pl               runE
get_input_data          runE_new

A mix of Perl scripts (.pl) and shell scripts (.sh) and Python (.py) and more. These folks can’t even keep a standard for what language to use. I know, I know, right tool for the job and all that. BUT it make it a maintenance nightmare as you must be fluent in 4 languages right out the gate. (The good news is that they are not that different. The bad news is that they have pernicious differences in some details…)

I don’t know what a .pm file type is, so I looked inside the RegressionTools one. It looks like a bunch of parameter setting. (Is pm short for parameters?)

Here’s a sample from the top of that file:

#package RegressionTools;
use CommandEntry;

my $standardFlags = "SETUP_FLAGS=-wait";
my $HYCOM = "E1fzhyc";

my $extraFlags;
$extraFlags{SERIAL}   = "";
$extraFlags{MPI}      = "ESMF=YES NPES=\$npes";
$extraFlags{OPENMP}   = "EXTRA_FFLAGS=-mp MP=YES NPROCS=\$npes";
$extraFlags{SERIALMP} = "EXTRA_FFLAGS=-mp"; # Intel kludge for matching serial and OpenMP.

$extraFlags{E1M20} ="";
$extraFlags{E1F20} ="";
$extraFlags{E1oM20} ="";
$extraFlags{E001tr} ="";
$extraFlags{$HYCOM} ="EXTRA_FFLAGS+=-DCHECK_OCEAN_HYCOM";

my $numLinesCMP;
$numLinesCMP{E1M20}  = 120;
$numLinesCMP{E1F20}  = 120;
$numLinesCMP{E1oM20} = 110;
$numLinesCMP{E001tr} = 123;
$numLinesCMP{$HYCOM} = 138;

sub setModuleEnvironment {
#    require perlModule;
    require "$ENV{MODULESHOME}/init/perl";
    module (purge);
    module (load, "comp/intel-10.1.023", "lib/mkl-10.1.2.024", "mpi/impi-3.2.1.009");

Now you can see why I had that preamble about libraries. This thing is setting the library environment as a parameter…

How much of this stuff is there?

chiefio@Headend:/Climate/modelE2_AR5_branch/exec$ wc -l *.pm
  157 CommandEntry.pm
  148 CommandPool.pm
  221 RegressionTools.pm
  526 total

526 lines. IF only 1/2 of them actually set parameters, that’s over 250. If 5 wiggles the trunk, what does 5 x 5 x 5 x 2 do?

But that’s a crude guess at best. For one thing, the CommandEntry.pm file isn’t all parameter setting. It runs some scripts inside it that does a variety of things, some obscure at this stage. Some snips from it:

First a bit of Perl that looks to be reading in command line arguments (parameters):

#!/usr/bin/perl
package CommandEntry;

sub new {
  my $proto = shift;
  my ($args) = @_;

  my $class = ref($proto) || $proto;
  my $self  = {};

  $self -> {COMMAND} = " ";
  $self -> {NUM_PROCS} = 1;
  $self -> {QUEUE} = LOCAL; # or BATCH
  $self -> {DEPENDENCY} = ""; # place holder for list of references to other commands
  $self -> {STATUS} = PENDING;
  $self -> {STDOUT_LOG_FILE} = "LOG";
  $self -> {STDERR_LOG_FILE} = "ERR";

  # override defaults with arguments
  while ( my ($key, $value) = each(%$args) ) {
    $self -> {$key} = $value;
  }

  bless ($self, $class);
  return $self;
}

But further down in has an embedded ‘here script’ in standard Bourne Shell:

  my $script = <<EOF;
#!/bin/bash
#PBS -l select=$nodes:ncpus=$ncpus:proc=neha
#PBS -l walltime=$walltime
#PBS -W group_list=a940a
#PBS -N regression
#PBS -j oe
##PBS -o foo
#PBS $queueString

cd \$PBS_O_WORKDIR
. /usr/share/modules/init/bash
module purge
module load comp/intel-10.1.023 lib/mkl-10.1.2.024 mpi/impi-3.2.1.009

$commandString
EOF

and not only is it working off of a command string, but it fiddles with the libraries too.

Yet even that isn’t the end of it. Looking in regreessionTests.pl we find a bunch more environment setting. First off, it calls those other bits above so we know where they get executed:

use Env;
use CommandPool;
use CommandEntry;
use RegressionTools;

Then for some reason it makes an http reference when it is opening a log file…

open(LOG,">nightlyTests.log");
# Make the LOG hot: <http://www.thinkingsecure.com/docs/tpj/issues/vol3_3/tpj0303-0002.html>

{
    my $ofh = select LOG;
    $| = 1;
    select $ofh;
}

Then it gets down to setting a bunch of environment variables (think of them as parameters that change where the program run gets parts of its context…)

$ENV{CVS_RSH}="ssh";
$env{CVSROOT}="simplex.giss.nasa.gov:/giss/cvsroot";
my $scratchDir = $ENV{NOBACKUP}."/regression_scratch";
$env{SCRATCH_DIRECTORY}=$scratchDir;
$env{REFERENCE_DIRECTORY}="$env{SCRATCH_DIRECTORY}/modelE";
$env{BASELINE_DIRECTORY}="$ENV{NOBACKUP}/modelE_baseline";
$env{RESULTS_DIRECTORY} = $ENV{NOBACKUP}."/regression_results";

@modelErcVariables = (DECKS_REPOSITORY, MP, CMRUNDIR, EXECDIR, OVERWRITE,
                      OUTPUT_TO_FILES, VERBOSE_OUTPUT, SAVEDISK, GCMSEARCHPATH,
                      COMPILER, COMPILER_VERSION, BASELIBDIR, ESMF_BOPT, MPIDISTR, NETCDFHOME);

Now from my POV, the PITA (Pain In The Ass…) is that they have things like “simplex.giss.nasa.gov” coded into this thing. To run it elsewhere I will likely need to find and change those things. OK, nice that it is set in a parameter setting bit, but PITA to have it set at all. Typically you give a program a working directory path and leave it alone.

Now the good thing is it looks like they are actually using a CVS repository to capture changes of state. (If only that would do that for actual temperature data…) but the bad thing is they have it coded into the setup programs.

$env{DECKS_REPOSITORY}="$scratchDir/decks_repository";
$env{MP}="no";
$env{CMRUNDIR}="$scratchDir/cmrun";
$env{EXECDIR}="$scratchDir/exec";
$env{OVERWRITE}="YES";
$env{OUTPUT_TO_FILES}="YES";
$env{VERBOSE_OUTPUT}="YES";
$env{SAVEDISK}="$scratchDir/savedisk";
$env{GCMSEARCHPATH}="/discover/nobackup/projects/giss/prod_input_files";
$env{BASELIBDIR}="/usr/local/other/baselibs/ESMF222rp3_NetCDF362b6_10.1.017_intelmpi/Linux";

And we have yet more library fiddle… Just what are those libraries and where do I get a copy? Who knows… Yet Another Dig Here in getting this set up and run. Hopefully the docs cover it.

There’s a whole lot more of that kind of thing in that particular directory, but I’m going to move on to another bit as these weeds are already too deep.

AUX making Inputs

In the “aux” directory there are a lot of files, but the ‘make’ ones caught my eye first.

chiefio@Headend:/Climate/modelE2_AR5_branch$ ls aux/make*
aux/make_surface_transient_Alkenes_C90.sh
aux/make_surface_transient_BCB.sh
aux/make_surface_transient_BCII.sh
aux/make_surface_transient_CH4_C90.sh
aux/make_surface_transient_CO_C90.sh
aux/make_surface_transient_NH3.sh
aux/make_surface_transient_NOx_C90.sh
aux/make_surface_transient_OCB.sh
aux/make_surface_transient_OCII.sh
aux/make_surface_transient_Paraffin_C90.sh
aux/make_surface_transient_SO2.sh

So we’ve got 10 different programs “make”ing a set of transient parameters for specific gasses. Lots of knobs here.

Looking in the last one (SO2) a sample:

#!/bin/bash

module avail tool/idl
module load tool/idl-6.4
ulimit -s 6000000
ulimit -v unlimited

res='C90'
spec='SO2'

for year in 1850
do
  cd /discover/nobackup/dgueyffi/modelE/aux

   dmget /archive/g08/dmkoch/ftp/AR5/IPCC_emissions_${spec}_anthropogenic_${year}_0.5x0.5*nc
   cp /archive/g08/dmkoch/ftp/AR5/IPCC_emissions_${spec}_anthropogenic_${year}_0.5x0.5*nc .

   dmget /archive/g08/dmkoch/ftp/AR5/IPCC_emissions_${spec}_ships_${year}_0.5x0.5*nc
   cp /archive/g08/dmkoch/ftp/AR5/IPCC_emissions_${spec}_ships_${year}_0.5x0.5*nc .

  ./remap.pl -par ncregrid-ijl.par -in IPCC_emissions_${spec}_anthropogenic_${year}_0.5x0.5*nc -out IPCC_emissions_${spec}_anthropogenic_${year}_C90_Dec_2009.nc

  ./remap.pl -par ncregrid-ijl.par -in IPCC_emissions_${spec}_ships_${year}_0.5x0.5*nc -out IPCC_emissions_${spec}_ships_${year}_C90_Dec_2009.nc

  cd /gpfsm/dnb53/gfaluveg/AR5_emissions/v5_anthro
  rm -f AR5.bat
  ./make_AR5_program_${res}.ksh ${spec} ${year} anthropogenic C90_Dec_2009 dom ind wst awb tra
  echo ".com convert_${spec}.pro" >> ./AR5.bat
  echo ".run convert_${spec}.pro" >> ./AR5.bat
#now run the idl batch file:
[...]

Hoo boy… so it grabs a bunch of stuff from “/archives”. But we don’t have a “/archives”. Looks like their “usual dodge” of release the model but not the parameter files driving it; at least at first glance. “We’ll see” as I dig into it in the coming weeks (months? years?).

But wait, there’s More!

Looking at aux/regrid.par (the .par got me wondering if it was more parameters…) we find:

filedir=/discover/nobackup/projects/giss/prod_input_files/
regridfile=remap288-180C90-90.nc
imsource=288
jmsource=180
ntilessource=1
imtarget=90
jmtarget=90
ntilestarget=6
format=ij
nfields=1
title=yes

Yup. MORE parameters. For one it points to “Yet Another Missing Directory” in that /discover top level qualifier. Then sets about setting more parameters. How many .par files are there?

chiefio@Headend:/Climate/modelE2_AR5_branch$ ls aux/*.par
aux/ICOxref.par       aux/ncregrid.par          aux/regridNOxnatural.par
aux/ICgastracers.par  aux/regrid.par            aux/regridc1x1.par
aux/ICsulfate.par     aux/regrid1x1Q.par        aux/regride2x2.5.par
aux/ncregrid-ijl.par  aux/regridCH4natural.par

Hopefully they are only used in the post-processing… yet even that determines the results as presented to the world.

About That Makefile

Seems that when you look into it, there’s a lot more to do than that top level Makefile. In the config directory, we find:

chiefio@Headend:/Climate/modelE2_AR5_branch$ ls config
ESMF.default.mk     compiler.gfortran.mk  machine.IRIX.mk    mpi.mpich2.mk
base.mk             compiler.intel.mk     machine.IRIX64.mk  mpi.mpt.mk
compiler.IRIX.mk    compiler.intel_7.mk   machine.Linux.mk   mpi.mvapich.mk
compiler.IRIX64.mk  compiler.nag.mk       machine.OSF1.mk    mpi.mvapich2.mk
compiler.Lahey.mk   compiler.pgi.mk       mpi.IRIX64.mk      mpi.openmpi.mk
compiler.OSF1.mk    compiler.xlf.mk       mpi.LAM.mk         rules.mk
compiler.absoft.mk  machine.AIX.mk        mpi.SCALI.mk
compiler.g95.mk     machine.Darwin.mk     mpi.intel.mk

Now the really good thing is that this ought to have a custom .mk make file for most any kind of machine you want to run this on. But it does show that the “make” is going to be modestly complicated.

Looking in the compiler.g95.mk file as one example:

F90 = g95
FMAKEDEP = $(SCRIPTS_DIR)/sfmakedepend
CPPFLAGS += -DCOMPILER_G95
FFLAGS = -cpp -fno-second-underscore -O2 -fendian=big
F90FLAGS = -cpp -fno-second-underscore -O2 -ffree-form -fendian=big
LFLAGS =
# uncomment next two lines for extensive debugging
# the following switch adds extra debugging
ifeq ($(COMPILE_WITH_TRAPS),YES)
FFLAGS += -fbounds-check -ftrace=full -freal=nan -fpointer=invalid
LFLAGS += #-lefence
endif

We’ve got compiler flags set to adjust the compiler behaviour. Normally this would only interest the developer of a code, but since every run is a compile, these can adjust how the code behaves when run.

Similarly, the machine.Linux file sets things for that environment:

# Linux - specific options

CPP = /lib/cpp -P -traditional
CPPFLAGS = -DMACHINE_Linux
ECHO_FLAGS = -e

cpp is the C Pre-Processor and it has flags to change how it works.

Then you get to the MPI Message Passing Interface setting blocks where you can choose and tune which kind of MPI you are using.

Here’s the one for OpenMPI:

ifneq (${MPIDIR},)
FFLAGS += -I${MPIDIR}/include
F90FLAGS += -I${MPIDIR}/include
LIBS += -L${MPIDIR}/lib
endif

# try to work around memory leak
CPPFLAGS += -DMPITYPE_LOOKUP_HACK

#LIBS += -lcprts -limf -lm -lcxa -lunwind -lrt -ldl \
#-lfmpi -lmpi -lstdc++ -threads

##LIBS += -lcprts -limf -lm -lcxa -lunwind -lrt -ldl \
##-lmpi_f90 -lmpi_f77 -lmpi -lstdc++ -threads

LIBS += -lmpi_f77 -lmpi -lmpi_cxx -lstdc++
ifneq ($(shell uname),Darwin)
  LIBS += -lrt
endif

#LIBS +=  -lmpi -lmpigc3 -lmpigc4 -lmpigf -lmpigi -lmpiic4 -lmpiic -lmpiif -threads

That “try to work around memory leak” isn’t encouraging…

In any case, this is relatively normal for setting up an environment to port and compile a program. What is unusal about it is that it is needed for every run.

But that isn’t enough… they have a certain degree of ‘cross linking’ between the various make files. You find that in rules.mk and base.mk:

base.mk:

# standard makefile to be use in each "component" subdir

.PHONY: FORCE

#get defaults from ~/.modelErc
HOMEDIR = $(wildcard ~)
MODELERC ?= $(HOMEDIR)/.modelErc
sinclude $(MODELERC)

#some of modelE definitions
#SCRIPTS_DIR = $(GISSCLIM_DIR)/scripts
#MOD_DIR = $(GISSCLIM_DIR)/mod
#LIB_DIR = $(GISSCLIM_DIR)/lib

#hack : make sure that ESMF_Interface is set
#ESMF_Interface = .
[...]

and rules.mk:

#
# this file contains rules shared by Makefiles in "model" and "aux"
#

.PHONY:
ifdef MOD_DIR
VPATH = $(MOD_DIR)
endif

#VPATH = $(MODULES_CMPLR_OPT)
#VPATH = /usr/local/unsupported/esmf/1.0.3-551/include
######  Some user customizable settings:   ########

# EXTRA_FFLAGS specifies some extra flags you want to pass
# to Fortarn compiler, like
# -g        - include debugging information
# -listing  - create listings (.L)
EXTRA_FFLAGS =

# EXTRA_LFLAGS specifies some extra flags you want to pass
# to linker. Currently needed as a hack to compile hybrid MPI/OpenMP
# code to pass "-openmp" to linker
EXTRA_LFLAGS =

# hack to force compilation errors to be written to ERR files rather than scren
ifeq ($(OUTPUT_TO_FILES),YES)
  COMP_OUTPUT = > $*.ERR 2>&1 || { r=$$? ; cat $*.ERR ; exit $$r ; }
  LINK_OUTPUT = > $(RUN).ERR 2>&1 || { r=$$? ; cat $(RUN).ERR ; exit $$r ; }
else
  COMP_OUTPUT =
  LINK_OUTPUT =
endif
[...]

Oh, it says there’s another batch of make files in ‘model’… Oh Joy…

chiefio@Headend:/Climate/modelE2_AR5_branch/model$ ls tracers/
Makefile  Tracers_mod.F90
chiefio@Headend:/Climate/modelE2_AR5_branch/model$ ls solvers
Makefile  TRIDIAG.f
chiefio@Headend:/Climate/modelE2_AR5_branch/model$ ls shared
CONST.f                 Makefile                   StringUtilities_mod.F90
Dictionary_mod.F90      PARAM.f                    SystemTimers_mod.F90
FileManager_mod.F90     PARSER.f                   UTILDBL.f
GenericType_mod.F90     Parser_mod.F90             array_bundle.f
Geometry_mod.F90        RunTimeControls_mod.F90    cubic_eq.f
JulianCalendar_mod.F90  RunTimeControls_mod.m4F90  stop_model.F90
KeyValuePair_mod.F90    SYSTEM.f
chiefio@Headend:/Climate/modelE2_AR5_branch/model$ ls -l Makefile
-rw-r--r-- 1 chiefio chiefio 3400 Dec 30  2015 Makefile
chiefio@Headend:/Climate/modelE2_AR5_branch/model$ ls profiler
Makefile               TimeFormatUtilities_mod.F90  Timer_mod.F90
ProfileReport_mod.F90  TimerList_mod.F90
ReportColumn_mod.F90   TimerPackage_mod.F90
[...]

and several other directories with their own Makefile in them. Yet even that isn’t enough complexity. Seems you can choose what make files get used sometimes.

Looking in ‘doc’, we have this in Ent.txt

Working with Ent DGTEM (Dynamic Global Terrestrial Ecosystem Model)
===================================================================

This document describes how to checkout, compile and run modelE
coupled to Ent DGTEM.
[...]
Compiling the code
==================

Since in this new setup some of the code is located in subdirectories
the Makefile had to be modified to be able to work with
subdirectories. Basically a completely different makefile was written
for this purpose. To switch to this new makefile you have to specify

  NEW_MAKEFILE_FORMAT=YES

in your ~/.modelErc file. This new makefile expects two additional
sections to be present inside the rundeck. Section "Components:" lists
the names of the subdirectories (from now called "Components")
containig the code which has to be compiled with the model. The
section "Component Options:" contains the flags which have to be passed
to each component (i.e. the flags which have to be set when local
makefile is executed inside corresponding subdirectory). Each
component (subdirectory) contains a makefile which in most cases is
trivial: it contains the list of source files and includes standard
template makefiles from config/ directory (see model/shared/Makefile
for an example). If rundeck should be able to modify the list of
source files in the component that can be done by passing
corresponding "Component Options:" (see how it is done inside
model/Ent/Makefile). Current template rundeck for modelE coupled to
Ent is E1M20dv1.R (see it as an example of how to use "Components:"
and "Component Options:"). The commands supported by the new makefile
are basically the same with some small differences. To compile and set
up a typical modelE+Ent run you would do:

# remove old object files if necessary (old "vclean")
  make clean_all

# create a rundeck
  make rundeck RUN=my_run RUNSRC=E1M20dv1

# build dependencies
  make depend RUN=my_run ESMF=YES NPES=1

# compile and do setup
  make setup RUN=my_run ESMF=YES NPES=1

So all this environment fiddling ends up happening on each ‘compile and run’ of a job.

Interestingly, they left in some ‘local notes’ in the ‘local’ file: local_info.txt

I’ve bolded a bit where it shows why I talk about libraries and compile flags and such.

Running on Discover (Reto Ruedy)
+++++++++++++++++++
News:
====
[...]
Unresolved bugs:
===============
- occasionally, jobs tend to "hang" for extended period of times,
  usually during i/o processes (creation of PRT files)
     hint: for long runs use "kdiag=13*9," not "kdiag=13*0,"

- serial and parallel versions do not always give the same result
     hint: see next section (Consistency ...)

- intelmpi: may be used on all current and future nodes

- scali mpi: can only run on the relatively few old 4-cpu nodes

- openmpi: may be used on all nodes but is not recommended (use intelmpi)
   - compile with: EXTRA_FFLAGS="-DMPITYPE_LOOKUP_HACK" to avoid memory leak
   - Bundling openmpi jobs is currently impossible

Compiler issues:
===============
For off-line fortran compilations use:
   ifort -convert big_endian x.f -o x.exe

Before doing so, load the appropriate modules, currently:
   /etc/profile.d/modules.sh             ; # bash or ksh or:
   source /etc/profile.d/modules.csh     ; # tcsh or csh
and either
   module load comp/intel-9.1.042
   module load mpi/scali-5               ; # for scali mpi (on borga-e...)
or
   module load comp/intel-9.1.052        ; # for openmpi
   module load mpi/openmpi-1.2.5/intel-9
or
   module load comp/intel-10.1.017       ; # for intel mpi
   module load mpi/impi-3.2.011

This is to be put into your startup script (e.g. .profile if your login shell
is ksh), so these modules also get loaded when running in batch mode.
"runpbs (=run)" was modified to load the appropriate modules for mpi batch jobs
even if those commands are missing in the login script. But you still need to set
them appropriately during the "setup" step.

Sheesh… different contexts / compiles give different results and we swap around MPI libraries at compile time based on what we like or what what works today or…

All that BEFORE you get to the massive parameter setting inside the model!

Consistency between serial and parallelized executables:
-------------------------------------------------------
openMP and serial runs give the same results ONLY if using the
   option -mp (machine precision) in both cases, i.e. use
   gmake setup|exe RUN=.... EXTRA_FFLAGS=-mp EXTRA_LFLAGS=-mp
Warning: The current version of modelE no longer supports openMP

mpi and serial runs (without the -mp options) give identical results
   (but different from serial runs and openMP runs with -mp)

Um, maybe you ought to figure out why the results vary with different MPI context and just fix the code?… Just sayin’…

Having code that is significantly ‘context of run’ or ‘context of build’ sensitive is just not making my “Oooh that’s reliable” warm fuzzy feeling happen…

There’s lots more interesting in that file (and in other files) but this posting is too long already, so most of it will need to wait for “another day”. I’ll just put this interesting snip here as a parting tease:

For serial runs, no options are needed; however if you want results identical
to that produced by an openMP executable, add EXTRA_FFLAGS=-mp EXTRA_LFLAGS=-mp

NOTE: When changing options, make sure to use     "gmake clean vclean"
      to get rid of old object modules and avoid incompatibilities.

Yeah, if you skip a ‘make clean vclean’ you can have incompatibilities…

There are lots of other parameter settings described in other .txt files. Here’s just a sample from the oceans biology one. Note: This is NOT the “oceans”, just the oceans biology obio.txt:

Prescribed and Interactive Ocean Biology in modelE
(by Natassa Romanou, Watson Gregg and Gavin Schmidt)
=========================================

This document describes the steps and methodology for
running (non)interactive chlorophyll and interactive carbon
cycle etc.
This document describes the steps and methodology for
running (non)interactive chlorophyll and interactive carbon
cycle etc.

step 1: Check out the model from CVS as usual
cvs checkout modelE

step 2: Choose which ocean model to use if any at all

step 3: Choose whether you want to do
        i) chlorophyll-radiation feedbacks
       ii) real or prescribed chlorophyll and ocean biology
      iii) CO2 gas exchange with the atmosphere
       iv) prognostic alkalinity

Then in the decks directory use the appropriate rundeck.
=========================================

Description of CPPs to be used in the rundecks

#define CHECK_OCEAN_HYCOM           ! HYCOM ocean: needed for CMPE002 only
#define HYCOM_RESOLUTION_2deg       ! HYCOM ocean: use 2deg resolution for HYCOM ocean

#define CHECK_OCEAN                 ! RUSSELL ocean: needed to compile aux/file CMPE002

#define TRACERS_OCEAN               ! RUSSELL ocean: ocean tracers activated
#define TRACERS_OCEAN_INDEP         ! RUSSELL ocean: independently defined ocn tracers

#define OBIO_ON_GARYocean           ! RUSSELL ocean: prognostic ocean biology
#define CHL_from_SeaWIFs            ! RUSSELL ocean: read in SeaWIFs

#define TRACERS_OceanBiology        ! ANY ocean: Watson Gregg's ocean bio-geo-chem model

#define OBIO_RAD_coupling           ! ANY ocean: radiation -- ocean biology coupling

#define pCO2_ONLINE                 ! ANY ocean: pCO2_seawater computed online

#define TRACERS_ON                  ! ANY ocean: ATMOS tracers: include tracers code

#define TRACERS_GASEXCH_ocean       ! ANY ocean: special tracers to be passed to ocean
#define TRACERS_GASEXCH_ocean_CO2   ! ANY ocean: special tracers to be passed to ocean

#define CHL_from_OBIO               ! ANY ocean: interactive CHL

#define TRACERS_GASEXCH_CFC_Natassa ! ANY ocean: CFC-11 or CFC-12

#ifdef TRACERS_Alkalinity           ! ANY ocean: prognostic alkalinity

In Conclusion

What are in those rundecks? Rundeck.txt tells us:

Rundeck
===================================================================

A rundeck (a file with an extension .R) is a file which contains
a complete description of a particular model run, including
the description of model code used and run-time parameters.
Directory modelE/templates contains typical rundecks which
can be used as examples.


Rundeck structure
================

Rundeck consists of a number of sections describing different
aspects of the run. Each section starts with certain keywords
(like "Object modules:") and terminates either with "End ..."
statement or with the start of a new section. The sections
are supposed to follow in a pre-defined order and can't be
interchanged, though some unneeded sections can be skipped.
The character ! starts a comment. Everything to the right of !
until the end of the line is ignored. Here is the list
of rundeck sections in proper order:

 * Run name and comment
 * Preprocessor Options
 * Run Options
 * Object modules
 * Components
 * Component Options
 * Data input files
 * Label and Namelist
 * Run-time parameters
 * Restart and timing parameters

Yup. From preprocessor to compile and choosing object modules to lad and more… Decisions decisions. Just what tune to play and how do I want this elephant dancing while playing the trombone?…

* Components
============

Starts with a line

Components:

This section lists all "Components" (the subdirectories of
modelE/model) which have to be compiled with the executable. Keep in
mind that each Component will be compiled into a library which will be
then linked with the main executable. As a result in a case of a name
conflict (more than one subroutine with the same name is present) the
preference will be given to the one located in modelE/model
directory.


* Component Options
===================

Starts with a line

Component Options:

This section lists specific options which have to be passed to certain
components. For each Component the options are listed on a single line
as follows

OPTS_ = = =

where Opt1, Opt2 are variabes to be set and X, Y are corresponding
values. The instruction above is equivalent to setting Opt1=X, Opt2=Y
at the start of  Makefile.

There is a special debugging option OVERWRITE_FSRCS which can be
passed to any component (which uses base.mk). If this option is set to
a list of source files, these files will be used when compiling the
component (instead of the ones specified inside the component
Makefile). The use of this option is discouraged for the production
runs though (which should use the source files specified by the
component itself). Similar option OVERWRITE_F90SRCS can be used to
specify F90 files.

Every run is a unique mix of libraries, compiler options, parameters set, input data choices, model parts chosen to run, and even, if you like, special versions of any given bit of Fortran you might want to swap in…

The choices are endless. The whole thing from soup to nuts is a parameter driven process.

And that’s the whole problem.

Subscribe to feed

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in GCM and tagged , , , , . Bookmark the permalink.

42 Responses to GCM ModelE – Elephants Playing Mozart

  1. Svend Ferdinandsen says:

    I am not well at programming and running computers, but it seems the chaotic nature of weather is built into the code too, so that you can have chaos in the second degrees.
    I just wonder how any scientist can have any confidence in the output from such a program.
    Would they ever know what they asked the program to do?
    Maybe the GCM’s really comes with unprecedented outputs every time.
    This drawing gets more and more actual: https://wattsupwiththat.files.wordpress.com/2013/09/josh-knobs.jpg?w=720

  2. philjourdan says:

    The “profession” of programming is as much DOCUMENTATION as it is writing code (I hated the former, but saw the value in it).

    Trillions being wasted on what amounts to a worthless hunk of letters and numbers.

  3. tom0mason says:

    How about a simple flow chart, something like

    |Garbage in| –>|Climate_Kludgifier| –> |Garbage Out| ——->> To MSM.
    ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑
    ———————————————————
    | Adjustable Parameters + Module Variables |
    —————–Fudge Modules———————

    Leading to

  4. cdquarles says:

    Ouch. Yet TPTB want to impose a tyranny on us based on a poorly specified model derived from an incorrectly specified question. UGH

  5. Glenn999 says:

    Svend
    Love that cartoon. I think the best part is the guy gazing down the side of the machine with the controls for the sun. Priceless.
    tom’s printout is funny too. Snowflakes are experiencing kernel panic.

  6. Soronel Haetir says:

    I thought .pm in code (especially code that also has regular perl files) is “perl module”.

  7. E.M.Smith says:

    @Soronel:

    Might well be. I stated I don’t know and was guessing. I’ve done some minor intalling of Perl, and scanned a book on it a couple of decades ago, but would not claim skill, or even familiarity with most of it. Any clue appreciated.

    @Svend:

    That cartoon fairly accurately depicts how ModelE is built and run, IMHO.

  8. E.M.
    This is still an iterative finite element model, suitable for a finite number of lumped constants, never for compressible fluid dynamics. NOAA at least uses a fluid dynamics model, suitable for an incompressible fluid (ocean). Have you found the stated limitations on parameters that may hint at possible convergence for the GISS model? Even the fourth order Runge Kutta method for solution to Laplace transforms, has severe limitations for convergence.
    My opinion is that Jimmie Hanson knew full well that the GISS model could never work ‘as he promoted to the US congress’, but could be easily be ‘tuned’ to give ‘his’ desired result. Gavin Schmidt knowingly continues to promote such deception within the US government!
    All the best! -will-

  9. Douglas R Fix says:

    Some decades ago, I watched a grad student run a hydrology simulation via a punch card deck (a real run deck) that ran the output back in the input reader until the simulation was completed. His run deck was as equally incomprehensible. His explanation was that using the reader was not charged to his account and wasn’t authorized to use permanent storage. Creating solutions one problem at a time, Mother of invention , and all that.

  10. kneel63 says:


    I thought .pm in code (especially code that also has regular perl files) is “perl module”.

    Yes, definately true.

  11. Larry Ledwick says:

    Consistency between serial and parallelized executables:
    ——————————————————-
    openMP and serial runs give the same results ONLY if using the
    option -mp (machine precision) in both cases

    Hmmm sounds like they discovered the Machine epsilon problem. With the sort of problems they are working with the absolute precision of of calculations and constants matter, if the compilers use different rounding or calculation precision or constant value precision differs between two machines, iterative processes will slowly diverge as those rounding/precision errors add up on each loop through the calculation.

    I learned about this when I was in about 7th grade when I discovered I got different answers to the same problem on two different calculators. Turned out the constant Pi used by the two calculators differed slightly with one using a few less decimal places in the value of Pi it used when you hit the Pi constant key.

    https://en.wikipedia.org/wiki/Machine_epsilon

  12. Larry Ledwick says: 19 January 2017 at 2:01 am

    ” Consistency between serial and parallelized executables:… option -mp (machine precision) in both cases .Hmmm sounds like they discovered the Machine epsilon problem.”
    The 1970’s ‘calculators’ used all kinds of ‘corrections’ to trigonometric functions to give ‘best estimate to that number of digits’! Oh woah are we..!!

  13. Larry Ledwick says:

    Given your comment on complilers, it would be interesting to run the compile on the same source code several times and see if the resulting code hashes to the same value each time.

  14. Larry Ledwick says: 19 January 2017 at 2:39 am
    “Given your comment on complilers, it would be interesting to run the compile on the same source code several times and see if the resulting code hashes to the same value each time.”

    Larry,
    The machine computation of the same, exact can lead to the same result. Machine ‘compilers’ do not compute the same exact, They can only ‘try’ to interpret the ignorance and dyslexia of earthlings incessantly banging away at keyboards. Benevolent monarchs do that well. All other forms of government fail completely! . -will-

  15. gareth says:

    Hi EM – you wrote “Then for some reason it makes an http reference when it is opening a log file…” but I think the “#” is that it’s commented out?
    The actual link goes to an FAQ about buffering of Perl i/o or somesuch…

    Not being much of a programmer – I’m an electronics chap – I’m just reading your post and looking for loose ends that might make the thing unravel, on the basis that if I can’t find any it’s more likely to be accurate.

    (One of the things about the Climategate data that convinced me that the skeptics were right was the comments in code and that readme file)

    I regret I was not paying proper attention to your previous posts so I missed where this model fits with the stuff that they are currently running – is it representative?

  16. E.M.Smith says:

    @Gareth:

    I usually do a very rapid first pass scan, not following all the rabbit holes, but just noting things in passing. Then I’ll go back and “Dig Here!” on things that were most interesting.

    The Model II code is ‘way old’ and mostly run by students as a learning toy at this point (since it runs on one PC…). This code is the Model E code. It is THE current work product of NASA GISS for climate modeling (though in several slowly mutating forms, from their page description of it). So yes, very representative. In fact, that AR5 tag is saying this is the code used for the AR5 report runs.

    My general impression of it is that it is “Good enough for government work” and will, in fact, run reasonably well if you are careful in the setup. I just grouse about the poor technique in it because it would not be acceptable in my shop. I’d have told the relevant programmers in staff meeting (i.e. in front of peers) that they needed to tighten it up so it ran the same way in all environments.

    Now how much do things like libraries and compiler variations matter?

    For most codes, hardly at all to none. But models are different. At Larry pointed out, when they iterate over the same code again and again, even the tiniest of error can accumulate to gigantic proportions. Yet even for more ‘traditional’ code, it can tickle pernicious bugs.

    That is why it was “kind of a big deal” when the world went to dynamic libraries (in the ’80s?) and why it is still a big deal on any new OS release that the exact library release level is specified that is known to work well with the rest of system. Look at a Linux release spec, and you find library release numbers listed along with kernel and package release levels that are “compatible”. That means “didn’t make bugs dance in the QA suite like the other libraries did”…

    For Model E, it looks like mostly they are just using the particular libraries needed on divergent hardware (as it runs on many platforms in their ‘cluster’) and / or the particular libraries needed for a particular version of MPI chosen. No problem with that. PROVIDED they code was QA checked and shown to produce the same results in all those environments. Yet their messages say it doesn’t. Ooops.

    Similarly, most of the time a compiler produces the same code out for the same program in. Yet not always. The compiler makes optimizing decisions based on various flags set and various environment variables and the machine context. So by submitting this to compile and run on divergent systems with divergent environment and divergent flags, you will get different executables. Have they ALL been regression tested? Have they ALL passed a QA check? Heck, even a rough “Sanity Check”? I doubt it… The code is structured to run as a mutating toy world where you run it, see if you like the result, and if not, change things and toss it in again. There is no way to regression test or adequately QA test that.

    OK, they are not making an accounting package nor doing a drug trial system. No need to satisfy Government Oversight nor Legal Liability. Got it. But it is still just sloppy compared to how the commercial product world does things. (Yes, I know, science is a seat of the pants sloppy thing… but once you claim to know the control knobs of the world, best to do it right and rigorous…)

    So far my general assessment is that the mistakes will be in the assumptions about what is actually the controlling system and the data fed in, with only minor coding errors. They ASSUME CO2 is the master knob, so find it is the master knob. They ASSUME the solar constant is, well, constant and they ignore the differential allotment of UV vs IR, so miss those effects. Etc. etc. That’s why they diverge from reality.

    But that ‘not the same unless FOO flag set’ is a worry. That says they have mathematical instabilities in their code. FORTRAN can have very subtle bugs from ‘edge cases’. Underflow and overflow. Fixed limits to precision at the small and large ends. Zero that isn’t really zero… I’ve written FORTRAN a while now and it was my first Computer Language. I was fortunate enough to have it taught to me in the Engineering Department where they were very concerned that we know these things as the codes would be used for things like bridge design and they did NOT want deaths from their graduates. The whole problem set was carefully crafted to guide us into just those traps. Burned fingers not easily forgotten… (After the 2nd problem set, I caught on and started looking for what trap was being set – got A+ after that… That was the point where my Fine Toothed Comb was born for code reviews…)

    So do I think there’s a math error or coding error in their code? Certainly. Probably several. It is manifestly sensitive to the environment and does not always produce stable results. “Captain, there be Bugs here!”… but do they matter? Probably not as much as the general structure of what the code is designed to do, which is find CO2 causal. Yet at this point I can’t rule out that their whole exponential growth of temperature output isn’t just an accumulation of very very very tiny errors due to a poor handling of math in some limit case. (Like ‘if zero exit loop’ when the code never hits EXACTLY zero… better is if the value is within 0.00000001 of zero, exit loop. That, BTW, was one of the problem sets…)

    Anyway, it is a giant block of code, so a long slog to get through it. Made worse by what looks like lots of ‘hidden dependencies’ on “stuff” from other machines and other directories (those absolute path names starting with / as in /archives) that are not available to the unwashed public. But as usual: We’ll see…

  17. John in Oz says:

    E.M. SMITH says “No need to satisfy Government Oversight nor Legal Liability.”

    $Billions are being spent on the output of these models – there should be criminal charges for ‘scientists’ who hold their work as accurate and important but turn out to be YMMV.

    Perhaps this is why we hear so much about ‘It’s worse than we thought’ as a simple change of parameters can make the Earth fall into the sun.

  18. Svend Ferdinandsen says:

    Could be i am ignorant, but how would a climate scientist perform a certain run of a climate model to investigate some aspect of climate?
    I find it hard to see him fidling with all those libraries and settings, so how do he orders a run, and would he know what else was set by the “runner”.

  19. Gail Combs says:

    Uninstalling Obummer…

  20. Glenn999 says:

    Perhaps you can toss out all of the slimy bits and put a functional model together that would actually show the actual weather? Did they make it ugly just to hide the fact that it is CO2 that they are correlating to temps?

  21. E.M.Smith says:

    @Glenn999:

    That is my general (high level and maybe over ambitious) goal.

    @Gail:

    Think you meant the TT page… but yeah… just minutes to go… Pence now V.P.

    @Svend:

    They make the ‘run deck’ and some of them will do the set up. Others might ask an I.T. helper to do the general set up for machine, libraries, etc… then just do the run deck.

  22. p.g.sharrow says:

    I’m not sure that this thing is worth saving. When you consider the nature and concepts that it was created under, It might be better to examine it, learn from it, and then set it aside, start over. Sun and water physics need to be properly allowed for and then modified by secondary factors. The data base of records need to be solid.

    Normally, The embeds of formula in a spread sheet are set and the data inputs change. In this case the data needs be set and some of the embedded formula tweaked to learn of their effect on the output. The ability of the users to find, understand and adjust the control knobs is of importance. The wide spread ability to utilize software and hardware to examine cause and effect in weather/climate along with the use of the Internet will allow thousands of minds to work on this problem, and do it inexpensively and fairly quickly.

    When Hanson and Jones started this effort only a government could afford the cost of that effort and they were nearly ignorant of computer possibilities, Believing that super computers could instruct or tell them how weather/climate worked. We know that computers are very simple minded. They only will tell you the answer that you can find for yourself. They just will do it quickly, or if garbage goes in, garbage comes out and you have to know if you are getting garbage.

    I see this as a spread sheet, data base problem, similar to the ones that I created and used to examine and do various problems both in longhand and with the help of desk top computers…pg

  23. E.M.Smith says:

    @P.G.:

    Essentially so.
    One good thing from the first blush code scan: Since THEY swap modules, turn source on and off at will, even overlay source as desired, then I can do the same and point any complaints back at ModelE design structure.

    At this point, my goal is to find the minimal subset that runs and see what happens. Basically, prune out modules and kick start them, then glue bits back together. Build it back from those stubs needed to make the core “go”.

    Not swallowing the whole whale at once, but one chunk at a time…

  24. Bobl says:

    For your readers,
    I am an engineer we must understand the nature of precision.

    Part of the precision problem is this: all numbers in a computer have limited precision. Say I wanted to calculate 2/6*3 if we assume limited precision (integers) then we calculate 2/6 = 0, 0 x 3 = 0, but compilers might be clever, they will know that the intermediate division will be imprecise and will reorder the calculation to be 2*3=6, 6/6=1 so in one case where the compiler doesn’t optimise you get 0, in the case where is does reordering optimisation you get an answer of 1. This occurs at EVERY precision. To get around this engineers often force the evaluation order writing (2*3)/6 because the brackets are always evaluated before * and /; This isn’t the only problem, in floating point precision is lost for very small numbers. That is the difference between X and X + delta X is a much larger proportion of X for small values of X.

    Engineers go to a lot of trouble to manually reorder calculations so divisions by large numbers NEVER OCCUR.

    Now think about this, using gcc I can set an ENVIRONMENT variable called CFLAGS, in CFLAGS I can set -g to get symbols in the code, BUT if I recall correctly -g also disables all optimisation including reordering, which means that merely setting debug mode changes my answer from 1 to zero. This has actually happened to me back in university when writing genetic algorithms, in that case the code worked fine in debug mode but failed with debug turned off.

    A big clue here is that they say the results change on multiprocessor machines, this says to me that there ARE precision sensitivities like this in the code, that they are unhandled, and that they MATTER.

  25. Bobl says:

    Oh, another issue for the readers who aren’t engineers.

    Lets Say I have 5 digits of precision. I write

    100*10 + 10/1000 – OK seems reasonable

    100 x 10 = 1000.0 to five digit precision
    10/1000 = 0.0100 to five digit precision
    Adding we get
    1000.0 + 0.010 which is 1000.01 – but that’s 6 digits of precision, at 5 digit precision the answer is 1000.0

    Now let’s say the code repeats over an over adding 0.01 to 1000 a million times which should equal 11000 but due to precision the answer given is 1000.0 – this is what we call divergence. It’s bad, unless you ensure you don’t add small numbers to big numbers divergence can be a real problem in computers. Typically simulations or “Models” are of the Form X2=X1 + delta (f (X1) ) where f(X1) is some function of X1 and usually a small increment, so iterating means X can get larger while f(x) is small or even becomes smaller, eventually you’ll diverge.

  26. Larry Ledwick says:

    @Bobl
    Great simple examples of computational errors using computers. You have to understand the computational methods used by the code at a very deep level to avoid (or minimize) these sorts of issues. By the nature of the math, you cannot avoid it completely any more than you can exactly represent the square root of 2.

    Sometimes you can express the problem symbolically so that the actual numerical conversion only happens one time at the final output stage, if you can work out a general solution that applies at all scales, but when doing iterative math loops where you go over and over the same calculation each pass feeding a solution to the next pass through the loop you accumulate those small errors until they become big errors.

    It would be interesting to see a high level computer math person do an errors analysis of what they are doing and figuring out the maximum number of passes through the iteration you can run before the errors run away with you.

  27. E.M.Smith says:

    @Bobl:

    Yup! Wonderful examples. Especially loved the ‘with debug’ vs not. Priceless. It is the accumulation of a few thousand of those that makes an experienced programmer valuable.

    I once had a program where the NAME of the function call changed the results. Drove me nuts for a week trying to ‘fix it’. Called the vendor: Compiler bug… not me at all… Short term fix? Change that function name to something else until it works… so for a month or so while writing it, every time it failed, I’d swap all my FUNC W to FUNC Z or back. It worked fine every time in the interpreter… (A language that was both interpreted and compiled… so I’d develop in the interpreter fine, but compile and Oh Oh… change all the FUNC W to FUNC Z again…) The compiler used the function name as ‘salt’ in the hash function so sometimes landed it on code space where it ought not to be… but change the salt, no problem…

    One of my other favorite ‘gotchas’ is testing a FLOAT for .EQ. zero or 1 or some other INT value. Rarely does a FLOAT exactly equal zero or 1 or “whatever”. One of the problem sets that bit most folks was just such a thing. The better way being something like:

    IF ( VAR .LT. 0.000001 ) GO TO ….

    Where the exact value of 0.000001 is set at your error limit choice, assessed after thinking about just what is happening down in the maths… I remember helping other students where the test problem had gone from (limit of precision like) 0.0000000000001 to -0.0000000000001 never touching 0.0 and them wondering why their code was wrong… Making it a DOUBLE instead of a FLOAT doesn’t fix the problem, just moves it into longer run times….

    Yet what do we find in the “Model II” code? Lines like

    Pjal0C9.f:      IF(DEN.EQ.0.) GO TO 3219                                          5213.2 
    

    Now one hopes. this is only after having set it to '0.' in the prior line for some cases so it can be zero exactly; but every one of those cases must be checked.

    Now most of their comparisons are INTEGER, where that’s OK to do (so you have a variable starting with I through N compared to a value with no trailing decimal point, so 1 not 1. or 1.0 that flags it as a FLOATing point constant) yet through multiple programs there are the odd lines that compare a FLOAT to a fixed value…

    B11pdC9.f:      IF (SKIPSE.EQ.1.) GO TO 540
    

    Now I’ve not “done the math” to assess the risk in these. This is just a first pass “turn over the rocks” and see what squirms… BUT the existence of lines like that makes me squirm. It seriously needs thinking about to see if this would be better:

    IF ((SKIPSE-1.).LT. 0.0000000000001) GO TO 540
    

    or better yet

    IF ((SKIPSE-1.).LE. ERRORBAND) GO TO 540
    

    Where ERRORBAND is set for each particular machine and word size used…

    FWIW, on a first pass my assessment is that the programmers of Model II were very workmanlike, but a bit incautious at times, like above. They know the language and the rules well, avoiding most “issues”; but taking risks where I’m not sure they have mitigated them (or even realized them) in the odd lines like those two above. (Do note: It is perfectly FINE to do something like enter a loop and on the first pass have set the value of a variable to “0.” and then test for “0.” to not process in the same way on the first pass. The odds of subsequent passes ever hitting exactly “0.” are very small. What bites is doing any math then testing for “0.” or “1.” to do anything…)

    This potential flaw does not get smaller in the Model E code, it gets much larger, with more complicated bits like:

    mxkprf.f:      elseif(ifbg_theta_interp.EQ.1) then
    mxkprf.f:     &         (t_back(k).EQ.0.).or.
    mxkprf.f:     &         (s_back(k).EQ.0.)    )) then
    

    (where mxkprf.f is the program name spit out by ‘grep’ the Global Regular Expression Print pattern matcher used to search for .EQ. lines…)

    and:

    obio_carbon.f:          IF(XL .EQ. DRTSAFE)RETURN
    obio_carbon.f:          IF(TEMP .EQ. DRTSAFE)RETURN
    

    Now I don’t know that XL and TEMP are FLOATS. (Need to check their declarations) but so far I’ve not seen them using INT range variables as floats nor float range variables as INTs, so I think it likely a fair assumption at this point. But this is exactly the kind of code that gets you into deep doo. I really really hope that “DRTSAFE” is something like Decimal Range Test Safe error band, that would go a long way to bringing comfort…

    Sea Ice has this same class of issue:

    SEAICE.f:      IF (ACEFO .EQ. 0.) THEN
    

    so one hopes it is only a ‘if newly entering the code having set value to ‘0.’ just prior to entry do initial set up, otherwise do something else” which is valid if a tiny bit sloppy.

    More a worry is:

    MODEL_COM.f: IF (STR.EQ.TIMESTR(N)) THEN

    Comparing two FLOATS for equivalence without error bounding it (one hopes they are declared string types as the STR implies).
    Then season with math like:

    MODEL_COM.f:          AM(LM+1:LM+LM_REQ) = REQ_FAC_D(1:LM_REQ)*PMTOP*1d2*BYGRAV
    MODEL_COM.f:        PEDN(LM+2:LM+LM_REQ) = REQ_FAC(1:LM_REQ-1)*PEDN(LM+1)
    MODEL_COM.f:        PEDN(LM+LM_REQ+1)=0.    ! 1d-5  ! why not zero?
    

    Where you can accumulate those minor errors described above and / or underflow and overflow of FLOAT ranges and / or… and then stuff arrays into IF tests like:

    OCEAN.f:          IF (Z1O(I,J).EQ.Z1OOLD(I,J)) CYCLE
    

    and you are just begging for one of those trivial hard to even see error accumulators and error bar breaking IF test faults to really screw your pooch.

    (No offense intended to pooches…)

    Every single bit of the math in this thing needs checking for just those kinds of subtile Aw Shits that happen when doing computer math and logic tests with typed data (or even untyped data in many cases… the computer still has those issues even if the language designer tried to work around them…)

    BUT what really frosted my shorts was the note that you get different results if you set mp vs not so just make sure to do it… instead of saying “and this is bad” or “and this means we’ve got precision sensitive code in there, John, start digging.” or … Just “Hey, look, cool. Watch for it and just know the runs will be different, not that it matters…”

    IMHO one of the major risks in “Climate Science” is folks with Ph.D.s in sciences who took one class of FORTRAN grammar and think they are a programmer now. They may have learned the syntax and grammar but they don’t have the scars yet… and they don’t have anyone doing code reviews, QA regression testing, bug reports, etc. etc. to accumulate them… so they just keep ploughing on ‘Deeper and Higher’…

  28. Gail Combs says:

    Bobl says….

    Yes very good examples. Even someone like me, who is a ‘computer challenged,’ non-engineer could follow it. Although you still need some science which unfortunately is not taught in the USA school systems.

  29. Bobl says:

    @E.M.Smith
    Agreed, I find it hard to believe that the equivalence problem is not dealt with properly, many compilers deal with the comparisons automagically, but this just makes it worse because the programmers get sloppy. It is insidious though.

    Even ( In C)
    float a
    double b

    a=22.0/7.0
    b=22.0/7.0

    if ( a == b) …

    Can fail because of the sign extension of a to 64 or 80 bits.

    Whats worse, some compilers choose when to use software floating point or the FPU. While the results of both calculations are stored as 64 bits the intermediate results on an FPU are stored at (IIRC) 80 bits. So two instances of the SAME calculation at the same declared precision can be different depending on whether the compiler decided to use the FPU or not.

    It doesn”t even stop at equivalence consider

    float a
    double b

    a=22.0/7.0000000001
    b=22.0/7.0

    if ( a < b ) …

    There is every possibility that the low precision of a and the errors in sign extension to 64 bits or the intermediate results precision could cause a to be greater than b, Even though the code "defines" that a is less than b. I can't think of a way for a compiler to get this right every time.

    I used to teach realtime C and Modula 2 to university students this was always one of the first things I taught.

    Engineers are taught this stuff, The computer science people I studied with sometimes had no idea about it. Self taught "Climate Scientists" have less than no idea. Maybe things are better now, but I doubt it.

  30. Bobl says:

    (Oh I omitted semicolons, it’s kinda pseudocode – don’t try running this)

  31. Bobl says:

    @Gail Coombs

    I just thought i’d explain to non-technical readers what happens in real life programming no science really, just a little math. In real-time programming things get even worse because values can change as you read them or faster than you read them from memory. The Time at which you do a calculation or – worse the time a calculation takes to do, can change the result. I remember chastising my University real-time computing lecturer, he gave us an assignment including code. It deal with a buffer that put a character from a serial port onto the beginning of a list (singly linked list for those in the know), to take the character off the list one had to move through all the characters in the list, from the beginning to end, following the links from one to the next.

    ( Gail, think of this as taking a chain, and then following the chain from where the chain is bolted to the gate to find the last link so you can apply a lock.)

    Now think about this, moving from one link to the next takes time, so traversing the full chain takes around n x T(1 link) where n is the number of links in the chain. The delay in receiving characters from the serial port though is a constant time, so as you add characters at some point it takes longer to traverse the list that the period between characters, Removing characters from the list can never be done as fast or faster than the characters are put on the list, at that point this length of the list grows without end until you run out of memory and BANG, the 747 falls out of the sky.

    Ah, the joys of realtime programming.

    Note that the professor giving the lecture ( Head of School in IT – No Less) was not very impressed with my observation when I pointed it out in class.

  32. E.M.Smith says:

    @Bobl:

    As I noted before: I was very fortunate to learn FORTRAN and ALGOL in the Engineering Department… In later years, saw what Computer Science was doing and found it somewhat sad. LOTS of glitz and new ideas and complexity hiding… and complexity ignoring that leads to indeterminate code and outcomes…

    One really MUST know what is going on to have a clue what happened…

    FWIW, I’d not been exposed to the GPU / FPU issue in school… Ran into it in a discussion of ARM chips where the NEON GPU does non-IEEE compliant math, but you can set it to soft float for IEEE compliance, or hard float for “close enough”… so even the particular chip set determines your results (NEON vs other GPU / FPU ) and what compiler flags you set (soft vs hard float) AND what the OS builder did (set it to run hard float or soft…).

    How many folks dumping a FORTRAN run into the new box they got know what the chip set is, what the OS build is set to use of that chip set, and what the default compiler flags do with the chip set? And what all that means to the results of their math and how that interacts with their branching and calculation steps?

    As one wag said:

    “We give these people computers and we expect them to know how to use them?!?”

    BTW, I’ve not yet run into even ONE non-Engineer who wants to think about such things…

    The general attitude seems to be “Hey, it compiled, so it must be right!”… Or “This time for sure!”…

    IIRC our Burroughs B6700 had an ALU (Arithmetic Logic Unit) but no FPU. (Also an odd word size and interesting protection bits…) so I can forgive them not teaching FPU issues, I guess… This link has lots of interesting bits of worry in it, but also has an interesting B6700 story:

    http://catless.ncl.ac.uk/Risks/24.52.html

    Re: Trig error checking (Ewing, RISKS-24.51

    Mon, 18 Dec 2006 21:27:26 +1100

    Martin Ewing wrote of a VAX floating point unit that exhibited intermittent
    faults (RISKS-24.51 ). Computers’
    arithmetic-logic units ((ALUs) don’t seem well protected against
    intermittent faults. In the early 1970s I worked with a Burroughs B6700
    computer that occasionally, when compiling a copy of its operating system,
    failed the computation with a spurious fault. (The core operating system was
    written in an Algol dialect and took about 20 minutes to compile.)

    After much investigation we found that the cause was an occasional bit
    dropped by the RDIV operator (which calculated the remainder from an integer
    division). This rarely used operator was used by the operating system in
    calculating disk segment addresses in the computer’s virtual memory paging
    file. When a bit was dropped in the segment address, the compiler would be
    fed a chunk of garbage from the wrong page file segment and flag a syntax
    error. After some detective work I wrote a program that did repeated RDIVs
    and checked the results, highlighting the problem. The fault was rare, less
    than once in 1000.

    Had the problem occurred with a more commonly used operator the result could
    have been nasty.

    Perhaps ironically this was one of the first B6700s that was delivered with
    main memory that included Hamming code single bit error correction and
    double bit error detection. But nothing detected a faulty ALU.

    Mike Martin […]
    Sydney

    And folks wonder why I write code like:

    If A do FOO; GoToEnd;
    elseif B do BAR; GoToEnd;
    endif;
    if NOT A OR NOT B GoToError;

    Print(You Can’t Get Here!);
    exit(1);

    I’ve had code like that print out “you can’t get here!”…

  33. I never had a need to learn FORTRAN, and most of my programming was in Databus (a very-efficient language on the Datapoint machines), versions of Basic on various boxes, but mainly assembler. These gotchas in Fortran, though, may well explain why the code doesn’t do what the programmers intended. I’ve looked at C and C++, but didn’t have a need to actually write programs in them (assembler got me to running code a lot quicker, and bug-free) but found the definition of the language somewhat sloppy. Since C variants are supposed to be self-documenting, few people put useful comments into the code to say what they intend the code to do, making understanding the intentions (and finding the bugs) a somewhat longer task than it should have been.

    In writing code, I first write the comments saying what the subroutine is intended to do, then implement it. This way round, it helps in keeping the logic straight. Once I’m out of the programming fugue, it also makes it somewhat easier to check for bugs – I don’t have to remember what I intended that chunk to do.

    With assembler, there weren’t many flags to set (if any) and the machine did what I told it to, so if it gave the wrong answer it was my fault. Not so true nowadays, with out-of-order processing and the compiler assigning the core and thread. There are likely ways around that, but these days I no longer write code so it’s Somebody Else’s Problem. It does however seem to me that there are far more opportunities for the running code to not reflect what the programmer intended, with all the compiler flags and the optimisation passes that delete sections of code that the compiler thinks are unnecessary. I’m sure some optimisation runs would delete that bit of code “Print(You Can’t Get Here!” because it should never run, which would defeat the reason for putting it there.

    It seems that the high-level languages automagically convert one data-type to another when needed, rather than complaining about it. I’d rather we were told, and if a floating-point number changed length or precision it’s flagged. There’s more words needed if you need to specifically change the type before using it in a new way, but at least you know it’s happened and what difference it will make. There are a lot of gotchas, and it’s currently looking like the GCM is falling foul of them in a lot of places.

    I won’t be much of a help in debugging the code – too much Fortran experience needed, and it would take quite a while to understand all the possible unintended errors. You only get that experience by hard work for several years, and there’s a lot to learn. At the moment, it looks like you will get this code to run, but it’s so badly-designed that each run will return a different answer. Butterflies are programmed-in….

    The world itself has a lot of vortices, and as far as I can see this code doesn’t really model those. This is likely a problem with a cell-based approach anyway, where the development of a vortex inside a cell just won’t be modelled and only the extant vortices that stretch across multiple cells can be considered. If you get the initial conditions a bit wrong, then the model will diverge from reality. It could be a bit difficult to get that right, though, except by using much smaller cells. At ground level, we know that hills produce thermal vortices, so having a “hilliness” number for a large cell won’t help that much – we need to model what each hill does, and also each high-rise building.

    With most of the climate models being proprietary and thus the code unpublished, we can’t check any others for bugs or physics errors. We know that they all diverge, so most of them (and probably all of them) must be wrong. Not a good state of affairs when we’re trying to make multi-billion dollar decisions based on the results.

  34. E.M.Smith says:

    @Simon: Very interesting comments…

    FWIW the “can’t get here” experience was not in FORTRAN.

    Yes, optimizers would likely remove it, but one can set optimizer level. Usually little or none during devo, then high before production and final regression test.

    I, too, found hand assembler easy and clean…

    Folks who write C just think it self documents. It doesn’t. IMHO, less so than Pascal, ALGOL, ADA, or COBOL.

    Many of the GOTCHAs in FORTRAN exist just as much in other languages. Precision, typing, underflow and overflow, EQ testing floats. Just Engineers and FORTRAN folks think about them more… Some, like I to N self declaring as ints is a feature… I like FORTRAN more than most languages for numeric tasks.

    Auto type conversion is variable by language. C is loose so good for low level and OS writing with casting when desired, just bit stuffing when needed. Pascal is a straight jacket… Others in between.

    That vortex point is one I’ve pondered… watch earth.nullschool and pretty quick you realize this place is all circular motion in 3D … and the models do rectilinear or hexagonal tiles without circular motion. Fundamentally broken at that first design point. The world is made of fractals and vortexes that are not in the model structure. How’s that gonna work…

  35. wyzelli says:

    Just…. wow! I just downloaded and had a look at a couple of the Perl scripts….
    There is some very convoluted thinking going on in there… very bad code.
    Normally the first things to do when looking at Perl code is to ensure that warnings and the ‘strict’ pragma are enabled to ensure that basic errors are caught (declaring variables and then not using them, declaring the same variable name twice etc). Errors and warnings out of the box…
    Then there is the logic flow which is a whole nother realm.. :/

  36. E.M.Smith says:

    @wyzelli:

    Interesting experience, isn’t it? :-)

    There is a very different ‘style’ as compared to commercial programming standards. More like an erector set with a model gizmo stuck together with odd bits, and some other optional modules laying in the box…

  37. cdquarles says:

    FORmula TRANslator, and yes I am fond of it and BASIC, an ALGOL derivative. Way back when, before the rise of Computer Science, programming was either in the Mathematics Department or the Engineering Department. When I took my programming classes at the University, that was the situation. [Yes, I have been one of those student prototyping programmers. It was fun ;)] And yes, the good practices of the time were drilled into our heads. You’d diagram your code (one way to generate a document to cross-check coded comments) before keypunching a deck or go play with the interactive dumb terminal so your generally small modules were as bug-free as possible (100 to 1000 lines). The first Assembly code I did was 6502 assembly. I’d do the same kind of diagramming for it. Hand assembly, when done well, used to be more efficient than the compiler or interpreter’s generated stuff. I’m sure that present day compilers are fed test assembly script to see if the code generators are working as well as they can.

  38. I like a fast FORTH threaded interpreter! With such you can do ‘anything’ including DRONK, STONED, or STUPID! These should be fundamental FORTH words! Da faster it goes, da faster you can fix da inevitable AW s**t!!
    All the best! -will-

  39. E.M.Smith says:

    @CDQuarles:

    There was also, in the early ’70s, a small presence in the Econ departments as they were discovering econometric modeling. But at my school (U.Cal.) you got a Math degree or an Engineering degree for computers. I was an “odd duck” in that I took Engineering classes with lots of computers while being an Econ major… but that whole econometric modeling had caught my attention…

    Also there is something about keypunching a deck and the finality of a hole in a card that made a fella want to have it right before booking punch time and buying cards… I miss the old 029 keypunch…

    @Will Janoschka

    I ran into FORTH in about 1983? and really liked it. On my shelf is a book on creating Threaded Interpretive Languages. At the time I was working on mainframes and they didn’t want to let me install a low level interpreter on the hardware ;-) so I didn’t get to do much with it.

    Instead I fell into Unix quasi by accident. Most scripting languages are threaded and interpreted, so I found outlet there. Since then I’ve done a LOT of scripting, often in the threaded interpretive model. One trivial example:

    I have a set of commands to make more commands.

    Word    Definition
    allow:  chmod +x $1
    cmd:    cd ~/bin; vi $1; allow $1
    lsb:    cd ~bin; ls $*
    

    Note how “cmd” calls “allow”? Yes, trivial example, yet it works nicely…

    Now if I’m going to do just about anything, like, say, move directories to an archive, I’ll type:

    cmd archive

    be tossed into an editors where I can type:

    cp -r $1 /My/Archive_dir
    ls -l /MyArchive_dir/$1 $1

    and then have the word “archive” in my set of words and after that to archive a thing I can just type:

    archive FOO

    and it is done.

    (The actual form wads it up into a compressed tar ball but I left that out for the not so *Nix folks to not glaze too much… but “tar cf – $1 | (cd /MyArchive; gzip > $1.targz)” comes close though many systems now support tar with a compress flag built in. Oh, and I usually do: archive FOO& so it runs as a background task).

    So even if not FORTH, my early exposure to the concepts of FORTH drive my daily activity. Just yesterday I added a word to inspect the status of my newly built raid array:

    raid: mdadm –detail /dev/md0; echo; cat /proc/mdstat

    I’ve saved myself endless hours of typing by compressing commonly used commands and options into single words in a very FORTH like manner…

    Were I going to bit twiddle down at the hardware level, I’d want to use FORTH or something like it. (But most likely would end up using C as it is The Way on *Nix boxes…)

    It is interesting to note that scripting drives the basic functions of most things *Nix for exactly those kinds of FORTH like property reasons. (Part of why I don’t like SystemD, it violates all that.) Just being able to look at (and potentially change) the “words” of *Nix administration is a huge benefit, lost in the SystemD “binary blob” approach.

  40. Can you imagine a machine with a code region that has distinct addressable locations of “machine code” but no other addresses whatsoever. All addresses are on a ‘stack’ of pointers to an encrypted base of pointers back to the code region for what to do with the values on the other stack? I tried that once; much like going insane! 2+2 = 4 and this is close enough for government work. I believe I will have another beer!

  41. E.M.Smith says:

    No, I can’t…

    ;-)

    What purpose to encryption inside the machine to then just decrypt…. just pot stirring?

    Did once work with an OS with a ‘single level store’ design. Disks, memory, you name it, all just one big address space… and yes, there were, um, “issues”… (what to do when swap space ended up on that disk pack on the shelf over there… or a tape shipped offsite… and just how do you make backup copies of critical data sets?… that might be anywhere in the total store space…)

  42. E.M.Smith says: 5 February 2017 at 1:19 am
    “No, I can’t… ;-)
    What purpose to encryption inside the machine to then just decrypt…. just pot stirring?”
    Yes and no! This was done in a Verifone keyloader for point of sale terminals. DES and 3DES so no one could snoop and find out! I did anyhow!

Comments are closed.