A FAMOUS/HadCM3 Parallel Computing Paper That Has Saved Me Years

I’ve been pondering computer climate models for a while now. In particular, how to make one run in a more parallel way on a cluster of smaller computers (instead of using a giant $Millions money sucking monster made of Unobtanium and beyond the reach of Just Regular Folks).

Why? Well, because it’s what I do… I’ve been immersed in the world of HPC High Performance Computing in one way or another for a decade or two and my specialty for some of that time was “efficiency reviews”. Looking at some poor chunk of Big Iron near melt down from some aspect of code “not caring” about it, and finding ways to give it a significant tune up.

Sidebar on Databases:

HPC is usually focused just on the “compute” side. Engineering and Scientific compute heavy loads. My start in the area was on IBM Big Iron running database programs. What today might be called “Big Data”. Massive farms of disks full of data, and the big Mainframe sifting through it.

One particularly satisfying contract had me at the California State Architects Office. They had a mainframe that was 100% full and at times bogging down enough to be painful to someone. This machine was largely running the FOCUS DBMS, my product specialty at the time. Over about 1 week I took my (personally developed) 26 point checklist of items and began tuning.

Now they were faced with the expectation that they would need to buy ANOTHER $4 Million computer. They were quasi-accepting of this until the data center informed them they had no place to put it and it would need to wait a year or two… The idea of a year or two of pain and delay bothering even State employees, I was considered a small price to pay for seeing if I could get them a bit more lifetime. Hopefully below 90% load for the next year.

When I was done, the machine was running at 4% load and everyone was quite happy; some were a bit astonished. (Me among them… I didn’t know code could be that bad ;-) Well, needless to say, the results from tending to efficiency questions have been important to me ever since.

How was this “miracle” worked? Largely by a bunch of simple and obvious things, once you bother to think about it. Select records prior to sorting them. Put index fields on frequently selected and sorted items to make finding them faster. 26 of those were what I’d identified over my years of working with the product. Each Senior Consultant was expected to develop a “specialty” and I’d decided mine would be “efficiency”, so that’s what I collected. Any time I saw an efficiency “trick”, I’d add it to my list.

Turns out that selecting 1000 records out of a Million or two THEN sorting, can save a lot of time over sorting a Million records ;-) Especially if you added an index key on that frequently selected field so you don’t have to read the whole database even to find the 1000 records…

In short: Knowing what the machine is doing and how it works matters. Knowing how to be smart in what you ask it to do matters too.

That is the essence of HPC. Know your hardware. Know what you want done. Tell the hardware to do it in a hardware friendly smart way.

It is the tendency for folks like Microsoft and, really, most vendors these days to ignore exactly that; which is the source of my resentment toward them. I’ve spent a very large part of my professional life cleaning up after that kind of thinking. It had become trendy in the 90s to think that “hardware is cheap” and not care about efficiency outside the area of HPC. IMHO, that’s “exactly wrong”. Now that Moore’s Law is topping out, there is a large pool of horridly bad code just waiting for the efficiency miners to go to work and make it good code. I know it can give huge returns…

Climate Models

The climate model folks do seem to put some effort into making their models run with reasonable efficiency. However, it is also clear that they are more than happy to just toss more $Millions at even bigger computers and “improve” results by adding more cycles to loops with more steps of the same old computes, but smaller grid cells. They get Bragging Rights for the bigger hot box, newer tech, bigger budget. So why not?

Even in the world of HPC the tendency has been to lean on the Moore’s Law crutch too much. Just wait a year or two, buy a bigger box, run the same old same old program with with double the grid cells and publish “new” results… Easy peasy.

But “times move on”. Even in HPC we’re running out of faster machines. In the last decade or two it has all been a move to “Massively Parallel”. My Cray in the late 80s / early 90s was a 4 processor box. That was the end of the Big Processor era as the field moved into ever more processors. Most of the Neat Tricks of Cray moved into things like high end Intel / HP / PowerPC cores. Heavily pipelined instructions (executing several instructions in steps, with the steps overlapping) and using a sort of a Vector Unit via things like SSE instructions. So folks just started sticking hundreds of them in a box. Then thousands. Then tens of thousands. We’re now up in the 100,000 cores in a box range.

Yet Amdahl’s Law still applies. (I worked at Amdahl Corp. for some years. That’s where I first started using Unix and my whole life path shifted from Database stuff to All Things *Nix and Big Iron). In reality, it is very hard to do most kinds of computer stuff in parallel. Over the years, the tools to do it have improved a lot, but it is still the case that most of the time for most kinds of work, at about 4 cores on a given job you have gotten most of the parallel bits serviced and the sequential bits are now dominant.

For work suited to parallel processing, at about 4000 cores you start to hit a wall. (That’s why most US Supercomputers top out at 4000 to 8000 cores. More isn’t buying you much, or anything. The Chinese have been making these monster boxes packed with 100,000 cores mostly so they can brag on them. It would be more effective to let their 100 researchers running 100 jobs run on a cluster of 25 machines of 4000 cores each, but “whatever”…)

There is a very very small set of jobs that are “embarrassingly parallel”. (That’s the official jargon! Honest!) Only those really benefit from more than 1000 cores. For weather and climate models, the maximum parallel that can readily be reached is roughly proportional to number of cells (but even then, only if you write your code in such a way that each cell “does the math” separately from the others…). In most cases, even the separate “cells” of a job have dependencies on their neighbor cells such that you can’t compute them ALL at once, but in groups or waves.

Eventually you have made parallel what can be made parallel and the job is once again dominated by what can not be made parallel. Amdahl’s Law.

So the HPC folks are still out there, slugging along finding opportunities to tune code to be faster, to match it to a given hardware, and to make things run in parallel where possible. It was that “problem” that got me interested when looking at climate models (that started as just a ‘what is it doing and is that rational?’ question).

So here I am, sitting in my home office, typing on a Raspberry Pi Model 3, compiling model code on it, and pondering running a model on a cluster of such Dinky Iron. What am I thinking!?

Well, I’m thinking that there’s lots of opportunity to tune things, and that these codes were first invented about 30 years ago. That’s a lot of “doubling times” ( 18 months to 2 years) of hardware speed. For $40 I can now buy more processing power than my $40,000,000 Cray provided.

In the Model II code, many bits have a header saying:

c ** Based on GCMII code for IBM RS/6000 computers created at GISS

They were Damn Fast back in the ’90s. Now not so much… We’re talking 50 Mhz to 500 Mhz and mostly single digit GB of memory. Or slower than what’s on my desktop with less memory in many cases. At present, the Model II code is stated to run well on Intel based PCs. Using a collection of a few “Pi Type” boards to run something suited to an RS/6000 and that works well on PC CPUs is not a hard leap.

BUT, it would require that I learn some parallel coding techniques. Something I’m rather interested in doing.

Here’s some benchmarks, just to drive the point home:


Step By Step (30 Years)

FORTRAN code is the same for all platforms (except for the time functions)
Download the console program for Win32 (solver of linear system Ax= f) Test.zip (36KB)
or the source Test_source.zip (2KB) and compare your CPU’s horsepower with the Intel 386, RISC Intel 860 or legendary NeXT station.

Copyright Vladimir Galouchko Home page 3dfmaps.com

Hardware (Software)                                    Sec
Intel i7 6700K 4.00GHz (Intel Fortran XE 2015 x64) 	0.06
Intel i7 6700K 4.00GHz (Intel Fortran XE 2015 x86) 	0.06
Intel i7 2700K 3.7GHz (Intel Fortran XE 2011 x64) 	0.10
Intel i7 2700K 3.7GHz (Intel Fortran XE 2011 x86) 	0.11
IBM RISC/6000-55	                               14.00
SPARCstation 20 superSPARC/50                          14.50
IBM RISC/6000-550/40                                   14.74
IBM 3090J (vec)                                        14.78

IBM RS/6000 250 PowerPC/66 (IBM AIX XLF FORTRAN v2.3)  15.10
IBM RS/6000 250 PowerPC/66 (IBM AIX XLF FORTRAN v2.2)  15.76 

Essentially, I’ve got about the same power in just one of my SBCs (Single Board Computer). Perhaps more (depending on how I use it and the GPU).

Which brings us to this paper and the FAMOUS model that runs on PC class hardware (though, it would seem, not fast enough ;-) It is described as a course fast version of HadCM3, though I’ve not found where to download a copy of the source code.

FAMOUS / HadCM3 and Parallel Conversion

When contemplating trying something, especially when it may take months to test, I like to take time up front to find out if someone else has already done that work so I don’t get stuck with it. In this case, someone has already done the code profiling and conversion to parallel and the testing. They also used a model (FAMOUS) that now runs on desktops (i.e.coarser steps) and they have profiled the results. It’s impressive as a bit of work. Work I now need not do, and where just reading a paper for an hour or so covers it. Thanks for that!

They try to be fancy at the first link and give you an interactive experience via some stuff glued on to the side. On the Android tablet that caused a strange “jumping” behaviour when paging down. I just downloaded the PDF and read it instead. Oddly, it wants me to sign in to get the article to display on the Pi M3 (perhaps because the Tablet has a Google Account on it…) but the second link worked without that:



Go figure…

The author basically introduces the FAMOUS model and then proceeds to profile it, find the majority of the computes go into the atmospheric radiative processes, and finds ways to make that run much faster in parallel on several different kinds of hardware. Including the PowerPC in the Sony PlayStation hardware and on OpenCL machines and using GPUs in an Intel box. Nice, that. It took them about a week to make a parallel version, yet elsewhere they day “2 1/2 man years”, so I’m figuring this paper is worth about that much of my life NOT spent redoing any of it. Thanks for that!

Geosci. Model Dev., 4, 835–844, 2011
www .geosci-model-dev.net/4/835/2011/
© Author(s) 2011. This work is distributed under
the Creative Commons Attribution 3.0 License.

Model Development

FAMOUS, faster: using parallel computing techniques to accelerate the FAMOUS/HadCM3 climate model with a focus on the radiative transfer algorithm

P. Hanappe1, A. Beurive1, F. Laguzet1,*, L. Steels1, N. Bellouin2, O. Boucher2,**, Y. H. Yamazaki3,***, T. Aina3, and M. Allen3
1 Sony Computer Science Laboratory, Paris, France 2 Met Office, Exeter, UK 3 University of Oxford, Oxford, UK
*now at: Laboratoire de Recherche en Informatique, Orsay, France
**now at: Laboratoire de M´et´eorologie Dynamiqe, IPSL, CNRS/UPMC, Paris, France
***now at: School of Geography, Politics and Sociology, Newcastle University, Newcastle, UK
Received: 10 May 2011 – Published in Geosci. Model Dev. Discuss.: 17 June 2011
Revised: 10 September 2011 – Accepted: 12 September 2011 – Published: 27 September 2011

Abstract. We have optimised the atmospheric radiation algorithm of the FAMOUS climate model on several hardware platforms. The optimisation involved translating the Fortran code to C and restructuring the algorithm around the computation of a single air column. Instead of the existing MPI-based domain decomposition, we used a task queue and a thread pool to schedule the computation of individual columns on the available processors. Finally, four air columns are packed together in a single data structure and computed simultaneously using Single Instruction Multiple Data operations.

They give this reason for the conversion to C. “Because no Fortran compiler existed for the SPEs, we were compelled to translate the radiation code to C.” Given that FORTRAN and C are both about the same speed, Provided the needed parallel facilities can be used, it ought not matter what the surrounding language is, if it is from one of the efficient ones. I suspect Julia would work just as effectively and likely with a bit less trouble than C (which is less user friendly though closer to the hardware). Since I have FORTRAN on the computers here, I can also just leave that be if converted to a parallel form. Bolding by me.

The modified algorithm runs more than 50 times faster on the CELL’s Synergistic Processing Elements than on its main PowerPC processing element. On Intel-compatible processors, the new radiation code runs 4 times faster. On the tested graphics processor, using OpenCL, we find a speed-up of more than 2.5 times as compared to the original code on the main CPU. Because the radiation code takes more than 60% of the total CPU time, FAMOUS executes more than twice as fast. Our version of the algorithm returns bit-wise identical results, which demonstrates the robustness of our approach. We estimate that this project required around two and a half man-years of work.

A “50 times” speed-up is a Very Big Deal. Saving 2.5 years is another one ;-) Even the 4 x they get with OpenCL is “nice to have”. Then the comparative speedup with multi-threads vs OpenCL vs GPU is very good information for planning. The “Cell” processor is in the Sony PlayStation. It is a main CPU with what amounts to 6 user addressable RISK cores glued on. A bit more than GPU cores, but less than a full processor. Those are why the PlayStation was such an attractive beast for DIY parallel clusters (and why, IMHO, the DOD leaned on Sony to block access to them and then move on to a different processor… put too much compute power in the hands of use home gamers ;-) But, no worries, there are other ways … ;-)

So you can think of the CELL processor as being, basically, a 7 core machine (really 9 but 2 are not user accessible). Well, my Odroid XU4 has 8 cores, all of them full RISK processors. Sure, the A15 cores are not as “fancy” as the main PowerPC core, but I’ve got 4 of them AND 4 auxiliary A7 cores to work with. I suspect is is “quite enough”, and that a stack of 4 to 8 of them is “way more than enough”. Probably enough to allow upping the model precision some. After all, it’s 10 years after the citation of the CELL processor, so that’s about 5 doublings of Bang/$ … or 32 times.

Section 4 describes the changes we have made to the radiation algorithm of FAMOUS to exploit parallel computing techniques. Our revised code yields very large performance improvements on the CELL processor. The modifications are beneficial for other computing platforms as well, including general purpose CPUs with vector instructions, multi-core platforms, and Graphics Processing Units (GPUs). Details of the performance we achieved are given in Sect. 5

That they took the time to compare and contrast CPUs with “media extensions” (vector instructions), GPUs and the CELL processor lets you know where the best gains can be found.

That the major resource suckage was in radiative atmospheric portions also is encouraging to me. I’m interested in a model that looks a bit more to the natural physics and is less radiative obsessed, so this tells me I can get by with a lot less computes for that portion.

Section 2:


FAMOUS (FAst Met Office/UK Universities Simulator) is a low-resolution version of the better known HadCM3, one of the coupled atmosphere-ocean general circulation models used to prepare the IPCC Third Assessment Report, and is a particular configuration of the U.K. Met Office’s Unified Model, which is used for both weather prediction and climate simulation. FAMOUS is designed as a fast test bed for evaluating new hypotheses quickly or for running a large ensemble of long simulations. It has been calibrated to produce the same climate statistics as the higher resolution HadCM3. FAMOUS uses a rectangular longitude/latitude grid. The resolution of the atmospheric component is 48×36 (7.5◦longitude×5◦latitude or roughly 830km ×550 km at the equator) with 11 vertical levels. It has a 1-h time-step for the atmosphere dynamics and a 3-h time-step for the radiation. The resolution of the ocean component is 98 ×72 (3.75◦longitude×2.5◦latitude) with 20 vertical levels and a 12-h time-step. FAMOUS contains legacy code that has been optimised for previous hardware platform and that has been adapted continuously. It consists of about 475,000 lines of Fortran 77 with some extensions of Fortran 90.

So largely the same code, just run more coarse on the steps. Nearly 1/2 million lines of Fortran (Ooof!). One wonders how much of it is really needed to model heat flow from a sphere into space… Personally, I suspect a simple enginering oriented model using a spherical heat pipe with water as the working fluid and “air contamination” and some spots of the ‘hot end’ drying out would be quite enough. Might need to add some fancy bit to cover the zero pressure at the top and gravity, but not much more than that I think… Radiative ought to only matter above the tropopause (as the reason there is a tropopause is that convection is doing the work that radiation can not…) and there more CO2 means MORE radiation to space. But back to the paper…

Subroutine           Computation  Computation
                         time (s)    time (%)
Ocean sub-model            142.66       10.04
Atmosphere sub-model      1278.43       89.96
,→Atmosphere physics      1120.47       78.85
  ,→Radiation              950.34       66.87
    ,→Long-wave radiation  572.84       40.31
    ,→Short-wave radiation 314.76       22.15
  ,→Convection              46.01        3.24
  ,→Boundary layers         38.86        2.73
,→Atmosphere dynamics      109.84        7.73
  ,→Adjustment              49.10        3.46
  ,→Advection               29.08        2.05
  ,→Diffusion               10.52        0.74 

Basically, only 1/3 of the model time is spent on the non-radiative part. This implies getting that part to run on small iron ought to be fairly easy.

Then the author makes a very nice job of attacking that 2/3 with parallel code. I’ll leave the details of that for folks who care to go read the paper. The bit that interested me most (aside from the discussion of the code originally being designed for Cray Vector processing and that STILL being in the code design ;-) was his shifting from a wide layers array approach to a narrow columns approach. I’d been pondering making things cell or column oriented so as to make more modular parts that could be more easily distributed. They’ve already done it, and it worked well. Nice to know.

I found this snippet interesting:

During this conversion process, we deleted unused code sections and a fair number of if-then-else statements in low-level computation routines that select which version of the algorithm is used. This results only in a minor loss in the flexibility of the radiation code because FAMOUS’ configuration is not expected to be changed.

So the original also has a lot of “tuning” available… Just sayin’…

Figure 4 is also interesting. Discussing the effect of “rounding errors” on the final results, we see the trend stays more or less the same, but individual computations can be a few 1/10 C different. (The graph is in K) Since that’s essentially the scale of “Global Warming”, one is left to wonder just how much of the warming really might just come down to rounding variations…

I’m not going to cut / paste the graph (as it passes through GIMP for me) but just cite the text:

Fig. 4. Graphs showing the effects of rounding-errors on thedecadal means of a 120yr simulation using three different implementations of FAMOUS using: the Intel standard floating-point unit (original version), the Intel SSE extensions and libsimdmath (sse version), the CELL SPEs (spe version).

When variations in your findings are nearly the same scale AS your findings but originate with the particular hardware on which your code is run, IMHO, that indicates some basic issues in the approach used in the code… Bolding again by me.

5.2 The effects of rounding errors on the SPEs

The single-precision floating point calculations on the SPEs are not fully compliant with the IEEE 754 standard. In particular, the rounding mode of floating-point operations is always truncation, while CPUs typically round the intermediate results to the nearest value. To evaluate the effects of the truncation on the stability of the climate model, a 120yr simulation was performed and the result compared to a reference run. This simulation was forced by historical changes in greenhouse gas concentrations, solar forcing, volcanic aerosols, and a time-varying climatology of sulphate aerosols.

As can be seen in Fig. 4, the decadal mean of the global average surface temperature computed by the spe version (blue line) evolves differently than the output of the reference simulation (red line). However, the results did not show any instability or bias and the statistical differences between the versions are comparable to running the unmodified model on different platforms or with different compiler configurations (see also Knight et al. (2007) for a discussion on how the hardware variation effects the model behavior). The green line in the figure shows the results obtained with the simd version using Intel’s SSE

That it’s the same variation with changes of hardware or compiler configurations does NOT comfort me as much as it does them…

In Conclusion

I’ll leave the rest of the technical bits for folks to read in the paper itself. For me, the key bits are that the non-radiative code is fairly fast, the radiative is where it sucks cycles, changes in that run-context have significant impact on how a model run “evolves” and they had a lot of tuning available there. Parallel via a column (or by extension grid-cell) centered compute model works. The code is not just old FORTRAN, but old and crusty FORTRAN. It is very sensitive to run time environment (and that’s a very bad thing…) but folks in the field are “OK with that”… for some reason…

All that, and the fact that I’m now about 2 to 3 years ahead of where I was a few days ago, makes me one very happy camper!

Now I think I can finally get to bed ;-)

Subscribe to feed

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in GCM, Tech Bits and tagged , , , , , . Bookmark the permalink.

26 Responses to A FAMOUS/HadCM3 Parallel Computing Paper That Has Saved Me Years

  1. Lionell Griffith says:

    EM: I didn’t know code could be that bad

    Oh, it can get much much worse. I suspect you already know that.

    Yes, it boils down to know your hardware, know your OS, know your compilers/assemblers, know your algorithms, and then use every trick known to man plus some new ones to optimize the hell out of the code. Then avoid what I call sub-optimization in which you spend a lot of time on code that, even as bad as it is, it doesn’t take much time.

    I started with computers when there was no choice but to do that in order to get your results soon enough to matter ca 1965. The computer I had to work with had a 4 MILLISECOND cycle time and it was supposed to be a real time computer that was to be used to run an entire industrial process or laboratory. Every instruction that was not necessary to execute had to be eliminate and the remainder had to execute in the fewest passes possible.

    I learned a lot the hard way from direct hands on experience. It is surprising how much the writing of good HPC code can become automatic when you really care how much time it takes to execute the process. Then spend a half a life time doing it.

  2. Lionell Griffith says:

    EM: one is left to wonder just how much of the warming really might just come down to rounding variations…

    There is no mystery about it, when you are looking for largely irrelevant differences between large numbers to panic about, ignored rounding errors can scare you to death. Which, I am convinced, is why rounding errors are so often ignored in the so called climate models.

    There is a huge difference between software that really works and software that simply has behavior. That difference offers a serious possibility that it will destroy technological civilization and force humans into extinction simply because it is so much ignored. AI I am not worried about, Rounding errors are the stuff of nightmares. Add to that sloppy coding and inattention to the important small stuff the results can be fatal.

  3. Power Grab says:

    FOCUS? Really? I still run my Focus jobs every day. They’re so efficient to run. I love that system! I launch the whole series of extracts with one command, then wait around 10 minutes. Then I use a DOS batch file to download 80-100 files. The system was supposed to vanish from our landscape about 14 months ago, but it’s still there. The migration has not been without bumps. As long as the system is still there, I’m going to keep using it. There are some pieces of data I can’t find in the new system, They apparently didn’t migrate everything to the new system.

    The rest of the stuff I use is on Windows-based systems where I have to touch every file by hand every time I download it. It sure seems like a huge step backwards.

  4. E.M.Smith says:

    Yup. FOCUS, from before micro-Focus existed.

    I really like the system and the language. I wish there were something like it on Linux in the Libre space, but there isn’t. Frankly, mySQL looks primitive in comparison, to me.

    The ability to extract, manipulate, combine, and report; with substantially simple English like commands, was / is extraordinary. I taught classes in it for years and did “internal consulting” at Amdahl to their programming staff. Prior to that I was an on-the-road consultant in RAMIS-II that is (was) a strong look alike / work alike that existed first. (The Ramis guy left and formed Information Builders way too fast… rumors of a tape of source code abounding… )

    Oh Well…

  5. Steven Fraser says:

    EM: Had an ‘aha’ while reading this. The analysis of ‘parallelizable’ tasks, and completion dependencies, is PMBOK-style project management planning. If the running program is thought of as a ‘project’, the time to completion is critical path analysis, which can be modelled in something as familiar as MS Project.

    i smile. This ‘aha’ is fun.

  6. E.M.Smith says:

    That’s an interesting insight.

    Yes, there is a management aspect of parallel computing. Size tasks for each processor. Assign and manage them. Integrate results for the final project to complete…

    Per Non-Grid approach:

    I was thinking about “a better way” and that this paper found “columns” worked well for radiation. Could that be extended?

    Say we look at the problem as a set of columns, each one large (like that planetary map of the earth on a dodecahedron) so movement of things between columns doesn’t matter much) then each can be viewed as a large vertical heat pipe, Then the parameters are minimal, IMHO.

    There’s solar variation and average angle of a column to the sun. That’s the input power.

    Then there’s transmissivity of the air. LW and SW to the ground surface. This is partly driven by degree of clouds and water vapor (that will change over the life of the model, but in negative feedback mostly).

    The input energy that makes it to the ground is absorbed or reflected, so we have surface albedo and absorbtivity. Both can likely be covered by “surface composition” initially just plants, dirt, ice, or water (as percentages for columns with mixed types). Now you have how much energy goes into the surface and air. Treat the dirt as a capacitor with a resistor of the surface type (i.e. plants pretty much insulate it, concrete / rock not so much). The rest ends up in the air.

    The Troposphere is substantially LW opaque, so it’s treated as a convective / evaporative / condensing heat pipe structure. Water dominating all else. At the top of the tropopause, we transition to a radiative mode at the stratosphere and above, with CO2 radiating outward (inward is absorbed in the cloud tops and tropopause and just goes back up…)

    So essentially for each column, the surface / troposphere needs to be characterized as to the “wet performance” of the local heat pipe.

    That gives you most everything of what you need to figure the heat flow. I think. ;-)

    Input, transmissivity, absorptivity, (side storage in dirt in the long seasonal cycle, about a 3 month lag), heat piping, radiation out.

    Ought to be enough to establish general balances and show cool ocean air, hot humid topics, hotter dry deserts, etc.

  7. Larry Ledwick says:

    Actually you might have to just define two heat pipes.
    One centered over land
    One centered over water.
    Add parameters for angle of sun and average albedo of the bottom of the column and clarity of the atmosphere.

    Sum all the land based columns with similar albedo and emissivity at various wave lengths.
    sum all water based columns with a factor for the clarity of the water (ie its turbidity and hence albedo and absorption/reflection of various incoming wave lengths.

    If done that way, you have a massively parallel computation.

    The key question is how many different groups of columns do you need?
    Average reflectivity of earth scenes is 18%, but snow and white sand might be 98%, dark basalt might be 5%. How many steps in the albedo values are needed.
    Black and white film can only record 11 different densities which are sufficiently different to be seen as different gray tones to a human.

    Might that be sufficient to give representative characteristics to the columns?

    73% of your columns would be water based
    27% of your columns would be land based

    Do like rainbow tables and pre-calculate the important characteristics of all possible water based and land based columns, for each step of the various parameters, then just assign each column to a bin.
    I bet it would be a lot faster to look up the characteristics of a specific column type than to calculate it each time for each column based on the parameters for that specific column.

  8. pouncer says:

    “Actually you might have to just define two heat pipes.
    One centered over land
    One centered over water.
    Add parameters for angle of sun and average albedo of the bottom of the column and clarity of the atmosphere.”

    Coming back to my Dodecahedron model … When we get a model reduced to the point I can run it in MS Excel I may actually become useful in this project.

    Steve Mosher suggests his research indicates (note weasel words, mine not his) that a grid cell average temperature can be completely predicted from altitude and latitude. All ocean cells of course have zero altitude. Cells in the Rockies mirror those in the Andes — just southern latitude instead of northern. They also may model mirror many in Central Asia.

    It would be interesting to check that claim, first. See if the US cells do in fact mirror Chinese and Chilean cells… If so, then the problem of having too many cells to compute is much reduced.

  9. p.g.sharrow says:

    @Larry; interesting concept, I need to examine ways to pixelate a grid cell to carry the values needed. I have a friend that has been working with people that aerially survey farm fields to determine exact cultural needs in the field down to exact location of individual weeds. Precalculating surface values sounds to be a good way to speed secondary examinations of changing conditions or different concepts.
    Changing vegetation, changing snow coverage, changing water conditions. etc !.
    yuk ! the more I think about the variables just involved with surface conditions :-(… pg.

  10. Larry Ledwick says:

    I thought after making that post above, you could build a multidimensional array for the columns but as you mention that geometric progression of possible values gets really big really fast if you do it all in a single lump.

    In the old punch card days, we set things up to block sort the punch cards on the first digit of the data field then sub sort each group, reducing the sort workload by a factor of 10. I once got the “privilege” of sorting 186,000 punch cards by hand – took me the better part of the day but it was managable.

    By similarly breaking the arrays down by “types”, for example a set or arrays for water base columns a set or arrays for land based columns under 300 ft elevation, 300 – 1000 ft elevation etc you could make the array sizes a bit more manageable perhaps, but as mentioned the first task would be to figure out what parameters to use for the column specifications and array values for each dimension.

    From the sound of Steve Moser’s comments, you would want a first order determination of land or water based column, next latitude, then third altitude of the location, and that would be sufficient to define an array type that was appropriate for that cell/column (perhaps with a value for average albedo for the time of the year derived from analysis of satellite earth images over a year cycle).
    Also an array value for cloud cover but how to order those factors would take some serious think time to consider order of calculation etc. to reduce the total amount of data to a manageable quantity and still be able to look up results values faster than you could calculate the same value from the raw input data.

    Moser’s observations make sense within certain limitations because for a given altitude and humidity the lapse rate would be the same pretty much world wide since the atmosphere is pretty uniform except for humidity. That said average humidity at say 3000 ft elevation will be much different in British Columbia and the eastern plains of Colorado so the difference in dry adiabatic lapse rate and saturated adiabatic lapse rate would mean that the two locations would have different average temperatures for a given elevation.

    For example Kirkwood Mountain Resort in California with a base elevation of 7800 ft got 804 inches of snow during the 2005-2006 ski season. It is at 38.6 deg North latitude, this is about the same elevation and latitude of Cheyenne Mountain just south of Colorado springs which gets less snow than Denver. Evergreen Colorado just outside Denver at a similar altitude and just 60 some miles north only gets about 80 inches of snow each year due to its mid-continent alpine climate verses the much wetter climate in the Sierras just miles from the Pacific ocean.

    I think it would be obvious that their would also have to be come differences of seasonal climate averages for a given altitude latitude and also the local climate region and availability of moisture.
    It is going to take a lot longer to melt off 800 inches of wet snow than 80 some inches of dry powder snow. Likewise those snow packs are going to represent much different latent heat of phase change both as they form and melt.

  11. p.g.sharrow says:

    Every time I visualize atmospheric conditions I see the Tropophause at 35,000ft at the equator and nearly sea level at the poles. This results in various altitude levels of conditions migrating with latitude and season. Snow levels at Kirkwood and Vail are nearly the same but 100 inches of Sierra Cement are a great deal different to 100 inches of powder in their energy and water content. This also has a great deal of difference to do with the tenacity of the snow fields in the face of the warming sun. Glaciers are created by heavy snow falls not cold weather. I once lived in glacier country in Alaska. It was heavy warm snow that built them up and not cold temperatures that made the difference in glacial build or retreat…pg

  12. E.M.Smith says:


    I think the ice based coluns would need a separate treatment. Likely near zero heat piping…

    I’d missed a trick on the Rainbow Tables idea. As I’m partial to pw cracking, that one ought to have popped for me… usually I cross connect knowing-domains better than that… One can pre-compute many parts, like heat pipe performance and evaporative suface humidity then they just become lookups. Even the existing radiative codes could be done that way. Then it is just a CPU vs IO race to decide which method to use… Put big SSDs on each node and even huge lookup tables can be fast… Hmmmm….

    BTW, IIRC, the range for BW print paper is only 7 zones…. so even less granularity yields very good images. I doubt we need more resolution than an Ansel Adams print….

    And yes: accounting for water flows (humidity, vapor, cloud, precipitation, snow, snowpack, …) is critical as it is THE major driver. Get the water right, the rest ought to work out ok. I see a Water World model in columns as the general structure, the rest fine tuning.


    Hmmm… I’d expect similar, but not identical. Similar enough? Likely. Think of snowy mountain tops world wide. Yet I would expect a mountain in the tropics to need more altitude to snow line given the greater tropopause height…. Cross check would be easy. Look up snow line and treeline for major mountains globally, list by latitude. Select outliers to contemplate. That sounds like an Excel problem to me! 8-)

    I find it amusing that it’s a Mosher point… as it has no room for CO2 in it.

    Might be interesting to express snowline as a fraction of tropopause altitude and see if it is a constant. Tropopause can be ground level in the artic areas, above jetliner cruising altitude in the tropics. Varies with season though, so likely winter value needed.


    I was contemplating that yesterday… cold wind and outside watching dogs doing midday duty… barefoot. On the cement walkway that had been mostly shaded, cold feet and cold torso. Step a foot over on sunny grass, warm feet insulated from the night heat dump of the cement. Move 20 feet to the rubber mat on the entry in the full sun, but blocked from wind by house and garage (warm corner) it was suddenly a nice still air early summer day… all of me warm. All in 20 feet. Yet we call it one temperature for the whole town.

    At some point you must stop chasing fractals and declare a coastline length “for all practical purposes”… I would do that via energy in, fraction to reach ground, fraction retained, transport lag. Let the details sort themselves out.

  13. pouncer says:

    “I’d expect similar, but not identical. Similar enough? Likely. That sounds like an Excel problem to me! 8-) ”

    Yeah, and it seems like much of the data collection has already been done. BUT, for example,
    cities by latitude (example, Chicago) on Wikipedia shows

    Chicago  Illinois 41°53′N 87°38′W

    while the same Wikipedian pile of random knowledge with the nugget of cities by latitude shows Chicago as

    Chicago Americas N41.8369 W87.6847

    That the formatting for the lat (and long) is different, is not enough. The values are different. At the very least with different precision. So making a “match” falls to the text field on city name rather than the latitude field. At least until I fix the format. And enable “similar, not identical” matches. And since only about 180 world cities are documented in the Wiki with altitude while over 1000 are documented by latitude the initial cut at the problem throws away 90% of the handy data…

    One really has to love this sort of thing to look past this sort of disappointment.

  14. pouncer says:

    Oops, second pile of Wiki data should be identified as “cities by ALTITUDE”. Disappointing only in that, although the army of volunteers collected it, they did so without regard to data format standards.

  15. E.M.Smith says:

    Hmmm… looking again here:


    The list of subroutines is missing something…

          CALL CONDSE                                                        141.   
          CALL PRECIP  
    C**** RADIATION, SOLAR AND THERMAL                                       149.   
          CALL RADIA                                                         150.   
    C**** SURFACE INTERACTION AND GROUND CALCULATION                         157.   
      400 CALL SURFCE                                                        158.   
          CALL GROUND                                                        159.   
          CALL DRYCNV                                                        160.   
    C**** STRATOSPHERIC MOMENTUM DRAG                                        166.   
          CALL SDRAG                                                         167.  
    C**** SEA LEVEL PRESSURE FILTER                                          177.   
          CALL FILTER                                                        181.  
    c  ** This code calls the ocean diffusion in the deep ocean mode 
            CALL ODIFS
          CALL OSTRUC                                                        206.   

    They have precipitation, and dry convection… but where is moist convection?

    I don’t think I cut that out… but I need to go back and search the code again. Make sure I didn’t miss it somewhere…

    Since IMHO, moist convection is what drives the whole heat pipe, if it is given short shrift, that could leave a lot up to the CO2 knob that really is water…

  16. E.M.Smith says:

    I think I found it… maybe… in Mjal2cpdC9.f


          SUBROUTINE ADVECQ (PA,QT,PB,Q,DT1)                                3501.
    C****                                                                   3502.
    C**** CURRENT AIR MASS FLUXES          
    C**** END OF HORIZONTAL ADVECTION LAYER LOOP                            3581.
    C****                                                                   3582.
    C**** VERTICAL ADVECTION OF MOISTURE                                    3583.
    C****                                                                   3584.
          DO 1715 L=1,LM-1                                                  3585.
          IMAX=1                                                            3586.
          LP1=L+1                                                           3587.
          DO 1715 J=1,JM                                                    3588.
          IF(J.EQ.JM) IMAX=1                                                3589.
          DO 1710 I=1,IMAX                                                  3590.
          FLUX=DT1*SD(I,J,L)*2.*Q(I,J,L)*Q(I,J,LP1)/(Q(I,J,L)+              3591.
         *  Q(I,J,LP1)+1.E-20)                                              3592.
          IF(FLUX.GT..5*QT(I,J,LP1)*DSIG(LP1)) FLUX=.5*QT(I,J,LP1)*         3593.
         *  DSIG(LP1)                                                       3594.
          IF(FLUX.LT.-.5*QT(I,J,L)*DSIG(L)) FLUX=-.5*QT(I,J,L)*DSIG(L)      3595.
          QT(I,J,L)=QT(I,J,L)+FLUX/DSIG(L)                                  3596.
     1710 QT(I,J,LP1)=QT(I,J,LP1)-FLUX/DSIG(LP1)                            3597.
     1715 IMAX=IM                                                           3598.

    Now it looks to me like they are moving the air mass to move the moisture, when, IMHO, it is the moister as vapor that moves the air mass… it ought not be advecting, it ought to be convecting mostly.


    There’s also this bit higher up:

    C     IF(MOD(NSTEP,NCNDS).NE.0) GO TO 400                                140.5
          CALL CONDSE                                                        141.
          CALL CHECKT (3)                                                    141.5
          CALL PRECIP                                                        142.
          CALL CHECKT (4)                                                    142.5
          CALL CLOCKS (MNOW)                                                 143.
          MINC=MLAST-MNOW                                                    144.
          MCNDS=MCNDS+MLAST-MNOW                                             145.
          MLAST=MNOW                                                         146.
             IF(MODD5S.EQ.0) CALL DIAG5A (9,NCNDS)                           147.
             IF(MODD5S.EQ.0) CALL DIAG9A (3)                                 148.
    C**** RADIATION, SOLAR AND THERMAL                                       149.

    Where the comment clearly acknowledges moist convection., but then just does CONDSE and PRECIP….

    Looks like a more detailed code analysis is needed, but I’m not seeing good handing of moist convection as causal of the other processes.

  17. E.M.Smith says:

    Looks like Model II handles moist convection as part of the condensation routines. I think I’ve found it in Pjal0C9.f

          SUBROUTINE CONDSE                                                 2001.
    C****                                                                   2002.
    C**** AND SUPER SATURATION CLOUDS.                                      2005.
    C**** MOISTURE (SPECIFIC HUMIDITY)                                      2178.
          QL(L)=Q(I,J,L)                                                    2179.
          QM(L)=Q(I,J,L)*AIRM(L)                                            2180.
          QMOLD(L)=QM(L)                                                    2181.
          COND(L)=0.                                                        2182.
          CDHEAT(L)=0.                                                      2183.
          DM(L)=0.                                                          2184.
          DSM(L)=0.                                                         2185.
    C**** MOIST CONVECTION                                                  2202.
    C****                                                                   2203.
    C**** UNSTABLE LEVEL.                                                   2208.

    Still looks a bit after the fact and only as a precipitation effect to me, where IMHO reality is more humidity driven even on days with no rain…

  18. Soronel Haetir says:

    I find the comments about rounding mode interesting, in that x86 has given control over that to the programmer for years (at least back to the original Pentium, probably much earlier in the x87 – yes 7 – life cycle). I would have expected these authors to be aware of that, I have to wonder if they deliberately glossed over that due to working for Sony.

  19. Larry Ledwick says:


    I think the ice based coluns would need a separate treatment. Likely near zero heat piping…

    In the case of low humidity, high altitude and direct sun you have to account for sublimation. I have seen days here in the Denver area where several inches of snow sublimated and the temperature never got above zero deg F.

    Sometimes the effect is so strong you get significant ground fog on low albedo surfaces like pavement but air temp is well below freezing.

  20. E.M.Smith says:


    We know who they work for and it isn’t Sony.

    Sony never liked folks using the PS3 for general computes and even took steps to lock folks out of the processor. Eventually they just killed it and went to a different processor. As my paranoid muse,I wonder if that was at the request of the US Govt., but there is no evidence for that other than the prior history of the USG trying to block general access to massive computes.

    The x87 is just the math co-processor (originally discrete hardware, then instruction set). Now largely incorporated into the FPU of the CPUs.

    In ALL hardware there are rounding issues. Often they can be mitigated by some means, but they are never gone. The key point in the paper NOT being a slam on Intel CPUs (they are IEEE compliant) but pointing out GPUs (pretty much all of them; and the Intel SSE instructions are aimed at the same GPU functions since they are for graphics acceleration) are NOT IEEE; so you get different results. Same thing about the Sony SPE where precision is again a bit different.

    So no grand deception. Just investigating the use of GRAPHICS oriented hardware (GPU, SPE, and SSE instructions) for general purpose math. As this is an abuse of the original intent (fast graphics where non-IEE math does not hurt) one must look at the impact of that graphics type of math,rounding, and precision to know if it hurts, or not.

    The purpose being to find ways to use all that much faster hardware (like 10x faster) for general computing without getting bogus results. You are, of course, free to just buy dozens more CPUs and use their IEEE compliant FPUs and ignore the rest of the hardware; IF you have the money for it… but most of us don’t.

    GPU programming has “taken off” in the last decade or so. Expect a lot more of this kind of analysis of ‘issues’ from it. For every problem currently CPU bound on regular hardware, as folks move to CUDA and OpenGL, OpenCL, and others, their MUST be this kind of analysis of the non-IEEE math on the results of codes originally designed for CPUs with IEEE compliant processors.

    I hope eventually nVidia and others will make GPU like cards with i-EEE math in them, but until that happens, any exploration of GPUs and GP instructions ( like SSE ) will be faced with this problem and need this kind of comparison.


    Oh, yeah…


  21. cdquarles says:

    @Larry, yeah, you’d think ‘climate scientists’ would not be so quick to avoid sublimation; but then, again, they are quick to avoid water and go for carbon dioxide.

  22. Soronel Haetir says:

    I am referring to the rounding mode not being IEEE-I am referring to their statement that the rounding mode is not IEEE-754 compliant, it is true that nearest-value is the default but you can set Intel processors to use truncation instead. I don’t know what if any performance issues that might have.

  23. Soronel Haetir says:

    And how do you figure that something from “Sony Computer Science Laboratory” is not by folks working for Sony? I wasn’t talking about the authors of FAMOUS, only this parallized version.

  24. E.M.Smith says:

    Sorry, I was using IEEE as shorthand for the whole long IEEE-754 ( I think there is only one current IEEE floating point standard for general use… but I could easily be wrong and just have a narrow exposure…) Checking the wiki it does seem to come in eras, though:

    “The current version, IEEE 754-2008 published in August 2008, includes nearly all of the original IEEE 754-1985 standard and the IEEE Standard for Radix-Independent Floating-Point Arithmetic (IEEE 854-1987).”

    The whole point is that NORMAL CPU based processing uses rounding nearest in compliance with IEEE-754. Pretty much everything written prior to the present move to GPUs is based on the regular CPU / FPU that is IEEE-784 compliant (at least since the mid ’80s). It is the act of porting those codes (or writing new ones) to run on GPU / SSE / SPE math that’s the issue. It is THAT which the paper is addressing. The LACK of round nearest in GPU / SSE / SPE based processing, along with the shorter bit length in most. (In the Pi it’s the 24 bit length truncate math used IIRC).

    The paper is uninterested in turning your CPU into a “GPU like Math engine”. (Near as I can tell, nobody wants to go that direction. I can’t think of a reason to do so; though I’m sure their are some odd edge cases where it might benefit someone). That the CPU can do IEEE-754 and round up, down, nearest, or truncate is just not the question being addressed. It’s the GPU / SSE / PSE limitation to only truncate and to (sometimes) shorter bit length that’s the direction of porting and thus the direction of interest and exploration.

    My only reason for even mentioning it, really, was the following line where they say it had about the same impact as compiler flags and word length (platforms chosen). That puts a “size of the problem” with the code (same as 24 vs 32 bit at a minimum and truncate vs round) while demonstrating that the model code is very sensitive to compute context (which implies poorly thought out in terms of impacts of “computer math” on their calculations… i.e. not thinking about binary conversions, truncation, rounding, epsilon, etc. etc. That is, bad programming for the problem.)

    So there is no grand conspiracy to have Sony suppress the knowledge that Intel (like every other general purpose CPU maker ) does IEEE-754 math. It is only looking at the effect of using the GRAPHICS NON-IEEE-754 math on existing codes when ported to those kinds of hardware. That’s the only thing going on anyway. (Nobody is trying to port their GPU code to IEEE-754 CPUs as you lose performance and it isn’t trendy anyway…)

  25. E.M.Smith says:

    As I parse this:

    1 Sony Computer Science Laboratory, Paris, France 2 Met Office, Exeter, UK 3 University of Oxford, Oxford, UK

    *now at: Laboratoire de Recherche en Informatique, Orsay, France

    **now at: Laboratoire de M´et´eorologie Dynamiqe, IPSL, CNRS/UPMC, Paris, France

    ***now at: School of Geography, Politics and Sociology, Newcastle University, Newcastle, UK

    I get that 1 is the University of Oxford employer working in a particular LAB named after Sony (and perhaps with some donation $$ so not entirely without question, but just as possibly just a free boat load of Sony equipment donated – a Dig Here.)

    Another at a French Lab in information research.
    Another at the Dynamic met Lab in Paris.
    Another at Newcastle University…

    Not seeing “Employees of Sony Corp.” in that…

    But maybe I’ve got a line break in the wrong place and 1 is Sony while 3 is Oxford?

    It’s like saying if I looked up something in the Chase Library at some school I was an employee of Chase Bank… near as I can parse things.

  26. E.M.Smith says:

    Ah, looks like I have it parsed wrong. There is a Sony Lab Paris
    that does look to be part of Sony

    So I guess it’s supposed to break at the digits?

    1 Sony Computer Science Laboratory, Paris, France
    2 Met Office, Exeter, UK
    3 University of Oxford, Oxford, UK

    In which case I’d just point out that the TEAM is not all Sony even if someone on the team might be, and still no motivation for Met Office or Oxford to promote a Sony Conspiracy to slam Intel.

Comments are closed.