I’ve been pondering computer climate models for a while now. In particular, how to make one run in a more parallel way on a cluster of smaller computers (instead of using a giant $Millions money sucking monster made of Unobtanium and beyond the reach of Just Regular Folks).
Why? Well, because it’s what I do… I’ve been immersed in the world of HPC High Performance Computing in one way or another for a decade or two and my specialty for some of that time was “efficiency reviews”. Looking at some poor chunk of Big Iron near melt down from some aspect of code “not caring” about it, and finding ways to give it a significant tune up.
Sidebar on Databases:
HPC is usually focused just on the “compute” side. Engineering and Scientific compute heavy loads. My start in the area was on IBM Big Iron running database programs. What today might be called “Big Data”. Massive farms of disks full of data, and the big Mainframe sifting through it.
One particularly satisfying contract had me at the California State Architects Office. They had a mainframe that was 100% full and at times bogging down enough to be painful to someone. This machine was largely running the FOCUS DBMS, my product specialty at the time. Over about 1 week I took my (personally developed) 26 point checklist of items and began tuning.
Now they were faced with the expectation that they would need to buy ANOTHER $4 Million computer. They were quasi-accepting of this until the data center informed them they had no place to put it and it would need to wait a year or two… The idea of a year or two of pain and delay bothering even State employees, I was considered a small price to pay for seeing if I could get them a bit more lifetime. Hopefully below 90% load for the next year.
When I was done, the machine was running at 4% load and everyone was quite happy; some were a bit astonished. (Me among them… I didn’t know code could be that bad ;-) Well, needless to say, the results from tending to efficiency questions have been important to me ever since.
How was this “miracle” worked? Largely by a bunch of simple and obvious things, once you bother to think about it. Select records prior to sorting them. Put index fields on frequently selected and sorted items to make finding them faster. 26 of those were what I’d identified over my years of working with the product. Each Senior Consultant was expected to develop a “specialty” and I’d decided mine would be “efficiency”, so that’s what I collected. Any time I saw an efficiency “trick”, I’d add it to my list.
Turns out that selecting 1000 records out of a Million or two THEN sorting, can save a lot of time over sorting a Million records ;-) Especially if you added an index key on that frequently selected field so you don’t have to read the whole database even to find the 1000 records…
In short: Knowing what the machine is doing and how it works matters. Knowing how to be smart in what you ask it to do matters too.
That is the essence of HPC. Know your hardware. Know what you want done. Tell the hardware to do it in a hardware friendly smart way.
It is the tendency for folks like Microsoft and, really, most vendors these days to ignore exactly that; which is the source of my resentment toward them. I’ve spent a very large part of my professional life cleaning up after that kind of thinking. It had become trendy in the 90s to think that “hardware is cheap” and not care about efficiency outside the area of HPC. IMHO, that’s “exactly wrong”. Now that Moore’s Law is topping out, there is a large pool of horridly bad code just waiting for the efficiency miners to go to work and make it good code. I know it can give huge returns…
The climate model folks do seem to put some effort into making their models run with reasonable efficiency. However, it is also clear that they are more than happy to just toss more $Millions at even bigger computers and “improve” results by adding more cycles to loops with more steps of the same old computes, but smaller grid cells. They get Bragging Rights for the bigger hot box, newer tech, bigger budget. So why not?
Even in the world of HPC the tendency has been to lean on the Moore’s Law crutch too much. Just wait a year or two, buy a bigger box, run the same old same old program with with double the grid cells and publish “new” results… Easy peasy.
But “times move on”. Even in HPC we’re running out of faster machines. In the last decade or two it has all been a move to “Massively Parallel”. My Cray in the late 80s / early 90s was a 4 processor box. That was the end of the Big Processor era as the field moved into ever more processors. Most of the Neat Tricks of Cray moved into things like high end Intel / HP / PowerPC cores. Heavily pipelined instructions (executing several instructions in steps, with the steps overlapping) and using a sort of a Vector Unit via things like SSE instructions. So folks just started sticking hundreds of them in a box. Then thousands. Then tens of thousands. We’re now up in the 100,000 cores in a box range.
Yet Amdahl’s Law still applies. (I worked at Amdahl Corp. for some years. That’s where I first started using Unix and my whole life path shifted from Database stuff to All Things *Nix and Big Iron). In reality, it is very hard to do most kinds of computer stuff in parallel. Over the years, the tools to do it have improved a lot, but it is still the case that most of the time for most kinds of work, at about 4 cores on a given job you have gotten most of the parallel bits serviced and the sequential bits are now dominant.
For work suited to parallel processing, at about 4000 cores you start to hit a wall. (That’s why most US Supercomputers top out at 4000 to 8000 cores. More isn’t buying you much, or anything. The Chinese have been making these monster boxes packed with 100,000 cores mostly so they can brag on them. It would be more effective to let their 100 researchers running 100 jobs run on a cluster of 25 machines of 4000 cores each, but “whatever”…)
There is a very very small set of jobs that are “embarrassingly parallel”. (That’s the official jargon! Honest!) Only those really benefit from more than 1000 cores. For weather and climate models, the maximum parallel that can readily be reached is roughly proportional to number of cells (but even then, only if you write your code in such a way that each cell “does the math” separately from the others…). In most cases, even the separate “cells” of a job have dependencies on their neighbor cells such that you can’t compute them ALL at once, but in groups or waves.
Eventually you have made parallel what can be made parallel and the job is once again dominated by what can not be made parallel. Amdahl’s Law.
So the HPC folks are still out there, slugging along finding opportunities to tune code to be faster, to match it to a given hardware, and to make things run in parallel where possible. It was that “problem” that got me interested when looking at climate models (that started as just a ‘what is it doing and is that rational?’ question).
So here I am, sitting in my home office, typing on a Raspberry Pi Model 3, compiling model code on it, and pondering running a model on a cluster of such Dinky Iron. What am I thinking!?
Well, I’m thinking that there’s lots of opportunity to tune things, and that these codes were first invented about 30 years ago. That’s a lot of “doubling times” ( 18 months to 2 years) of hardware speed. For $40 I can now buy more processing power than my $40,000,000 Cray provided.
In the Model II code, many bits have a header saying:
c ** Based on GCMII code for IBM RS/6000 computers created at GISS
They were Damn Fast back in the ’90s. Now not so much… We’re talking 50 Mhz to 500 Mhz and mostly single digit GB of memory. Or slower than what’s on my desktop with less memory in many cases. At present, the Model II code is stated to run well on Intel based PCs. Using a collection of a few “Pi Type” boards to run something suited to an RS/6000 and that works well on PC CPUs is not a hard leap.
BUT, it would require that I learn some parallel coding techniques. Something I’m rather interested in doing.
Here’s some benchmarks, just to drive the point home:
Step By Step (30 Years)
FORTRAN code is the same for all platforms (except for the time functions)
Download the console program for Win32 (solver of linear system Ax= f) Test.zip (36KB)
or the source Test_source.zip (2KB) and compare your CPU’s horsepower with the Intel 386, RISC Intel 860 or legendary NeXT station.
Copyright Vladimir Galouchko Home page 3dfmaps.com
Hardware (Software) Sec Intel i7 6700K 4.00GHz (Intel Fortran XE 2015 x64) 0.06 Intel i7 6700K 4.00GHz (Intel Fortran XE 2015 x86) 0.06 Intel i7 2700K 3.7GHz (Intel Fortran XE 2011 x64) 0.10 Intel i7 2700K 3.7GHz (Intel Fortran XE 2011 x86) 0.11 [...] IBM RISC/6000-55 14.00 SPARCstation 20 superSPARC/50 14.50 IBM RISC/6000-550/40 14.74 IBM 3090J (vec) 14.78 IBM RS/6000 250 PowerPC/66 (IBM AIX XLF FORTRAN v2.3) 15.10 IBM RS/6000 250 PowerPC/66 (IBM AIX XLF FORTRAN v2.2) 15.76
Essentially, I’ve got about the same power in just one of my SBCs (Single Board Computer). Perhaps more (depending on how I use it and the GPU).
Which brings us to this paper and the FAMOUS model that runs on PC class hardware (though, it would seem, not fast enough ;-) It is described as a course fast version of HadCM3, though I’ve not found where to download a copy of the source code.
FAMOUS / HadCM3 and Parallel Conversion
When contemplating trying something, especially when it may take months to test, I like to take time up front to find out if someone else has already done that work so I don’t get stuck with it. In this case, someone has already done the code profiling and conversion to parallel and the testing. They also used a model (FAMOUS) that now runs on desktops (i.e.coarser steps) and they have profiled the results. It’s impressive as a bit of work. Work I now need not do, and where just reading a paper for an hour or so covers it. Thanks for that!
They try to be fancy at the first link and give you an interactive experience via some stuff glued on to the side. On the Android tablet that caused a strange “jumping” behaviour when paging down. I just downloaded the PDF and read it instead. Oddly, it wants me to sign in to get the article to display on the Pi M3 (perhaps because the Tablet has a Google Account on it…) but the second link worked without that:
The author basically introduces the FAMOUS model and then proceeds to profile it, find the majority of the computes go into the atmospheric radiative processes, and finds ways to make that run much faster in parallel on several different kinds of hardware. Including the PowerPC in the Sony PlayStation hardware and on OpenCL machines and using GPUs in an Intel box. Nice, that. It took them about a week to make a parallel version, yet elsewhere they day “2 1/2 man years”, so I’m figuring this paper is worth about that much of my life NOT spent redoing any of it. Thanks for that!
Geosci. Model Dev., 4, 835–844, 2011
© Author(s) 2011. This work is distributed under
the Creative Commons Attribution 3.0 License.
FAMOUS, faster: using parallel computing techniques to accelerate the FAMOUS/HadCM3 climate model with a focus on the radiative transfer algorithm
P. Hanappe1, A. Beurive1, F. Laguzet1,*, L. Steels1, N. Bellouin2, O. Boucher2,**, Y. H. Yamazaki3,***, T. Aina3, and M. Allen3
1 Sony Computer Science Laboratory, Paris, France 2 Met Ofﬁce, Exeter, UK 3 University of Oxford, Oxford, UK
*now at: Laboratoire de Recherche en Informatique, Orsay, France
**now at: Laboratoire de M´et´eorologie Dynamiqe, IPSL, CNRS/UPMC, Paris, France
***now at: School of Geography, Politics and Sociology, Newcastle University, Newcastle, UK
Received: 10 May 2011 – Published in Geosci. Model Dev. Discuss.: 17 June 2011
Revised: 10 September 2011 – Accepted: 12 September 2011 – Published: 27 September 2011
Abstract. We have optimised the atmospheric radiation algorithm of the FAMOUS climate model on several hardware platforms. The optimisation involved translating the Fortran code to C and restructuring the algorithm around the computation of a single air column. Instead of the existing MPI-based domain decomposition, we used a task queue and a thread pool to schedule the computation of individual columns on the available processors. Finally, four air columns are packed together in a single data structure and computed simultaneously using Single Instruction Multiple Data operations.
They give this reason for the conversion to C. “Because no Fortran compiler existed for the SPEs, we were compelled to translate the radiation code to C.” Given that FORTRAN and C are both about the same speed, Provided the needed parallel facilities can be used, it ought not matter what the surrounding language is, if it is from one of the efficient ones. I suspect Julia would work just as effectively and likely with a bit less trouble than C (which is less user friendly though closer to the hardware). Since I have FORTRAN on the computers here, I can also just leave that be if converted to a parallel form. Bolding by me.
The modiﬁed algorithm runs more than 50 times faster on the CELL’s Synergistic Processing Elements than on its main PowerPC processing element. On Intel-compatible processors, the new radiation code runs 4 times faster. On the tested graphics processor, using OpenCL, we ﬁnd a speed-up of more than 2.5 times as compared to the original code on the main CPU. Because the radiation code takes more than 60% of the total CPU time, FAMOUS executes more than twice as fast. Our version of the algorithm returns bit-wise identical results, which demonstrates the robustness of our approach. We estimate that this project required around two and a half man-years of work.
A “50 times” speed-up is a Very Big Deal. Saving 2.5 years is another one ;-) Even the 4 x they get with OpenCL is “nice to have”. Then the comparative speedup with multi-threads vs OpenCL vs GPU is very good information for planning. The “Cell” processor is in the Sony PlayStation. It is a main CPU with what amounts to 6 user addressable RISK cores glued on. A bit more than GPU cores, but less than a full processor. Those are why the PlayStation was such an attractive beast for DIY parallel clusters (and why, IMHO, the DOD leaned on Sony to block access to them and then move on to a different processor… put too much compute power in the hands of use home gamers ;-) But, no worries, there are other ways … ;-)
So you can think of the CELL processor as being, basically, a 7 core machine (really 9 but 2 are not user accessible). Well, my Odroid XU4 has 8 cores, all of them full RISK processors. Sure, the A15 cores are not as “fancy” as the main PowerPC core, but I’ve got 4 of them AND 4 auxiliary A7 cores to work with. I suspect is is “quite enough”, and that a stack of 4 to 8 of them is “way more than enough”. Probably enough to allow upping the model precision some. After all, it’s 10 years after the citation of the CELL processor, so that’s about 5 doublings of Bang/$ … or 32 times.
Section 4 describes the changes we have made to the radiation algorithm of FAMOUS to exploit parallel computing techniques. Our revised code yields very large performance improvements on the CELL processor. The modiﬁcations are beneﬁcial for other computing platforms as well, including general purpose CPUs with vector instructions, multi-core platforms, and Graphics Processing Units (GPUs). Details of the performance we achieved are given in Sect. 5
That they took the time to compare and contrast CPUs with “media extensions” (vector instructions), GPUs and the CELL processor lets you know where the best gains can be found.
That the major resource suckage was in radiative atmospheric portions also is encouraging to me. I’m interested in a model that looks a bit more to the natural physics and is less radiative obsessed, so this tells me I can get by with a lot less computes for that portion.
FAMOUS (FAst Met Ofﬁce/UK Universities Simulator) is a low-resolution version of the better known HadCM3, one of the coupled atmosphere-ocean general circulation models used to prepare the IPCC Third Assessment Report, and is a particular conﬁguration of the U.K. Met Ofﬁce’s Uniﬁed Model, which is used for both weather prediction and climate simulation. FAMOUS is designed as a fast test bed for evaluating new hypotheses quickly or for running a large ensemble of long simulations. It has been calibrated to produce the same climate statistics as the higher resolution HadCM3. FAMOUS uses a rectangular longitude/latitude grid. The resolution of the atmospheric component is 48×36 (7.5◦longitude×5◦latitude or roughly 830km ×550 km at the equator) with 11 vertical levels. It has a 1-h time-step for the atmosphere dynamics and a 3-h time-step for the radiation. The resolution of the ocean component is 98 ×72 (3.75◦longitude×2.5◦latitude) with 20 vertical levels and a 12-h time-step. FAMOUS contains legacy code that has been optimised for previous hardware platform and that has been adapted continuously. It consists of about 475,000 lines of Fortran 77 with some extensions of Fortran 90.
So largely the same code, just run more coarse on the steps. Nearly 1/2 million lines of Fortran (Ooof!). One wonders how much of it is really needed to model heat flow from a sphere into space… Personally, I suspect a simple enginering oriented model using a spherical heat pipe with water as the working fluid and “air contamination” and some spots of the ‘hot end’ drying out would be quite enough. Might need to add some fancy bit to cover the zero pressure at the top and gravity, but not much more than that I think… Radiative ought to only matter above the tropopause (as the reason there is a tropopause is that convection is doing the work that radiation can not…) and there more CO2 means MORE radiation to space. But back to the paper…
Subroutine Computation Computation time (s) time (%) Ocean sub-model 142.66 10.04 Atmosphere sub-model 1278.43 89.96 ,→Atmosphere physics 1120.47 78.85 ,→Radiation 950.34 66.87 ,→Long-wave radiation 572.84 40.31 ,→Short-wave radiation 314.76 22.15 ,→Convection 46.01 3.24 ,→Boundary layers 38.86 2.73 ,→Atmosphere dynamics 109.84 7.73 ,→Adjustment 49.10 3.46 ,→Advection 29.08 2.05 ,→Diffusion 10.52 0.74
Basically, only 1/3 of the model time is spent on the non-radiative part. This implies getting that part to run on small iron ought to be fairly easy.
Then the author makes a very nice job of attacking that 2/3 with parallel code. I’ll leave the details of that for folks who care to go read the paper. The bit that interested me most (aside from the discussion of the code originally being designed for Cray Vector processing and that STILL being in the code design ;-) was his shifting from a wide layers array approach to a narrow columns approach. I’d been pondering making things cell or column oriented so as to make more modular parts that could be more easily distributed. They’ve already done it, and it worked well. Nice to know.
I found this snippet interesting:
During this conversion process, we deleted unused code sections and a fair number of if-then-else statements in low-level computation routines that select which version of the algorithm is used. This results only in a minor loss in the flexibility of the radiation code because FAMOUS’ configuration is not expected to be changed.
So the original also has a lot of “tuning” available… Just sayin’…
Figure 4 is also interesting. Discussing the effect of “rounding errors” on the final results, we see the trend stays more or less the same, but individual computations can be a few 1/10 C different. (The graph is in K) Since that’s essentially the scale of “Global Warming”, one is left to wonder just how much of the warming really might just come down to rounding variations…
I’m not going to cut / paste the graph (as it passes through GIMP for me) but just cite the text:
Fig. 4. Graphs showing the effects of rounding-errors on thedecadal means of a 120yr simulation using three different implementations of FAMOUS using: the Intel standard ﬂoating-point unit (original version), the Intel SSE extensions and libsimdmath (sse version), the CELL SPEs (spe version).
When variations in your findings are nearly the same scale AS your findings but originate with the particular hardware on which your code is run, IMHO, that indicates some basic issues in the approach used in the code… Bolding again by me.
5.2 The effects of rounding errors on the SPEs
The single-precision ﬂoating point calculations on the SPEs are not fully compliant with the IEEE 754 standard. In particular, the rounding mode of ﬂoating-point operations is always truncation, while CPUs typically round the intermediate results to the nearest value. To evaluate the effects of the truncation on the stability of the climate model, a 120yr simulation was performed and the result compared to a reference run. This simulation was forced by historical changes in greenhouse gas concentrations, solar forcing, volcanic aerosols, and a time-varying climatology of sulphate aerosols.
As can be seen in Fig. 4, the decadal mean of the global average surface temperature computed by the spe version (blue line) evolves differently than the output of the reference simulation (red line). However, the results did not show any instability or bias and the statistical differences between the versions are comparable to running the unmodiﬁed model on different platforms or with different compiler conﬁgurations (see also Knight et al. (2007) for a discussion on how the hardware variation effects the model behavior). The green line in the ﬁgure shows the results obtained with the simd version using Intel’s SSE
That it’s the same variation with changes of hardware or compiler configurations does NOT comfort me as much as it does them…
I’ll leave the rest of the technical bits for folks to read in the paper itself. For me, the key bits are that the non-radiative code is fairly fast, the radiative is where it sucks cycles, changes in that run-context have significant impact on how a model run “evolves” and they had a lot of tuning available there. Parallel via a column (or by extension grid-cell) centered compute model works. The code is not just old FORTRAN, but old and crusty FORTRAN. It is very sensitive to run time environment (and that’s a very bad thing…) but folks in the field are “OK with that”… for some reason…
All that, and the fact that I’m now about 2 to 3 years ahead of where I was a few days ago, makes me one very happy camper!
Now I think I can finally get to bed ;-)