Hey P.G.!

Hey, P.G.!

It worked… Yes, I finally got around to checking my email…

pg sharrow sent you $20.00 USD
Note from pg sharrow:

quote this is a test. . . pg

Thanks!

Also, in checking, I found I have enough in there to buy about 5 more R.Pi boards and support stuff (power supply, etc.) in total. That means that whenever I get the GCM Model codes running, I can increase the size of the compute farm fairly quickly, if needed.

(Thanks to all who have contributed, even if I didn’t read the email yet!)

I have ordered an Orange Pi board, at $16 each plus power supply makes it $25, and Orange Pi make the cheaper $10 one (that wasn’t in stock at Amazon… so I’m not getting it right now…). I’m going to ‘bite the Chinese bullet’ and evaluate it for function and security… It has the same V7 instruction set as the Pi M2, so ought to interoperate in a cluster well. So you have directly funded both an investigation of the suitability of the Orange Pi family of boards, and expanded my compute cluster.

Thanks!

This is the gizmo:

https://www.amazon.com/Orange-Pi-Project-Board-ARMv7/dp/B01CD48E94

Orange Pi board

Orange Pi board

IFF the test is successful, then those Orange Pi Zero boards at $10 each (all up shipped) as headless compute nodes make a relatively high performance cluster dirt cheap. They are a bit light on memory, so it will depend a little on how benchmarks run on these vs Raspberry Pi vs Intel boards. Then again, headless takes a lot less memory.

It may well be that the “computes / $” are best with a couple of multi-core Intel boards from Fry’s and a bit of tinkering. I’ve done the DIY computer build thing many times, and building up a headless board is pretty easy… I have a half dozen Intel based old boxes that could be clustered, but the reality is that the 486 and Pentium I class chips in them suck enough power per compute that it would be cheaper to run a Pi cluster of equal computes (The Pi Model 2 is more computes…) So it would likely only be new Intel chips that would have enough computes / Watt to make it valuable, and I only have 2 of them runnable right now. (Both having disk issues, but that doesn’t matter in a headless PXE boot compute node).

But I’m getting ahead of myself…

First make the model run.

Second function test the port to Raspberry Pi, Orange Pi, Intel; and how well things parallelize.

Third benchmark alternatives.

Fourth make hardware recommendation for what would work best per $

Fifth buy and build…

Hey C.D. Quarles!

I got the input files for the Model! LOADS of thanks!!

I can now actually give it a spin and see what happens.

They may, or may not, need tweaking, but even if they do, that’s a heck of a lot less work than creating from scratch. Generally, it looks like “model experiments” consist of tweaking the input files anyway, so likely “Many such journeys are possible”… (TNG – Guardian of Forever).

So thanks to all, and special h/t to C.D.Quarles and P.G.!

Subscribe to feed

Advertisements

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in Human Interest, Tech Bits and tagged , , , , . Bookmark the permalink.

9 Responses to Hey P.G.!

  1. LG says:

    @ E.M>
    Just curious.
    What kind of switch are you using for that cluster ?
    Also, how do you manage I/O between devices ?

  2. E.M.Smith says:

    @LG:

    At the moment, I’m using either the Netgear WiFi Router for low intensity things (as long as I’m 4 nodes or less since that is built into it), or I’ve got a nice Netgear 8 port Switch from about a decade back for when I grow it.

    Since the Pi is only 10 / 100 Ethernet, it isn’t very picky about the switch… Should I find I/O limiting at the switch, I’d likely just get a Netgear 10 / 100 /1000 switch since I’m partial to their stuff.

    The systems tend to be self managing. Using “distcc” as long as they can see each other, they can talk. The one where you launch the compile is the master. Same thing for MPI / MPITCH etc.

    IF I make them into a Beowulf, then I’d set up a head station that would managed the group, but I’d still have it all on a private backend switch, so not a lot of network management.

    Since a switch is, by definition, switching the traffic based on IP and not much else, it just takes care of the from / to mapping for you. The head station hands out a problem to a slave node by IP and that’s that. The results come back by IP. Since it is “many to many” they can all talk as desired. The only real issue is if they all want to contact the head node at once…

    “Someday” I might need to go to a 24 or larger node switch, then I’d go to the local “They Went Out Of Business and Sold Us Their Gear” shop and pick up something more commercial. We have two of them in Silicon Valley that I commonly go to. Inventory is better right after an economic downturn, but they also have things other times, as during booms mergers are common and the one mergered usually has an effective “going out of business” for their I.T. shop as it gets moved to the buyer… Most folks shopping there buy computers, so the switches tend to be cheap.

    In my experience, most problems that parallelize well, don’t have that much network load anyway. Sort of by definition… if it is heavily communication dependent, it really wants to be in one large memory space… (there are folks making custom hardware for some of that class of problem. The Parallela board, for example. The 16 ARM cores are interesting, but it is the mesh interconnect and local memory that’s the real big deal on it. Similarly, using an NVDIA Jetson or similar with large memory lets lots of GPU cores work on small math bits with very fast interconnect speed as it’s memory…)

    So really that’s part of what I’ll be testing as I get the model running (for Model E) is how much MPI demands a fast interconnect speed. If it turns out that running on 16 ARM cores loads up my switch and things are I/O bound, well, then it means that adding more Pi boards above what I have is kind of silly, and I need something with faster I/O and a bigger switch…

    My expectation is that the Pi Model 3, being about as fast as a 64 bit Intel from a couple of years back ( I did a benchmark against my Asus / Antek box) and GISS saying that’s enough for a minimal run, but you need 88 of them for a decent run with oceans, means I can use a Pi per 64 bit Intel Core Equivalent (modulo communications load). So a benchmark with 4 boards ought to give me enough to ‘get clue’ on the memory / CPU / IO balance of this particular problem. Then I can step back and answer a couple of key questions:

    Is 250 MB memory / core enough?
    (GISS recommends 1 GB / Intel core and this is 1 GB / Pi of 4 cores)

    Is 100 Mb Ethernet interconnect fast enough? (If not, I go to CubieTruck or similar with Gb)

    Is the ARM going to cut it? (If not, I move on to Intel based SBCs)

    Basically I’m setting up a Toy System with a cluster of 16 ARM Cores and about 4 GB of memory, with aggregate 400 Mb/sec communications. Then I measure and asses.

    Depending on what limits, I adjust the design of what to buy. As long as it is Pi or similar, the 100 Mb will limit network issues to the Pi and not the switch. IFF it ends up being Intel SBCs with Gb, then the network will become a potential issue (depending on how chatty the programs are…)

    In any case, switches are cheap. Especially used ones…

    So far, looking at the Model II code, it looks like it has lots of math per lines of code, and modest data sizes. (even an 8k grid with 1k data items / grid would only be 8 Meg of data items, even at 64 bit, that would be about 64 Megbytes of data to move. Then you chew on it for a day / month / year cycle and then communicate and repeat for the next problem (Advection of Humidity then Advection of Mass then…)

    So my best guess at the moment is communications will not be the limit. Nor, really, memory. I suspect CPU will limit for the ARM chips and that MPI overhead may be higher than expected. The Pi behaved very badly on the parallel FORTRAN test… (Frankly, part of why I’m testing the Orange Pi – it is a different chip and I’m hopping it handles the parallel code better…) To the extent that the ARM chips just don’t do MPI well, I’ll be forced into Intel SBCs. Not horrible, really. At about 10:1 price ratio you get about 10:1 performance ratio too… with much fewer nodes to communicate between and larger blocks of problem per node… Just 4 x Pi Boards is about 1/2 the cost of an Intel test board (and about 1/20 the power)… so testing first on the cheap seats…

    But again I’m speculating into the future when measuring is needed…

  3. p.g.sharrow says:

    Pleased to know the link test worked.
    The cost in funds, power and labor are more important then speed. The ability to add more and more units that actually communicate and work well together should mark the correct path. Doing climate modeling should be a good test as it is a data rich environment.
    Minor detail is getting the climate science correct ;-) We know that the present code doesn’t work.
    I’m not sure that the Raspberry Pi is the best solution, just a good place to begin…pg

  4. E.M.Smith says:

    @P.G.:

    In the HPC world, the $ matter, the $/compute matters, and the Computes/Watt matter. I’ve done those figures a few times for the Crays we bought. At a 750 kVA power feed and about a 16 x 16 foot cooling water tower, the power costs add up fast… Typically the labor costs is not the big factor.

    Now for the Pi, the $ are quite small and the power nearly nothing. (Though, oddly, the power supply costs about 1/2 the board cost for the Orange Pi Zero… so coming up with a single large power supply with multiple 5 VDC USB spigots is likely worth it if a big cluster is built). My labor is effectively free, though limited in total supply. So we end up with $/compute as the biggest thing to look at, but followed by Computes/Watt. Every few years that Computes / Watt kills an entire generation of HPC computers as the new ones have $Total/Compute < $PowerCost/Compute of the old ones.

    Speed demand is an odd duck. It must be enough (needs to finish inside the need lifetime) but beyond that is just convenience. So the Model E 'advice' talks about a reasonable speed, but from what I can tell, that is to have a run finish in less than a day… once you are willing to let it run a month, or a year, that metric changes a lot.

    Also, if you need that daily speed so you can try different parameter sets, then getting the price of a cluster down to where 1000 people can buy one "just to see", lets you run 1000 different parameter sets in parallel (sort of a macro-parallel processing ;-)

    Still, with that said, each compute problem has its own characteristics and works best on tailored hardware. Some need massive memory. Some need hard core computes. Some are chatty and saturate the communications. The IBM 360 had a lousy small CPU ( 10 MIPS) , but had fast disk I/O channels and lots of them. It was a "mainframe" because of those I/O channels and business codes moving 1000 bytes of data, to increment one field, then move that 1000 bytes back to disk. The Cray was a "supercomputer" because it had a highly limited vector processor unit that could do simple math, but in sets of 64 (a 'stride' of 64) in parallel. So it goes.

    So before making a hardware commit, I need to characterize the GCM code.

    On first look, it looks very math intensive with small data and modest communications needs. IFF that is verified (at least when the chatty 'diagnostics' are turned off), then lots of small compute engines in parallel will be enough ( Pi or Pi Clone, maybe NVDIA Jetson if highly math intense). But if the Pi bogs down on all those trig calculations, or it turns out the data volume is much larger than it looks to me as the code is dumping values on every crank of the handle, then I'll need to get hardware that does those things "well enough"

    The name "General Purpose Computer" is a shorthand for saying "in conformance with Amdahl's performance rules". Basically, a machine with about a GB of memory per GHz of speed and I/O to match (about a Gb/sec) will work well enough on most problems. By that metric, the Pi (and many Pi Clones) is 50% lite on memory and has 1/10 th the network / communications speed needed. (And the 4 core Pi is about 1/4 the memory match) But for a highly compute intensive task, that ought to be good enough. As it is, with most of the O of I/O being output on the HDMI for desktop use, it is plenty fast on I/O for that. And for memory, while I occasionally spill over to swap, it is usually not much and mostly open but inactive web pages. All that encourages me that it will be OK for the GCM use. That, and the fact GISS talks about using General Purpose Computers in the form of their Intel chip statement…

    Well, I'm getting a bit deep in the weeds on performance stuff. If tech talk causes folks to glaze, the High Performance Computing world of computes esoterica causes catatonic states ;-)

    So I'm hoping to have the time today to run some Linpack or similar code on the Pi M3 and M2, and on a 'cluster' of the two of them. Just to see if that code has the same parallel issues that OpenMP had. Hopefully by Monday I'll have done a similar test with MPI / MPITCH / OpenMPI and that will tell me if those libraries and the particulars of the Pi let other parallel message passing techniques work efficiently. The OpenMP (multiple threads to multiple cores inside the chip) FORTRAN test was dismal. It took LONGER to run than a straight run… But MPI is message passing off chip and completely different codes / methods. As usual in all things computer performance "We'll see."

    Time for that morning "Wake up and get to work" coffee I think ;-)

  5. cdquarles says:

    Happy to help :). One of these days (need a round tuit :P) I may rebuild my old LAN from used PC parts.

  6. LG says:

    @ EM :
    Are you using any KVM device in your setup?

  7. p.g.sharrow says:

    @EMSmith; reading your technical explanations is good, it is the coding that makes my glaze over |-( but I must say that every time I go through the code explanations a bit more light dawns. Morning pot of coffee helps :-).
    I have budgeted another large donation so the “test” is to determine that method of transfer. Glad to see It Works!
    Sometimes the better solution is the General Purpose one and not the specialized one even if it is a bit slower. So GISS models are a good test of concept. Too bad they need fixing as well, but at this point just running them in a reasonable time and cost is a start. A thousand people poking at them might yield something that actually works…pg

  8. E.M.Smith says:

    @LG:

    Nope. KVM Keyboard Video Mouse… Not needed in “headless”. So you build a config that boots, and only issues a login prompt (if that). No GUI. No nothing. Once you have the OS configured (or Chip configured…) you just boot the thing with an ethernet wire and power cord. That’s IT.

    For fun, at some future time, I’m going to get a few SBCs with WiFi on them and configure them to use it by default. Then the “cluster” will be a pile of boards with power to them… and a WiFi router in the corner…

    Now if you want to do maintenance, just remote login to them from a machine that does have Keyboard, Video, Mouse on it. Like this:

    ssh -l yourlogin 192.168.1.34

    or whatever hard coded IP you assigned to it. (Or set up your DNS to assign “headless1” and “headless2” and… etc. to the worker nodes on DHCP and then ssh -l yourlogin headless2 )

    At that point, you have a terminal window (launch something like LXterminal from system tools) and can do whatever you need in terms of maintenance on it. I usually do this for each node and launch a “top” in it. That way I have real time monitor of the performance and load of each system — up to my screen size limit ;-0 Don’t try it on 2000 node clusters! )

    For regular use as a compute node, the master application station knows who’s out there and sends tasks to each one via IP/Mac Address. (It knows the IP, the router learns the MAC addresses to speed things up – but you never see that…) So there is some specific configuration methods you need to learn to set up distcc or MPI. For example, for distcc (distributed C Compiler) it is in a file:

    https://chiefio.wordpress.com/2016/05/06/distcc-success-that-was-fun/

    At the bottom of my .bashrc file I added:

    #added by EMS 6May2016

    export PATH=/distcc:$PATH
    export DISTCC_HOSTS=”10.16.16.253/24″
    export DISTCC_TO_TIMEOUT=3000

    So I’ve got a .bashrc file that sets the environment variable DISTCC_HOSTS to an IP range. Then when launching a distcc run, it looks around for cooperative hosts in that IP range and they form up a work group…

    I never need to login to the distcc ‘farm’ nodes as they are just sitting around waiting to be polled for some work to do, by IP number…

    Something similar is done with MPI, though the specifics are different.

    Thus no need for KVM or even a session on the compute nodes. Just a target login account that lets you run things (i.e. grants permissions to run programs).

    FWIW, it is also possible to use X Windows to make window sessions on the remote boxes from your desktop workstation, but that’s overkill for most things, IMHO. There are also Windows based remote terminal programs (that I’ve used with a Dongle Pi) for those folks who have a Windows Desktop as their workstation. But again, once the particular compute package is configured, you need not ever log into the remote nodes other than for maintenance on them…

    @P.G.:

    Thanks again, and in advance!

    First you make the broken model run, then you fix it… No idea how long it will take, but the first steps of the journey have been taken.

    I expect it to be “not too hard”. Looks like just adjusting their humidity calculations to get high atmosphere humidity trend right, then adding a step to the solar flux step that consults a ‘state of the sun’ table and adjust distribution of UV energy accordingly. At least, that’s the hope… Though I’m also thinking, after this morning, that an adjustment to water vapor forcing for “frost forcing” might be needed ;-) Thinking of a posting on that now…

    FWIW, I’m reasonably happy with Devuan. There have been one or two minor bugs on the Model 3, such as the ‘shutdown’ panel is a bit daft – you must choose ‘exit to prompt’ and then choose shutdown – but otherwise it seems fine. Even those will be ironed out over time as more usage happens. So I’m past that time suck of trying to escape systemD 8-)

    My desktop is running Devuan and the compute node (Pi M2) is running Devuan. I need to dig around and find where my other power supply went (or go to Fry’s and buy one for $8…) and clone the chip for the second Pi M2 board, then I’ll have a 12 core cluster up and running… Oh Joy. (Oh, and I need to add the distcc and MPI codes to the config… guess what I’m doing this weekend…)

    Somewhere in the next 2 weeks, the Orange Pi shows up and I see if integrates and “plays well with others”, at which time it becomes a 16 core cluster and I try some bigger things.

    At any rate, “Now I’m on my way, don’t know where I’m going”… (apologies to Simon &G…)

  9. LG says:

    @EMSmith,
    So Can one presume that, in the headless configs, on a reboot/powercycle, nodes would come back online with apps relaunching the same way a network appliance would with preconfigured protocols and settings ?

Comments are closed.