You can click on the image for a bigger version.
In the picture, the top left panel is a terminal window with a “make” of the Pi Linux Kernel being run on my Pi Model 3. The bottom left tiny window is where I ran “scrot” to make the SCreen shOT. The three panels on the right are running “top”. It shows what processes are running and some other operational aspects like memory used and swap and such.
In order, top to bottom, are the “top” reports for the Pi-M3 workstation showing one C compile running (CC1), under it the panel for “Headless 1” showing several CC1 instances running, then under that “Headless2” showing none running.
Now I’m not sure why none are running on “Headless 2”. It ought to have an even share of the load with Headless 1, but I might have a tuning parameter wrong or it might not be participating in the pool. That debugging is for tonight (now that I have a working test case).
The “biggie” for me is that I’m posting this on the Pi Model 3, with little notice of the build happening, since it is only running the occasional one compile in one cpu, while most of the work is farmed out to that other Pi Model 2 board. Nice, that.
So time to celebrate a tiny bit! Yay!!!
I “lost” about 3 hours today trying to find something to compile as a test case. I found free and open source compiles for a fancy BASIC, that was written in itself… Sigh. For a nice C, that wasn’t willing to try compiling on ARM. For g95 Fortran that I’ve compiled before, that ceased development in about 2008 so doesn’t know what an ARMv7 target is… and more junk.
Finally I had a bit of clue that maybe, just maybe I ought to simply build the Pi kernel itself since by definition it will have no such issues. Ta Da! How To here:
Don’t expect to see any such thing for FORTRAN compiles. It doesn’t use distcc. OTOH, compiling Model II on the Pi Model 3 takes about 3 minutes, so who cares… the models are NOT big chunks of code. (Especially not compared with Linux Kernels, tools, compilers, and most large commercial codes. Heck, even a browser is much much larger.)
So what good is it?
Well, for one, it is a demonstration case that the cluster works. It also will really help with things like “rolling my own” distribution and building one from sources. (so that week to build BSD can become a day, or less…)
It is also a ‘shake down’ for the cluster. If it can compile a working linux, it can run a compute intensive model without heat or other modes of failure.
So, with those steps out of the way, I’m on to the next bits.
1) Get “Headless 2” to participate in the compile party.
2) Build Model E and test the MPI distributed execution.
3) Run Model II and see how long it takes. Maybe try adding some MPI bits to it.
4) Add the Banana Pi 3rd headless node to the cluster and see if it “plays well with others”.
5) Finally get around to building that UPS I wanted…
Power failed today for about 1/2 hour in the storm. The Pi’s did a great job of rebooting on their own, but I’d rather not have that happen in the middle of a Linux build. I’ve got a kW UPS with a dead gel battery in it, and I’ve got a BIG Mercedes sized battery that could stand to live on a trickle charger… soo… Battery to box. Clean up UPS and remove battery. Add jumpers. Viola!… (Pronounce Vi-Ola for effect ;-) NOT an error in Voilà)
With that, I’m taking a dinner break and a movie break and a just sit and glow a bit break ;-)
Sometimes just stepping away for a minute is enough…
Hedless 2 had a config error in /etc/default/distcc… it was ‘listener’ left at the default of the loopback address 127.x.x.x and need to be the address of its own ethernet.
Fixed and restarted, now all three boards are compiling away!
I’m looking for a link for step-by-step creation of a headless unit. Have you written anything that I have missed ?
Sounds like beer thirty! A liquid blessing is due…pg
I’m going to do that next. First you get it to work, then you do the “How To”.
I’ve taken notes ;-)
Sadly, I’m out of suds at the moment and being Sunday, the Spouse Does Not Approve…
(so tomorrow when she’s at work ;-)
I just did a full run again, with all three boards ( 12 cores ) loaded to the max. Typical idle time about 5% (or 95% loaded) so cooking all the way (literally, see below…)
How to turn 400 minutes of compute waiting into 40:
First I made sure no .o object files were around and it would need to recompile everything.
Then I launched it with “time” in front of the “make” command so at the end it would report timing statistics. Note, too, the -j18 argument. That says to launch up to 18 compile requests in parallel. The two Headless nodes are configured to take 8 each, and while I’d configured the headend to take only 2, it seemed to often have more, so it may be some bits are not farmed out (which I saw in the first run) or that I’ve not got the tuning as right as I think I do (it is set in a couple of places…)
In any case, much of the time the three boards were all at nearly zero idle time, and only when doing the dtbs part (or near it) was there any block of over 25% idle on the headless while the headend was full loaded, so not much tuning is needed…
After running a while, I took the temperature. 70C is the usual marker for a hot CPU, and over that is “not good” (shortens chip life a little as dopants migrate in the junction… not an issue for short times, but days at above 90C or so can be an issue.)
So running kind of hot after 20 minutes… This is with heat sinks and in the clear plastic case with not many vent holes. I’d noticed that when installing it, but thought: “well, maybe…”. I’ll be moving it to the red case where the whole top comes off for better ventilation… The ones in the Dogbone Case have the entire periphery open and some holes between layers. I’ve not measured their temp in operation yet.
OK, after pages and page and pages of compiles, end ends up in the link phase (only on the headend) and finally finishes:
Well, 42 minutes elapsed time.
Now, note the User and Sys, add those together is more than elapsed. That’s due to 4 CPU cores… so about 140 CPU Minutes. But there are two more boards similarly loaded to the gills. So 140 x 3 = 420 CPU minutes, roughly. Or about 7 hours.
So computed on a single core machine of this average clock cycle, it would take 7 hours to do this build. The cluster finished it in 42 minutes. Nice. Very Nice. About a 10 x speed up (which makes sense since there are 12 total cores).
A few minutes after finishing, the core temp has dropped a lot:
which is about where it stabilizes, a bit under 60 C.
In a few days I’ll try integrating the Orange Pi and run the test again…
Given that the boards all load up to near 100%, it can still benefit from more boards. That continues up to the point where the headend is fully loaded just handing out tasks (not doing compiles itself) and the worker boards are not kept at near 100%. That’s the max cluster size that does anything for that particular problem. I’m guessing about 2 more boards and I’ll be getting close to that…
In any case, the job is done. The system is proven. I’ve got a Build Monster. Yay!!!
Now on to making backup copies of the chips (so any future configuration AwShit doesn’t set me back several hours to recreate them) and write up the How To. Then start from a bare empty SD card and follow the directions then test to show it is right.
No worries, I’ll post the detailed “how to” before I do the re-load and QA. It’s more fun that way ;-) Nothing like live “Ooops! That line was wrong in my notes, do this instead!” to give folks a bit of chuckle ;-)
Well, time to say goodnight to the spouse… I’m taking another break ;-)
Great to see the project moving along, and getting good results without too much brain damage from troubleshooting odd ball problems.
Fair play to you. Can’t wait to see the “how-to”. I have quite a number of Pis of various versions, which have now become free after a failed 360 Camera experiment using them. I’d like to do an analysis of the GHCN raw dataset, using some ideas of my own about it. Doubling that up with building a cluster and getting it running would be a real challenge.
Viola -just saw this at Prof Claes Johnson “I have to echo Viola in Twelfth Night: “O time, thou must untangle this, not I.”
“Time, the fire in which we all burn…” (TNG)
So time references spanning 12th Night to TNG… I think we have a theme ;-)
Well, it would have been a lot easier if the folks at Debian (the “upstream” provider to Devuan) and Red Hat (the “upstream” to Debian) were not so busy breaking things that work to “help” with with their new ideas… At lest 1/2 the time spent was just finding “What have they screwed up now that worked just fine?”. Yes, I know that security issues is a moving river and you can “never have enough”, but sometimes you can have too much… That point is where you can get your project to work by just shutting off all of it instead of figuring out how to configure the latest byzantine additions… You saw that in my just bypassing the whole hardware fingerprint thing for SSH. Now, “someday”, I need to go back and ‘back that out’, or just accept the ‘insecurity’ (of about the 1990’s style of SSH). I’d rather spend that time on “new stuff”, and I’m behind 2 firewalls and don’t connect out to the ROW with the headless ones, so will likely just leave it.
FWIW, “in theory” you have the whole “How To” in the litany above along with the linked prior postings. I tended to put a note here (or in the posting for those already found) every time I ran into an “issue” and worked through it. But I like to “tidy up” when the wandering in the forest is done, so will be “pulling it all together”. Often into a “scripted build”. For now I’m unlikely to make that script, but if Model E runs with good distributed computes and looks like a 16 board cluster would be enough, well, I’m more inclined to run a script 16 times than do 16 by hand builds. (OTOH, once one headless node runs, you can just clone it with ‘dd’ so why even run a script?… other than the change of IP thing…)
Also FWIW, there’s a few different distributed computing models and software to implement them. ‘distcc’ makes a giant distributed C compiler. Useful for folks doing a lot of C compiles, but useless for distributed data analysis… MPI (Message Passing Interface) lets you write your own programs that then send bits to different machines for analysis. You get your choice of a couple of flavors. OpenMPI or MPICH2 or… So what you will want is the MPI install and testing (hopefully by next Monday…) Now, if you write your analysis code in C, you can have both a distributed compile AND a distributed analysis…
As mentioned before, OpenMP (note the missing I at the end, it isn’t the same as OpenMPI) is a multiprocessor ‘by the thread’ parallel technique. It ought to let a program distribute bits between the 4 cores on a single Pi board. My test of it was dismal. It took LONGER to run than just in one core. Clearly the “set up” was costing more than the gain. Now there are 3 most likely reasons. It’s a lousy implementation of OpenMP. This chipset doesn’t do threads well. The test case was so trivial that the overhead dominated and a more complicated thread test would be improved. Of those, I think it’s likely that the chip just doesn’t do threads well and a maybe on the other two. It might be the OpenMP was optimized for Intel since that’s what almost everyone develops on, or that the ARM being a RISC design at heart doesn’t like threads. (Doesn’t handle a lot of stacks and stack pointers well). In any case, the distcc example shows that parallel processing can be done on the ARM if you “do it right”… so for now I’m not doing “thread parallel” but “processor parallel”. MPI being more processor parallel ought to work well.
But I’m wandering into next weeks task when I haven’t finished this weeks yet ;-)
Pingback: Clusters and Beowulfs and C.O.W.s, Oh My! | Musings from the Chiefio
Sipping a nice Australian Chardonnay as I type ;-)
For anyone wondering about my assertion that the Climate Models were pretty weak tea compared to really complicated codes like commercial software or the Linux kernel, I did a timed test of the Model II compile time.
So first, refresher, the Linux Kernel above was 7 hours on one core, or 40 minutes on 12.
Here’s the results of a compile of Model II on one core (FORTRAN does not go parallel without a lot of work…)
Drum Roll Please… BRRRRumpata rumpta rumpata!!
Yes, you read that right. 2 MINUTES of elapsed time (plus a smidge) and about the same CPU time.
2 Whole F’ing minutes.
Compare 7 HOURS for the Linux Kernel.
See why I say the code isn’t that complicated? It’s only 1/210 the time to compile as the Linux kernel, not the whole operating system…
Started working on MPICH2 and general MPI install / run and was “reminded” that the newer Linux versions can be a bit pig-headed about actually letting you use NFS… (more security-against-my-will mixed with complexity-just-grows…)
Need to do:
(Which implies you have installed them…)
On the server side to get it to actually export file systems… (There was a time when most Unix machines had NFS on by default and all you did was add the file systems to export to your /etc/exports file and do exportfs -a and were done…)
Despite exportfs saying the file systems were exported, they were not until more commands were issued. (See below).
Havn’tI have tested if this survives a reboot, and it survived. will likely need to do that and add it to rc.d somewhere if it doesn’t.
On the client side in /etc/fstab:
That way the same account with home directory there and the GCM runnable there is available and mounted on all the nodes. It’s an MPICH thing…
Well, I seem to have MPICH installed and running.
when run spits out a set of different numbers as expected and I did see an SSH pop up on the other headless unit:
So I think it worked. I’ll be writing up what I did as a posting. I think I can figure out which of the Magic Sauce bits mattered and which were deadends based on a false idea of the problem.
I was having keyring problems that were solved by launching ssh from each node to the other and having it capture a password response, even though it is all the same account, same nfs mounted file system and same keyring… There is likely some hardware identifier coded into it as “protection”…
Now, with MPICH2 seemingly working, I can compile Model E and see what it does on my mini-cluster ;-)
Had to put a pause into the program so it would not finish so fast I couldn’t check that it actually got to a different machine. Re-ran the test, and Yup, I’m dishing out programs to each machine in the cluster!
The changed program (that now pauses forever…)
and the processes ‘running’ (paused) on each board:
Oh, and you do a CTL-C to end them all on the originating system:
Looks like I’ve got me an MPICH cluster as well as a C Build Monster!
Time to go check out the status of the blush wine ;-)
Pingback: MPICH2 Installed on Pi Cluster | Musings from the Chiefio