Clusters and Beowulfs and C.O.W.s, Oh My!

I’ve gotten a stack of Raspberry Pi boards to act as a cluster computer for compiling C programs.
https://chiefio.wordpress.com/2017/01/09/cluster-distcc-first-fire/

In comments, LG asks a rather more deep question than it might, at first read, seem:

@EMSmith:
I’m looking for a link for step-by-step creation of a headless unit. Have you written anything that I have missed ?
Thanks

So h/t to LG for that.

Often, I’ve had the same experience in searching for something. My keyword list starts off with MY biggest interest, then tapers down and varies most at the end. So “HEADLESS Debian cluster build…”

Works great, right up until your idea of the paradigm is divergent from most folks who are doing the work. It can take a while to realize you are on a paradigm raft, adrift at sea in the internet… I’ve done it many many times.

So this question first raises the issue of the paradigm. “Headless” implies a particular structure of computing. A Master node controlling a bunch of Slave nodes. (Well, now in a more P.C. world they are sometimes called “worker” nodes, as though being a ‘wage slave’ is better… but they don’t even get wages… just fed with electric power… but I digress…) There are other paradigms, and the reason looking for a “headless” node for distcc is that the fundamental paradigm of distcc is NOT Master / Slave, like a Beowulf Cluster. It is a COW – Cluster Of Workstations.

If you look at that Beowulf “HOW TO” (somewhat dated and Red Hat specific) you will note down near the bottom it lists “ssh” and “MPI” steps. That’s the magic sauce in a Beowulf. The basic hardware setup in a Beowulf is a Master (that I like to call a ‘headend’ sometimes) and it talks to your company wide ‘inside’ ethernet. It also has a second ethernet interface that talks to a Very Private to The Cluster and Very Fast ethernet switch. Then a collection of Slave nodes (“worker nodes” or “headless” nodes) also plug into that switch. They known nothing of the greater outside world, only their entirely isolated ethernet and Master node.

For a COW, all the stations are equal. They all connect directly to the general work network. They all know about each other and can even do things like reach out through a firewall to the outside world. Each of them can originates jobs, or work on jobs another hands out (‘farming’ jobs). Distcc is like that. Picture an Engineering department with 40 folks all working on a large software development project. For 2/3 of the day, each worker is not at their terminal and their station is “doing nothing”. For most of the 1/3 when they are at work, they are editing a file, in a meeting, at the coffee pot or bathroom, servicing the email queue, etc. etc. and their high performance workstation is doing nearly nothing. Using a COW structure, when any of them wants to compile their chunk of code, it gets farmed out to ALL the workstations. So even one guy working through lunch can effectively use all 40 workstations as needed. That’s the paradigm for ‘distcc’.

What I’m building toward is a Beowulf for Climate Models. I’m starting with ‘distcc’. So my paradigm looks somewhat like both. I’m talking about the nodes like they were Master and Slave, but they are built as peers.

Which leads to the ‘cheeky’ answers on how to build a headless unit:

“Unplug the monitor”…

What makes it a headless node?

“You do”…

Both very true, and not very helpful.
“For any question there is an answer that is absolutely correct, concise, and useless. -E.M.Smith”

(Someone else may have said that too, and maybe even before me, I’m not sure)

The simple answer needs a bit more elaboration…

What I Did

There are a few things I did to shift these nodes more toward “headless” and away from “COW”. First step was, as noted, I unplugged the monitor, keyboard, and mouse. Technically that’s all you really need to do to turn a COW node headless. Lop off the head. However…

You also need to be able to log in and do maintenance. I chose to use “ssh” for that as it is (maybe ‘was’ now that they made it ‘more secure’ and a PITA…) easier to set up and takes less resources than VNC. I could have installed ‘tightvnc’ (it is in my build script but commented out) and used it to ‘login’ instead. In fact, I’ve got a trivial example of a vnc driven ‘headless’ unit in the Dongle Pi setup.

VNC –

Purpose: A remote graphical desktop interface to the target system. It lets you have a graphical desktop environment on the Puppet Pi via a screen / keyboard / mouse on your laptop.

For my initial build of the ‘headless’ nodes, I just plugged the keyboard, mouse, and monitor into the Pi directly. Once I had “ssh” working (or you could use VNC) then I didn’t need them plugged in and went back to using my workstation to further modify the headless nodes. (only a couple of times needing to move the KVM back as I blew it on something ;-)

Once you can effectively “get in” with a session from your remote workstation, you don’t need a full GUI running on the Pi, so that’s when I shut off lighdm, the desktop manager. No desktop, no manager needed.

At that point, you have a headless node. It just isn’t doing anything…

Configuring distcc

The real ‘magic sauce’ comes in the distcc configuration file. You can set this up to treat your computers as a COW of peers, or as a Master and Slaves. I’ll include one here:

from /etc/default/distcc

root@Devuan:/Climate# cat /etc/default/distcc 
# Defaults for distcc initscript
# sourced by /etc/init.d/distcc

#
# should distcc be started on boot?
#
STARTDISTCC="true"

#STARTDISTCC="false"

#
# Which networks/hosts should be allowed to connect to the daemon?
# You can list multiple hosts/networks separated by spaces.
# Networks have to be in CIDR notation, f.e. 192.168.1.0/24
# Hosts are represented by a single IP Adress
#
# ALLOWEDNETS="127.0.0.1"

ALLOWEDNETS="10.168.168.0/24"

#
# Which interface should distccd listen on?
# You can specify a single interface, identified by it's IP address, here.
#
# LISTENER="127.0.0.1"

LISTENER="10.168.168.40"

#
# You can specify a (positive) nice level for the distcc process here
#
# NICE="10"

NICE="10"

#
# You can specify a maximum number of jobs, the server will accept concurrently
#
# JOBS=""

JOBS="3"

#
# Enable Zeroconf support?
# If enabled, distccd will register via mDNS/DNS-SD.
# It can then automatically be found by zeroconf enabled distcc clients
# without the need of a manually configured host list.
#
# ZEROCONF="true"

ZEROCONF="false"

First off, the STARTDISTCC line. For a headless node that is always true. You want it to come up and start, no touching needed. For your personal workstation, you might want control and prefer a manual starting of distcc only when you don’t want to have 100% of your workstation for you. I chose to set it to ‘true’ on my headend anyway since I’m the only one using any of this hardware.

Then we have: ALLOWEDNETS=”10.168.168.0/24″

Here we tell the node where to look for distcc requests or where to put requests. You could have a Master with two interfaces, one outbound to 192.168.1.x that was NOT to be part of your Beowulf and one inbound to 10.168.16.x that was in the cluster. In that case, putting only the 10. address here forces all your distcc traffic into your Beowulf. Putting in the 192.168.1.x would let you be used by the entire internal network…

In my case, I’ve got my entire internal work network on that inside address block, so I’m acting like it is a COW and “any node is the same”. That is, for now, the Master Node only has one ethernet interface. At some future time, I’ll use the hardwire interface on the Pi M3 as the Cluster interface, and the WiFi wireless interface to connect out to the Rest Of World and it will become a Beowulf Master Node in full. At that time, the dedicated switch and Slave Nodes don’t get any internet connection (unless I let them route through the Pi Model 3 by turning on routing). Right now, all 3 nodes plug directly into the hardwire interface of my Netgear WiFi router. In the future I’ll plug them into a dedicated Netgear switch and the WiFi router becomes the “outside the Beowulf” network. In this way the COW becomes a Beowulf…

Next up, where do I pick up jobs?

LISTENER=”10.168.168.40″

This tells my workstation where to look for requests for a distcc job. For the worker nodes, it is their hardwire ethernet interface IP number. This, necessarily implies you set a fixed IP address. So in /etc/network/interfaces you need to do that.

root@Devuan:/Climate# cat /etc/network/interfaces
# interfaces(5) file used by ifup(8) and ifdown(8)

# Please note that this file is written to be used with dhcpcd
# For static IP, consult /etc/dhcpcd.conf and 'man dhcpcd.conf'

# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

auto lo
iface lo inet loopback

#iface eth0 inet manual

allow-hotplug wlan0
iface wlan0 inet manual
    wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf

allow-hotplug wlan1
iface wlan1 inet manual
    wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf


auto eth0
allow-hotplug eth0
iface eth0 inet static
address 10.168.168.40
netmask 255.255.255.0
gateway 10.168.168.254
dns-domain chiefio.home
dns-nameservers 127.0.0.1 192.1.1.253 10.168.168.254 8.8.8.8

It was at this point where I’d set it to have that hard coded IP number and found I was still getting a DHCP address too that I discovered someone had improved it by splitting out the dhcp setup to a new place, so had to manually shut off the ‘new thing’ to get only one IP address. (What I did then, in chronological order, here)

Now this matters as it is where you determine what network is “inside” a Beowulf and what is outside. Your Master node has two networks, your slaves only one. Your COW, only one, but hard coded values.

Now I believe but have not tested, that you can set that ‘listener’ to loopback and still hand out jobs to Slave nodes, but not listen for any for yourself, enforcing the Master role. As there are other more flexible ways to set this, I haven’t tried them all. This may be part of why my workstation was getting a few more than I thought it ought to get…

Who Owns My Computes?

The ‘nice’ value sets how much a job assignment can interrupt what you are doing. Nice of 20 doesn’t run until absolutely everything else has run. Nice of 1 is going to push you out of the way some times to get some decent work done. This config file has:

NICE=”10″

so will get some work done, but not give you very much grief on what you are doing. It will be deferential to your desktop work, but still not make the guy at the other end wait forever.

On a Master node, you could set this to, say 15 or 16, while on a Slave node, maybe a 1, or leave it out altogether as the only thing it ought to be doing is distcc work.

Then there is an absolute limit on the number of distcc jobs this node is to accept. A ‘reasonable’ value is 2 x cores (as sometimes a given job is waiting on I/O or finishes before the next one shows up). Here:

JOBS=”3″

I set it to 3 on my headend node. It isn’t a pure Master, where you might make it none, nor is it a slave node where you want it slammed. On the Slave nodes, I set this to 8 as they have 4 cores each and need to be doing this and nothing but this. The headend also does the job setup, the linking, and any other non-distributed process. I likely need to set it to 2 or 1, especially as the cluster grows and the setup work dominates, for the Master node.

There is a sort of a ‘self discovery’ mode for distcc that ZEROCONF turns on. I’m not familiar with it and I didn’t feel like taking on that bit of learning work at the moment, but for a very large cluster it looks like you can skip the hardcoded IP addresses and host list. Just put them in one subnet and let them find each other. That’s more a COW behaviour than a Beowulf, so wasn’t really calling my name…

Now I’d mentioned that there were a couple of places where you can configure the workload sharing. One is the “.bashrc” file in your home directory (on whatever machine is your workstation). Here’s an excerpt from the bottom of mine (dated 6 May 2016 as that was my first distcc test. So for all you folks being impressed by my distcc skill, remember I’m only 8 months ahead of you. Oh, and remember the Consultants Creed:

“An expert is the guy one page ahead of you in the manual.”

#added by EMS 6May2016

export PATH=/distcc:$PATH
# The remote machines that will build things for you. Don't put the ip of the Pi unless
# you want the Pi to take part to the build process.
# The syntax is : "IP_ADDRESS/NUMBER_OF_JOBS IP_ADDRESS/NUMBER_OF_JOBS" etc...
# The documentation states that your should set the number of jobs per machine to
# its number of processors. I advise you to set it to twice as much. See why in the test paragraph.
# For example:
#export DISTCC_HOSTS="localhost/3 10.168.168.41/8 10.168.168.42/8"

export DISTCC_HOSTS="localhost/2 10.168.168.41/8 10.168.168.42/8"

# When a job fails, distcc backs off the machine that failed for some time.
# We want distcc to retry immediately
export DISTCC_BACKOFF_PERIOD=0

# Time, in seconds, before distcc throws a DISTCC_IO_TIMEOUT error and tries to build the file
# locally ( default hardcoded to 300 in version prior to 3.2 )
export DISTCC_IO_TIMEOUT=3000
#
# To prevent local try of compile, uncomment this line:
#
#export DISTCC_SKIP_LOCAL_RETRY=1
# Don't try to build the file locally when a remote job failed
export DISTCC_SKIP_LOCAL_RETRY=10

So you have some more tuning opportunties, especially on your headend machine.

Now I think my setting JOBS=3 for my headend and maybe having RETRY set a bit odd might be why I get the workload on the headend that I get, so I may need to tune this a bit and / or read up on just what the settings are for DISTCC_SKIP_LOCAL_RETRY and what they really mean in detail. Or perhaps JOBS=3 is having me ‘listen to myself’ on the network for 3 and listen to the loopback interface for 2 more here with “localhost/2”. Yes, that kind of thing can happen depending on how the program was written. “Tuning, here we come”.

In Conclusion

So that’s the high level look (really, it is high level…) at what makes a headless node headless and how to do it.

I’m going to write up a ‘step by step’ through the process, but I think this kind of perspective posting is important before you just get down grinding through the weeds…

Oh, and FWIW, there are other kinds of clusters too. Giant Supercomputers that are a very tight cluster of 2000+ CPU/system boards on a high speed backplane (yet under the skin the concepts are the same for much of it). SETI and similar BOINC things spread over the internet with compute nodes of all sorts of different desktops at the other far end. Called Grid Computing, it differs from a COW mostly in that the machines are different architectures and spread over a slow internet instead of a fast corporate ethernet or a dedicated switched network (or an internal backplane fabric).

Don’t let all the fancy and divergent names fool you. Under it all, they are basically the same structure. Nodes that farm out work (or sometimes limited to only one node that can farm out work). Nodes that do the work. A network to connect them. As the network speed drops, you move from Parallel Processor Supercomputer to Beowulf to COW to GRID… and as the tightness of control of the headend drops from “complete” to “little” you follow the same order.

So don’t let the terminology and the “paradigms” intimidate. It’s just artificial complexity on tope of “a program on this computer hands some work to that computer over there”. Unplug the monitor, it’s a headless slave. Type on the keyboard with a monitor, it is the Master… or just a COW.

Subscribe to feed

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in Tech Bits and tagged , , , , . Bookmark the permalink.

7 Responses to Clusters and Beowulfs and C.O.W.s, Oh My!

  1. LG says:

    @EMSmith:
    Thanks! CheifIo for obliging.
    Question: What’s the expected behavior of the cluster when it encounters a power outage or member(s) crash(es)?
    Does it reconverge automatically or does it need manual intervention ?

    PS.
    Recently I came across this paper describing a MS internal hardware project where FPGAs and CPUs where integrated into accelerating a cloud-scale architecture. Different scope, same basic idea.

    https://www.microsoft.com/en-us/research/wp-content/uploads/2016/10/Cloud-Scale-Acceleration-Architecture.pdf

  2. E.M.Smith says:

    @L.G.:

    You are most welcome!

    How a cluster behaves in a power fail depends some on the nature of the individual nodes.

    What it ought to do is have the individual nodes boot up, and start their daemons, and everything is fine. What actually happens can vary…

    At boot, did ‘fsck’ finish fine, or hang? If hung, you just lost a node until you manually do an fsck and reboot it.

    Were some of the systems on a UPS and some not? Then some of the tasks completed and some didn’t. You might need to restart the job in question from original start point.

    Do you have ‘checkpoint / restart’ code in the application? (Like in Model E & IIRC Model II) Then at worst you can pick up where you left off.

    The key thing to realize about most parallel compute systems is that the folks designing them ran into most of The Bad Things and did something to mitigate it. Oh, and that distcc and MPI are not the whole of it. There’s a half dozen others in common use… each with their own behaviours.

    For distcc and MPI, when the backend nodes boot, they just check in for work. At most you lost the particular bit being worked when the system crashed, but the headend ought to have marked it as ‘checked out but not done’ and resend it to some node anyway. The headend ought to wake up and be fine too. Now the question is do you have some kind of ‘restart’ built into your headend process, or not?

    For compiles, the lack of a .o file will cause that unit to be compiled again, simply by restarting the job with a repeat of the “make” command. Prior work units that completed will have produced a .o so not need recomputing and “make” will ignore them. For other codes, it depends on how you wrote the checkpoint / restart parts.

    The actual cluster itself tends to be self healing in that work units are handed out by the headend to any backend node that checks in and asks for a work unit. So if 3 out of 5 backend nodes boot, one hangs in an fsck and the other lost a powersupply, you get a cluster of one headend and 3 backend nodes. When the fsck is run, the next one finishes the boot and joins the cluster by asking for a work unit. New powersupply and that next one boots, it joins up too…

    The key thing is to realize that protocol of “check in, get work unit, when done submit results, Headend marks that run complete and hands out next work unit.” It is very robust to dropouts and restarts and dynamic cluster config changes.

    Per the FPGA et. al.:

    Due to recent slow down in the rate of Moore’s Law silicon advances (basically feature size approaching quantum limits for some aspects) the world of HPC High Performance Computing is moving ever more into using massively parallel and smaller compute nodes. Also specialized nodes.

    This is showing up dramatically in the use of GPU (Graphics Processor Unit) hardware like from NVDIA as a general purpose Vector Processing Unit (though with a very small stride of about 4 IIRC, so 4 computes in parallel – though expect that to grow as special purpose ones are made. The 4 is due to video being only 4 dimensional in need. 2 D of screen, then color and some other parameter). A FPGA is a reconfigurable set of gates that can be turned into a specialized Vector Unit as needed, but being non-standard takes more software effort to use effectively. Best used for testing special purpose hardware or building very small lots of special purpose hardware. (Large lots you go to fab for lower $/part). So a FPGA is nifty, but somewhat specialized and niche. (The Parallela board has one, but most folks ignore it in their excitement about the 16 ARM cores on a communications fabric…)

    The end game, IMHO, will be larger “built for purpose” Vector Units with the intent being math intensive work. Basically a GPU that has a stride longer than 4… FWIW, the old Cray had a stride of 64. You handed it a group of 64 x and 64 y and said “multiply” and got returned 64 z in one clock cycle… SIMD Single Instruction Multiple Data. (Your typical CPU is SISD Single Instruction Single Data – multiple cores let you do MIMD Multiple Instructions Multiple Data…) So we’ve already started exploring MIMD and are moving into more SIMD like the old Crays…

    Welcome to the alphabet soup of Supercomputing ;-)

  3. Larry Ledwick says:

    Given how low the power consumption is on the Pi systems a small UPS should just about eliminate any issue with power fails in normal circumstances. Just put a power strip on the UPS battery supported output and plug your power suppy(s) into that power strip and a short power outage that drops line power for more than:
    (pi watts_in * number_of_nodes * hours_power_out) = UPS watt hours rating / safety_factor)

    You should not have to worry about dropping power for normal power bumps and short interruptions.

    8 nodes at 5 watts each, and power out for 0.25 hours and safety factor of 2 you would need a UPS rated at 20 watt hours.

    The only problem is the UPS manufactures make it more complicated than it needs to be to size a UPS, it appears that the VA values (effectively AC watts) used for UPS specification are the RMS voltage and RMS Amperage at a 0.6 power factor for the AC output side for one hour of operation.

    If anyone else has better info if that is really the standard convention.

  4. E.M.Smith says:

    @larry:

    I’m happy with a giant Diesel Car Battery as my basic power supply. I figure a few weeks of power outage at the consumption of a few pi. Heck, I figure the internal discharge rate is higher than 4 x pi draw!

  5. kneel63 says:

    “An expert is the guy one page ahead of you in the manual.”

    Indeed! I’ve spent the last few weeks (includes XMas, so not as bad as it sounds) “playing” with IPSec site-to-site tunnels. Jump into that, and you’ll find all sorts of TLA’s that you need to look up because you’ve forgotten what they mean, along with plenty of people who will back away with hands raised, palm out, not wanting anything to do with it (now THAT’S experience :-( ) But like so much in our game, it looks daunting – even scarey – until you find you have no choice but to sink or swim, then you realise it’s actually not as bad as you first thought; that once you have your head around it, it really does make sense; and that “industry standards” that evolve from proprietary ones are normally ugly and most definately NOT “the way I’d do it”.
    You also realise you just created your own monster – next time something goes wrong, everyone points to you and says “don’t ask me – he’s the expert!”

    So NEVER, EVER forget that quote and NEVER, EVER bandy it about much – if you blab too much about that, people will realise you ain’t that clever after all ;-) Oh, and drag someone else into it as well – you definately don’t want to be the only one who “gets it”, or you’ll be the one with the ringing phone at 2AM XMas morning.

  6. E.M. Smith says:

    As an O.T. aside:

    I’m typing this on my “new to me” slowest Macbook Air ever!

    The Spousal Computer had the Solid State Disk fail (from the same kind of bit wear as USB sticks, ti would seem, and 5 years is about when it hits.. with her constant use profile). At $700 for Apple to replace it (or $400 or so for me to DIY) and with the 2 GB of memory meaning not going to get future OS upgrades, it was better for her to get a newer one… and I inherited this one…. Which I have managed to boot from a “Cruzer’ USB “stick” about the size of a chicklett and 16 GB.

    It gets really slow with long lag times on ‘disk writes’ and takes a good while on large reads like launching applications…

    OTOH, I can now post and do similar things without bing tethered to a monitor.

    Who knows, in a few months I might even buy a real SSD and the install kit for the worlds tiniest torx screws ;-)

  7. E.M.Smith says:

    Oh, and a minor “Security Trick” that also can be used to make the backend of a cluster less visible to the rest of the network:

    Any given machine can see the network directly outside its ethernet interface. DHCP hands over all the stuff it needs to see the rest of the world, too. When you set the address manually, you must provide all that stuff. But you ar not required to provide it if you don’t want that function…

    So to get to the Rest Of World (ROW), any given network or subnet needs a ‘gateway’. A box with 2 interfaces, one on each network, and a router active between them. Your machine must know what address to talk to to reach that router to see the rest of the world. You do this with the ‘gateway’ command in the /etc/network/interfaces file (noted above).

    gateway 10.168.168.254
    

    That says to talk to that IP address to get to the ROW. Take that line out, your machine can still see all the computers inside its local network, but can’t ‘talk out’ to the ROW.

    You can also use this ‘feature’ outside of a cluster as a security feature (if a small one). Say you have a file server set up and need it to share files to your home network of 192.168.1.0/24 but do NOT want it talking to the ROW. Just leave out the Gateway line. Now anyone hacking in can’t get traffic back from that particular machine. Hard to crack it if you can’t get it to talk to you…

    To crack it, you first have to compromise some OTHER machine, then use that machine which is on the local network as a sock puppet to crack into the file server. Now if you want to leave your file server up 24/7 and only fire up your desktop or laptop when at home using it, you have significantly reduced your attack surfaces, especially when you are not at home.

    Yes, a minor security enhancement. Yet it is a collection of dozens of these things, each one minor, that results in a wall of “Aw Shits” in front of the system cracker and slows them down enough for your IDS / IPS or even just the blinky lights on your router to give you clue. The goal is to frustrate them enough they give up, while giving you more time to discover and counter.

Anything to say?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s