I’ve mentioned a few times that it is hard to get the Raspberry Pi Model 2 fully loaded up on CPU usage in normal use. I’ve also posted about “issues” with using parallel FORTRAN not being very efficient / not giving much gain.
I think this is related to “Amdahl’s Second Law”.
An interesting article about a DIY cluster computer references it here:
Designing the System
Microwulf is intended to be a small, cost-efficient, high-performance, portable cluster. With this set of somewhat conflicting goals, we set out to design our system.
Back in the late 1960s, Gene Amdahl laid out a design principle for computing systems that has come to be known as “Amdahl’s Other Law”. We can summarize this principle as follows:To be balanced, the following characteristics of a computing system should all be the same: the hertz of CPU speed the bytes of main (RAM) memory the bits per second of (I/O) bandwidth
The basic idea is that there are three different ways to starve a computation: deprive it of CPU; deprive it of the main memory it needs to run; and deprive it of I/O bandwidth it needs to keep running. Amdahl’s “Other Law” says that to avoid such starvation, you need to balance a system’s CPU speed, the available RAM, and the I/O bandwidth.
For Beowulf clusters running parallel computations, we can translate Amdahl’s I/O bandwidth into network bandwidth, at least for communication-intensive parallel programs. Our challenge is thus to design a cluster in which the CPUs’ speeds (GHz), the amount of RAM (GB) per core, and the network bandwidth (Gbps) per core are all fairly close to one another, with the CPU’s speeds as high as possible, while staying within our $2500 budget.
FWIW, I worked at Amdahl Corp in the early ’80s, so was exposed to various of Amdahl’s ideas early on.
Back At The Pi M2
So with the Raspberry Pi Model 2, we have a 100 Mb/sec ethernet, about 10 – 30 MB/s of I/O to SD card, 480 Mb/sec of USB, 1 GB of memory, and 4 cores of computer power running at 900 MHz each (or 1 GHz if overclocked, which I do). Call it 4 GHz of aggregate clock.
For the SD card, allowing some for overhead, that’s about 0.1 Gb / second to 0.3 Gb / second. ( 10 MegaBytes with a couple of overhead bits is about 100 Mega bits, or 0.1 Giga bits )
That gives us a CPU : RAM : USB 2.0 : SD : COM ratio of:
4 : 1 : 0.48 : 0.1 – 0.3 : 0.1
The USB and Ethernet also share some of the hardware, so you can’t drive both at full speed at the same time.
Adding up the max SD and the USB/Ethernet, you get about 0.48 + 0.3 = 0.78. Call it 0.8 Gb/second.
That gives an overall CPU : RAM : I/O of about:
4 : 1 : 0.8
A fairly long way from 1 : 1 : 1 as the ideal.
In short, it looks like the original balance of one core with memory and I/O has been upset with the 4 core chip, and there was no action taken in the design to fix that. (Like, oh, having USB 3.0 or Gig-E Ethernet).
On Making Small Portable Clusters
Not quite P.G.’s “Beer can” computer, but a nice “lunch pail” computer:
It very nicely addresses what is another key aspect of HPC (High Performance Computing): Power and Heat. It is what is essential to address in making any HPC device effective. “computes per Watt” and heat extraction are key.
This design uses a PC/104 type board. That’s a standard “computer on a small board” for embedded systems. The Raspberry Pi didn’t originate the idea of a SBC (Single Board Computer), it just drove the price down dramatically. From the “$100-$300” range to the $10 to $30 range…
In the following quote, “DQ” is their second machine. I think DQ stands for Drag Queen, given the article description / photo…
Miniclusters were first created by Mitch Williams of Sandia/Livermore Laboratory in 2000. Figure 1 shows a picture of his earliest cluster, Minicluster I. This cluster consisted of four Advanced Digital Logic boards, using 277MHz Pentium processors. These boards had connectors for the PC/104+ bus, which is a PC/104 bus with an extra connector for PCI.
[… skippiung to page 2]
Sandia was not asleep at the time. Mitch built Minicluster II, which used much more powerful PIII processors. The packaging was very similar to Minicluster I. Once again, we ported LinuxBIOS to this newer node, and the cluster was built to have one master with one disk and three slaves. The slave nodes booted in 12 seconds on this system. In a marathon effort, we got this system going at SC 2002 about the same time the lights started going out. Nevertheless, it worked.
One trend we noticed with the PIII nodes was increased power consumption. The nodes were faster, and the technology was newer, and the power needed was still higher. The improved fabrication technology of the newer chips did not provide a corresponding reduction in power demand—quite the contrary.
It was no longer possible to build DQ with the PIII nodes—they were just too power-hungry. We went down a different path for a while, using the Advantech PCM-5823 boards as shown in Figure 5. There are four CPU boards, and the top board is a 100Mbit switch from Parvus. This switch is handy—it has five ports, so you can connect it directly to your laptop. We needed a full-size PC power supply to run this cluster, but in many ways it was very nice. We preserved instant boot with LinuxBIOS and bproc, as in the earlier systems.
As of 2004, again working with Mitch Williams of Sandia, we decided to try one more Pentium iteration of the minicluster and set our hungry eyes on the new ADL855PC from Advanced Digital Logic. This time around, things did not work out as well.
Second, the power demand of a Pentium M is astounding. We had expected these to be low-power CPUs, and they can be low power in the right circumstances, but not when they are in heavy use. When we first hooked up the ADL855PC with the supplied connector, which attaches to the hard drive power supply, it would not come up at all. It turned out we had to fabricate a connector and connect it directly to the motherboard power supply lines, not the disk power supply lines, and we had to keep the wires very short. The current inrush for this board is large enough that a longer power supply wire, coupled with the high inrush current, makes it impossible for the board to come up. We would not have believed it had we not seen it.
Instead of the 2A or so we were expecting from the Pentium M, the current needed was more on the order of 20A peak. A four-CPU minicluster would require 80A peak at 5 VDC. The power supply for such a system would dwarf the CPUs; the weight would be out of the question. We had passed a strange boundary and moved into a world where the power supply dominated the size and weight of the minicluster. The CPUs are small and light; the power supply is the mass of a bicycle.
That’s the way things usually go. More computes, more mass, much more power, and a hot heavy mass of metal needing a load of heat extraction.
The ARM chip takes a different path from the Intel folks. Small, low power, efficient, even if a bit slower per chip. But, for a small portable cluster, this can be a big feature… These folks were building a Beowulf Cluster to live in a large lunch box and have high portabality and without a Giant Sucking Sound from the Turbo Fan Heat Mover…
The Pentium M was acceptable for a minicluster powered by AC, as long as we had large enough tires. It was not acceptable for our next minicluster. We at LANL had a real desire to build 16 nodes into the lunchbox and run it all on one ThinkPad power supply. PC/104 would allow it, in terms of space. The issues were heat and power.
What is the power available from a ThinkPad power supply? For the supplies we have available from recent ThinkPads, we can get about 4.5A at 16 VDC, or 72 Watts. The switches we use will need 18 Watts, so the nodes are left with about 54 Watts between them. This is only 3W per node, leaving a little headroom for power supply inefficiencies. If the node is a 5V node, common on PC/104, then we would like .5A per node or less.
This power budget pretty much rules out most Pentium-compatible processors. Even the low-power SC520 CPUs need 1.5A at 5V, or 7.5 Watts—double our budget. We had to look further afield for our boards.
We settled on the Technologic TS7200 boards for this project. The choice of a non-Pentium architecture had many implications for our software stack, as we shall see.
The TS7200, offered by Technologic Systems, is a StrongARM-based single-board computer. It is, to use a colloquialism, built like a brick outhouse. All the components are soldered on. There are no heatsinks—you can run this board in a closed box with no ventilation. It has a serial port and Ethernet port built on, requiring no external dongles or modules for these connections. It runs on 5 VDC, and requires only .375A, or roughly 2W to operate. In short, this board meets all our requirements. Figure 6 is a picture of the board. Also shown in Figure 6 is a CompactFlash plugged in to the board, although we do not use one on our lunchbox nodes.
The article goes on into physical assembly of the stack, making a minature Ethernet Switch to fit in the lid of the package. Loading the OS et. al.
All in all, a nice little portable Beowulf Cluster. (Though it looks more like a small tool box than lunch box to me…)
There are some basic truths in life. When you find one, it can be useful forever.
If you are designing your own multiprocessor system, or even just buying a multicore box, two of them are Amdahl’s Second Law, and Power and Heat Management Matter.
For the Raspberry Pi Model 2, they got the power and heat right, but the balance of I/O to CPU is way off. In practical use, this shows up as a desire to “swap” with modest loads of memory hogs like many browser tabs open (I have swap on a USB disk so as not to ‘wear’ my SD card ); and difficulty getting all 4 cores “loaded up” with work at the same time. Rarely does it go over 60% utilization.
Furthermore, when a strongly compute intensive task is given to the card, it has trouble walking and chewing gum at the same time. Not only the low improvement in throughput with parallel FORTRAN, but simple stability.
I loaded up 2 cores with Golomb Ruler searches. They run about 99% utilization with nearly zero I/O. Nice way to balance the total system resource use…
Except that when I came back this morning and started using the browser to enter this article, about 3 sentences in, the system hung. This is not the first time I’ve had that happen. It isn’t often (about once a month?). It only happens when I’ve put compute intensive tasks on the cores and loaded up the system. In normal use, it doesn’t happen. But in normal use system demand is about 1/2 of what is available… This might just be a Raspian OS issue, but until further tested it must be treated as ‘generic’. Other Linux / OS options might get past this issue.
So my conclusion from this is that the Raspberry Pi Model 2 is not suited to use in making a Beowulf Cluster. The individual card is imbalanced. I/O is too limited. It isn’t stable to intensive compute tasks AND other I/O oriented work at the same time. It can, and does, hang when heavily driven; even if rare.
Now if you JUST used it for a compute engine, no X-windows. No browser. Would it then work? Most likely. It ran fine all night with 2 cores loaded up with Golomb Ruler searches. I don’t know what part got tickled and hung. Perhaps just not using X would be “enough”. At some point I am going to “load it up” with 4 Golomb Ruler searches and NO windowing system, just to see if it “can take it”. But at the moment, it is certainly not suited to being a mixed use card with heavy compute load, and has insufficient I/O for heavy I/O loads.
In short, it is what it was designed to be: a toy system for learning that can do basic day to day things. A Good Enough computer for dirt cheap.
For making a cluster, I’d use the 1 CPU boards that each have their own I/O. It will get you closer to a balanced system and you don’t have the problem of the cores interacting “not so well” as seen on a heavily loaded Model 2.
This matters to me as I’m hoping to build just such a cluster. I’ll now likley use the Model B+ as the basic module (especially if price erosion continues ;-) since a set of 4 at about $100 is going to let me use all 4 cores fully without issues, and will have 4 x the I/O and be a more balanced system. A $50 Model 2 (with add ons) is only going to let me keep 2 cores loaded up anyway… so the $/Mip used is about the same. (I’d use the Zero, but have not yet figured out how to get USB-ethernet out of it…)
This also further reinforces the idea that the CubieTruck (CubieBoard 4) has a better balance of I/O to CPUs and that even with the higher price, it is a better $ / “usable CPU” proposition for a desktop alternative.
This also points to a simple way to evaluate alternatives, by making that CPU GHz : Memory GB ; I/O Gb ratio and looking for 1 : 1 : 1 as the goal.