Raspberry Pi Model 2 and Amdahl

I’ve mentioned a few times that it is hard to get the Raspberry Pi Model 2 fully loaded up on CPU usage in normal use. I’ve also posted about “issues” with using parallel FORTRAN not being very efficient / not giving much gain.

I think this is related to “Amdahl’s Second Law”.

An interesting article about a DIY cluster computer references it here:

http://www.clustermonkey.net/Projects/microwulf-breaking-the-100gflop-barrier.html

Designing the System

Microwulf is intended to be a small, cost-efficient, high-performance, portable cluster. With this set of somewhat conflicting goals, we set out to design our system.

Back in the late 1960s, Gene Amdahl laid out a design principle for computing systems that has come to be known as “Amdahl’s Other Law”. We can summarize this principle as follows:

    To be balanced, the following characteristics of a 
    computing system should all be the same:

        the hertz of CPU speed
        the bytes of main (RAM) memory
        the bits per second of (I/O) bandwidth

The basic idea is that there are three different ways to starve a computation: deprive it of CPU; deprive it of the main memory it needs to run; and deprive it of I/O bandwidth it needs to keep running. Amdahl’s “Other Law” says that to avoid such starvation, you need to balance a system’s CPU speed, the available RAM, and the I/O bandwidth.

For Beowulf clusters running parallel computations, we can translate Amdahl’s I/O bandwidth into network bandwidth, at least for communication-intensive parallel programs. Our challenge is thus to design a cluster in which the CPUs’ speeds (GHz), the amount of RAM (GB) per core, and the network bandwidth (Gbps) per core are all fairly close to one another, with the CPU’s speeds as high as possible, while staying within our $2500 budget.

FWIW, I worked at Amdahl Corp in the early ’80s, so was exposed to various of Amdahl’s ideas early on.

Back At The Pi M2

So with the Raspberry Pi Model 2, we have a 100 Mb/sec ethernet, about 10 – 30 MB/s of I/O to SD card, 480 Mb/sec of USB, 1 GB of memory, and 4 cores of computer power running at 900 MHz each (or 1 GHz if overclocked, which I do). Call it 4 GHz of aggregate clock.

For the SD card, allowing some for overhead, that’s about 0.1 Gb / second to 0.3 Gb / second. ( 10 MegaBytes with a couple of overhead bits is about 100 Mega bits, or 0.1 Giga bits )

That gives us a CPU : RAM : USB 2.0 : SD : COM ratio of:

4 : 1 : 0.48 : 0.1 – 0.3 : 0.1

The USB and Ethernet also share some of the hardware, so you can’t drive both at full speed at the same time.

Adding up the max SD and the USB/Ethernet, you get about 0.48 + 0.3 = 0.78. Call it 0.8 Gb/second.

That gives an overall CPU : RAM : I/O of about:

4 : 1 : 0.8

A fairly long way from 1 : 1 : 1 as the ideal.

In short, it looks like the original balance of one core with memory and I/O has been upset with the 4 core chip, and there was no action taken in the design to fix that. (Like, oh, having USB 3.0 or Gig-E Ethernet).

On Making Small Portable Clusters

Not quite P.G.’s “Beer can” computer, but a nice “lunch pail” computer:

http://www.linuxjournal.com/article/8177

It very nicely addresses what is another key aspect of HPC (High Performance Computing): Power and Heat. It is what is essential to address in making any HPC device effective. “computes per Watt” and heat extraction are key.

This design uses a PC/104 type board. That’s a standard “computer on a small board” for embedded systems. The Raspberry Pi didn’t originate the idea of a SBC (Single Board Computer), it just drove the price down dramatically. From the “$100-$300” range to the $10 to $30 range…

In the following quote, “DQ” is their second machine. I think DQ stands for Drag Queen, given the article description / photo…

Miniclusters

Miniclusters were first created by Mitch Williams of Sandia/Livermore Laboratory in 2000. Figure 1 shows a picture of his earliest cluster, Minicluster I. This cluster consisted of four Advanced Digital Logic boards, using 277MHz Pentium processors. These boards had connectors for the PC/104+ bus, which is a PC/104 bus with an extra connector for PCI.
[… skippiung to page 2]
Sandia was not asleep at the time. Mitch built Minicluster II, which used much more powerful PIII processors. The packaging was very similar to Minicluster I. Once again, we ported LinuxBIOS to this newer node, and the cluster was built to have one master with one disk and three slaves. The slave nodes booted in 12 seconds on this system. In a marathon effort, we got this system going at SC 2002 about the same time the lights started going out. Nevertheless, it worked.

One trend we noticed with the PIII nodes was increased power consumption. The nodes were faster, and the technology was newer, and the power needed was still higher. The improved fabrication technology of the newer chips did not provide a corresponding reduction in power demand—quite the contrary.


It was no longer possible to build DQ with the PIII nodes—they were just too power-hungry.
We went down a different path for a while, using the Advantech PCM-5823 boards as shown in Figure 5. There are four CPU boards, and the top board is a 100Mbit switch from Parvus. This switch is handy—it has five ports, so you can connect it directly to your laptop. We needed a full-size PC power supply to run this cluster, but in many ways it was very nice. We preserved instant boot with LinuxBIOS and bproc, as in the earlier systems.
[…]
As of 2004, again working with Mitch Williams of Sandia, we decided to try one more Pentium iteration of the minicluster and set our hungry eyes on the new ADL855PC from Advanced Digital Logic. This time around, things did not work out as well.
[…]
Second, the power demand of a Pentium M is astounding. We had expected these to be low-power CPUs, and they can be low power in the right circumstances, but not when they are in heavy use. When we first hooked up the ADL855PC with the supplied connector, which attaches to the hard drive power supply, it would not come up at all. It turned out we had to fabricate a connector and connect it directly to the motherboard power supply lines, not the disk power supply lines, and we had to keep the wires very short. The current inrush for this board is large enough that a longer power supply wire, coupled with the high inrush current, makes it impossible for the board to come up. We would not have believed it had we not seen it.

Instead of the 2A or so we were expecting from the Pentium M, the current needed was more on the order of 20A peak. A four-CPU minicluster would require 80A peak at 5 VDC. The power supply for such a system would dwarf the CPUs; the weight would be out of the question. We had passed a strange boundary and moved into a world where the power supply dominated the size and weight of the minicluster. The CPUs are small and light; the power supply is the mass of a bicycle.

That’s the way things usually go. More computes, more mass, much more power, and a hot heavy mass of metal needing a load of heat extraction.

The ARM chip takes a different path from the Intel folks. Small, low power, efficient, even if a bit slower per chip. But, for a small portable cluster, this can be a big feature… These folks were building a Beowulf Cluster to live in a large lunch box and have high portabality and without a Giant Sucking Sound from the Turbo Fan Heat Mover…

The Pentium M was acceptable for a minicluster powered by AC, as long as we had large enough tires. It was not acceptable for our next minicluster. We at LANL had a real desire to build 16 nodes into the lunchbox and run it all on one ThinkPad power supply. PC/104 would allow it, in terms of space. The issues were heat and power.

What is the power available from a ThinkPad power supply? For the supplies we have available from recent ThinkPads, we can get about 4.5A at 16 VDC, or 72 Watts. The switches we use will need 18 Watts, so the nodes are left with about 54 Watts between them. This is only 3W per node, leaving a little headroom for power supply inefficiencies. If the node is a 5V node, common on PC/104, then we would like .5A per node or less.

This power budget pretty much rules out most Pentium-compatible processors. Even the low-power SC520 CPUs need 1.5A at 5V, or 7.5 Watts—double our budget. We had to look further afield for our boards.

We settled on the Technologic TS7200 boards for this project. The choice of a non-Pentium architecture had many implications for our software stack, as we shall see.

The TS7200

The TS7200, offered by Technologic Systems, is a StrongARM-based single-board computer. It is, to use a colloquialism, built like a brick outhouse. All the components are soldered on. There are no heatsinks—you can run this board in a closed box with no ventilation. It has a serial port and Ethernet port built on, requiring no external dongles or modules for these connections. It runs on 5 VDC, and requires only .375A, or roughly 2W to operate. In short, this board meets all our requirements. Figure 6 is a picture of the board. Also shown in Figure 6 is a CompactFlash plugged in to the board, although we do not use one on our lunchbox nodes.

The article goes on into physical assembly of the stack, making a minature Ethernet Switch to fit in the lid of the package. Loading the OS et. al.

All in all, a nice little portable Beowulf Cluster. (Though it looks more like a small tool box than lunch box to me…)

In Conclusion

There are some basic truths in life. When you find one, it can be useful forever.

If you are designing your own multiprocessor system, or even just buying a multicore box, two of them are Amdahl’s Second Law, and Power and Heat Management Matter.

For the Raspberry Pi Model 2, they got the power and heat right, but the balance of I/O to CPU is way off. In practical use, this shows up as a desire to “swap” with modest loads of memory hogs like many browser tabs open (I have swap on a USB disk so as not to ‘wear’ my SD card ); and difficulty getting all 4 cores “loaded up” with work at the same time. Rarely does it go over 60% utilization.

Furthermore, when a strongly compute intensive task is given to the card, it has trouble walking and chewing gum at the same time. Not only the low improvement in throughput with parallel FORTRAN, but simple stability.

I loaded up 2 cores with Golomb Ruler searches. They run about 99% utilization with nearly zero I/O. Nice way to balance the total system resource use…

Except that when I came back this morning and started using the browser to enter this article, about 3 sentences in, the system hung. This is not the first time I’ve had that happen. It isn’t often (about once a month?). It only happens when I’ve put compute intensive tasks on the cores and loaded up the system. In normal use, it doesn’t happen. But in normal use system demand is about 1/2 of what is available… This might just be a Raspian OS issue, but until further tested it must be treated as ‘generic’. Other Linux / OS options might get past this issue.

So my conclusion from this is that the Raspberry Pi Model 2 is not suited to use in making a Beowulf Cluster. The individual card is imbalanced. I/O is too limited. It isn’t stable to intensive compute tasks AND other I/O oriented work at the same time. It can, and does, hang when heavily driven; even if rare.

Now if you JUST used it for a compute engine, no X-windows. No browser. Would it then work? Most likely. It ran fine all night with 2 cores loaded up with Golomb Ruler searches. I don’t know what part got tickled and hung. Perhaps just not using X would be “enough”. At some point I am going to “load it up” with 4 Golomb Ruler searches and NO windowing system, just to see if it “can take it”. But at the moment, it is certainly not suited to being a mixed use card with heavy compute load, and has insufficient I/O for heavy I/O loads.

In short, it is what it was designed to be: a toy system for learning that can do basic day to day things. A Good Enough computer for dirt cheap.

For making a cluster, I’d use the 1 CPU boards that each have their own I/O. It will get you closer to a balanced system and you don’t have the problem of the cores interacting “not so well” as seen on a heavily loaded Model 2.

This matters to me as I’m hoping to build just such a cluster. I’ll now likley use the Model B+ as the basic module (especially if price erosion continues ;-) since a set of 4 at about $100 is going to let me use all 4 cores fully without issues, and will have 4 x the I/O and be a more balanced system. A $50 Model 2 (with add ons) is only going to let me keep 2 cores loaded up anyway… so the $/Mip used is about the same. (I’d use the Zero, but have not yet figured out how to get USB-ethernet out of it…)

This also further reinforces the idea that the CubieTruck (CubieBoard 4) has a better balance of I/O to CPUs and that even with the higher price, it is a better $ / “usable CPU” proposition for a desktop alternative.

This also points to a simple way to evaluate alternatives, by making that CPU GHz : Memory GB ; I/O Gb ratio and looking for 1 : 1 : 1 as the goal.

Subscribe to feed

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in Tech Bits and tagged , , , . Bookmark the permalink.

4 Responses to Raspberry Pi Model 2 and Amdahl

  1. Larry Ledwick says:

    One other item to consider on Beowulf type clusters. For maximum efficiency you also need to match cpu speed to the number of nodes in use in the cluster. If the CPU nodes are too fast for the size it finishes its work unit before the master controller comes around to check on its status, and spends a lot of time just waiting for a new work unit. Ideal relationship is that the cpu is just the right speed so it finishes its computational task just slightly ahead of getting polled for status by the master controller so it spends most of its time processing rather than waiting for new work tasks. That implies that as the number of nodes go up, you actually want slightly slower CPU cycle time (or a faster master control node) which can partition out work packets at just the right speed so that all the nodes spend the majority of their time doing actual computation.

    Interesting balancing act to keep all your critical failure choke points engineered so they all reach choke flow at about the same time instead of one dominating all the others.

  2. E.M.Smith says:

    I think I may have “got clue” on the “hang” issue.

    1st Clue: It happens when computes are heavy AND communication is going on.

    2nd Clue: It seems to happen when SWAP is active, or about to be (opening another browser tab with many already open).

    3rd Clue: A “hang” also happens if SWAP is sent to a USB disk that ‘sleeps’ (such as the Toshiba IIRC) as it doesn’t cause a ‘wake up’.

    What would happen if the USB were prone to “issues” and the SWAP couldn’t complete?

    Some directed searches turned up a lot of pages about “USB Dropped Packets”, showing slow improvement over time.

    https://duckduckgo.com/?q=raspberry+Pi+drop+USB+packets

    The most helpful one is here:

    http://ludovicrousseau.blogspot.com/2014/04/usb-issues-with-raspberry-pi.html

    USB issues with a Raspberry Pi

    Some people report problems with my CCID driver and a Raspberry Pi. The problem is not with the CCID driver but with the Raspberry Pi itself.

    I don’t know if the problem is hardware, software or a combination of the two. I found a description of the problem on the excellent website yoctopuce.com. For example from the article “Cook and Hold with Raspberry Pi (video)” you can read:

    There is one caveat on the Raspberry Pi : the USB support is still somewhat buggy perfectible, and we will need to configure it to make it work reliably. The problem is, the RasPi will occasionally drop USB packets for “full-speed” peripherals (such as keyboard, mouse, modems, as well as some audio devices) when working in standard “high-speed” mode. The problem is less acute with the most recent firmware, but it is not completely solved. The only reliable workaround for now is to force all peripherals to run in “full-speed” mode. This will have the negative side effect of limiting all peripherals (including the on-board network adapter) to 1.5 MBytes/s, but anyway, the Raspberry Pi is not designed to be a race horse…

    To force USB to run in “full-speed” mode, simply add dwc_otg.speed=1 to the /boot/cmdline.txt file, as follows:

        dwc_otg.lpm_enable=0 console=ttyAMA0,115200 kgdboc=ttyAMA0,115200
        dwc_otg.speed=1 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4
        elevator=deadline rootwait
    

    So sometime “later” I’m going to try that and see if my “sporadic hangs at full load + I/O” end.

    I may also simply stop swapping to USB disk and let the SD card “wear” and see if that stops it too. It would be a useful confirmation of a sort.

    Isn’t it fun what ‘debugging’ forces you to learn against your will? /sarc;>

  3. E.M.Smith says:

    And on the topic of fast SD card instead of USB for swap… someone has benchmarked the various cards, so I don’t need to! Yeay!!!

    http://www.midwesternmac.com/blogs/jeff-geerling/raspberry-pi-microsd-card

    In my experience, one of the highest-impact upgrades you can perform to increase Raspberry Pi performance is to buy the fastest possible microSD card—especially for applications where you need to do a lot of random reads and writes.

    There is an order-of-magnitude difference between most cheap cards and the slightly-more-expensive ones (even if both are rated as being in the same class)—especially in small-block random I/O performance. As an example, if you use a normal, cheap microSD card for your database server, normal database operations can literally be 100x slower than if you used a standard microSD card.

    Because of this, I went and purchased over a dozen different cards and have been putting them through their paces. Here are the results of those efforts, in a nice tabular format:

    Card Make/Model 	hdparm buffered dd write 	4K rand read 	4K rand write
    OWC Envoy SSD (USB) 	34.13 MB/s 	34.4 MB/s 	7.06 MB/s 	8.20 MB/s
    SanDisk Ultra Fit (USB) 31.72 MB/s 	14.5 MB/s 	4.99 MB/s 	1.07 MB/s
    Samsung EVO+ 	 	18.45 MB/s 	14.0 MB/s 	8.02 MB/s 	3.00 MB/s
    Samsung EVO 	 	17.39 MB/s 	10.4 MB/s 	5.36 MB/s 	1.05 MB/s
    SanDisk Extreme Pro 	18.43 MB/s 	17.6 MB/s 	7.52 MB/s 	1.18 MB/s
    SanDisk Extreme 	18.51 MB/s 	18.3 MB/s 	8.10 MB/s 	2.30 MB/s
    SanDisk Ultra 	 	17.73 MB/s 	7.3 MB/s 	5.34 MB/s 	1.52 MB/s
    Transcend Premium 300x 	18.14 MB/s 	10.3 MB/s 	5.21 MB/s 	0.84 MB/s
    PNY Turbo (C10 90MB/s) 	17.46 MB/s 	TODO 	 	6.25 MB/s 	0.62 MB/s
    Kingston (C10) 	 	12.80 MB/s 	7.2 MB/s 	5.56 MB/s 	0.17 MB/s
    No-name (C4) 	 	13.37 MB/s 	< 1 MB/s 	< 0.1 MB/s 	< 0.01 MB/s
    

    […]

    However, judging by performance vs. price, there are a couple clear standout cards—one is the Samsung Evo+, which is the fastest card for random access by a mile. And this year, I’ve seen it on sale for $10-20 for 32 or 64 GB, so it’s a steal. Other than that, I’d go with the SanDisk Extreme, as it can be had for about the same price, or the Samsung Evo (without the +), as it can be had for even less if you just need 8 or 16 GB.

    2015 Winner: Samsung Evo+ 32 GB (purchased for $9.99 from Best Buy)

    He then goes on to explain the rational for each test and more.

    So there you have it. What’s next on my Christmas List ;-)

  4. E.M.Smith says:

    I got two of those Samsung SD cards.. put one in the PiM2 with Berryboot on it…
    and got The Rainbow Screen Of Death…

    So low volts for some reason.

    Perhaps it is sucking too much power too close to the SOC? Who knows…

    I’ve also thought maybe the power supply with the kit was marginal anyway, and might explain the “4 Cores Loaded and Die” I’ve been seeing too.

    I’ll use it for other things, so not a loss. I’ll also be researching bigger power supplies.

Comments are closed.