One of my interests is maximal system performance per $ and per Watt. This comes directly out of my time running a Cray Supercomputer site. It is a constant calculation you do. By Definition, a Supercomputer is at the limit of available computes. (One common definition in the industry has it limited to the top few percent of performance at any one time, thus the competition to be on the Top 500 List )
So when people talk about making a ‘supercomputer’ out of a stack of a few dozen Raspberry Pi boards they are being extraordinarily economical with the truth… in fact they are making a very very small ‘cluster computer’.
In supercomputing circles (the real ones…) it is very common to worry about heat load and power consumption. They tend to drive ‘what is possible’ at the limit as much as does software design and methods. It isn’t much good to build a supercomputer that melts if you run it full speed or that costs more to build and run than the answers are worth.
Energy usage and heat management
See also: Computer cooling and Green 500
A typical supercomputer consumes large amounts of electrical power, almost all of which is converted into heat, requiring cooling. For example, Tianhe-1A consumes 4.04 megawatts (MW) of electricity. The cost to power and cool the system can be significant, e.g. 4 MW at $0.10/kWh is $400 an hour or about $3.5 million per year.
The energy efficiency of computer systems is generally measured in terms of “FLOPS per watt”. In 2008, IBM’s Roadrunner operated at 3.76 MFLOPS/W. In November 2010, the Blue Gene/Q reached 1,684 MFLOPS/W. In June 2011 the top 2 spots on the Green 500 list were occupied by Blue Gene machines in New York (one achieving 2097 MFLOPS/W) with the DEGIMA cluster in Nagasaki placing third with 1375 MFLOPS/W.
Because copper wires can transfer energy into a supercomputer with much higher power densities than forced air or circulating refrigerants can remove waste heat, the ability of the cooling systems to remove waste heat is a limiting factor. As of 2015, many existing supercomputers have more infrastructure capacity than the actual peak demand of the machine – designers generally conservatively design the power and cooling infrastructure to handle more than the theoretical peak electrical power consumed by the supercomputer. Designs for future supercomputers are power-limited – the thermal design power of the supercomputer as a whole, the amount that the power and cooling infrastructure can handle, is somewhat more than the expected normal power consumption, but less than the theoretical peak power consumption of the electronic hardware.
So that’s why I keep looking at heat and cooling issues. And computes / Watt.
Because it matters. Even in my tiny little miniature ARM based cluster.
Sidebar: Aren’t all Supercomputers made with Intel or large custom chips?
Well, despite the Intel advertising, no. It changes over time. There’s a nice graph in that supercomputer wiki showing the changes.
Whoever wins the $/compute and computes/Watt race, rises to dominance over about 3 years (that’s the lifespan of a supercomputer as it goes from first in class to out of the race for $/compute and computes/Watt…) Interesting to note, Cray is making a supercomputer out of high end ARM chips…
Cray to Deliver ARM-Powered Supercomputer to UK Consortium
Michael Feldman | January 18, 2017 04:00 CET
Cray is going to build what will looks to be the world’s first ARM-based supercomputer. The system, known as “Isambard,” will be the basis of a new UK-based HPC service that will offer the machine as a platform to support scientific research and to evaluate ARM technologies for high performance computing. Installation of Isambard is scheduled to begin in March and be up and running before the end of the year.
Prof Simon McIntosh-Smith, leader of the project and Professor of High Performance Computing at the University of Bristol made a presentation about the upcoming system at the Mont-Blanc ARM event taking place at the Barcelona Supercomputing Centre (BSC) this week. “I think this is really exciting for a number of reasons,” McIntosh-Smith told TOP500 News. “It’s one of, it not the first serious, large(ish)-scale ARMv8 64-bit production machines. And it’s the first time Cray has explicitly announced an ARMv8 product meant for more than just prototyping.”
Product or not, Isambard looks to be a formidable machine – probably on the order of tens of teraflops. Isambard will include over 10,000 64-bit ARMv8 cores, in addition to a smattering of x86 CPUs, Intel Knights Landing Xeon Phi processors, and NVIDIA P100 GPUs. The project’s rationale for this architectural diversity is to compare application performance across a range of processors on the same machine. From Cray’s perspective, such diversity fits neatly into its vision of a heterogeneous computing future. “Scientists have a growing choice of potential computer architectures to choose from, including new 64-bit ARM CPUs, graphics processors, and manycore CPUs from Intel,” said McIntosh-Smith.
So there are several things to note here. First, you can see why I’m looking at ARM chips instead of Intel. We are near an edge… Second, it’s clear from the last paragraph that “heterogeneous computing” is here and now. Even Cray is mixing architectures in a system box. (Or more accurately, boxes.) Finally, you can also see when your Pi Pile starts to be a real supercomputer scale… at about 10,000 cores or 2,500 boards… so clear out the garage, get a bigger A/C, and remember you will need something with GigE Ethernet, not the Raspberry 100 Mb, and a Giant Switch for those 2500 boards to talk through…
Clearly I’m NOT making a supercomputer out of Pis… Just a cluster with about the performance of a 1980 Supercomputer matching that used when the GISS models were first written…
Oh, and there is hope for the future that any workload for a “porting” effort to the ARM chips for climate models will reduce a lot:
The UK’s Met Office is also a partner in the effort, since they want to evaluate Isambard’s ability to run its own weather and climate simulations. The rationale here is to see if these compute-heavy workloads can be supported on a more energy-efficient platform. These workloads are currently being run on their in-house 8-teraflop (peak) Cray XC40 supercomputer powered by x86-based Intel CPUs, specifically the 18-core Xeon E5-2695 v4 processors.
So it isn’t all that silly to be looking at ARM chips and all when the “models” run on Intel… because it is “Intel for now…”. But clearly my runs won’t finish in minutes or hours like theirs do. Mine will take days, weeks, or months. And be at reduced granularity. (Later I can look into that 2nd “garage” and the order for the other 9,996 Pi boards ;-)
Looking at Nanos
But those considerations still apply, even to my little cluster system. How many $/compute? How many computes/Watt? How much cooling and what interconnect speeds?
So to explore the low end of $/Compute I looked at the Orange Pi One. I really wanted to look at the Orange Pi Zero for $9, but they were sold out… the One is essentially the same system, plus some added I/O bits and a larger memory option, so a reasonable test case. The result of that “look” was to discover that the Orange Pi One is grossly speed limited by a lousy heat removal system (i.e. none… no heat sink, and the tiny little CPU board gets way too hot trying to be one, so the cores get their mHz downgraded to sloth as needed) and that the build quality leaves much to be desired (in particular, the HDMI system is sucky).
So if Cheap / Compute is out, why not just buy another Raspberry Pi Model 3?
Well, that was “the plan”. Or more accurately, the default. I’ve still got one more slot in the Dogbone Case and figured I’d just buy another PiM3 and “move on”. BUT, as is my way, first you do a rapid cross check on the nearest competition to see who has good $/compute and decent computes/Watt and what thermal management is like. This was facilitated by a nice little comparison page at the DietPi page which quickly confirmed that the “tiny board” systems had consistent heat management issues. It also showed the Odroid-C2 was pretty good across the board.
Further looking about found this interesting discussion page:
That was when I was thinking maybe a NanoPi would work better than the Orange Pi while still being incredibly small (and before finding the DietPi page with heat data…).
The discussion is long, but very interesting. Particularly where they look at heat limits to performance. See the dinky boards are marketed to the IoT folks for their Internet Of Things (or Idiot Of things, IMHO) use. Those folks do NOT expect continuous operation. They want a short fast spike of performance, then back to idle, and don’t care about heat load that much.
Posted 07 July 2016 – 06:46 PM
I guess the hardware has some similarities with NanoPi M1. Armbian has a working image for my M1. so maybe if you are lucky :)
the board is very small that’s very nice but the huge issue is the crap power design to feed the H3 SoC (the same as M1)
Either you need a very big heatsink OR you downclock the CPU as much as possible to limit overheating.
You get for what you pay :)
edit: currently my M1 with a copper heatsink, idling @ 240MHz has a temperature of 57°C
Posted 08 July 2016 – 08:41 AM
wildcat_paris, on 08 Jul 2016 – 07:27 AM, said:
so NanoPi M1 is crap and probably NanoPI Neo as well. (“you get for what you paid”)
Using a H3 SoC and using it at 50% because of low cost voltage management is madness, isn’t it?
Sorry, but we already know that it’s not the voltage regulator that makes the real difference regarding overheating (OPi One/Lite use exactly the same as NanoPi M1/NEO). Xunlong seems to use copper layers inside the PCB on Orange Pis to improve heat dissipation away from the SoC. I read some people complain about this since other PCB components get hot too (SD card for example). So based on perspective less heat dissipation through PCB design on the NanoPis can be considered both bug and feature.
Not all use cases require 100% CPU load on all cores, in fact when trying to use a H3 device as an IoT node you might want to limit max cpufreq to something like 600 MHz (when I did some tests with OPi PC then this setting ensured consumption never exceeding 2.5W while still being faster than most dual-core ARM SoCs when workloads are multithreaded)
Ok, we now see that the micro board I[di]oT targeted boards are made for intermittent use and the design expects to thermal limit rapidly, then cool for a long while. OK, scratch the micro-boards and anything without a decent heat sink from the shopping list…
I may yet get one just to play with. They have a heat sink option where the heat sink is as big as the board. It is discussed on page three of that forum (with picture):
Posted 29 July 2016 – 08:41 AM
FriendlyArm released a NanoPI NEO specific heat sink:
I am the owner of NanoPI NEO using the modified NanoPI M1 Armbian image of this thread.
I am not a expert of SBC design. But this board gets very hot. The sink has the same size like the SBC! Has the SBC a design problem, e.g. power supply? Or the CPU is so powerful and you must fix the heat problem by a approbiate heat sink or by software.
It looks like even with the heat sink, it doesn’t run cool… As I’m not that interested in a “The Fan Is Your Friend” design, my general idea is to move to more metal (larger board and larger heatsink).
Yet on page 4 with some adjusting of ‘fex’ settings, temperature got down to acceptable:
Posted 06 August 2016 – 12:24 AM
this is on the naonpim1 image… I did update the nanopine.fex to the one linked to above.
If I modprobe sunxi_pwm I get a /dev/sunxi_pwm device and some sysfs interfaces but they don’t seem to match up with any documentation I can find.root@nanopi-neo:~# ls a.out test.c root@nanopi-neo:~# armbianmonitor -m Stop monitoring using [ctrl]-[c] Time CPU load %cpu %sys %usr %nice %io %irq CPU 01:21:22: 1008MHz 0.16 0% 0% 0% 0% 0% 0% 52°C 01:21:27: 1152MHz 0.14 0% 0% 0% 0% 0% 0% 52°C 01:21:32: 240MHz 0.13 0% 0% 0% 0% 0% 0% 52°C 01:21:37: 240MHz 0.12 4% 1% 0% 0% 1% 0% 53°C 01:21:42: 240MHz 0.19 1% 1% 0% 0% 0% 0% 55°C 01:21:47: 1008MHz 0.31 1% 1% 0% 0% 0% 0% 53°C 01:21:52: 240MHz 0.28 1% 1% 0% 0% 0% 0% 53°C 01:21:58: 240MHz 0.26 4% 2% 1% 0% 0% 0% 53°C
But that is starting to look like work, and a Raspberry Pi M3 sized board with a heat sink on it isn’t that big on my desktop… ( I don’t need to embed it in a coffee pot…)
FWIW, lower down page 4 is a discussion of the board picking up RF interference if no console is attached and the need to attach a resistor to one pin to fix it…
From page 5 a set of tests with heat sinks. Nice graph in the original… Note that ALL the cases thermally throttle the CPU max performance…
From left to right:
NanoPi NEO/256 w/o heatsink lying flat on a table, SoC/DRAM on the lower PCB side so no airflow possible (~480 MHz average throttling)
NanoPi NEO/256 with FriendlyARM’s own heatsink operated vertically to let convection help somehow (~840 MHz average throttling)
NanoPi NEO/256 with tkaiser’s standard H3 heatsink operated vertically (~690 MHz average throttling)
OPi Lite with tkaiser’s standard H3 heatsink operated vertically (~900 MHz average throttling)
OPi PC with tkaiser’s standard H3 heatsink operated vertically (~980 MHz average throttling)
Using FA’s own heatsink is an improvement compared to cheap heatsinks both regarding heat dissipation as well as stability (FA’s heatsink is mounted perfectly and board + heatsink are ready for heavy vibrations). But as tests with OPi Lite and OPi PC show obviously PCB size and condition matter (copper layers inside PCB and the larger the size the better the heat dissipation — Orange Pi Plus 2E for example performs better under identical conditions most probably due to its larger PCB size)
In case you want to buy NEO (or NanoPi Air later — I still believe they share the form factor and heatsink) you better order FA’s heatsink too if you plan to operate the device under constant high load (which is IMO not being made for!). Regarding my ‘standard H3 heatsink’:
Thus my earlier statement on the Orange Pi thread about not being interested in the dinky boards anymore. It’s a heat load problem. Even with an added heat sink. FINE for very sporadic use. Pointless for “run compile for an hour” or “run model for a week”…
Further down we have:
But please don’t be surprised that performance numbers reported will be lower compared to other H3 devices. NEO uses a single bank DRAM configuration and DRAM clockspeed is way lower than on all other H3 boards. Therefore performance will be lower anyway but using cpuminer’s benchmark mode you might get the idea how different heatsink/cooling solutions ‘perform’. But to be honest: NEO is not made for performance anyway so better use the heatsink as a simple matter of precaution and forget about benchmarking this tiny board at all :)
In case anyone wants to build a HPC cluster with NEOs (weird to say the least ;) ). I prepared an archive some time ago to do reliability testing with Pine64 that contains already a script to collect cpuminer benchmark numbers and feeds them into RPi-Monitor template therefore the efficiency of the cooling approach in question can be measured/compared directly: […] (see the screenshot there to get the idea)
Here we see that even the memory design of the Neo Pi boards is compromised. Bold mine.
OK, it goes on like that for many pages. The key take-away? The small (dinky) IoT targeted boards have significant thermal issues AND are designed from the ground up for intermittent use at limited performance. So “Get a bigger board” unless you have that use case.
Want a ‘distcc’ cluster that has 40 nodes and finishes a kernel build in 5 minutes so doesn’t have enough time to get hot, then sits idle for 1/2 a day? Sure, it’s fine. Want to run a model for a week? Uhhh…
So that’s what started me looking at the bigger boards, and that Odroid family in particular. That they already have large heat sinks tells me they saw this already and built for it.
I had a different benchmark page saved, but it seems to have gone walkies… so here’s this one:
I has less of the direct Odroid-C2 comments than I’d wanted… This caught my eye, though:
Boards are likely to show similar performance in synthetic benchmark, except ODROID-C2 which should show a significant lead. However, I could not find benchmark for Pine A64 right now, and as we’ve seen this morning, Aarch64 improves performance significantly over Aarch32, so current benchmarks are likely to become invalid if/once Raspberry Pi 3 gets a 64-bit port. For example, Pine A64 is currently 15 times faster in sysbench CPU benchmark (prime numner computation) compared to Raspberry Pi 3, and it’s clearly not showing the true performance difference.
I’m presently running armhf on the Pi Model 3 (not arm64 ) so that it is binary compatible with the Pi M2 cluster members (and because the Devuan arm64 had a few more bugs on first test of it… but over time that will resolve). Especially during model runs, that use significant DOUBLE or 64 bit math, that difference will start to matter a lot…
The bottom line is that it sure looks like it is a better bang for the buck, and runs fast and well.
Found the other page:
Review: ODROID-C2, compared to Raspberry Pi 3 and Orange Pi Plus
March 24, 2016
tl;dr: The ODROID-C2 is a very solid competitor to the Raspberry Pi model 3 B, and is anywhere from 2-10x faster than the Pi 3, depending on the operation. The software and community support is nowhere near what you get with the Raspberry Pi, but it’s the best I’ve seen of all the Raspberry Pi clones I’ve tried.
Another primary competitor in the space is the ODROID, from Hardkernel. The original ODROID-C1 was already a decent platform, with a few more features and more RAM than the comparable Pi at the time. The ODROID-C2 was just announced in February, and for $39 (only $5 over the Pi 3 price tag) offers a few great features over the Pi 3 like:
2GHz quad core Cortex A53 processor (Pi 3 is clocked at 1.2 GHz)
Mali-450 GPU (Pi 3 has a VideoCore IV 3D GPU)
2 GB RAM (Pi 3 has 1 GB)
Gigabit Ethernet (Pi 3 has 10/100)
4K video support (Pi 3 supports HD… drivers/support are usually better for Pi though)
eMMC slot (Pi 3 doesn’t offer this option)
UHS-1 clocked microSD card slot (Pi 3 requires overclock to get this speed)
Official images for Ubuntu 16.04 MATE and Android (Pi 3 uses Raspbian, a Debian fork)
The Pi 3 added built-in Bluetooth and WiFi, which, depending on your use case, might make the price of the Pi 3 even more appealing solely based on a feature comparison.
For a desktop the WiFi and Bluetooth might matter, in a cluster as a node, not so much…
I’m not sure where he got the 2 GHz from for CPU Clock, as what I’m seeing sold is 1.5 GHz. Perhaps an overclock? Early model over ambition?
One of the first major differences between the Pi 2/3 and the C2 is the massive heat sink that’s included with the ODROID-C2. Based on my observations with CPU temperatures on the Pi 3, the heat sink is a necessity to keep the processor cool at its fairly high 2 GHz clock. The board itself feels solid, and it feels like it was designed, assembled, and soldered slightly better than the larger Orange Pi Plus, on par with the Pi 3.
One extremely thoughtful feature is the ODROID-C2 board layout mimics the Pi B+/2/3 almost exactly; the largest components (e.g. LAN, USB, HDMI, OTG, GPIO, and even the screw holes for mounting!) are identically placed—meaning I can swap in an ODROID-C2 in most situations where I already have the proper mounts/cases for a Pi.
Here we see the heat issue addressed, and then the ‘drop in replacement’ feature physical design. For software issues it looks like you can pick an Ubuntu MATE desktop, or a more generic Debian (which implies an easy Devuan path…)
The official Ubuntu MATE environment is nice, but for better efficiency, I used the ODROBIAN Debian Jessie ARM64 image for ODROID-C2 instead. The download is only 89 MB compressed, and the expanded image is ~500 MB, making for an almost-instantaneous dd operation. There are some other images available via the community as well, but ODROIBIAN seems to be the most reliable—plus it already has excellent documentation!
There follows a lot of specific benchmarks and power / compute measurements. Then the “bottom line” paragraphs:
The ODROID-C2 is a very solid competitor to the Raspberry Pi model 3 B, and is anywhere from 2-10x faster than the Pi 3, depending on the operation. It’s network performance, CPU performance, and 2 GB of RAM are extremely strong selling points if you thirst for better throughput and smoother UIs when using it for desktop computing. The Mali GPU cores are fast enough to do almost anything you can do on the Raspberry Pi, and the (smaller, but dedicated) community surrounding the ODROID-C2 is quick to help if you run into issues.
The ability to easily install Android or Ubuntu MATE (or one of the community distros, like ODROBIAN) is a benefit, and instructions are readily available (more so than other clones).
So the bottom line is I bought one from their North American vendor, Ameridroid where it cost me $42 plus some small shipping (something like $4). I added the eMMC option for $21 as that “other” evaluation said it made a big difference. This will also let me do a “how big is big” comparison of the running the OS from eMMC vs SD card vs USB disk.
All up the package was $70 something. This will likely become my new Desktop Daily Driver if it performs as advertized. The 1.5 GHz clock is a significant step up from 1.2 GHz. ( 300 MHz is 25% of 1200 MHz, so a 25% uplift just there), then the double RAM size will eliminate swap entirely for anything I’ve been doing while getting much closer to balance on “Amdahl’s Other Law”. Finally, the GigE Ethernet will be helpful “someday” when the rest of my stuff gets that speed… but for now mostly just says the I/O subsystem can handle that speed so will be less limiting. Only real disappointment is the USB 2.0 instead of 3.0 ( I have 3.0 hubs and disks already…)
So between faster OS from eMMC, double the memory, and a 25% per CPU speed boost, and with a huge heat sink to prevent CPU throttling, it is likely “worth it”. Though with the eMMC at 1/2 the board cost, it will need to prove itself to me…
This board is the same form factor as the Raspberry Pi, so fits in the same case. IFF I need any future capacity expansion as the models start to run, having the equivalent on 5 cores of PiM3 speed per board just from clock rate will be handy. Should the eMMC prove “nice but ‘feh’ for a cluster node”, the price of a stack of Odroid-C2 is about 4 x $42 = $168 plus shipping while the Pi Model 3 is about 4 x ($39+5 heat sink) = $176 from Amazon. So roughly a wash on price…
Since I was going to buy another PiM3 by default for the present stack, simply moving my desktop to the stack and putting an Odroid on the desktop is essentially a wash on costs. The software support reputation for it is also fairly good.
So that’s what all I went through looking at “other boards” and why I ended up buying an Odroid-C2 as an evaluation unit.
Clearly the micro-boards don’t cut it for a personal cluster that sees any heavy use. Clearly also the $/compute and computes / Watt are better on the Odroid-C2 than the Pi M3. The only real question is just how much trouble to get Debian / Devuan working on it, and just how much faster it really is.
Oh, and I didn’t go for the higher ends Odroids just because I hate fans… but that they use a fan on the next step up says they are looking closely at heat load issues.
In any real final cluster build-out with a couple of dozen boards, and a real dedicated ethernet switch connecting them, that GigE will matter a lot too. As will the higher clock rate and the double memory. Memory is used to cache disk pages so it can make up for a limited interconnect speed or slower disks, to some extent.
It ought to arrive in about a week. “Watch this space” for an update when I fire it up and see what surprises might be hiding in it…