Sometimes life gives you interesting perspectives on things.
The latest is the Raspberry Pi Zero.
They have a nice picture at the Pi link.
Today, I’m pleased to be able to announce the immediate availability of Raspberry Pi Zero, made in Wales and priced at just $5. Zero is a full-fledged member of the Raspberry Pi family, featuring:A Broadcom BCM2835 application processor 1GHz ARM11 core (40% faster than Raspberry Pi 1) 512MB of LPDDR2 SDRAM A micro-SD card slot A mini-HDMI socket for 1080p60 video output Micro-USB sockets for data and power An unpopulated 40-pin GPIO header Identical pinout to Model A+/B+/2B An unpopulated composite video header Our smallest ever form factor, at 65mm x 30mm x 5mm
A quick visit to adafruit showed it already sold out.
A quick visit to the local Barnes & Noble showed last month’s MagPi magazine, but not this new issue yet. (And a price of $15 as they pack it on to the price of a specialty magazine).
One more thing: because the only thing better than a $5 computer is a free computer, we are giving away a free Raspberry Pi Zero on the front of each copy of the December issue of The MagPi, which arrives in UK stores today. Russell, Rob and the team have been killing themselves putting this together, and we’re very pleased with how it’s turned out. The issue is jam-packed with everything you need to know about Zero, including a heap of project ideas, and an interview with Mike Stimson, who designed the board.
So I might end up spending $15 in a couple of weeks to get a $5 computer… and a magazine…
While wandering the bookstore, I picked up a Starbucks Vente Mocha for … $5.13 tax included.
Hope that puts it in perspective. The cost of “computes” is now less than the cost of a cup of fancy coffee. The spouse picked up a paperback Star Trek book. $6.90 or so.
How Fast Is A Pi?
One of the little “games I play” when looking at machine speeds is to compare them to machines I bought in the past. In particular, when I bought a VAX 11/780 (that was the standard for “1 MIPS” or 1 million instructions per second for many years) and a Cray Supercomputer. The Vax was used to run Apple.com for about 8 years, serving hundreds of terminal logins. The Cray was an XMP-48 and cost about $40 Million. The 4 is “4 processors” each consisting of a scalar unit and a vector unit; while the 8 is ‘8 megawords’ of memory. As the word size was 64 bits, that’s 64 MB of memory… (but very fast memory…) Date was about 1984. Call it 30 years ago.
So how does a Raspberry Pi stack up for the $Dollar? This doesn’t have the Pi Zero, but it does have the Pi Model 2 (which also just happens to be a quad core machine).
In particular, one caught my eye. The use of “NEON” instructions. These let you set up Vector instructions and hand them to a special part of the processor. Almost exactly the same as the Cray making “strides” that it would hand to the vector unit. A benchmark doing just that is a nearly direct comparison to how the Cray did things and uses both scalar and vector engines. Just what I wanted to know.
SIMD is “Single Instruction Multiple Data” and it lets you do things like multiply A x B = C for some chuck of things (called a ‘stride’ on the Cray) all at once. You you might load an array of A1, A2, A3…A64 and another array of B1, B2, … B64 and then just say “multiply A x B giving C” and have an array of the 64 products in C1, C2, C3…, C64 with one instruction cycle. That’s what a vector unit does. That is what your GPU or Graphics Processing Unit does. Using it for something other than video gives you a giant compute engine.
The ARM® NEON™ general-purpose SIMD engine efficiently processes current and future multimedia formats, enhancing the user experience.
NEON technology can accelerate multimedia and signal processing algorithms such as video encode/decode, 2D/3D graphics, gaming, audio and speech processing, image processing, telephony, and sound synthesis by at least 3x the performance of ARMv5 and at least 2x the performance of ARMv6 SIMD.
Cleanly architected NEON technology works seamlessly with its own independent pipeline and register file.
NEON technology is a 128-bit SIMD (Single Instruction, Multiple Data) architecture extension for the ARM Cortex™-A series processors, designed to provide flexible and powerful acceleration for consumer multimedia applications, delivering a significantly enhanced user experience. It has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide.
NEON instructions perform “Packed SIMD” processing:
Registers are considered as vectors of elements of the same data type
Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single precision floating point
Instructions perform the same operation in all lanes
So as long as you compile your program to use it well, you can get a huge bump in speed. (Just don’t expect it to do full motion video at the same time ;-) So back at those benchmarks…
NEON Float & Integer Benchmark – NeonSpeed
This was the first benchmark produced to measure speed using NEON instructions on ARM v7 CPUs using Android. It executes some of the code used in Memory Speed Benchmark, with additional tests recoded using NEON intrinsic functions. The benchmark and source code are included in Raspberry_Pi_Benchmarks.zip.
The compile command (for gcc 4.8) is shown below, where the -funsafe-math-optimizations option leads to the compiler generating NEON code for normal floating point statements. In this case, vfma Fused Multiply Accumulate instructions were generated, as opposed to vmla Vector Multiply Accumulate from the intrinsic functions. Then, vadd.i32 was produced for all integer tests. In this case, performance from both methods was quite similar.
An example Android results log is also provided, to show the difference where compiled NEON instructions are not provided.
[…]Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz NEON Speed Test V 1.0 Tue Mar 17 12:06:58 2015 Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 1914 1978 2049 2293 2341 2797 L1 32 1897 1951 2032 2253 2310 2745 64 1517 1543 1619 1694 1718 1915 L2 128 1417 1435 1510 1569 1594 1791 256 1414 1433 1499 1571 1593 1771 512 680 578 654 600 577 604 1024 434 403 451 414 396 409 RAM 4096 327 328 332 324 324 330 16384 333 334 338 345 330 337 65536 339 336 340 172 331 338 Max MFLOPS 479 495
Note those Mega FLOPS or Millions of Floating Point Operations per Second. That’s what we used to measure the Cray.
Over 400 for the Pi Model 2.
Our Cray was rated at about 100 MFLOPS per CPU x 4 CPUs or 400 MFLOPS…
From $40 Million to $40 and instead of using a 750 kVA power feed and water tower for cooling, it’s a 5 V 2 A supply and no fan.
Supplier or CPU and Clock VAX MWIPS Cost Intr System Precision MHz MWIPS MFLOPS MIPS DP Lang Opt $K Date Cray Cray 1A P4 Scalar 80 12.4 1.30 For 7000 1978 Cray 1A P4 Scalar 80 16.2 1.30 For later 7000 1978 Cray 1S P4 Scalar 80 16.1 For 1980 Cray 1S Vector 80 98.0 For 1980 X-MP1 P4 Scalar 118 30.3 11.0 34.7 2.70 For 5000 1982 X-MP1 Vector 118 313 151 175 For 5000 1982 X-MP1 90% Vector 118 162 66.4 125 For 5000 1982 Y-MP1 P4 Scalar 154 31.0 12.0 32.6 2.8 For 5000 1987 Y-MP1 Vector 154 449 195 314 For 5000 1987 Y-MP1 90% Vector 154 191 77.2 169 For 5000 1987 Cray 2/1 P4 Scalar 244 25.8 2.0 For 1984 Cray 2/1 Vector 244 425 For 1984
Now back at the Pi Zero. It isn’t not a quad core, but it doesn’t cost $40 either. Four of them is $20. More I/O Bandwidth too.
So where has all that potential compute performance gone?
Largely into “code bloat” for Microsoft products, some, but not nearly as much, for Linux. For Linux and especially Apple, it has gone into “Eye Candy”. Dancing Java Craplettes. Animation and shading, transparency and reflection effects. Fades and zooms.
Shut off the eye candy and run some clean FORTRAN in a terminal window, compiled with careful NEON selections, and you too can have a Super Computer.
But we don’t. We do sloppy bad compilation with compilers that generally have no idea what to do to optimize strides for a vector unit or how to use one. We use “high level Object Oriented Languages” that load 20 MB library routines just to change one line. We have a load of Eye Candy that we ignore. We run JAVA in a virtual machine (the Java Machine) that may itself be running in a virtual machine.
In short, we waste it all on fluff.
Then again, at $40, maybe it is OK to waste some of it…
Whenever they are no longer back ordered, I’m going to get a Pi Zero, or maybe a few of them. Some will become ‘dedicated servers’ doing stupid little tasks: DNS Servers, File Servers, PXE boot servers. Who knows what all else. It will cost me about 2 x as much for the Wi-Fi dongle to let them talk to the network than it costs for the Pi Zero. About 5 x as much if I get a USB hub for it to talk to both a network and a disk. I’d love to make a Beowulf Cluster out of a dozen of them, but the limited networking will be an issue. Perhaps they can be taught to all share one USB hub as the Beowulf network and add a Pi Model B as the node that talks to the ethernet.
If I could get just one thing from the Raspberry Pi folks as a ‘next board’ it would be one designed to be a compute node in a Beowulf cluster with GB Ethernet built in. I’d likely buy a dozen at $10 each. Maybe two dozen…
In all cases, just the time spent un-boxing it will cost more than it does. Computes are becoming functionally free. That has all sorts of implications…
As of now, the compute speed of these guys is pushing up against the memory and I/O speeds. I find that it’s almost impossible to load up all 4 cores of my Pi Model 2 without extra careful selection of multiple tasks. Even then, they seem to interfere with each other a bit too much. Running two Golomb Ruler searches in 2 cores makes editing a blog page in the third a bit of a pain, so the multi-core design of the SOC isn’t as good as it could be… 4 x Pi Zero in a cluster would do better. Similarly, we saw that the parallel FORTRAN used more cores but gave no faster results.
What parallel removes from User CPU, it is adding to System CPU. The necessary conclusion is that on this version of the Raspian OS, parallelizing FORTRAN is a losing proposition.
The conclusion from this being that having a few of these running ‘headless’ with single cores is as effective or maybe more so than having a multicore chip and trying to keep it busy with parallel code.
But then you start to have the “Balance Of System” costs being about $20 for a $5 computer…
$5 SD card, $5 power supply, $10 WiFi dongle.
Which clearly points at what needs cost attention next.
Oh, and a case at $5…
That’s my Daily coffee budget for almost a week! Just for one lousy supercomputer worth of computes ;-) /sarc;
Perspective, use it or lose it ;-)