Perspective on Raspberry Pi Benchmarks

Sometimes life gives you interesting perspectives on things.

The latest is the Raspberry Pi Zero.

They have a nice picture at the Pi link.

Today, I’m pleased to be able to announce the immediate availability of Raspberry Pi Zero, made in Wales and priced at just $5. Zero is a full-fledged member of the Raspberry Pi family, featuring:

    A Broadcom BCM2835 application processor
        1GHz ARM11 core (40% faster than Raspberry Pi 1)
    512MB of LPDDR2 SDRAM
    A micro-SD card slot
    A mini-HDMI socket for 1080p60 video output
    Micro-USB sockets for data and power
    An unpopulated 40-pin GPIO header
        Identical pinout to Model A+/B+/2B
    An unpopulated composite video header
    Our smallest ever form factor, at 65mm x 30mm x 5mm

A quick visit to adafruit showed it already sold out.

https://www.adafruit.com/products/2885

A quick visit to the local Barnes & Noble showed last month’s MagPi magazine, but not this new issue yet. (And a price of $15 as they pack it on to the price of a specialty magazine).

One more thing: because the only thing better than a $5 computer is a free computer, we are giving away a free Raspberry Pi Zero on the front of each copy of the December issue of The MagPi, which arrives in UK stores today. Russell, Rob and the team have been killing themselves putting this together, and we’re very pleased with how it’s turned out. The issue is jam-packed with everything you need to know about Zero, including a heap of project ideas, and an interview with Mike Stimson, who designed the board.

So I might end up spending $15 in a couple of weeks to get a $5 computer… and a magazine…

While wandering the bookstore, I picked up a Starbucks Vente Mocha for … $5.13 tax included.

Hope that puts it in perspective. The cost of “computes” is now less than the cost of a cup of fancy coffee. The spouse picked up a paperback Star Trek book. $6.90 or so.

How Fast Is A Pi?

One of the little “games I play” when looking at machine speeds is to compare them to machines I bought in the past. In particular, when I bought a VAX 11/780 (that was the standard for “1 MIPS” or 1 million instructions per second for many years) and a Cray Supercomputer. The Vax was used to run Apple.com for about 8 years, serving hundreds of terminal logins. The Cray was an XMP-48 and cost about $40 Million. The 4 is “4 processors” each consisting of a scalar unit and a vector unit; while the 8 is ‘8 megawords’ of memory. As the word size was 64 bits, that’s 64 MB of memory… (but very fast memory…) Date was about 1984. Call it 30 years ago.

So how does a Raspberry Pi stack up for the $Dollar? This doesn’t have the Pi Zero, but it does have the Pi Model 2 (which also just happens to be a quad core machine).

Benchmarks:

http://www.roylongbottom.org.uk/Raspberry%20Pi%20Benchmarks.htm

In particular, one caught my eye. The use of “NEON” instructions. These let you set up Vector instructions and hand them to a special part of the processor. Almost exactly the same as the Cray making “strides” that it would hand to the vector unit. A benchmark doing just that is a nearly direct comparison to how the Cray did things and uses both scalar and vector engines. Just what I wanted to know.

http://www.arm.com/products/processors/technologies/neon.php

SIMD is “Single Instruction Multiple Data” and it lets you do things like multiply A x B = C for some chuck of things (called a ‘stride’ on the Cray) all at once. You you might load an array of A1, A2, A3…A64 and another array of B1, B2, … B64 and then just say “multiply A x B giving C” and have an array of the 64 products in C1, C2, C3…, C64 with one instruction cycle. That’s what a vector unit does. That is what your GPU or Graphics Processing Unit does. Using it for something other than video gives you a giant compute engine.

NEON

The ARM® NEON™ general-purpose SIMD engine efficiently processes current and future multimedia formats, enhancing the user experience.

NEON technology can accelerate multimedia and signal processing algorithms such as video encode/decode, 2D/3D graphics, gaming, audio and speech processing, image processing, telephony, and sound synthesis by at least 3x the performance of ARMv5 and at least 2x the performance of ARMv6 SIMD.

Cleanly architected NEON technology works seamlessly with its own independent pipeline and register file.

NEON technology is a 128-bit SIMD (Single Instruction, Multiple Data) architecture extension for the ARM Cortex™-A series processors, designed to provide flexible and powerful acceleration for consumer multimedia applications, delivering a significantly enhanced user experience. It has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide.

NEON instructions perform “Packed SIMD” processing:

Registers are considered as vectors of elements of the same data type
Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single precision floating point
Instructions perform the same operation in all lanes

So as long as you compile your program to use it well, you can get a huge bump in speed. (Just don’t expect it to do full motion video at the same time ;-) So back at those benchmarks…

http://www.roylongbottom.org.uk/Raspberry%20Pi%20Benchmarks.htm#anchor24a

NEON Float & Integer Benchmark – NeonSpeed

This was the first benchmark produced to measure speed using NEON instructions on ARM v7 CPUs using Android. It executes some of the code used in Memory Speed Benchmark, with additional tests recoded using NEON intrinsic functions. The benchmark and source code are included in Raspberry_Pi_Benchmarks.zip.

The compile command (for gcc 4.8) is shown below, where the -funsafe-math-optimizations option leads to the compiler generating NEON code for normal floating point statements. In this case, vfma Fused Multiply Accumulate instructions were generated, as opposed to vmla Vector Multiply Accumulate from the intrinsic functions. Then, vadd.i32 was produced for all integer tests. In this case, performance from both methods was quite similar.

An example Android results log is also provided, to show the difference where compiled NEON instructions are not provided.
[…]

 Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

  NEON Speed Test V 1.0 Tue Mar 17 12:06:58 2015

       Vector Reading Speed in MBytes/Second
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16   1914   1978   2049   2293   2341   2797 L1
      32   1897   1951   2032   2253   2310   2745
      64   1517   1543   1619   1694   1718   1915 L2
     128   1417   1435   1510   1569   1594   1791
     256   1414   1433   1499   1571   1593   1771
     512    680    578    654    600    577    604
    1024    434    403    451    414    396    409 RAM
    4096    327    328    332    324    324    330
   16384    333    334    338    345    330    337
   65536    339    336    340    172    331    338

Max MFLOPS  479    495

Note those Mega FLOPS or Millions of Floating Point Operations per Second. That’s what we used to measure the Cray.

Over 400 for the Pi Model 2.

Our Cray was rated at about 100 MFLOPS per CPU x 4 CPUs or 400 MFLOPS…

From $40 Million to $40 and instead of using a 750 kVA power feed and water tower for cooling, it’s a 5 V 2 A supply and no fan.

Times change…

http://www.roylongbottom.org.uk/whetstone.htm

Supplier or     CPU and      Clock                    VAX    MWIPS             Cost   Intr 
System          Precision     MHz    MWIPS   MFLOPS   MIPS     DP   Lang  Opt   $K    Date 

Cray                                                                                       
Cray 1A         P4 Scalar      80     12.4                    1.30   For       7000   1978 
Cray 1A         P4 Scalar      80     16.2                    1.30   For later 7000   1978 
Cray 1S         P4 Scalar      80     16.1                           For              1980 
Cray 1S         Vector         80     98.0                           For              1980 
X-MP1           P4 Scalar     118     30.3    11.0    34.7    2.70   For       5000   1982 
X-MP1           Vector        118     313     151     175            For       5000   1982 
X-MP1           90% Vector    118     162     66.4    125            For       5000   1982 
Y-MP1           P4 Scalar     154     31.0    12.0    32.6    2.8    For       5000   1987 
Y-MP1           Vector        154     449     195     314            For       5000   1987 
Y-MP1           90% Vector    154     191     77.2    169            For       5000   1987 
Cray 2/1        P4 Scalar     244     25.8                    2.0    For              1984 
Cray 2/1        Vector        244     425                            For              1984 

Now back at the Pi Zero. It isn’t not a quad core, but it doesn’t cost $40 either. Four of them is $20. More I/O Bandwidth too.

So where has all that potential compute performance gone?

Largely into “code bloat” for Microsoft products, some, but not nearly as much, for Linux. For Linux and especially Apple, it has gone into “Eye Candy”. Dancing Java Craplettes. Animation and shading, transparency and reflection effects. Fades and zooms.

Shut off the eye candy and run some clean FORTRAN in a terminal window, compiled with careful NEON selections, and you too can have a Super Computer.

But we don’t. We do sloppy bad compilation with compilers that generally have no idea what to do to optimize strides for a vector unit or how to use one. We use “high level Object Oriented Languages” that load 20 MB library routines just to change one line. We have a load of Eye Candy that we ignore. We run JAVA in a virtual machine (the Java Machine) that may itself be running in a virtual machine.

In short, we waste it all on fluff.

Then again, at $40, maybe it is OK to waste some of it…

In Conclusion

Whenever they are no longer back ordered, I’m going to get a Pi Zero, or maybe a few of them. Some will become ‘dedicated servers’ doing stupid little tasks: DNS Servers, File Servers, PXE boot servers. Who knows what all else. It will cost me about 2 x as much for the Wi-Fi dongle to let them talk to the network than it costs for the Pi Zero. About 5 x as much if I get a USB hub for it to talk to both a network and a disk. I’d love to make a Beowulf Cluster out of a dozen of them, but the limited networking will be an issue. Perhaps they can be taught to all share one USB hub as the Beowulf network and add a Pi Model B as the node that talks to the ethernet.

If I could get just one thing from the Raspberry Pi folks as a ‘next board’ it would be one designed to be a compute node in a Beowulf cluster with GB Ethernet built in. I’d likely buy a dozen at $10 each. Maybe two dozen…

Who knows.

In all cases, just the time spent un-boxing it will cost more than it does. Computes are becoming functionally free. That has all sorts of implications…

As of now, the compute speed of these guys is pushing up against the memory and I/O speeds. I find that it’s almost impossible to load up all 4 cores of my Pi Model 2 without extra careful selection of multiple tasks. Even then, they seem to interfere with each other a bit too much. Running two Golomb Ruler searches in 2 cores makes editing a blog page in the third a bit of a pain, so the multi-core design of the SOC isn’t as good as it could be… 4 x Pi Zero in a cluster would do better. Similarly, we saw that the parallel FORTRAN used more cores but gave no faster results.

https://chiefio.wordpress.com/2015/08/16/an-amusing-parallel-pi-fortran-experiment/

What parallel removes from User CPU, it is adding to System CPU. The necessary conclusion is that on this version of the Raspian OS, parallelizing FORTRAN is a losing proposition.

The conclusion from this being that having a few of these running ‘headless’ with single cores is as effective or maybe more so than having a multicore chip and trying to keep it busy with parallel code.

But then you start to have the “Balance Of System” costs being about $20 for a $5 computer…

$5 SD card, $5 power supply, $10 WiFi dongle.

Which clearly points at what needs cost attention next.
Oh, and a case at $5…

That’s my Daily coffee budget for almost a week! Just for one lousy supercomputer worth of computes ;-) /sarc;

Perspective, use it or lose it ;-)

Subscribe to feed

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in Tech Bits and tagged , , . Bookmark the permalink.

5 Responses to Perspective on Raspberry Pi Benchmarks

  1. Andysaurus says:

    I started out driving a mainframe with 4 tape decks, a card reader and a paper tape reader. It had 8k (24 bit) words. Took eight of us to drive it 24X5 – and an on-site engineer to fix it when it broke!

  2. Larry Ledwick says:

    I had the same conclusion you had before I reached the end of the article, an optimized P1-0 setup as a node in a beowulf cluster and a RPi-2 optimized as a controller node etc. would be a huge breakthrough for universities and other groups who wanted to build a massively parallel system like a beowulf cluster.

    In large volume production they would cost just slightly more than an equivalent weight of sand.

    Design them with a standard interface to plug into a mother board which could provide the network connectivity for 50 -100 nodes, and you have a true super computer for less than a good dinner at a fine restaurant.

    Your observation about compute cycles approaching being zero cost has mind boggling implications for all sorts of applications like AI and robotics which are compute limited if someone made the jump to develop an optimized functional building block module.
    The next step is to change the form factor to a LSIC chip connected to an interface plug which handles gigabit ethernet, power and 2x or 4x usb all in one physical plug and you have a functional module that would become as ubiquitous as flash drives and small usb applications or gps chips.

  3. p.g.sharrow says:

    @Larry; you are describing the “beer can” computer that I envisioned 40years ago and Mr.Smith with the Raspi has now demonstrated. Next a modular interface system needs to be addressed, hardware and software. In the long run the form factor of the Pi is not Ideal, too many wires. A single plug-in for power and interface with additional for special in-puts would be better. I would think, broadband addressing might solve the IO talk between devices.

    Discussion on the net indicates to me that it might not be too expensive to get a purpose built board created and produced…pg.

  4. Chuckles says:

    E.M., If you haven’t seem them before, some useful Micro SD benchmark numbers, which might allow a bit of ratio tweaking –

    http://www.midwesternmac.com/blogs/jeff-geerling/raspberry-pi-microsd-card

  5. E.M.Smith says:

    Others have seen the same “issue” with the Zero that has me “less than thrilled” about buying 100 of them and making a cluster. The poor availability of “network at a price”. Yes, you CAN put a USB ethernet on the device, via a long list of parts costing more than the device… Micro-USB to USB adapter, then USB to WiFi Dongle or USB to ethernet adapter… and, should you want a disk, adding a USB Hub at about 5 x the price of the Pi Zero.

    I found this the most interesting solution:

    http://hackaday.com/2015/11/28/first-raspberry-pi-zero-hack-piggy-back-wifi/

    Skip the micro-to-USB adapter and solder a wifi dongle guts directly to the R.Pi.

    Nice.

    Now what I’d REALLY like is that WiFi chip already in place, OR an ethernet connector. The use of a simple ethernet connector would let me buy about 32 of these and make a very nice Beowulf Cluster for not very much money at all. (Connecting 32 of them with WiFi is an amusing thought, but the network bandwidth limit of one Access Point Router would restrict it to doing highly compute intensive / low communications tasks).

    We’ll see if over time a better solution comes along.

    At the moment, I’m Oh So Slowly looking at various SOM / COM / SBC (System on Module, Computer on Module – that’s a different name for the same thing, and Single Board Computer) options trying to find something that is a Dirt Cheap Dinky board with Ethernet. I don’t need a load of other ports, or break out of GPIO headers, or WiFi, or a hundred and one other things. All it takes to make a Beowulf Compute Node is a System On Chip soldered to a tiny PC board with an Ethernet connector.

    Unfortunately, nobody seems to make that ( that I can find). Lots of “everything and the kitchen sink” boards ( especially in PC/104) for $$$ (and no, I don’t need a serial port on it, or a set of 4 USB + USB On The Go, along with Sound, Video, etc. etc….) I did see the concept in play as a Gumstix Stagecoach board: http://electronicdesign.com/boards/cluster-gumstix

    But at $230 for the sow board, then about $150 EACH for the Gumstix piglets (with old slow SOC at that) it is just a silly pricy thing.

    It ought to be possible to make a Pi Zero analog WITH GigE for about $10 to $15 (and drop the GPIO et. al.) and a board to gang them together (with Ethernet Switch on board) for about $25. Now you can dream of a 16 node “Berry Bunch” at about $200 – $300. Gang together 8 of them on an outboard switch and you have a 128 Node Beowulf for about $1600 to $2400 dollars. With 128 GB of memory and decent IO speed inside the cluster. I’d have at least one node in each bunch have a USB for attached disk in a “production” system as using SD storage would be slow in production. I suspect that the Pi-Wulf boards could be made SOM style with an edge connector and get the price down to the $5 point pretty easy. Might raise the switch board cost a bit as they would be SIM sockets instead of ethernet plugs.

    At any rate, if anyone knows of such a “minimal compute module with ethernet” for under $30, I’d love to know about it (or even under $50 if the CPU is fast enough that the $/compute wins).

Comments are closed.