When 8 Arm64 Cores Are Just Not Enough…

So I was looking at BSD on the Pi and other Arm chip boards, with particular interest in the v8 or Arm64 chips (as the 64 bit math is faster for high precision math). Thinking a bit about maybe putting it on my “Octo-core” Odroid XU4… and discovered that Cavium have shipped their OMG-cores chips. Not only that, but it is the reference platform for Arm64 FreeBSD.

https://wiki.freebsd.org/arm64

Cavium ThunderX

Cavium’s ThunderX is the initial reference target platform for FreeBSD/arm64.

FreeBSD 11.0 supports the ThunderX EVB (evaluation board) and CRB (customer reference board) in SMP mode (48 CPU cores). SATA drives, PCIe expansion cards, and the on-chip network interface are fully supported.

FreeBSD is available on the 2 socket, 96 core Type 2A ThunderX systems at Packet.net.

Demo of SMP kernel on ThunderX

Yes, that’s right, a 48 core board, and a 2 socket 96 core system. For when 8 cores are just not enough… As a compute engine, this ought to scream. So what’s it cost? Remarkably affordable. This is list price, and without any searching for low cost deals / providers:

https://www.asacomputers.com/Cavium-ThunderX.html

Cavium ThunderX ARM 1U
SKU: ASA1901-48C-TX

1 x Cavium® ThunderX™ 48-core ARM processor

8 x DDR4 DIMM slots

1 x 40GbE QSFP+ LAN port

4 x 10GbE SFP+ LAN ports

4 x 3.5” hot-swappable HDD/SSD bays

400W 80 PLUS Gold single PSU

Starting configuration: $2350.00

They have a lower end 32 core model for $1726 and then there’s the desktop / towers:

Cavium ThunderX ARM Tower
SKU: ASA9104-32C-TX

Cavium® ThunderX™ family, 1 x ThunderX_CP™ processor 64bit ARMv8 architecture, 32 cores per processor, 1.8GHz BGA 2601, 28nm technology

90° Rotatable HDD Cage

Whisper-Quiet (<21dB)

Kensington Lock Support

Front I/O Ports: 2x Audio (HD/AC97) & 2x USB 3.0 & 2x USB 2.0 & 2x 1394 Firewire Ports

1x Optional Front 12cm (1850 RPM) PWM Fan

Mid-Tower Chassis Supports Micro-ATX Motherboard, Sizes – E-ATX/ATX/Micro ATX

500W Bronze Level Certified High-Efficiency Power Supply

1x Rear 12cm (1850 RPM) PWM Fan

2x 5.25" External HDD Drive Bays & 4x 3.5" Internal HDD Drive Bays

Starting configuration: $1630.00

Where the 48 core variation runs $2500 for the package.

I know folks who will pay $1600 range prices for high end Mac desktops.

I’m much more interested in the “Gaggle of cheap SBCs” world at the moment, and that $1630 price tag for 32 cores is about $51 / core where you can get 4 to 8 core SBCs at about the same “$50-something” price range and a similar 1.8 Ghz clock speed (2.0 GHz for the 48 core model). No mention of memory size so it is likely a configurable option. A package of 4 x XU4 (32 cores) would have 8 GB of memory as would an 8 x 4-core SBC solution with 1 GB / board; so that’s a good comparison memory size to choose.

The big question is just what is the bus speed between those cores / memory vs the network speed of the SBC cluster; as that’s where lots of parallel processing hits a bottleneck. on the communications speed.

Still, if you want a single tower with 32 to 48 cores of 64 bit processor running at 1.8 GHz to 2.0 GHz, it does look like a nice package. Then having a load of multi-core boxes and rack-mount bits will certainly accelerate the development of good BSD / Linux ports / support.

Whenever I’ve finally managed to fully load up my “stack of boards”, it’s nice to know there’s an easy path to a whole lot more cores in a tightly coupled package. Even if a bit expensive in comparison.

Then there’s the future upgrade path, the ThunderX2:

https://en.wikichip.org/wiki/cavium/thunderx2

ThunderX2 is a family of 64-bit multi-core ARM server microprocessors introduced by Cavium in early 2018 succeeding the original ThunderX line.

Overview

The ThunderX2 was designed to succeed the original ThunderX family. Cavium first announced the ThunderX2 back in May 30 2016 with models based on their own second-generation microarchitecture with models up to 54 cores. Cavium eventually scrapped their own design and in late 2016 they acquired the Vulcan design from Broadcom which has designed a server microprocessor but has given up on the project for reasons not well understood. In early 2018, Cavium announced that their ThunderX2 processors (now based on Vulcan) have reached general availability.

CN99xx
[…]

The first parts of the ThunderX2 family, CN99xx series, that made it to general availability are based on the Vulcan microarchitecture. Those parts are different from Cavium’s original ThunderX2 design which started sampling in 2016. Originally designed by Broadcom, those parts have much higher performance and a slightly different set of features. All parts have the following features in common.
Mem: Up to 2 TiB of quad/hexa/octa- channel DDR4 2666 MT/s memory
Up to 4 TiB in dual-socket configuration
ISA: ARMv8, 128-bit NEON SIMD
I/O: x48, x56 PCIe Gen 3 Lanes
Only the 64-bit AArch64 execution state is support. No 32-bit AArch32 support.

Two terrabytes is a nice size memory ;-) but you can get to 4… Then 128 bit NEON hardware for SIMD (Single Instruction Multiple Data – basically like the old Cray “vector processor” but twice the world length, for math intensive codes). Things like computer vision and math intensive iterative models will like that. That they do not implement the 32 bit mode means two things: The first port of software will take longer (can’t just run your existing 32 bit code / port) and ports will get the 64 bit conversion done faster (for the same reason…you can’t put it off and just run 32 bit for a while).

Up to 2.5 GHz, so nice and fast too.

This is going to push forward the state of massively parallel computing in the lower cost end of the market (i.e. not K-Core+ supercomputers).

FWIW, I was learning some Go programming language over the last week or two. Not committing to it, just checking it out. Invented by Google, so some issues there. They make available a large “free” library of library routines … but you get them by having your program grab them from the Google servers when you compile it. I’m not interested in letting Google know every time I write some code. Perhaps there’s a way to snag a local copy and I’ve just not reached that point yet. But the “Guy From Google” in one of the tutorials I watched said they were constantly updating the libraries, which implies they want you to point at their current versions.

The key point to Go? It is designed to be easy to write massively parallel programs. Has intrinsics built in to spawn jobs (“goroutines”) and have them communicate. Here’s some information on it (5 pages):
https://www.ualr.edu/pxtang/papers/ACC10.pdf

It is highly similar to C with some added parallel bits, and a few things pruned out (so simpler to learn, really). As a C programmer, it is more like a dialect or extension of what you already know.

Google runs things (like their document server) on it, on thousands of processor clusters, so it is known to scale very well.

While my present path is to continue to test the performance of various parallel types of FORTRAN on “Pi class” SBCs: as I’ve not seen a lot of improvement ( I suspect parallel FORTRAN is low on the list of priorities to make work well ) I’m also going to explore some other parallel code options. In that context, Go (or in Debian package terms, “golang”) is certainly on the list for future exploration.

The most intriguing thing is that “massively parallel” hardware and languages are becoming attainable by folks in the “home gamer” class. Nice.

52 minute tutorial on Go:

That “Go”pher is their mascot for the language. Yes, T-shirts and all…

Subscribe to feed

Advertisements

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in Tech Bits and tagged , , , , , . Bookmark the permalink.

9 Responses to When 8 Arm64 Cores Are Just Not Enough…

  1. Steven Fraser says:

    So, a single-board supercomputer. Hmmm. I wonder if GFSV2 would port easily to it?

    Hopefully, the fans would be non-noisy, as I know you prefer your speed to be silent.

  2. E.M.Smith says:

    Don’t know that I’ll do anything with it other than a couple of test / training exercises, but golang in available for the Pi class boards:

    root@odroidxu4:/SG2/ext/chiefio/SQL/v3# apt-get install golang
    Reading package lists... Done
    Building dependency tree       
    Reading state information... Done
    The following packages were automatically installed and are no longer required:
      libjsoncpp0 libuuid-perl
    Use 'apt-get autoremove' to remove them.
    The following extra packages will be installed:
      golang-doc golang-go golang-go-linux-arm golang-src
    Recommended packages:
      golang-go.tools
    The following NEW packages will be installed:
      golang golang-doc golang-go golang-go-linux-arm golang-src
    0 upgraded, 5 newly installed, 0 to remove and 5 not upgraded.
    Need to get 18.7 MB of archives.
    After this operation, 110 MB of additional disk space will be used.
    Do you want to continue? [Y/n] y
    Get:1 http://auto.mirror.devuan.org/merged/ jessie/main golang-src armhf 2:1.3.3-1 [5,143 kB]
    Get:2 http://auto.mirror.devuan.org/merged/ jessie/main golang-go-linux-arm armhf 2:1.3.3-1 [3,361 kB]
    Get:3 http://auto.mirror.devuan.org/merged/ jessie/main golang-go armhf 2:1.3.3-1 [8,193 kB]
    Get:4 http://auto.mirror.devuan.org/merged/ jessie/main golang-doc all 2:1.3.3-1 [1,950 kB]                                                                     
    Get:5 http://auto.mirror.devuan.org/merged/ jessie/main golang all 2:1.3.3-1 [25.0 kB]                                                                          
    Fetched 18.7 MB in 13s (1,367 kB/s)                                                                                                                             
    Selecting previously unselected package golang-src.
    (Reading database ... 92680 files and directories currently installed.)
    Preparing to unpack .../golang-src_2%3a1.3.3-1_armhf.deb ...
    Unpacking golang-src (2:1.3.3-1) ...
    Selecting previously unselected package golang-go-linux-arm.
    Preparing to unpack .../golang-go-linux-arm_2%3a1.3.3-1_armhf.deb ...
    Unpacking golang-go-linux-arm (2:1.3.3-1) ...
    Selecting previously unselected package golang-go.
    Preparing to unpack .../golang-go_2%3a1.3.3-1_armhf.deb ...
    Unpacking golang-go (2:1.3.3-1) ...
    Selecting previously unselected package golang-doc.
    Preparing to unpack .../golang-doc_2%3a1.3.3-1_all.deb ...
    Unpacking golang-doc (2:1.3.3-1) ...
    Selecting previously unselected package golang.
    Preparing to unpack .../golang_2%3a1.3.3-1_all.deb ...
    Unpacking golang (2:1.3.3-1) ...
    Processing triggers for man-db (2.7.0.2-5) ...
    Setting up golang-src (2:1.3.3-1) ...
    Setting up golang-go-linux-arm (2:1.3.3-1) ...
    Setting up golang-go (2:1.3.3-1) ...
    Setting up golang-doc (2:1.3.3-1) ...
    Setting up golang (2:1.3.3-1) ...
    root@odroidxu4:/SG2/ext/chiefio/SQL/v3# 
    

    Interesting that it looks like some source packages are included

    Setting up golang-src

    There’s also a package that lets you open a MySQL database from inside Go, but I’m not going to do that unless / until I do more than just toy cases for learning…

    https://tutorialedge.net/golang/golang-mysql-tutorial/

    that depends on this:

    https://github.com/go-sql-driver/mysql

    so a bit of “some assembly required” to make it all work well together. So there’s a lot of potential there for writing fast, understandable, and highly parallel code using a database (and database locking).

    That’s all on the back burner behind a bunch of other stuff (like the actual temperature database design / load) so unlikely anything soon. These are just to save pointers to where things were found. And save other folks the digging time ;-)

    In general, Go is an interesting language and approach, proven in actual use at scale by Google. As a “New, Trendy, Cool!” language, I have some reservations about “rushing there”; but as a “mostly C” like language, it isn’t much of a workload… I could probably be productive in under a week. Minimally programming in a day.

    I wonder if the bits removed from C in making Go were any of the parts important to making an operating system? If not, then making highly parallel OS code ought to be easy… But I think it was. IIRC some of the trick pointer stuff was left out…

    Interesting paper comparing Go to C
    https://dead10ck.github.io/2014/12/15/go-vs-c.html

    One of the big pluses for Golang is just the folks who created it:

    https://en.wikipedia.org/wiki/Go_(programming_language)

    Go (often referred to as Golang) is a statically typed, compiled programming language designed at Google by Robert Griesemer, Rob Pike, and Ken Thompson. Go is syntactically similar to C, but with the added benefits of memory safety, garbage collection, structural typing, and CSP-style concurrency.

    There are two major implementations:

    Google’s self-hosting compiler toolchain targeting multiple operating systems, mobile devices, and WebAssembly.
    gccgo, a GCC frontend.

    A third compiler, GopherJS, compiles Go to JavaScript for front-end web development.

    Yeah, THAT Ken Thompson…

    Omissions

    Go deliberately omits certain features common in other languages, including (implementation) inheritance, generic programming, assertions,[e] pointer arithmetic,[d] implicit type conversions, untagged unions,[f] and tagged unions.[g] The designers added only those facilities that all three agreed on.

    Of the omitted language features, the designers explicitly argue against assertions and pointer arithmetic, while defending the choice to omit type inheritance as giving a more useful language, encouraging instead the use of interfaces to achieve dynamic dispatch[h] and composition to reuse code. Composition and delegation are in fact largely automated by struct embedding; according to researchers Schmager et al., this feature “has many of the drawbacks of inheritance: it affects the public interface of objects, it is not fine-grained (i.e, no method-level control over embedding), methods of embedded objects cannot be hidden, and it is static”, making it “not obvious” whether programmers will overuse it to the extent that programmers in other languages are reputed to overuse inheritance.[58]

    The designers express an openness to generic programming and note that built-in functions are in fact type-generic, but these are treated as special cases; Pike calls this a weakness that may at some point be changed. The Google team built at least one compiler for an experimental Go dialect with generics, but did not release it. They are also open to standardizing ways to apply code generation.

    Initially omitted, the exception-like panic/recover mechanism was eventually added, which the Go authors advise using for unrecoverable errors such as those that should halt an entire program or server request, or as a shortcut to propagate errors up the stack within a package (but not across package boundaries; there, error returns are the standard API).

    It looks to me like it will be a bit slower as an implementation, but could be done, to write an OS in it. I doubt anyone will bother. I could, perhaps, see some parts written in it. But most OS stuff doesn’t like to be spread around and broken up… that whole race condition and locking thing…

    A major open issue is how fast it will mutate (some new languages change so fast they die… ALGOL came in a few flavors in the first few years and that slowed things, still complicates retro-programming. Perl has a couple of incompatible types and that slowed adoption – I know it stopped me. FORTRAN has survived a few big revisions, but they happen about every 20 years, so slowly, and you are allowed to keep running the old specs.) Then there’s the question of how much Google will let it live free… I’d not want to be locked into them and their decisions…

    So, OK, for now an interesting thing to play with and useful for some small projects I’ve got, maybe. That the speed comparison showed FORTRAN still faster says it will stay the choice for parallel models (with Coarrays built in..) until something else is shown better; PROVIDED the implementation in Debian is actually fast…

  3. E.M.Smith says:

    @Steven:

    Well, it does run Linux & BSD, so it ought to port… also they state that the fan is a quiet one.

    Whisper-Quiet (<21dB)

    I’m fine with that… (it is just the $2k that’s an issue for me ;-)

  4. Steven Fraser says:

    @EM: Yeah, me too.

  5. Larry Ledwick says:

    Sounds like a good start to a home made super computer for all sorts of interesting ideas.

    Unfortunately that includes black hat hackers making pass word cracking systems, foreign renegade governments building systems to model “complex physics processes” etc.
    but those prices are peanuts to folks who do that kind of stuff.

  6. jim2 says:

    Did u see this one?

    New RPi in four flavors:

    Today we bring you the latest iteration of the Raspberry Pi Compute Module series: Compute Module 3+ (CM3+). This newest version of our flexible board for industrial applications offers over ten times the ARM performance, twice the RAM capacity, and up to eight times the Flash capacity of the original Compute Module.

    https://www.raspberrypi.org/blog/compute-module-3-on-sale-now-from-25/

    I wonder when we’ll see RPi M4xxx

  7. E.M.Smith says:

    @Jim2:

    Yes, I’ve seen it. It is basically a Pi M3 minus some I/O facilities.

    I’m generally unenthused about the compute modules, and would rather pay the extra $10 to get all the interfaces, while avoiding the need for an interface adapter or motherboard. Were I their target market (OEM Manufacturers) buying dozens or hundreds, then I’d care. Or if you custom fab a board that holds 8 to 16 of them with on board networking, it makes a nice compute engine, but that takes custom fab of PC boards…

    Note that the speed claims are relative to the ORIGINAL compute module with a single v6 core at 700 MHz. That’s a common technique in marketing (and “climate science”) the cherry pick a baseline for comparisons.

  8. ossqss says:

    Ya know I would be interested in seeing how my 12 year old, or more, mother hen Dell 3.6 gigahertz hyper-threading desktop performed against such? Maybe we could do a test? BTW, it is an XP machine. Would I get a 3 stroke handicap for that ;-)

    I have definitely watched the developement (look at Intel core I whatever spec’s) in this arena focus on power consumption, not really performance recently.

    Perhaps I am off base, but my 5 year, or more, phone runs at 2.7 Ghz with 3 gig of Ram pushing somewhere around 2560×1440 resolution. Probably better than my Dell desktop! Now that would be a funny test. Hummm, I do have a spare phone……..

  9. E.M.Smith says:

    @Ossqss:

    There’s pretty good published benchmarks. Intel, for many decades, has had a BIG performance edge over ARM. That is changing. 3 reasons:

    1) Intel patents on a lot of key tech have expired. Anyone can do that stuff now. If it was in use in a chip in the year 2000, it is almost certainly public domain now.

    2) ARM is no longer really RISC. They have been busy shoving more advanced architecture bits into it over recent years. Even more “exotic” bits like parallel pipelines and predictive branch execution. A lot of what gave Intel a lot more “juice” for a long time.

    3) SPECTRE & MELTDOWN (and a few others) patches. These exploits used just those predictive branch things that let lots of parallel execution (speculative) happen. The patches shut off some of those “features” (and some benchmarks have show about 20% to 30% decrease in performance).

    Don’t get me wrong, a bit ‘ol Intel CPU still runs rings around a 2 W Arm chip (some large part of all that heat and electricity…), but the gap is narrowing. As software has become more “multiple thread friendly” you can also more effectively use a gaggle of 5 ¢ Arm cores to do what a $100 Intel chip does. (though not the high end multi $Hundreds chips, yet…)

    https://chiefio.wordpress.com/2015/10/14/64-bit-vs-raspberry-pi-model-2-a-surprising-thing/

    The command executed was this:

    [root@CentosBox TempsArc]# cat /usr/bin/sqitlocal 
    mksquashfs ${1-/tmp} ${2-/tmp/$1}.sqsh -b 65536
    

    Basically the same as the ‘sqit’ command but letting me put the output somewhere more interesting, like on the local disk as /tmp/GHCN.sqsh.

    [root@CentosBox TempsArc]# ls -l /tmp/GHCN.sqsh 
    -rwx------. 1 root root 3712598016 Oct 14 17:44 /tmp/GHCN.sqsh
    [root@CentosBox TempsArc]# du -ms GHCN
    6926	GHCN
    

    So a 6.9 GB file got reduced to 3.7 GB using about an hour of ‘wall time’ in that “real 60′ and about 44 minutes of User CPU time along with a nearly irrelevant 3 minutes of ‘system’ CPU time.
    […]
    Yes, only 45 minutes ‘wall time’. The user time is 173 minutes but you must divide by 4 as there are 4 processors.

    So on an actual wall time basis, the RPiM2 is 4/3 the speed of that AMD 64 bit CPU. Golly.

    Now you didn’t notice, but I swapped over to the RPiM2 to paste in the stats from the terminal window there. I can state without reservation that the Antek / ASUS / AMD 64 bit CPU machine has a much more responsive and “liquid” feel when editing WordPress pages. Likely due to having a whole CPU instead of 1/4 of 4 available. But once this code is made to “run parallel”, then the RPiM2 will beat it here, too.

    Not at all what I expected.

    Now that’s a very old AMD 64 bit box, but also an old Arm board the Pi M2. I’m now on an octo-core Odroid with individual CPUs much faster than that. My RockPro64 has even faster individual ARM cores…

    So at present it all comes down to how well your code will “parallelize”… Have ONE THREAD that does everything? That Intel WonderChip wins. Have a mixed load of many things, and stuff that does multi-threaded well? The gaggle of Arm chips wins.

    For me, the Ordoid XU4 / RockPro64 class have “enough” single core performance and the multicores take up lots of the work. THE big hold up had been web browsers, but in the last year+ they have gone to much more multi-threaded builds and that’s just not an issue anymore. The only limit now is video performance (and that largely comes down to the software – my favorite Linux builds don’t yet do all the graphics / video in the GPU as they ought to do…).

    So I can tell you right now how they will compare: It will all depend on how multi-threaded the benchmark is, that you chose to run.

Anything to say?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.