So I was looking at BSD on the Pi and other Arm chip boards, with particular interest in the v8 or Arm64 chips (as the 64 bit math is faster for high precision math). Thinking a bit about maybe putting it on my “Octo-core” Odroid XU4… and discovered that Cavium have shipped their OMG-cores chips. Not only that, but it is the reference platform for Arm64 FreeBSD.
Cavium’s ThunderX is the initial reference target platform for FreeBSD/arm64.
FreeBSD 11.0 supports the ThunderX EVB (evaluation board) and CRB (customer reference board) in SMP mode (48 CPU cores). SATA drives, PCIe expansion cards, and the on-chip network interface are fully supported.
FreeBSD is available on the 2 socket, 96 core Type 2A ThunderX systems at Packet.net.
Demo of SMP kernel on ThunderX
Yes, that’s right, a 48 core board, and a 2 socket 96 core system. For when 8 cores are just not enough… As a compute engine, this ought to scream. So what’s it cost? Remarkably affordable. This is list price, and without any searching for low cost deals / providers:
Cavium ThunderX ARM 1U
1 x Cavium® ThunderX™ 48-core ARM processor
8 x DDR4 DIMM slots
1 x 40GbE QSFP+ LAN port
4 x 10GbE SFP+ LAN ports
4 x 3.5” hot-swappable HDD/SSD bays
400W 80 PLUS Gold single PSU
Starting configuration: $2350.00
They have a lower end 32 core model for $1726 and then there’s the desktop / towers:
Cavium ThunderX ARM Tower
Cavium® ThunderX™ family, 1 x ThunderX_CP™ processor 64bit ARMv8 architecture, 32 cores per processor, 1.8GHz BGA 2601, 28nm technology
90° Rotatable HDD Cage
Kensington Lock Support
Front I/O Ports: 2x Audio (HD/AC97) & 2x USB 3.0 & 2x USB 2.0 & 2x 1394 Firewire Ports
1x Optional Front 12cm (1850 RPM) PWM Fan
Mid-Tower Chassis Supports Micro-ATX Motherboard, Sizes – E-ATX/ATX/Micro ATX
500W Bronze Level Certified High-Efficiency Power Supply
1x Rear 12cm (1850 RPM) PWM Fan
2x 5.25" External HDD Drive Bays & 4x 3.5" Internal HDD Drive Bays
Starting configuration: $1630.00
Where the 48 core variation runs $2500 for the package.
I know folks who will pay $1600 range prices for high end Mac desktops.
I’m much more interested in the “Gaggle of cheap SBCs” world at the moment, and that $1630 price tag for 32 cores is about $51 / core where you can get 4 to 8 core SBCs at about the same “$50-something” price range and a similar 1.8 Ghz clock speed (2.0 GHz for the 48 core model). No mention of memory size so it is likely a configurable option. A package of 4 x XU4 (32 cores) would have 8 GB of memory as would an 8 x 4-core SBC solution with 1 GB / board; so that’s a good comparison memory size to choose.
The big question is just what is the bus speed between those cores / memory vs the network speed of the SBC cluster; as that’s where lots of parallel processing hits a bottleneck. on the communications speed.
Still, if you want a single tower with 32 to 48 cores of 64 bit processor running at 1.8 GHz to 2.0 GHz, it does look like a nice package. Then having a load of multi-core boxes and rack-mount bits will certainly accelerate the development of good BSD / Linux ports / support.
Whenever I’ve finally managed to fully load up my “stack of boards”, it’s nice to know there’s an easy path to a whole lot more cores in a tightly coupled package. Even if a bit expensive in comparison.
Then there’s the future upgrade path, the ThunderX2:
ThunderX2 is a family of 64-bit multi-core ARM server microprocessors introduced by Cavium in early 2018 succeeding the original ThunderX line.
The ThunderX2 was designed to succeed the original ThunderX family. Cavium first announced the ThunderX2 back in May 30 2016 with models based on their own second-generation microarchitecture with models up to 54 cores. Cavium eventually scrapped their own design and in late 2016 they acquired the Vulcan design from Broadcom which has designed a server microprocessor but has given up on the project for reasons not well understood. In early 2018, Cavium announced that their ThunderX2 processors (now based on Vulcan) have reached general availability.
The first parts of the ThunderX2 family, CN99xx series, that made it to general availability are based on the Vulcan microarchitecture. Those parts are different from Cavium’s original ThunderX2 design which started sampling in 2016. Originally designed by Broadcom, those parts have much higher performance and a slightly different set of features. All parts have the following features in common.
Mem: Up to 2 TiB of quad/hexa/octa- channel DDR4 2666 MT/s memory
Up to 4 TiB in dual-socket configuration
ISA: ARMv8, 128-bit NEON SIMD
I/O: x48, x56 PCIe Gen 3 Lanes
Only the 64-bit AArch64 execution state is support. No 32-bit AArch32 support.
Two terrabytes is a nice size memory ;-) but you can get to 4… Then 128 bit NEON hardware for SIMD (Single Instruction Multiple Data – basically like the old Cray “vector processor” but twice the world length, for math intensive codes). Things like computer vision and math intensive iterative models will like that. That they do not implement the 32 bit mode means two things: The first port of software will take longer (can’t just run your existing 32 bit code / port) and ports will get the 64 bit conversion done faster (for the same reason…you can’t put it off and just run 32 bit for a while).
Up to 2.5 GHz, so nice and fast too.
This is going to push forward the state of massively parallel computing in the lower cost end of the market (i.e. not K-Core+ supercomputers).
FWIW, I was learning some Go programming language over the last week or two. Not committing to it, just checking it out. Invented by Google, so some issues there. They make available a large “free” library of library routines … but you get them by having your program grab them from the Google servers when you compile it. I’m not interested in letting Google know every time I write some code. Perhaps there’s a way to snag a local copy and I’ve just not reached that point yet. But the “Guy From Google” in one of the tutorials I watched said they were constantly updating the libraries, which implies they want you to point at their current versions.
The key point to Go? It is designed to be easy to write massively parallel programs. Has intrinsics built in to spawn jobs (“goroutines”) and have them communicate. Here’s some information on it (5 pages):
It is highly similar to C with some added parallel bits, and a few things pruned out (so simpler to learn, really). As a C programmer, it is more like a dialect or extension of what you already know.
Google runs things (like their document server) on it, on thousands of processor clusters, so it is known to scale very well.
While my present path is to continue to test the performance of various parallel types of FORTRAN on “Pi class” SBCs: as I’ve not seen a lot of improvement ( I suspect parallel FORTRAN is low on the list of priorities to make work well ) I’m also going to explore some other parallel code options. In that context, Go (or in Debian package terms, “golang”) is certainly on the list for future exploration.
The most intriguing thing is that “massively parallel” hardware and languages are becoming attainable by folks in the “home gamer” class. Nice.
52 minute tutorial on Go:
That “Go”pher is their mascot for the language. Yes, T-shirts and all…