So I was looking at BSD on the Pi and other Arm chip boards, with particular interest in the v8 or Arm64 chips (as the 64 bit math is faster for high precision math). Thinking a bit about maybe putting it on my “Octo-core” Odroid XU4… and discovered that Cavium have shipped their OMG-cores chips. Not only that, but it is the reference platform for Arm64 FreeBSD.
https://wiki.freebsd.org/arm64
Cavium ThunderX
Cavium’s ThunderX is the initial reference target platform for FreeBSD/arm64.
FreeBSD 11.0 supports the ThunderX EVB (evaluation board) and CRB (customer reference board) in SMP mode (48 CPU cores). SATA drives, PCIe expansion cards, and the on-chip network interface are fully supported.
FreeBSD is available on the 2 socket, 96 core Type 2A ThunderX systems at Packet.net.
Demo of SMP kernel on ThunderX
Yes, that’s right, a 48 core board, and a 2 socket 96 core system. For when 8 cores are just not enough… As a compute engine, this ought to scream. So what’s it cost? Remarkably affordable. This is list price, and without any searching for low cost deals / providers:
https://www.asacomputers.com/Cavium-ThunderX.html
Cavium ThunderX ARM 1U
SKU: ASA1901-48C-TX1 x Cavium® ThunderX™ 48-core ARM processor
8 x DDR4 DIMM slots
1 x 40GbE QSFP+ LAN port
4 x 10GbE SFP+ LAN ports
4 x 3.5” hot-swappable HDD/SSD bays
400W 80 PLUS Gold single PSU
Starting configuration: $2350.00
They have a lower end 32 core model for $1726 and then there’s the desktop / towers:
Cavium ThunderX ARM Tower
SKU: ASA9104-32C-TXCavium® ThunderX™ family, 1 x ThunderX_CP™ processor 64bit ARMv8 architecture, 32 cores per processor, 1.8GHz BGA 2601, 28nm technology
90° Rotatable HDD Cage
Whisper-Quiet (<21dB)
Kensington Lock Support
Front I/O Ports: 2x Audio (HD/AC97) & 2x USB 3.0 & 2x USB 2.0 & 2x 1394 Firewire Ports
1x Optional Front 12cm (1850 RPM) PWM Fan
Mid-Tower Chassis Supports Micro-ATX Motherboard, Sizes – E-ATX/ATX/Micro ATX
500W Bronze Level Certified High-Efficiency Power Supply
1x Rear 12cm (1850 RPM) PWM Fan
2x 5.25" External HDD Drive Bays & 4x 3.5" Internal HDD Drive Bays
Starting configuration: $1630.00
Where the 48 core variation runs $2500 for the package.
I know folks who will pay $1600 range prices for high end Mac desktops.
I’m much more interested in the “Gaggle of cheap SBCs” world at the moment, and that $1630 price tag for 32 cores is about $51 / core where you can get 4 to 8 core SBCs at about the same “$50-something” price range and a similar 1.8 Ghz clock speed (2.0 GHz for the 48 core model). No mention of memory size so it is likely a configurable option. A package of 4 x XU4 (32 cores) would have 8 GB of memory as would an 8 x 4-core SBC solution with 1 GB / board; so that’s a good comparison memory size to choose.
The big question is just what is the bus speed between those cores / memory vs the network speed of the SBC cluster; as that’s where lots of parallel processing hits a bottleneck. on the communications speed.
Still, if you want a single tower with 32 to 48 cores of 64 bit processor running at 1.8 GHz to 2.0 GHz, it does look like a nice package. Then having a load of multi-core boxes and rack-mount bits will certainly accelerate the development of good BSD / Linux ports / support.
Whenever I’ve finally managed to fully load up my “stack of boards”, it’s nice to know there’s an easy path to a whole lot more cores in a tightly coupled package. Even if a bit expensive in comparison.
Then there’s the future upgrade path, the ThunderX2:
https://en.wikichip.org/wiki/cavium/thunderx2
ThunderX2 is a family of 64-bit multi-core ARM server microprocessors introduced by Cavium in early 2018 succeeding the original ThunderX line.
Overview
The ThunderX2 was designed to succeed the original ThunderX family. Cavium first announced the ThunderX2 back in May 30 2016 with models based on their own second-generation microarchitecture with models up to 54 cores. Cavium eventually scrapped their own design and in late 2016 they acquired the Vulcan design from Broadcom which has designed a server microprocessor but has given up on the project for reasons not well understood. In early 2018, Cavium announced that their ThunderX2 processors (now based on Vulcan) have reached general availability.
CN99xx
[…]The first parts of the ThunderX2 family, CN99xx series, that made it to general availability are based on the Vulcan microarchitecture. Those parts are different from Cavium’s original ThunderX2 design which started sampling in 2016. Originally designed by Broadcom, those parts have much higher performance and a slightly different set of features. All parts have the following features in common.
Mem: Up to 2 TiB of quad/hexa/octa- channel DDR4 2666 MT/s memory
Up to 4 TiB in dual-socket configuration
ISA: ARMv8, 128-bit NEON SIMD
I/O: x48, x56 PCIe Gen 3 Lanes
Only the 64-bit AArch64 execution state is support. No 32-bit AArch32 support.
Two terrabytes is a nice size memory ;-) but you can get to 4… Then 128 bit NEON hardware for SIMD (Single Instruction Multiple Data – basically like the old Cray “vector processor” but twice the world length, for math intensive codes). Things like computer vision and math intensive iterative models will like that. That they do not implement the 32 bit mode means two things: The first port of software will take longer (can’t just run your existing 32 bit code / port) and ports will get the 64 bit conversion done faster (for the same reason…you can’t put it off and just run 32 bit for a while).
Up to 2.5 GHz, so nice and fast too.
This is going to push forward the state of massively parallel computing in the lower cost end of the market (i.e. not K-Core+ supercomputers).
FWIW, I was learning some Go programming language over the last week or two. Not committing to it, just checking it out. Invented by Google, so some issues there. They make available a large “free” library of library routines … but you get them by having your program grab them from the Google servers when you compile it. I’m not interested in letting Google know every time I write some code. Perhaps there’s a way to snag a local copy and I’ve just not reached that point yet. But the “Guy From Google” in one of the tutorials I watched said they were constantly updating the libraries, which implies they want you to point at their current versions.
The key point to Go? It is designed to be easy to write massively parallel programs. Has intrinsics built in to spawn jobs (“goroutines”) and have them communicate. Here’s some information on it (5 pages):
https://www.ualr.edu/pxtang/papers/ACC10.pdf
It is highly similar to C with some added parallel bits, and a few things pruned out (so simpler to learn, really). As a C programmer, it is more like a dialect or extension of what you already know.
Google runs things (like their document server) on it, on thousands of processor clusters, so it is known to scale very well.
While my present path is to continue to test the performance of various parallel types of FORTRAN on “Pi class” SBCs: as I’ve not seen a lot of improvement ( I suspect parallel FORTRAN is low on the list of priorities to make work well ) I’m also going to explore some other parallel code options. In that context, Go (or in Debian package terms, “golang”) is certainly on the list for future exploration.
The most intriguing thing is that “massively parallel” hardware and languages are becoming attainable by folks in the “home gamer” class. Nice.
52 minute tutorial on Go:
That “Go”pher is their mascot for the language. Yes, T-shirts and all…
So, a single-board supercomputer. Hmmm. I wonder if GFSV2 would port easily to it?
Hopefully, the fans would be non-noisy, as I know you prefer your speed to be silent.
Don’t know that I’ll do anything with it other than a couple of test / training exercises, but golang in available for the Pi class boards:
Interesting that it looks like some source packages are included
There’s also a package that lets you open a MySQL database from inside Go, but I’m not going to do that unless / until I do more than just toy cases for learning…
https://tutorialedge.net/golang/golang-mysql-tutorial/
that depends on this:
https://github.com/go-sql-driver/mysql
so a bit of “some assembly required” to make it all work well together. So there’s a lot of potential there for writing fast, understandable, and highly parallel code using a database (and database locking).
That’s all on the back burner behind a bunch of other stuff (like the actual temperature database design / load) so unlikely anything soon. These are just to save pointers to where things were found. And save other folks the digging time ;-)
In general, Go is an interesting language and approach, proven in actual use at scale by Google. As a “New, Trendy, Cool!” language, I have some reservations about “rushing there”; but as a “mostly C” like language, it isn’t much of a workload… I could probably be productive in under a week. Minimally programming in a day.
I wonder if the bits removed from C in making Go were any of the parts important to making an operating system? If not, then making highly parallel OS code ought to be easy… But I think it was. IIRC some of the trick pointer stuff was left out…
Interesting paper comparing Go to C
https://dead10ck.github.io/2014/12/15/go-vs-c.html
One of the big pluses for Golang is just the folks who created it:
https://en.wikipedia.org/wiki/Go_(programming_language)
Yeah, THAT Ken Thompson…
It looks to me like it will be a bit slower as an implementation, but could be done, to write an OS in it. I doubt anyone will bother. I could, perhaps, see some parts written in it. But most OS stuff doesn’t like to be spread around and broken up… that whole race condition and locking thing…
A major open issue is how fast it will mutate (some new languages change so fast they die… ALGOL came in a few flavors in the first few years and that slowed things, still complicates retro-programming. Perl has a couple of incompatible types and that slowed adoption – I know it stopped me. FORTRAN has survived a few big revisions, but they happen about every 20 years, so slowly, and you are allowed to keep running the old specs.) Then there’s the question of how much Google will let it live free… I’d not want to be locked into them and their decisions…
So, OK, for now an interesting thing to play with and useful for some small projects I’ve got, maybe. That the speed comparison showed FORTRAN still faster says it will stay the choice for parallel models (with Coarrays built in..) until something else is shown better; PROVIDED the implementation in Debian is actually fast…
@Steven:
Well, it does run Linux & BSD, so it ought to port… also they state that the fan is a quiet one.
I’m fine with that… (it is just the $2k that’s an issue for me ;-)
@EM: Yeah, me too.
Sounds like a good start to a home made super computer for all sorts of interesting ideas.
Unfortunately that includes black hat hackers making pass word cracking systems, foreign renegade governments building systems to model “complex physics processes” etc.
but those prices are peanuts to folks who do that kind of stuff.
Did u see this one?
New RPi in four flavors:
Today we bring you the latest iteration of the Raspberry Pi Compute Module series: Compute Module 3+ (CM3+). This newest version of our flexible board for industrial applications offers over ten times the ARM performance, twice the RAM capacity, and up to eight times the Flash capacity of the original Compute Module.
https://www.raspberrypi.org/blog/compute-module-3-on-sale-now-from-25/
I wonder when we’ll see RPi M4xxx
@Jim2:
Yes, I’ve seen it. It is basically a Pi M3 minus some I/O facilities.
I’m generally unenthused about the compute modules, and would rather pay the extra $10 to get all the interfaces, while avoiding the need for an interface adapter or motherboard. Were I their target market (OEM Manufacturers) buying dozens or hundreds, then I’d care. Or if you custom fab a board that holds 8 to 16 of them with on board networking, it makes a nice compute engine, but that takes custom fab of PC boards…
Note that the speed claims are relative to the ORIGINAL compute module with a single v6 core at 700 MHz. That’s a common technique in marketing (and “climate science”) the cherry pick a baseline for comparisons.
Ya know I would be interested in seeing how my 12 year old, or more, mother hen Dell 3.6 gigahertz hyper-threading desktop performed against such? Maybe we could do a test? BTW, it is an XP machine. Would I get a 3 stroke handicap for that ;-)
I have definitely watched the developement (look at Intel core I whatever spec’s) in this arena focus on power consumption, not really performance recently.
Perhaps I am off base, but my 5 year, or more, phone runs at 2.7 Ghz with 3 gig of Ram pushing somewhere around 2560×1440 resolution. Probably better than my Dell desktop! Now that would be a funny test. Hummm, I do have a spare phone……..
@Ossqss:
There’s pretty good published benchmarks. Intel, for many decades, has had a BIG performance edge over ARM. That is changing. 3 reasons:
1) Intel patents on a lot of key tech have expired. Anyone can do that stuff now. If it was in use in a chip in the year 2000, it is almost certainly public domain now.
2) ARM is no longer really RISC. They have been busy shoving more advanced architecture bits into it over recent years. Even more “exotic” bits like parallel pipelines and predictive branch execution. A lot of what gave Intel a lot more “juice” for a long time.
3) SPECTRE & MELTDOWN (and a few others) patches. These exploits used just those predictive branch things that let lots of parallel execution (speculative) happen. The patches shut off some of those “features” (and some benchmarks have show about 20% to 30% decrease in performance).
Don’t get me wrong, a bit ‘ol Intel CPU still runs rings around a 2 W Arm chip (some large part of all that heat and electricity…), but the gap is narrowing. As software has become more “multiple thread friendly” you can also more effectively use a gaggle of 5 ¢ Arm cores to do what a $100 Intel chip does. (though not the high end multi $Hundreds chips, yet…)
https://chiefio.wordpress.com/2015/10/14/64-bit-vs-raspberry-pi-model-2-a-surprising-thing/
Now that’s a very old AMD 64 bit box, but also an old Arm board the Pi M2. I’m now on an octo-core Odroid with individual CPUs much faster than that. My RockPro64 has even faster individual ARM cores…
So at present it all comes down to how well your code will “parallelize”… Have ONE THREAD that does everything? That Intel WonderChip wins. Have a mixed load of many things, and stuff that does multi-threaded well? The gaggle of Arm chips wins.
For me, the Ordoid XU4 / RockPro64 class have “enough” single core performance and the multicores take up lots of the work. THE big hold up had been web browsers, but in the last year+ they have gone to much more multi-threaded builds and that’s just not an issue anymore. The only limit now is video performance (and that largely comes down to the software – my favorite Linux builds don’t yet do all the graphics / video in the GPU as they ought to do…).
So I can tell you right now how they will compare: It will all depend on how multi-threaded the benchmark is, that you chose to run.