For the last few days I’ve been “Down The Rabbit Hole” of parallel high performance computing in the modern era.
(“Why? Don’t ask why. Down that path lies insanity and ruin. -E.M.Smith”. Exploring why: Well, it’s something I do. That whole HPC thing has been in my blood since my first Cray in about 1984. Maybe since my Dual Processor B6700 days in the ’70s when dual processors were exotic. Recently it’s changed, so time for a refresher. Besides, I might want to build My Own Private Teraflop machine… “This life is not a dress rehearsal, take Big Bites!” and if nobody is going to offer me a ‘ride’, I might need to roll my own…)
First up, just a touch of history:
I’ve added some white space to make this easier to read, and bolded some bits:
The Discovery of Global Warming February 2015
General Circulation Models of Climate
The climate system is too complex for the human brain to grasp with simple insight. No scientist managed to devise a page of equations that explained the global atmosphere’s operations. With the coming of digital computers in the 1950s, a small American team set out to model the atmosphere as an array of thousands of numbers.
The work spread during the 1960s as computer modelers began to make decent short-range predictions of regional weather. Modeling long-term climate change for the entire planet, however, was held back by lack of computer power, ignorance of key processes such as cloud formation, inability to calculate the crucial ocean circulation, and insufficient data on the world’s actual climate.
By the mid 1970s, enough had been done to overcome these deficiencies so that Syukuro Manabe could make a quite convincing calculation. He reported that the Earth’s average temperature should rise a few degrees if the level of carbon dioxide gas in the atmosphere doubled. This was confirmed in the following decade by increasingly realistic models. Skeptics dismissed them all, pointing to dubious technical features and the failure of models to match some kinds of data.
By the late 1990s these problems were largely resolved, and most experts found the predictions of overall global warming plausible. Yet modelers could not be sure that the real climate, with features their equations still failed to represent, would not produce some big surprise.
(The history of rudimentary physical models without extensive calculations is told in a separate essay on Simple Models of Climate, and there is a supplementary essay for the Basic Radiation Calculations that became part of the technical foundation of comprehensive calculations. For a brief technical introduction to current climate modeling see Schmidt, Physics World, Feb. 2007).
Now first off, since when did “Plausible” become the standard of excellence in Science? Secondly, it might be interesting to look up “Syukuro Manabe” and see just what he actually said and did.
I doubt that those “deficiencies” were overcome in the “mid-70s” since in the “mid-90s” I was letting a Ph.D. student at Stanford finish his Ph.D. Thesis using our Cray. We were winding down operations, and he had “hit a wall” on computer time. He called up asking if we might donate some Cray time to him for “cloud formation simulations” as it was not understood well at all. “Maybe a few hours?”
Seems Stanford had a budget for a bright Ph.D. and he had run out of computes well short of said degree. As we had discontinued contracts with folks to buy time from us (“going out of business” is not compatible with “reliable supplier”…) and the machine was doing “not much” as the groups that had used it were laid off, “I gave him an offer he couldn’t refuse”: ALL the time he wanted. Free. Just credit Apple Computer Company in his thesis. A few hundred Cray Hours later (so a $Few-Hundred-Thousand of rental time…) he said “I have more than enough to finish my thesis. I don’t want to do any more runs.”… in that strange voice you get from a starved person who has just sat a banquet until unable to eat and is coming to terms with that new feeling….
But the point of THAT story is just this: In the mid-1990s a Stanford Ph.D was awarded for “cloud formation” based on more compute time than anyone else had available and because nobody really had a clue how clouds formed. (As I understand it, he found solutions for one small part of the problem, not the whole process… Maybe someday I’ll look up the thesis and see what he really did ;-) FWIW, it was about 10 hours at 400 MFLOPS per single cloud formation. Not thunderstorm. Not cloud deck. Not high cirrus over low cumulus. ONE cloud in clear air. So I’m pretty sure that Syukuro had not “overcome these deficiences” in the ’70s.
The other key bit to recognize is that date at the end. Published in Physics World 2007. That’s the general time scale we’re looking at for our “usable computer speed” to run the stuff they did. That’s just after the end of the XMP reign and when ASCI Red was King Of The Hill time frame. (No, not the fastest, but way more than most Climate Researchers could get their hands on. So it is “about right”.)
So what’s an ASCI-Red?
ASCI Red (also known as ASCI Option Red or TFLOPS) was the first computer built under the Accelerated Strategic Computing Initiative (ASCI), the supercomputing initiative of the United States government created to help the maintenance of the United States nuclear arsenal after the 1992 moratorium on nuclear testing.
ASCI Red was built by Intel and installed at Sandia National Laboratories in late 1996. The design was based on the Intel Paragon computer. The original goals to deliver a true teraflop machine by the end of 1996 that would be capable of running an ASCI application using all memory and nodes by September 1997 were met. It was used by the US government from the years of 1997 to 2005 and was the world’s fastest supercomputer until late 2000. It was the first ASCI machine that the Department of Energy acquired, and also the first supercomputer to score above one teraflops on the LINPACK benchmark, a test that measures a computer’s calculation speed. Later upgrades to ASCI Red allowed it to perform above two teraflops.
ASCI Red earned a reputation for reliability that some veterans say has never been beaten. Sandia director Bill Camp said that ASCI Red had the best reliability of any supercomputer ever built, and “was supercomputing’s high-water mark in longevity, price, and performance.”
ASCI Red was decommissioned in 2006.
First off, at that time “My Cray” was running about 400 MegaFlops, or about 40% of a GigaFlop. It was a big deal when someone did 2500x that speed. IMHO, these two speeds “bookend” the range of what is ‘reasonably needed’ to match the size and scale of computing used in any GCM or similar “Climate Model” of that era.
In reality, it is likely even less. While a Personal Teraflop is available to use to run 24 x 7 x 365, those $40 Million to $Billion scale machines were very much “time shared”. You would sign up for your time slice and get a limited run time. At that time, a Cray XMP CPU rented for about $1500 / hour (we rented spare time to some folks) and there were companies that specialized in arranging the rentals so that the machine ought never be completely idle. We were only a little bit in that market, so often did have some idle time, that let me do all sorts of fun projects ;-) But the point: To run for 1 day would cost about $36,000 of real salable CPU time. $144,000 if you used all four CPUs of an XMP-48. Over $4 Million for a month of use. Yes, those are “time share rental ‘mini-bar’ prices”; but that is what Management could get if they leased the time out. (That was part of my job then… as I was Management and had a budget… and needed to justify things to The Upper Management every 6 months or so…)
The point? Simple enough. Any GCM of the era didn’t have “constantly running” as an option. Maybe now in this era of Climate Hype and Hog Tough Budgets they can, but it is much more reasonable to assume that a Teraflop then was used for a few days. And maybe even now “one run” on a 100 Teraflop machine might take a day. That means you can do the same in 100 days. Or that you can do what they did when first panicking over Global Warming in that same “few days”. In other words: One TeraFlop is enough “for all practical purposes”. Maybe even overkill.
So that’s where I planted my flag. What’s it take to get a Teraflop?
This is enlightening:
2012 IBM Sequoia 16.32 PFLOPS Lawrence Livermore National Laboratory, California, USA 2012 Cray Titan 17.59 PFLOPS Oak Ridge National Laboratory, Tennessee, USA 2013 NUDT Tianhe-2 33.86 PFLOPS Guangzhou, China
I find it interesting that their chart ends in 2013. We’ve had 2 more years of Moore’s Law, so things ought to have doubled since then… Today, the top speed is about 34,000 TeraFlops per the chart, and http://top500.org/ seems to confirm that.
That’s a whole lot of advance. That same effect will have been reflected in similarly lower costs and size for a TeraFlop. Since I’m more interested in running one of the more basic models, and perhaps making it more efficient and accurate, than running a bad model 34,000 times further into the error band, I’m OK with that…
Sidebar On CISC RISC SIMD MIMD SISD…
There’s a whole alphabet soup of acronyms in Supercomputing. This will be a very very shortened crib note on it.
Most of the computers most of you know are CISC “Complex Instruction Set Computer”. That’s the Intel Family of Pentiums et. al. VERY complicated machines that go very fast but also do a lot of things at each step.
Some of the computers folks use are RISC “Reduced Instruction Set Computer”. They have a simpler CPU core and do a more limited set of things. BUT, while doing it, use a lot less silicon, a lot less power, and can get much more done with a whole lot less resources.
That’s why an Intel CPU costs $300 while an ARM CPU runs about 50 ¢ yet both are 32 bit (or now, more often, 64 bit) machines. Mostly the gain to the CISC CPU comes out of instruction overlap in time (it starts the next instruction before the last one is even 1/2 done – called ‘pipelined instructions’) and a complex instruction that may be rarely used sometimes, is available, and completes in ‘one clock cycle’ where the RISC machine might need 4 instructions and 4 clock cycles to do the same thing. IFF you NEED that instruction, CISC matters.
Lately, even the RISC machines have been getting pipelines and more complex instruction sets as the cost of silicon continues to plunge per compute… to the point where some of the ARM chips are not what I’d call “reduced instruction set”… but not at the level of the Intel chip.
This matters as One Big CPU can often do a lot more than even a dozen small cheap CPUs. A 1/2 dozen Raspberry Pi’s in a Beowulf Cluster will NOT beat a newer Intel based home computer. This matters.
The others are well explained here:
SISD (Single instruction stream, single data stream)
A sequential computer which exploits no parallelism in either the instruction or data streams. Single control unit (CU) fetches single instruction stream (IS) from memory. The CU then generates appropriate control signals to direct single processing element (PE) to operate on single data stream (DS) i.e. one operation at a time.
Examples of SISD architecture are the traditional uniprocessor machines like a PC (currently manufactured PCs have multiple cores) or old mainframes.
SIMD (Single instruction stream, multiple data streams)
A computer which exploits multiple data streams against a single instruction stream to perform operations which may be naturally parallelized. For example, an array processor or GPU.
MISD (Multiple instruction streams, single data stream)
Multiple instructions operate on a single data stream. Uncommon architecture which is generally used for fault tolerance. Heterogeneous systems operate on the same data stream and must agree on the result. Examples include the Space Shuttle flight control computer.
MIMD (Multiple instruction streams, multiple data streams)
Multiple autonomous processors simultaneously executing different instructions on different data. Distributed systems are generally recognized to be MIMD architectures; either exploiting a single shared memory space or a distributed memory space. A multi-core superscalar processor is a MIMD processor.
Why does this matter? Because it determines what kind of machine can find an answer to what kind of problem the fastest and cheapest.
A SISD is the standard computer of the 1960s to 1990s. One Big CPU doing one thing at a time. Since then, we’ve layered in ever more “parallel processing” (so it can “walk and chew gum” better) as silicon speed started to hit a wall. Today, even your Intel CPU in your desktop has “multiple threads” it can run at once and is overlapping execution of instructions in pipelines. Things only “supercomputers” did in the 1960s and even into the ’70s. So SISD is also under erosion as a pure entity these days.
SIMD is what “My Cray” did that the other mainframes of the era did not. It had an attached “Vector Processor” that could take a block of 64 numbers (called a ‘stride’) and multiply it by another block of 64 numbers, and put the product back into a third block of 64 numbers, all with ONE instruction and in ONE clock cycle. Heady stuff “in the day”. Today rather tame.
Today that is done with a GPU or Graphics Processing Unit. Originally these were dedicated Vector Processors added to PCs to make the graphics go faster than glacially slow. (That was when your screen got color, and movement ;-) Today, folks have figured out they have a LOT of Compute Power in them that is ignored or “wasted” on driving Quake and other battle games into glorious full color 24 bit deep with 60 frames / second of motion. At least, “wasted” from the point of view of someone wanting to run a climate model on his PC…
IF you don’t specially code your programs to use that GPU, it doesn’t. You are only using your SISD CPU and ignoring your SIMD GPU in terms of performance. This Matters. Rather a lot.
As the quote notes, MISD is rather unimportant for modeling. So we move on to MIMD. That’s the land of the Beowulf Cluster or COW (Cluster Of Workstations) and “Grid Computing”. You can take any old collection of computers and “wire them up” with networking and spread multiple copies of programs over them to run at the same time.
In about 1994 I made one of these “just for fun” out of 8 old “White Box PCs” that I’d collected from “junk”. This is the basis for programs like SETI At Home and others. It only works well on problems that are “embarrassingly parallel” and where data communications is dinky compared to compute time. So a great technology, but limited applications. “Grid Computing” is just extending this model to a collection of compute facilities spread between many institutions; so for example, Stanford and Berkeley “Cal” can combine their compute resources and get more total computes on a problem. Add in UCLA, UC Davis, and San Jose State and you have a whole lot of “cooperative computing” ( IFF you can get them to stop arguing about Football ;-)
The Architecture Drives the Software that Drives the Architecture
Huge numbers of folks are constantly working to improve algorithms and methods. To find ways to make a monolithic problem solvable by decomposition. To find ways to make a single threaded task multi-threaded.
My Brother-in-law was a Ph.D. researcher in aeronautics at NASA while I was running the Supercomputer Center for Apple. “We talked”. He had a great little chart that I wish I could find. It showed the improvement in computes (log scale) from Moore’s Law as a straight line at about 40 degrees slope. Above it was another line, at about 50 degrees slope. “Improvement from improved algorithms”… Yes, software development in aerospace had improved total results faster than Moore’s Law.
(IMHO, this also shows why Microsoft Software has remained about the same level of crap the last 30 years. BAD software can consume ALL of Moore’s Law gains and then some, leaving no perceptible change to the user… Another part of why I like Linux. It doesn’t suffer such decay nearly as fast.)
Over time, the race between RISC and CISC ebbs and flows. Sometimes one is favored, sometimes another. In the ’80s it was RISC. Now it is CISC. Later perhaps RISC again. Similarly the power of each machine changes, so MIMD vs SISD and SIMD wanders, and as folks find ways to reduce data communications and make some parts more compute intensive, the optimal choice changes for given problems.
Why does this matter?
Because I’m going to show 3 different approaches to parallel computing at different MIPS / MFLOPS rankings and folks tend to think “Ooooh! I’ll take the BIG number”… but that doesn’t do you any good if you don’t have the right kind of architecture for your problem.
One real world example: When at Apple, we had simulated a ‘hot chip’ (that eventually became the PowerPC architecture too late for our particular program) and on it ran a simulated OS and on that a simulated PC. Yes, we made the Cray a 400 MFLOPs “Personal Cray”. A Mac was used for mouse encoding as we couldn’t find a place to plug it into the Cray ;-) (Yes, we put a mouse on the Cray… and a Gigabit full motion 3D display. All in about 1990.) We were several $Millions into making the next Killer Machine (think of a quad CPU box with 400 MFLOPs in it, 3D, full motion animation and an under $10,000 price point in about 1992. It would have sunk Sun, SGI, etc.)
BUT, we needed to “reduce to silicon”. That’s a scalar process, not a vector process. The Cray was absolutely lousy at doing that. We made a “pitch” to the Board for a $1 Million Scalar Engine (i.e. mainframe) to do the “reduce to silicon”. After which we could fab and build and sell. They had (IMHO) an Idiot on board (Al Alcorn) who mostly just killed things. He had rejected the Apple 1 when at H.P. and the Steve’s had offered it to H.P. How he ever got to be an Apple Fellow is beyond me. He was assigned to “investigate” and reported that the project ought to be killed. Thus we threw away about $100 Million worth of work because some folks could not understand why a $40 Million Vector machine was not as good as a $1 Million Scalar machine at laying out silicon.
I’m sure there was more to it than that, but the simple fact was that had we done the ‘reduce to silicon’ then, we would have been about 3 to 4 years ahead of anyone else in ‘time to market’ with a machine that blew the doors off everything else and at a price that was phenomenal. About the same as high end gaming stations you didn’t see around for another decade. So yes, in the Real World ™ this stuff matters. Rather a lot.
In short, making a Damn Fast Vector Teraflops will not do you much good if YOUR problem is a Scalar SISD one.
So when you find yourself salivating over some Amazing MFlops numbers, remember that you MUST ask if it is suited to your task.
This also bleeds over into “What Language”? Not all systems run all computer languages in all kinds of processors. So if you want to run a FORTRAN Climate Model on a system that only understands “C”, you are in for a world of hurt. More on that down below.
Just remember that “Language Choice” is a part of the overall architecture of the problem set. Not all languages work well for all kinds of problems, or for all kinds of hardware choices.
From what I’ve seen, most of the Climate Codes are written in FORTRAN. This is actually a reasonable choice. Despite all the carping from folks who learned C first and have limited exposure to other compute paradigms, and despite the folks who know a dozen languages, all of them trendy and new and shiny: FORTRAN is exceptionally good at doing math in a way that isn’t hard to learn, and at sucking in big blocks of data and spitting out big blocks of results.
I’ve spent hours some times just fighting C enough to read in a fixed format data file. Something that in FORTRAN takes about a minute. And no, the answer is NOT to put it all into a database and be “more modern”. Often a large flat file is THE fastest way to feed a whole lot of massive data into a compute engine and just as often a large flat file output is THE fastest way to get it back out. It does little good to have my 1 TFLOP compute engine starved as it waits for a 1 MIPS (or 100 MIPS) Scalar Database Engine to do disk seeks. The world of “Big Iron” has a very different set of needs than your typical business application doing random reads and writes and DB queries for the quarterly report. (I’ve done both, and know both worlds.)
Besides: “It is what it is. -Paul the Mercedes Mechanic”
The codes are a ‘done deal’, so pining for them to be rewritten in C, C++, C#, Python, Ruby, Ruby on Rails, Objective C, Perl, etc. etc. is just a waste of time and mindshare. It also will not typically speed them up or improve them. Object Oriented is a fine way to write some things, but often causes a LOT of code bloat, is often slower, is harder to optimize and has “issues” about matching it to the Vector Hardware. To use vector hardware is often to be working in the land of assembly (yes, still…) and always very close to the hardware even if using C or FORTRAN. More on that in the examples below. So it’s a LOT easier to add an OpenCL compiler directive to some existing FORTRAN than to rewrite a Million lines of FORTRAN into Haskell… even if you can find someone who knows it…
Why pick on Haskell?
The language has an open, published specification, and multiple implementations exist. The main implementation of Haskell, GHC, is both an interpreter and native-code compiler that runs on most platforms. GHC is noted for its high-performance implementation of concurrency and parallelism, and for having a rich type system incorporating recent innovations such as generalized algebraic data types and type families.
It has some cool concurrency and parallel execution bits in it… but useless if you have a decade of “porting” to get to them…
So I’m going to make comments about 2 languages while looking at ‘options’. One is “how well is it for FORTRAN?” and the other is “What’s it got for C?”. The first, as that’s likely the most important for the existing body of old model code. The second as that is much more ‘general purpose’ and likely of interest to most folks for “new things”.
The choices I’m going to look at, in order, are:
Raspberry Pi Beowulf (with honorable mention of using the Video Processor as a compute engine).
Parallela Board – 16 ARM cores and a head end.
NVIDIA Jetson and related.
But NOT just quoting how many MFLOPS each gives you. I’m also going to look at the software limitations that come into play and what class of problem these things are “good for”. The good, the bad, and the ugly. Oh, and a mention of price. ALL of these are under $200 for the basic guts. A Beowulf Cluster can grow to any number of nodes you want, so there is no upper bound on price, it’s all about $/compute. It is also very reasonable to mix and match these things, so you can make a Beowulf Cluster out of NVDIA Jetson boards if you like… and if it suits your problem set.
The Raspberry Pi
Cost / unit: From recent price cut announcements, list: Pi Model B+: $25 Pi Model Zero: $5 Memory: Pi Model B+: 500 MB Pi Model Zero: 1000 MB Watts / unit: From the wiki Pi Model B+: 3 W Pi Model Zero: 0.8 W 1.8 A @ 5 Vdc = 9 Watts, max. to attached devices. Cost / usable unit: Pi Model B+: $35 Pi Model Zero: $20 Allowing about $5 for power supply and $5 for the SD Card, plus about $5 for a "micro USB to ethernet dongle" for the Pi Zero. GFLOPS / unit: From here on, it's only for the PiB+ Why? The Zero is "unobtainable" at the moment. It has no ethernet so you get into the land of add on dongles. It's easy enough to "do the math" as the chip is 10/7 faster and the cost is 20/35 less. So just uprate everything by about 2.5 the price / perf ratios. The software issues down below sort of kill it anyway. But were I going to make a small play cluster, and IFF the Zero is ever in stock, 8 of them would be the fun way to go. (also from the Wiki) "While operating at 700 MHz by default, the first generation Raspberry Pi provided a real world performance roughly equivalent to 0.041 GFLOPS. [...] The GPU provides 1 Gpixel/s or 1.5 Gtexel/s of graphics processing or 24 GFLOPS of general purpose computing performance. [...] The LINPACK single node compute benchmark results in a mean single precision performance of 0.065 GFLOPS and a mean double precision performance of 0.041 GFLOPS for one Raspberry Pi Model-B board." So you've got about (generously assuming not double precision) 0.065 GFLOPS / PiB+ board. There 'are issues' using that video performance for general purpose computing, so I give 2 "performance/$" numbers. One with it and one without. The "without" is the real number for anything other than playing at the moment. Cost / GFLOP: $35 / 65 = $0.53 $/MFLOP or $538 /GFLOP without GPU. $35 / 24.065 = $1.45 / GFLOP (if only they could be effectively used) GFLOPS / Watt: 0.065 / 3 = 0.0216 24.065 / 3 = 8.021 Cost for a 1 TFLOP system: 1000 TFLOPS / GFLOP. So 1000 / 0.065 = 15385 / TFLOP $35 x 15385 = $538,475. Clearly only good for 'toy' systems 1000 / 24.065 = 42 $35 x 42 = $1470. Very approachable. IF you can do GP GPU processing. Best problem sets: Unfortunately, since the ethernet is only 100 Mb/second and the GPU is a Royal Pain to use, it's mostly only usable as a toy system for learning about how to do parallel processing. Language and Other Issues: Aye, now there's the rub. It's nearly impossible to get to the GPU and it isn't available to things like FORTRAN
This link has a very nice exposition on it.
Step Zero: Build An Assembler
Herman Hermitage has done some excellent reverse engineering work (before the documentation was released) and has written a QPU (the name of a “core” in the GPU) assembler that you can get from github:
git clone https://github.com/hermanhermitage/videocoreiv-qpu
Unfortunately, I was unaware of this when I began, so I wrote my own and, for better or for worse, that’s what the code is written in. The assembler is pretty rough, has lots of quirks and bugs and supports only what I needed to implement this algorithm, but it has the advantage (only to me, I suppose) that I know exactly how it works and what code it will produce. If you’d prefer to use Herman’s assembler which is probably more sane and friendly, you can assemble the code with mine, then disassemble it with Herman’s disassembler which (should?) allow you to reassemble it with his assembler.
In addition, when we get to the loop unrolling and register substitutions, we’ll start using m4 macros in the assembly source. M4 is pretty simple while being quite powerful, easy to read and write and it comes installed on pretty much all Linux systems. Instead of trying to introduce all the syntax for the assembler and the macros right here (and risk losing all my readers), I’ll try to introduce the syntax as we need it.
In the case of GPGPU programming (especially without a host library like OpenCL), this can be much more involved as there may be a chunk of host-side code as well that needs to be written to initialize the GPU, map memory, configure the parameters, etc … In the case of our QPUs, it’s even worse because to actually see anything come back from the QPU, we have to dive into the VPM and VCD and DMA-ing things back to the host. Oh well, such is life. Let’s get started.
Yeah… write an assembler… or maybe download and port one. THEN you can start writing the assembly language to build your environment to actually do something…
Basically, with the Raspberry Pi you have a single threaded RISC CPU and a nearly unavailable GPU (at least for General Purpose use as a GP-GPU compute engine).
Currently, I am having a hard time to discover what the problem with my multithreading C program on the RPi is. I have written an application relying on two pthreads, one of them reading data from a gps device and writing it to a text file and the second one is doing exactly the same but with a temperature sensor. On my laptop (Intel® Core™ i3-380M, 2.53GHz) I am have the program nicely working and writing to my files up to the frequencies at which both of the devices send information (10 Hz and 500 Hz respectively).
The real problem emerges when I compile and execute my C program to run on the RPi; The performance of my program running on RPi considerably decreases, having my GPS log file written with a frequency of 3 Hz and the temperature log file at a frequency of 17 Hz (17 measurements written per second)..
I do not really know why I am getting those performance problems with my code running on the PI. Is it because of the RPi has only a 700 MHz ARM Processor and it can not process such a Multithreaded application? Or is it because my two threads routines are perturbing the nice work normally carried out by the PI? Thanks a lot in advance Guys….!!!
Here my code. I am posting just one thread function because I tested the performance with just one thread and it is still writing at a very low frequency (~4 Hz). At first, the main function:
Just doing “threads” on the R.Pi “has issues”. As it is a single thread processor, not a big surprise. You take ‘context switches’ for no real benefit.
So while it’s a very cheap toy for learning, it is NOT suitable for any “real work” as the effort to port things to it is going to kill you and any attempt to use the GPU for GPGPU is going to suck you dry writing assembler. Maybe in a few years IFF someone else gets an OpenCL environment running or “whatever”…
OpenCL on the Raspberry PI 2
OpenCL can be enabled on the Raspberry PI 2! However, you’ll be disappointed to know that I’m referring to the utilization of its CPU, not GPU. Nevertheless, running OpenCL on the PI could be useful for development and experimentation on an embedded platform.
You’ll need the POCL implementation (Portable OpenCL) which relies on the LLVM. I used the just released v0.12 of POCL and the Raspbian Jessie supplied LLVM v.3.5.
After compiling and installing POCL with the natural procedure (you might need to install some libraries from the raspbian repositories, e.g. libhwloc-dev, libclang-dev or mesa-common-dev) you’ll be able to compile OpenCL programs on the PI. I tested the clpeak benchmark program but the compute results were rather poor:
So now you know why I went through all the lead in stuff about Vector Units and Scalar Units and GPUs and SIMD vs SISD and all. Without that, you could read that OpenCL was running on the R.PiM2 and think “Great! 24 GFLOPS here I come!”. With it, you realize that this is just a port that lets you syntax check on the SISD CPU and doesn’t actually get to those 24 GFLOPS on the GPU.
The software available limits the problem set that can be addressed and further limits how much of the hardware can be applied to it.
Still, for non-compute intensive things where you want to play with distributed COW / GRID configuration and / or making several monitors work together as One Big Screen, you might find some “fun” here. Just don’t expect it to do anything with Climate Codes beyond GIStemp. (GIStemp is old and stupid enough that it doesn’t take much CPU at all to make it go and I’ve ported it to the Raspberry Pi where it has plenty of computer power…)
The Parallela Board
Cost / unit: $99 Watts / unit: Finding this took some looking. https://www.linux.com/news/enterprise/systems-management/692990-introducing-the-99-linux-supercomputer " The 16-core Epiphany chip delivers 26 GFLOPS of performance and with the entire Parallella computer consuming only 5 watts, making it possible to prototype compute-intensive applications with mobile device power budgets or equally to construct energy-efficient HPC clusters. " So we've got 5 Watts Cost / usable unit: I'd make this about $120, assuming some shipping, a powersupply and likely a couple of other "bits". It might be as low as $105. I'll use $110 as a reasonable guess. GFLOPS / unit: 26 GFLOPS Cost / GFLOP: $110 / 26 = $4.23 Not bad at all. GFLOPS / Watt: 26 / 5 = 5.2 GFLOPS / Watt. Very nice. Cost for a 1 TFLOP system: 1000 / 26 = 39 units. 39 x 110 = $4290 A very attainable price, if a bit pricey for "home gamers". More "pricey" than the R.Pi IFF you could actually get to the GPU on the R.Pi in any sort of usable way. Best problem sets: It has OpenCL so generally usable for parallel problems. The "cores" have a matrix network ("grid") on the chip and not a lot of memory each, so things that move a lot of data will tend to limit on memory bandwidth. http://elinux.org/Parallella Indicates good software availability. Language and Other Issues: However... For those FORTRAN Climate Codes: https://parallella.org/forums/viewtopic.php?f=36&t=485
Looks like FORTRAN is a “roll your own” from the repositories.
Re: Building GCC for Fortran too..
Postby madtom1999 » Wed Jul 31, 2013 7:37 am
Cheers – its line 139 in mine – I gitted yesterday!
I apologise – I’m using new tools and didnt find that by my own search. These modern new fangled computers….
Joined: Mon Dec 17, 2012 3:25 am
Re: Building GCC for Fortran too..
Postby theover » Wed Jul 31, 2013 11:13 am
I did’t yet built anything, because on my general use machine I needed to compile 3 math libraries and I didn’t feel like that yet, but I’d like to express my interest in Fortran, too, but for a different purpose: Maxima.
I hope XMaxima can be put on the ARM, and would be interested in running Fortran (which can be created from Maxima’s symbolic formulas) compiled parallel programs on the Epiphany, for instance for rendering Audio waves.
Joined: Mon Dec 17, 2012 4:50 pm
Re: Building GCC for Fortran too..
Postby ysapir » Wed Jul 31, 2013 11:31 am
One note, though – if using the e-lib in your Epiphany code, remember that the library is a C library and thus has C linkage. You may need to tell FORTRAN to take care of that in your program’s header files.
Joined: Tue Dec 11, 2012 7:05 pm
Re: Building GCC for Fortran too..
Postby dar » Sat Dec 14, 2013 12:43 pm
Just a comment on Fortran. You can target Epiphany with the STDCL API which has Fortran bindings. At present your kernels would still need to be written in C, but at least integration with your host code would be simplified this way.
The “kernels” are the bits that get handed over to the array of cores to execute. So any chunks of code you want to spit out to that array of cores is going to be in C. Period.
Happy porting. See you in a decade or two. /sarc;>
So a very nice solution for folks wanting to work in C, not so good for folks wanting to port FORTRAN Climate Codes.
So far my “dive” has shown me that making a small R.Pi cluster could be fun, but not going to do anything for a personal Climate Code Engine, and that the Prallella is a nice bit of kit for folks doing C, me not so much. I figure I’ve saved about $3000 so far in ‘exploration costs’… and likely weeks of failed porting / hacking attempts. (Hey, it’s what Managers do… look at these things BEFORE approving the P.O. ;-)
The NVIDIA Jetson
There’s two of these. The older K1 and the newer X1. The X1 is $600 more or less. I’m mostly going to give info for the K1 as it is more in my price range and more comparable to the other “solutions” but with some minor comments on the X1. These are “development boards” that have a ‘head end’ processor on them, and then a bunch of ‘cores’ for the actual math. About the same performance for the head end as for the other boards above; it’s the back end that’s the fun bit ;-) One major note. The X1 does 1 TeraFLOP all on it’s own. So that’s the basic benchmark we’re looking at. $600 for your own personal TeraFLOP.
NVIDIA Jetson TX1 Development Kit Proprietary DDR4 Motherboards 945-82371-0000-000
4 out of 5 stars 1 customer review
Price: $599.99 & FREE Shipping.
Temporarily out of stock.
Order now and we’ll deliver when available. We’ll e-mail you with an estimated delivery date as soon as we have more information. Your account will only be charged when we ship the item.
Ships from and sold by Amazon.com. Gift-wrap available.
256 CUDA Cores
1 TFLOPs (FP16) Peak Perf
Complete Jetson SDK available from developer.nvidia.com/embedded-computing
So “there you have it”. Now if the software available is adequate…
Cost / unit: About $195 for the K1 and About $600 for the X1 Here's a Spec PDF Watts / unit: From: This nice review of the different models K1: 11 Watts Cost / usable unit: I'd guess about $200 with PSU. GFLOPS / unit: K1: 300 (depending on how measured, 290 to ~333) Cost / GFLOP: (Range for the K1) $200 / 290 = 69 cents / GFLOP (!) $200 / 333 = 60 cents / GFLOP (!!) GFLOPS / Watt: (Range for the K1) 290 / 11 = 26 333 / 11 = 30 Cost for a 1 TFLOP system: 1000 / 290 = 3.44 1000 / 330 = 3.003 So between 3 and 4 units. $200 x 3.44 = $688 $200 x 3.003 = $600.60 (or about the same as the X1) So I'd rather buy 3 of these than 1 of the X1 just as I'm going "variable stuff" and it would let me start out smaller and add as needed. For heavy duty crunch, the "on one board" performance will be better without the inter- board latency if working on Just One Problem... Best problem sets: Generally looks well suited to just about any "math heavy" problem with parallel execution opportunities. Things with many large "Do loops" without one result being dependent on the prior result. In climate models, many "grid / boxes" of iteration where each box depends on the PRIOR state of the box or nearby boxes, but not the presently changed state of the box next to it. (So "compute the globe" then "iterate the globe", not things like "compute cell one, then use that to compute cell two, then use that to compute cell three... Though their ought to be ways to 'unwind' that with some work.) Language and Other Issues: Only that CUDA is what is used for programming and it is all C oriented. But see below for more about "options".
Has an interesting observation about the board, and potential ‘overclock’ performance, or perhaps a different board:
The last 2 pics show Jetson.. with a heatsink and a small fan.
The size of that heatsink.. K1 is probably running at least 20-30W+
I guess Tegra K1 when clocked conservatively will fit into a tablet, but in order to do all this automotive stuff, It is being clocked higher..
Nvidia states 326 GFLOP’s, but if I remember well, the GTC Audi guy said around 500 GFLOP’s were required for vehicle and pedestrian detection.
The device on the left looks to be Jetson Pro, not Jetson TK1. Jetson Pro a Tegra 3 board specifically designed for use with a discrete GPU and intended for automotive markets.
This is Jetson TK1: http://images.anandtech.com/doci/7905/Jetson_TK1.jpg
Originally Posted by ViRGE View Post
This is Jetson TK1: http://images.anandtech.com/doci/7905/Jetson_TK1.jpg
Actually, that’s more of an artistic mockup which leaves out the cooler. The Jetson K1 looks like this:
Actually, that’s more of an artistic mockup which leaves out the cooler. The Jetson K1 looks like this:
This image has been resized. Click this bar to view the full image. The original image is sized 565×400.
Well yeah. My point is that the real TK1 is a complete board in and of itself. It’s not a card like in the original picture.
Well yeah. My point is that the real TK1 is a complete board in and of itself. It’s not a card like in the original picture.
The card was a G-Sync board, not Tegra And the dev-platform in that article was actually the same one used to demo the K1 at CES: http://semiaccurate.com/2014/01/20/w…eally-perform/
Perhaps the difference in cooler size is down to the difference between K1-with-A15s and K1-with-Denver? (Either way, the board in the article isn’t the same as the Jetson K1 they just launched. It may just be different because it’s a prototype.)
The point being that there are more options to explore here. But we’ve set our “Baseline” at $600 / TeraFLOP. Not bad. Not bad at all.
On to programming the beast.
Now the Tesla is NOT the same as the Jetson, but it will be programmed similarly as CUDA is the method NVIDIA uses for their GPGPU programming. It is a “card” that you slot into a chassis. Some of the configurations listed in The Wiki have a TeraFLOP rating on some of the benchmarks. Others nearer 345 GFLOPs, but then you get into “modules” packaging a lot of them and it’s 8 TFLOP time. Clearly NVIDIA is going big time into the Massive Computes space.
Their C1060 “compute module” lists as 933 GFLOPS (single precision) in the wiki, and Amazon has it for about $450. A lower price per compute than the Development board, but you need to BYOC – Bring Your Own Computer / Chassis…
Tesla Gpu C1060 Retail Box for End Customers
Enables the transition to energy efficient parallel computing power
Brings the performance of a small cluster to a workstation.
Dedicated computing resource at their desk-side
Much faster and more energy-efficient
Based on the massively parallel, many-core Tesla processor
1 refurbished from $450.00
Though one wonders how this is “refurbished”. For $19 more you can get a new one “by HP” (yet sold by “Network Systems Resale”…):
Nvidia Tesla C1060 Compute Processor
Price: $468.75 & FREE Shipping
Only 3 left in stock.
Arrives before Christmas.
Estimated Delivery Date: Dec. 11 – 16 when you choose Standard at checkout.
Ships from and sold by Network Systems Resale.
But back at the Mines and Programming thing. A couple of key quotes show that it is relatively easy to program, at least as compared to the others.
Ⅱ. Basic concepts of NVIDIA GPU and CUDA programming
CUDA is a high level language. NVIDIA is committed to supporting CUDA as hardware changes. Hardware is projected to change radically in the future. Primarily, the processor count may go from hundreds to tens of thousands. Program algorithm, architecture and source code can remain largely unchanged. Increase problem size to use more processors. Increase a 3D grid by a factor of 5 to go from hundreds to tens of thousands of processors.
CUDA is C with a few straight forward extensions. The extensions to the C programming language are four-fold: Function type qualifiers to specify whether a function executes on the host or on the device and whether it is callable from the host or from the device __global__, __device__, __host__. Variable type qualifiers to specify the memory location of a variable __device__, __shared__. A new directive to specify how a kernel is executed on the device from the host.
Four built-in variables that specify the grid and block dimensions and the block and thread indices – gridDim, blockIdx, blockDim, threadIdx. NVIDIA promises to support CUDA for the foreseeable future. CUDA encapsulates hardware model, so you don’t have to worry about hardware model changes, all the conveniences of C vs assembly. Learning the hardware and developing parallel algorithms is still difficult. But the infrastructure for writing, developing,debugging and maintaining source code is straight forward and similar to conventional serial programming.
Steps in a CUDA code
Host Code (xx.cpp and xx.cu):
Initialize/acquire device (GPU)
Allocate memory on GPU
Copy data from host to GPU
Execute kernel on GPU
Copy data from GPU to host
Deallocate memory on GPU
Run “Gold” version on host
Kernel Code (xx_kernel.cu):
A kernel is a function callable from the host and executed on the CUDA device — simultaneously by many threads in parallel. How to call a kernel involves specifying the name of the kernel plus an execution configuration. An execution configuration just means defining the number of parallel threads in a group and the number of groups to use when running the kernel for the CUDA device.
Nvidia CUDA Programming Basics:
The Programming model
The Memory model
CUDA API basics
CUDA Programming Model:
The GPU is seen as a compute device to execute a portion of an application, a function for example, that:
Has to be executed many times;
Can be isolated as a function;
Works independently on different data.
Such a function can be compiled to run on the device. The resulting program is called a Kernel. The batch of threads that executes a kernel is organized as a grid of thread blocks.
All well and good, but about that whole FORTRAN thing…
Waaayyy down the page, after a LOT of using C, we find FORTRAN:
The CUBLAS Library
CUBLAS is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA (compute unified device architecture) driver. It allows access to the computational resources of NVIDIA GPUs. The library is self-contained at the API level, that is, no direct interaction with the CUDA driver is necessary. CUBLAS attaches to a single GPU and does not auto-parallelize across multiple GPUs.
The basic model by which applications use the CUBLAS library is to create matrix and vector objects in GPU memory space, fill them with data, call a sequence of CUBLAS functions, and, finally, upload the results from GPU memory space back to the host. To accomplish this, CUBLAS provides helper functions for creating and destroying objects in GPU space, and for writing data to and retrieving data from these objects.
For maximum compatibility with existing Fortran environments, CUBLAS uses column-major storage and l-based indexing. Since C and C++ use row-major storage, applications cannot use the native array semantics for two-dimensional arrays. Instead, macros or inline functions should be defined to implement matrices on top of onedimensional arrays.
Header and library files
CUBLAS consists of these files below. An emulation library file is used for developing on non-TESLA capable environment by emulation. Do not use it when you want real computing by TESLA on TSUBAME.
* C header file: cublas.h
* CUBLAS library: libcublas.so / libcublasemu.so (for emulation)
Example is shown in last CUDA samlpe to implement Matrix-Vector Multiplication.
CULA contains a “CULAPACK” interface that is comprised of over 150 mathematical routines from the industry standard for computational linear algebra, LAPACK. Our CULA library includes many popular routines including system solvers, least squares solvers, orthogonal factorizations, eigenvalue routines, and singular value decompositions.
This link shows a complete routines CULA contains and their perforamnce.
CULA offers performance up to a magnitude faster than optimized CPU-based linear algebra solvers.
CULA is available in a variety of different interfaces to integrate directly into your existing code. Programmers can easily call GPU-acclerated CULA from their C/C++, FORTRAN, MATLAB, or Python codes. This can all be done with no GPU programming experience.
So what’s a CUBLAS?
The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library is a GPU-accelerated version of the complete standard BLAS library that delivers 6x to 17x faster performance than the latest MKL BLAS.
cuBLAS-XT is a set of routines which further accelerates Level 3 BLAS calls by spreading work across multiple GPUs connected to the same motherboard, with near-perfect scaling as more GPUs are added. By using a streaming design, cuBLAS-XT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size. cuBLAS-XT is included with CUDA 7 Toolkit and no additional license is required.
So the vendor has realized the utility of making tools available to do all those things FORTRAN folks like to do without them needing to do it all over themselves…
So that’s what Iv’e been up to the last few days. Aren’t you glad you can get it in an hour instead of days? ;-)
So the cost of a TeraFlop is about $450 “naked” or $600 as a “development board” with the Head End processor already in place. It can be had with nice tools for FORTRAN, and great ones for C programming. It doesn’t take much space, or much power.
I don’t need to buy 64 Raspberry Pi boards, even the “Zero”, to make a home cluster, as that will run about $1280, minimum,and for that, I can get 2 TeraFLOPS of NVIDIA boards, with a mature development environment, and have $280 left over for coffee and doughnuts…
The K1, at $200 (ish) is not too much more expensive than the Parallella, and gives a WHOLE LOT more bang for those bucks. It doesn’t take much more space or power either. (Though the “cores” are more limited in what they do – basic math and not the general RISC cores of the Parallella, so for non-math problems the Parallella will be better).
For “My Own Private TeraFLOP” computer, the NVIDIA solution is the better one, by far, with vastly less work to get productive on it.
I can start with a single K1 board for “cheap”, and if it doesn’t do what I want, not a big loss.
I need to start paying a LOT more attention to NVIDIA stock ;-)