Parallel, Scripts, Clusters, & Easy Use

So I’ve got this cluster of boards. I’ve got “distcc” set up on 3 of them. Fine. So I can do compiles in parallel. How often do most folks compile big things?…

I’ve got MPICH installed. I can embed calls to parallel message passing IFF I rewrite any code I want to run in a peculiar way… Oh, and it didn’t speed things up much (or at all…) on the Pi. It does offer significant speed up for some classes of problems on some other hardware, though.

But again, what good is it for anything “day to day”?

Most of what I do is systems maintenance in a scripted way, or bulk data movements, file compression or expansion, and the occasional analysis of things with “Unix Tools”. Commands glued together with pipes and such. The ‘find’ command sending a list of file names to some other step. It isn’t “compiling”, and I’m not going to re-write Linux Tools to use MPICH or similar.

So what can be done to make that cluster useful for more generic things?

Well, I was looking into Climate Models that were already written to use parallel computer methods, and ran into something else. GNU ‘parallel’. A simple Unix-Like command that works on Linux Tools and Shell Scripts to run them in parallel. It integrates into a pipe connected set of regular commands, and sends various parts of it off to other cores or to other computers for execution.

As near as I can tell, it’s been around about a decade. Guess I was doing other things and didn’t notice it show up. I tested on the Devuan / Debian to see if it knew about this product, and it rapidly installed:

root@odroidxu4:/# apt-get install parallel
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 194 kB of archives.
After this operation, 639 kB of additional disk space will be used.
Install these packages? [y/N] y
Get:1 jessie/main parallel all 20130922-1 [194 kB]
Fetched 194 kB in 0s (231 kB/s)
Selecting previously unselected package parallel.
(Reading database ... 89307 files and directories currently installed.)
Preparing to unpack .../parallel_20130922-1_all.deb ...
Adding 'diversion of /usr/bin/parallel to /usr/bin/parallel.moreutils by parallel'
Adding 'diversion of /usr/share/man/man1/parallel.1.gz to /usr/share/man/man1/parallel.moreutils.1.gz by parallel'
Unpacking parallel (20130922-1) ...
Processing triggers for man-db ( ...
Setting up parallel (20130922-1) ...
root@odroidxu4:/# which parallel

So there it is.

I ran into this in some Youtube videos. The guy has a bit of an accent and talks very fast, so the pause button was my friend. There’s a few of these, but I’m just going to embed the first one. If you are interested, I’m sure you can find the rest.

It comes with a man page:

PARALLEL(1)                                      parallel                                      PARALLEL(1)

       parallel - build and execute shell command lines from standard input in parallel

       parallel [options] [command [arguments]] < list_of_arguments

       parallel [options] [command [arguments]] ( ::: arguments | :::: argfile(s) ) ...

       parallel --semaphore [options] command

       #!/usr/bin/parallel --shebang [options] [command [arguments]]

       GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can
       be a single command or a small script that has to be run for each of the lines in the input. The
       typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of
       tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input
       into blocks and pipe a block into each command in parallel.

       If you use xargs and tee today you will find GNU parallel very easy to use as GNU parallel is
       written to have the same options as xargs. If you write loops in shell, you will find GNU parallel
       may be able to replace most of the loops and make them run faster by running several jobs in

       GNU parallel makes sure output from the commands is the same output as you would get had you run
       the commands sequentially. This makes it possible to use output from GNU parallel as input for
       other programs.

       For each line of input GNU parallel will execute command with the line as arguments. If no command
       is given, the line of input is executed. Several lines will be run in parallel. GNU parallel can
       often be used as a substitute for xargs or cat | bash.

   Reader's guide
       Before looking at the options you may want to check out the EXAMPLEs after the list of options.
       That will give you an idea of what GNU parallel is capable of.

       You can also watch the intro video for a quick introduction:
       http ://

It has a long list of options and arguments, so a lot of time to be spent reading the whole man page.

In Conclusion

So now I’m motivated to have my cluster powered up and online all the time. Things like being able to send a listing of all files in a directory into a compression program, all in parallel, and get the compressed versions back, all with a one line command, that’s interesting to me! Taking a huge file of temperature data and searching different chunks of it for a particular entry (parallel lets you ‘chunk’ a file into segments and send processing for each one to a different CPU / SBC)

In short, it brings parallel processing to all those mundane scripts and housekeeping and data munging tasks that make up 90% of the Systems Admin day.

I haven’t done any comparative performance testing yet, so it might well turn out that with slow shared ethernet, shipping chunks of data off somewhere else for a text search might “cost” more time than just doing it locally. Or perhaps latency of writes to SD cards might be an issue. Or maybe some other quirk of very small systems. I’ll find out.

I’m also certain that, given the command syntax and options, I’ll be putting some scripts-as-commands into my own script command directory just so I don’t have to remember all those options. So a command like “squashem” might list all the files in a directory that are not already compressed then farm out compress jobs to all the known CPUs in the cluster.

But the simple fact that it installs with just an ‘apt-get’ and is just sitting there, with typically one long command line to launch a load of stuff; that means I’m going to use it. Which means I’m going to use the rest of the machines as a cluster a lot more.

Subscribe to feed


About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in Tech Bits and tagged , , , . Bookmark the permalink.

12 Responses to Parallel, Scripts, Clusters, & Easy Use

  1. p.g.sharrow says:

    @EMSmith; Now that you have settled on a small computer type and OS to build an extended system on, getting a paralleling software to work on them would seem to be the next step in getting a band of small computers to do massive jobs when needed.
    The use of big chips or many cores on one board is a real energy and heat management problem. Really fast but generally a waste of resources and a headache for heat management. The use of onboard wifi or bluetooth to reduce wiring complications will in time solve the wiring problem. While these small boards are cheap, wiring them together can cost more then the boards.
    There seems to be quite a large number of people working on this same problem so finding them and their works may be more then half of the job. Maybe it is nearing time for a university to be created…pg

  2. E.M.Smith says:


    The problem with a WiFi cluster is bandwidth. A switched hardwire network gives 1 Gb for GigE, per port. A stack of 8 cards has 8 Gb bandwidth for distributing work and data. For harder problems with more data flow, you can add parallel ethernet ports to double or quadruple that. Compare WiFi with one shared channel of 54 Mb for the common gear, and using more exotic stuff about 1 Gb.

    So it works ok for problems with long computes on not much shared data, but has issues on big shared date problems (like compressing files). One of THE big development points for clusters is increasing connection speeds… and improved switch fabrics.

    That said, I’m going to use WiFi in my cluster. Simply because I’ve run out of ports on my router! Eventually I’ll dig out my 8 port 100 Mb switch, then I’ll add the hardwired connections to those on WiFi. The cheap Pi boards are only 100 Mb anyway. I’d need to buy gigE boards for any significant use (and a gigE switch).

    It is one of the truths of HPC High Performance Computing, that different kinds of problems need different kinds of computers or clusters. You must know the nature of the problem you are working to know what architecture suits it. Some need one huge scaler engine. Others a tight high bandwidth cluster, others (called “embarrassingly parallel”) can run distributed on a collection of home computers over the internet (think SETI or BOINC).

    Per a university:

    HPC methodes are researched and taught at universities all over the planet. $ Billions are spent on it. R&D at the NSA alone is huge. It is vital to code breaking and all sorts of modeling. We used our Cray for plastic molding simulations at Apple, saving $ millions on die costs. While it seems rare and exotic to folks outside the field, it is really a common need throughout engineering, big manufacturing, big analytics, and governments; along with all the more common weather prediction and modeling as R&D.

    Many conferences are held each year. Whole university departments work on it. I’m not sure what could be added.

    Where I’m playing now is in the application of those known rules, methods, and software to the incredibly dinky cheap end of the spectrum. That’s where it looks more scattered and chaotic. Mostly because the players mostly don’t have a real HPC background. You see folks claiming to build a supercomputer out of 64 Pi boards, for example. They really mean “dinky cluster”. By definition, a real supercomputer must be in the top few percent of performance globally. But they don’t know that. So yeah, those folks could use a bit more educating, but it’s a full time job for hundreds… not me.

    Is a fun site that ranks the current state of the art in big iron (I.e. who are the 500 fastest computers). The latest list and ranking changes are here:

    Note that the first page is all measured in petaflops and has core counts in hundred thousands to 10 million…

    So were I still employed doing daily work in HPC, I’d have kept up on things and known about GNU parallel a couple of years sooner. So in a way, if a university were needed, it would be for me to attend some recent seminars and classes to get current…

    Sidebar on CUDA:

    NVIDIA is known for video boards. Graphics cards are really vector units in old Cray Speak. You have a bunch of very small and fast cores that can do simple math. Lots of problems just need simple math. So a few folks started hand coding codes to use video cards as compute engines, now, years later, Nvidia has dedicated compute engines based on that tech and a standard language to use them. CUDA .

    One can buy a hundreds of core Jetson board for cheap and have a high speed connection fabric between those cores… For real home HPC, that’s the better way to go. It is being used in the self driving cars for things like vision… I’d have already bought one, but the fan is reputed to be loud :-)

    $200 for 192 cuda cores and a scalar ARM chip front end.

  3. p.g.sharrow says:

    that Jetson K-1 looks like a real toaster. Back to big power supply and fans.
    Guess it depends on the trade offs you are willing to make to get the job done.
    I barely understand enough to follow your postings on this subject. My 21 year old grandson thinks I’m helplessly incompetent, but then after he leaves I get to fix his work if it fails.
    I have been watching this field progress since the early 1970s. Learning enough to fulfill my needs. but not really competent. For me the computer is a tool, not the tool. I would prefer to work with a general purpose tool rather then a specialty one. My son and grandson are more into gaming and video graphics so they lean to BIG engines…pg

  4. E.M.Smith says:


    Well, for me computers are a tool, and a fun toy. So I buy stuff to play with that I don’t really need ;-)

    The world of computing can be ‘cut’ many ways. One way is as “general purpose” vs “special purpose” where “supercomputer” is a subset of “special purpose”. Pretty much everything sold to the public in stores is a “general purpose” machine. The usual CPU + Memory + programs to do regular old things like web browsers and spreadsheets.

    That is changing. It started with “gamers”.

    Among the special purpose machines are ones that do a lot of simple math in a hurry. Turns out video display, especially high res animation with things like reflections and textures, needs a LOT of that. so “Graphics Processing Units” GPUs got added to computers to make pretty windows displays. Gamers pushed those to the limit and then some, so ever fancier GPUs started to be built into systems. A common one (on the R. Pi IIRC) has 4 cores that can do a fast set of simple math. You give it 4 sets of 2 inputs and it gives back 4 results with a simple math function ( + – x / though some don’t do x and / ) applied to them as sets of 2. (In Cray terms this would be called a ‘stride’ of 4, the Cray had a stride of 64. Give it 2 arrays of 64 numbers each and in one clock it would give back 64 products of them…)

    So at this point you ought to see where this is going… Bigger GPUs with more cores giving bigger ‘strides’ for a lot more math / clock. Gamers loved the higher resolution and more complex images (more shadows, reflections, transparency, textures, etc.) with higher frame rates for better motion.

    Then the Science Nerds and Engineers realized they could get a decent “stride” out of a PC by loading it up with 4 or 6 Nvidia high end gamer GPUs but using that math for work instead of images… So one of the “home hot machines” you can make / buy is a regular PC with some high end GPUs slotted into it. (That’s why the ads ooh and aah over what GPU is in a box… the rest of us don’t care as long as Firefox works ;-)

    What the Jetson does is put a giant GPU (192 cores worth) on the board along with a medium fast ARM chip to act as the “desktop computer” and user interface. They also added a nice language (CUDA) to let you use them without doing obscure and esoteric GPU assembly coding – a giant step forward…

    So yeah, the Jetson is a toaster. All those gates make heat as they run… Many GPUs have their own dedicated fan on them. You can have a PC with a big CPU with it’s fan, and then 4 GPUs with their own fans, all in the box with the PSU with a fan. 6 Fans and we aren’t even ventilating the case yet ;-)

    BUT, if you want a job doing “self driving car maintenance” and / or programming robots, that’s the kit to buy for learning CUDA and GPU programming. Also being used in parts of AI and for various engineering tasks. Some real supercomputers are starting to include banks of GPUs too, since the guys with a dozen PCs loaded with 6 NVIDIA GPUs each were starting to eat into their business ;-)

    Now, will that make your browser faster or your spreadsheet finish sooner? Nope, not at all. Those run on the GP General Purpose CPU. It’s only codes written with CUDA (or similar methods) that use the GPU as a compute engine, and only for very math intensive codes with lots of opportunities for parallel processing.

    Want to make your kids happy? Buy them a large NVIDIA GPU on a plug in for their existing PC (make sure it supports CUDA) and a CUDA programming guide… If they like to tinker outside of games, and are interested in robotics, the Jetson will have them employable in the field in about 6 months of playing with it (assuming they can already program in a language like C or Python).

    In General:

    The field of GP computing covers 90%+ of everything regular folks will ever care about. It is slowly changing as robotics and robotic vision become common outside of labs.

    Special Purpose computers are by definition suited to particular tasks. If you don’t have one of those tasks, they don’t do much for you. In particular, a Jetson without programs written to use CUDA is going to be about as fast a a Pi M3 and slower than an Odroid XU4. But properly programmed, it can drive robots and process vastly more data.. They key there being you get to do the “properly programming” ;-)

    There are other kinds of “special purpose” computers, but that’s a long exposition. From “micro-controllers” at the very small end (used to manage a robot elbow, for instance) to VLIW machines (Very Long Instruction Word), to VLSI ASICs and FPGA special purpose compute engines for a dedicated task. (Very Large Scale Integration, Application Specific Integrated Circuit, Field Programmable Gate Array) And many more. BUT, unless you know your problem needs one of them, you don’t need one!

    So take a cluster of workstations ( a COW). Unless you have a highly parallel problem that works OK with longer latency on memory transfers, but lots of computes; you just don’t need a COW.

    For me, as my “problem” is “keep me entertained and in step with industry changes”, a Beowulf using SBCs Single Board Computers is my solution to “my problem”. That it also will let me do Linux From Scratch faster via distcc compiles, and now, with GNU parallel, do some administrative tasks faster; that’s just gravy on top of the play and learn part. Spending the same money on a fast PC with fast disks would be faster using a GP machine… IF I could accept a recent Intel processor… which I can’t any more due to ME Management Engine issues.

    So, for all things computing, start with asking “What am I trying to accomplish?” Then find the hardware / architecture / software that lets you do that. Most often it’s a GP box with fast disk.

    Sidebar on Pi and GPU:

    In the R. Pi cores and some other ARM chips, the GPU uses the NEON instruction set. There are now folks doing math using NEON on the GPU. One of the differences between young and mature Linux ports is that at first, the GPU is ignored, then it gets used for simple video. Eventually it gets enough attention to do most of the video really well. (Thus my XU4 young port having video compositing sloth and jitter as the CPU is doing it, not the GPU as my best guess). Now we are starting to have the GPU used for more general compute needs in addition to just video.

    So take a R. Pi and run it “headless”, the GPU is doing nothing. Use that for math with NEON, and you get a many times faster result… So watch for some distributions of Linux to start bragging about the NEON compute option…

    On my “someday list” is to figure out how to incorporate using NEON in my cluster headless units as part of the compute facilities. It can give something like a 25 x faster compute platform…. but only if you can program it and your codes use it and you need that much simple math in a hurry…

  5. Steven Fraser says:

    @EMSmith: I was thinking a 5-ton chiller should handle the heat from your entire data center.

    Or, alternatively, a wine cooler cabinet. No unitaskers allowed!

  6. E.M.Smith says:


    Well, the SBC boards run about 5 W each full load… so even 20 of them would still be 1/2 the 200W monitor and about the same as my light bulbs… (2 halogen).

    Don’t need AC, even in summer, as the 60 C to 70 C CPU temp is higher than the 35 C typical MAX… but on really hot days, I turn off the light bulbs ;-)

    IFF it ever gets too hot and I need the window AC, I’ll make a wine cooler (the kind you sip) and sit in the shade with the dogs…. until sundown cooling… then use the computers at night…

    Factoid: One of these quad core boards has about the compute power of my old Cray. The Cray also had 4 CPUs, cost about $40 Million, had a 750 KVA power feed, and used a 16 x 16 water tower fed by a 4 inch pipe for cooling…. my how times change ;-)

  7. E.M.Smith says:

    Interesting article on HPC using ARM boards and GPU s. Specifically calls out the Odroid family.

    “I’m interested in a cheap device that’s mass manufactured that’s reliable and is high-performance,” says Khanna, “And I think that naturally brings you to two things: video gaming cards like NVIDIA GeForce or the AMD Radeons and mobile chips such as ARM.”

    Khanna studied the SBC (single board computing) space including the Raspberry Pi for a suitable platform. “While it’s true that the Pis sip power,” he said, “being in the few hundred megaflops range each, you would have to have so many of them with so many power supplies, cables, network cards and switches to get some substantial performance that it’s just not worth the hassle.”

    A couple years ago he began experimenting with using AMD Radeon cards to crunch some of his astrophysics codes. With the assistance of students, he had already created OpenCL versions and CUDA versions of his codes, and the Radeon of course supports OpenCL. He reports being impressed by the performance and reliability of the cards. Of the roughly two dozen cards crunching scientific work for the last two years, basically non-stop 24/7, there’s only been two or three that failed, he says. The latest version cards he’s acquired are the Radeon R9 Fury X, which provide 8.6 teraflops of single-precision floating point computing power and 512 GB/s of memory bandwidth for about $460.

    The problem with the Pi being that the GPU is odd, has a binary blob in the way, and is a royal PITA to use. Then compare that Radeon with 8.6 terraflops for under $500. Now that’s a great flops/watt and flops/$.

    But can you do better? The Odroid GPU is not used as oddly as the pi, is supported by OpenCL, and hss good flops/watt.

    First up, the Nvidia

    While he was experimenting with the Radeon gaming boards, Khanna also wanted to implement a cluster with a mobile platform. He was looking for something sufficiently powerful yet energy-efficient with support for either CUDA or OpenCL. “If you keep those constraints in line, you find there are really two nice viable platforms – one is the NVIDIA Tegra, which of course supports CUDA, and the other is ODROID boards, developed by Hardkernel, a South Korean purveyor of open-source hardware. The boards use Samsung processors and an ARM Mali GPU that supports OpenCL.”

    Khanna ended up going with the Tegra X1 series SoC from NVIDIA, in part because he had several colleagues whose codes were better suited for the CUDA framework. He was also impressed with a stated peak performance (single-precision) of 512 gigaflops per card.

    In May, UMass Dartmouth’s Center for Scientific Computing & Visualization Research (CSCVR) purchased 32 of these cards at roughly a 50 percent discount from NVIDIA. The total performance of the new cluster, dubbed “Elroy,” is a little over 16 teraflops and it draws only 300 watts of power.

    So 512 Gflops for a relatively cheap card . But he mentions the Odroid…

    Although he’s very pleased with the performance and energy-efficiency metrics of the Tegra-based “Elroy,” Khanna doesn’t think the mobile device experiment, as he refers to it, was that advantageous from a cost perspective. Although he received a nice discount, the boards have a full sticker price of around $600 each while the ODROID boards are $60 and offer about one-fifth the FP32 performance, so potentially a 2X performance per dollar savings. Of course, peak floating point performance does not tell the whole story, but Khanna is optimistic about the ODROID prospects.

    “I think if we had done the ODROIDs instead, that would have been more attractive from a cost perspective, and in fact I think we are going to build a cluster with those ODROID Samsung boards as well for comparison’s sake,” he shares.

    Khanna maintains that with the right components he can achieve a factor of five or better on performance per watt and performance per dollar over more traditional server silicon. “All we’re doing is misusing these platforms to do constructive science,” he says.

    I wrote an OpenCL test case and was dissapointed in the Pi running it. I think I need to test it on the Odroid… one fifth of 512 Gflops is about 100 Gflops. I could be happy with 100 Gflops for model runs… or really 300 Gflops as I have 3 Odroids, if I could set them up to share the work… (3 different CPU types and different GPU models…)

    IFF I get this working with real performance near that, it woukd be a heck of a bang/$ …

  8. Paul Hanlon says:

    Hi ChiefIO,

    Thanks for the tip about NEON. It looks very interesting. Seems to apply to the four cores of the CPU, which means OpenGL would be needed for the GPU. Even so, it is nice to know that the four cores can be manipulated. Interesting about the Odroid. I have a C2 as my web server, and I’m very pleased with it. Very zippy.

    I believe it comes with the MALI460 GPU, which is, as you say, OpenCL capable. As far as I know, OpenCL can also use the four cores, as well as running the GPU. Sounds like a great project.

  9. E.M.Smith says:

    OpenCL runs on the regular cores and, with some significant effort, on the GPU cores as well ( 6 of them in the Mali). I’ve been looking at what it takes and it isn’t trivial. Not way hard either. Enough that I’m not going to do it soon, though. Likely take a weekend to get to first fire running job.

    POCL looks like the key bit (but then has dependency on CLANG and LLVM…)

    Portable Computing Language (pocl) aims to become a MIT-licensed open source implementation of the OpenCL standard which can be easily adapted for new targets and devices, both for homogeneous CPU and heterogeneous GPUs/accelerators.

    pocl uses Clang as an OpenCL C frontend and LLVM for kernel compiler implementation, and as a portability layer. Thus, if your desired target has an LLVM backend, it should be able to get OpenCL support easily by using pocl.

    pocl currently has backends supporting many CPUs, ASIPs (TCE/TTA), NVIDIA GPUs (via CUDA), HSA-supported GPUs and multiple private off-tree targets.

    In addition to providing an open source implementation of OpenCL for various platforms, an additional purpose of the project is to serve as a research platform for issues in parallel programming of heterogeneous platforms.

    Then you get the support package for your GPU and install it then stitch things together then… ;-)

    But still in the early stage of figuring it out so you well might find an easier way.

    Then these folks have a different approach, using NEON:

    You could instead use pocl, an open source implementation of OpenCL that runs on many different platforms, include ARM CPUs with NEON.
    There exist some third party research work on this topic

    they created a framework for OpenCL to make use of NEON instructions

    check here:

    So even the Pi gets in on it…

    In any case “Some Assembly Required”… as you assemble your system to do GPU math… THEN you get to convert your program to parallel orientation with OpenCL calls in it…

  10. Paul Hanlon says:

    Thanks ChiefIO,

    The POCL link led me to the LLVM and CLang sites. Both of these look very interesting in their own right, even if they weren’t needed for POCL.

    Also, from the LLVM site:

    6. The OpenMP subproject provides an OpenMP runtime for use with the OpenMP implementation in Clang.

    You posted about using OpenMP on your distcc? project? I wonder does this give CLang access to lower level calls to OpenMP. Might possibly lead to a useful speed up in message passing between the various ‘chips’. My whistle is definitely whetted, on this one.

  11. E.M.Smith says:

    CLang and LLVM are slated to replace gcc, so worth getting familiar.

    OpenMP is disjoint from distcc. One is a language extension (API). The other a compiler.

    I built an OpenMP test case, then compiled it like any other program. Tested on only one board, but using 4 cores. On the R.Pi there was no speed improvement, so I’m exploring other choices and methods. So yeah, testing it with CLang is a good idea since I used gcc.

    There is also a Python implementation for parallel codes too.

    Folks forget that a LOT of what made the Cray fast was the compiler work. It knew how to make FORTRAN parallel on a sride 64 machine vector processor. What is needed now is better parallel language extentions and compilers that understand them, like Cray wrote.

  12. Pingback: Pi Cluster Parallel Script First Fire | Musings from the Chiefio

Comments are closed.