So I’ve got this cluster of boards. I’ve got “distcc” set up on 3 of them. Fine. So I can do compiles in parallel. How often do most folks compile big things?…
I’ve got MPICH installed. I can embed calls to parallel message passing IFF I rewrite any code I want to run in a peculiar way… Oh, and it didn’t speed things up much (or at all…) on the Pi. It does offer significant speed up for some classes of problems on some other hardware, though.
But again, what good is it for anything “day to day”?
Most of what I do is systems maintenance in a scripted way, or bulk data movements, file compression or expansion, and the occasional analysis of things with “Unix Tools”. Commands glued together with pipes and such. The ‘find’ command sending a list of file names to some other step. It isn’t “compiling”, and I’m not going to re-write Linux Tools to use MPICH or similar.
So what can be done to make that cluster useful for more generic things?
Well, I was looking into Climate Models that were already written to use parallel computer methods, and ran into something else. GNU ‘parallel’. A simple Unix-Like command that works on Linux Tools and Shell Scripts to run them in parallel. It integrates into a pipe connected set of regular commands, and sends various parts of it off to other cores or to other computers for execution.
As near as I can tell, it’s been around about a decade. Guess I was doing other things and didn’t notice it show up. I tested on the Devuan / Debian to see if it knew about this product, and it rapidly installed:
root@odroidxu4:/# apt-get install parallel Reading package lists... Done Building dependency tree Reading state information... Done The following NEW packages will be installed: parallel 0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded. Need to get 194 kB of archives. After this operation, 639 kB of additional disk space will be used. parallel Install these packages? [y/N] y Get:1 http://auto.mirror.devuan.org/merged/ jessie/main parallel all 20130922-1 [194 kB] Fetched 194 kB in 0s (231 kB/s) Selecting previously unselected package parallel. (Reading database ... 89307 files and directories currently installed.) Preparing to unpack .../parallel_20130922-1_all.deb ... Adding 'diversion of /usr/bin/parallel to /usr/bin/parallel.moreutils by parallel' Adding 'diversion of /usr/share/man/man1/parallel.1.gz to /usr/share/man/man1/parallel.moreutils.1.gz by parallel' Unpacking parallel (20130922-1) ... Processing triggers for man-db (18.104.22.168-5) ... Setting up parallel (20130922-1) ... root@odroidxu4:/# which parallel /usr/bin/parallel
So there it is.
I ran into this in some Youtube videos. The guy has a bit of an accent and talks very fast, so the pause button was my friend. There’s a few of these, but I’m just going to embed the first one. If you are interested, I’m sure you can find the rest.
It comes with a man page:
PARALLEL(1) parallel PARALLEL(1) NAME parallel - build and execute shell command lines from standard input in parallel SYNOPSIS parallel [options] [command [arguments]] < list_of_arguments parallel [options] [command [arguments]] ( ::: arguments | :::: argfile(s) ) ... parallel --semaphore [options] command #!/usr/bin/parallel --shebang [options] [command [arguments]] DESCRIPTION GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input into blocks and pipe a block into each command in parallel. If you use xargs and tee today you will find GNU parallel very easy to use as GNU parallel is written to have the same options as xargs. If you write loops in shell, you will find GNU parallel may be able to replace most of the loops and make them run faster by running several jobs in parallel. GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially. This makes it possible to use output from GNU parallel as input for other programs. For each line of input GNU parallel will execute command with the line as arguments. If no command is given, the line of input is executed. Several lines will be run in parallel. GNU parallel can often be used as a substitute for xargs or cat | bash. Reader's guide Before looking at the options you may want to check out the EXAMPLEs after the list of options. That will give you an idea of what GNU parallel is capable of. You can also watch the intro video for a quick introduction: http ://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
It has a long list of options and arguments, so a lot of time to be spent reading the whole man page.
So now I’m motivated to have my cluster powered up and online all the time. Things like being able to send a listing of all files in a directory into a compression program, all in parallel, and get the compressed versions back, all with a one line command, that’s interesting to me! Taking a huge file of temperature data and searching different chunks of it for a particular entry (parallel lets you ‘chunk’ a file into segments and send processing for each one to a different CPU / SBC)
In short, it brings parallel processing to all those mundane scripts and housekeeping and data munging tasks that make up 90% of the Systems Admin day.
I haven’t done any comparative performance testing yet, so it might well turn out that with slow shared ethernet, shipping chunks of data off somewhere else for a text search might “cost” more time than just doing it locally. Or perhaps latency of writes to SD cards might be an issue. Or maybe some other quirk of very small systems. I’ll find out.
I’m also certain that, given the command syntax and options, I’ll be putting some scripts-as-commands into my own script command directory just so I don’t have to remember all those options. So a command like “squashem” might list all the files in a directory that are not already compressed then farm out compress jobs to all the known CPUs in the cluster.
But the simple fact that it installs with just an ‘apt-get’ and is just sitting there, with typically one long command line to launch a load of stuff; that means I’m going to use it. Which means I’m going to use the rest of the machines as a cluster a lot more.