An Amusing Parallel Pi FORTRAN Experiment

Yes, FORTRAN. No, I’m not on board with the “90s Thing” of spelling it Fortran. It was drummed into me to spell it all CAPS in my FORTRAN IV class. Like all things I learned in that (excellent) class, it stuck. Taught by Engineers, for Engineers, it was my first programming class ever. That I was near the top of the class informed my future in many ways… though I didn’t realize quite how much then…

I was thinking (always a dangerous thing…) that the Raspberry Pi Model 2 has 4 cores, but testing GIStemp on it only used one. Might I be able to make bits of it more “parallel”? The compiler directives ought to be in there somewhere..

A quick search turned up this nice example:

http://jblevins.org/log/openmp

Parallel Computing in Fortran with OpenMP

JANUARY 27, 2009

Both the GNU and Intel Fortran compilers have native support for OpenMP. All you need to do to create parallel programs is to add some OpenMP directives, specially formatted comments, to your code and compile the program with a special flag to enable OpenMP. For gfortran, compile the program with -fopenmp and for ifort, the flag is -openmp.

The trick to getting the most out of parallel processing is to run sections of code in parallel such that the additional overhead of starting new threads is outweighed by time required to perform the task. That is, it’s ideal to split the code into distinct tasks that are each sufficiently hard to compute on their own–ones that might take more than a couple of seconds to compute–rather than simple tasks that execute very fast.

In many cases you will have sections of code that are embarrassingly parallel. That is, a group of task need to be carried out multiple times and each task is independent of the others. Monte Carlo experiments are good examples of this. Each trial is independent of the rest. At the end, all of the trials simply need to be collected in any order so that summary statistics such as the mean and standard deviation can be calculated.

A Parallel Monte Carlo Experiment

Here is a short example program, mc1, to get started. Suppose we have a function called monte_carlo which we need to run several times with different data. Each run returns some result and we would like to compute the mean of these results. In the example below, monte_carlo(n) draws n Uniform(0,1) random numbers and returns the sample mean.

Written in just that “gotcha!” style I remembered from oh so long ago…

(about 42? years ago now?… my how time flies… time flies like an arrow, fruit flies like a banana… from a linguistics class of the same era… but I digress…)

He does a test case with a ‘call random’ in it (“call random” also used in that FORTRAN IV class long long ago where the question of getting actual random was explored in great detail… something about Engineers and Math and random not being really random just really sticks in their craw… but again, I digress…) So he uses an example, then shows the parallel form, then shows you just how you got stuck and trapped by assumptions about “random” and how to fix it. My whole FORTRAN IV class was like that. Problems sets ALL set up as traps. After the first two I caught on to the trick and started getting 100% right… since then I have always looked for the traps… and I think that was the whole point.

OK, I ported the code to the ANTEK / Asus box under CentOS. Discovered in the process that their vi editor has been taught to understand FORTRAN and does a kind of key word color marking and syntax checking when you type in a FORTRAN program (any file ending in .f such as foo.f).

Nice. VERY nice. Clearly they expect folks doing science to be using their OS and have adapted it accordingly. Might even be enough to get me over their “we are in control; you are a peon” general attitude toward the folks using the box. Yes, it is more or less “the right thing to do” in large companies and Government Agencies… but still, it is a bit much for my Man Cave…

So I dutifully typed in the programs. (One “straight” and one with the parallelisms) and come compilation scripts and some ‘notes to self’ about setting ENVironment variables and… tested.

Now you might wonder why one would test parallelism on a single core machine? Well, mostly as the HDMI monitor was tied up with a 3 GB download of a Kali Linux (Penetration Testing) and I would otherwise be idle for a couple of hours. But partly, too, as I figured their FORTRAN would do a pretty good job of nagging me on the inevitable typos. It did. I fixed. ( a “,” looks a lot like a “.” when you do not have your readers on… Oddly, back in about 1973 in my ALGOL class a swap of a ; for : coupled with a line printer with moving paper smearing the lower dot into a , coupled with an O29 key punch what was missing a dot so ; printed like : on the card had me decode the card via reading the punched HOLES to find the mistake… but I digress yet again…) Knowing that the program ran, I used the “home dir on USB” to move it to the R.PiM2 (after the download was done and I could put the monitor on it).

Confident and full of certainty, I compiled. It gave errors… The gfortran on the R.Pi does not like comments that are just fine on the CentOS version… I fixed. Mostly by removing a couple of comments….

And it compiled. And ran.

And only used one core.

What? But, but bu… sputter…

That old familiar feeling.

OOOOOooooooommmmmmmmmm………..

Centered, and remembering “It isn’t about ME…”

What was different?

Cutting to the chase, I’d named the program paratest.f and gfortran liked it fine, compiler directives and all. It just ignored them. Changing the name to paratest.f90 it compiled and paid attention to the compiler directives and made parallel code. Sigh…

Deep breath. About that wine cup, where did I leave it?…

The Results

But wait, there’s more

While “it ran”, the resultant program performance profiling has me wondering a bit. I’m not yet to the point of “how to make my random come out to the same random as the other pseudo-random”, I’m back at “It did what?!”

First, the code:

Here’s the list of files:

pi@RaPiM2 ~/Desktop/fort $ ls
a.out  doit  doit2  paratest.f  paratest.f90  pt  setthreads
pi@RaPiM2 ~/Desktop/fort $ 

a.out is the default place a compile places an executable. It used to mean “assemlber”.”output” but that was long ago.

doit and doit2 are two small scriptlettes I made so I didn’t have to type the same thing right twice in a row.

paratest.f was the original program that worked on CentOS. Paratest.f90 is what worked in parallel on the R.PiM2.

pt is the executable that worked. It comes from paratest.f90 (while a.out comes from paratest.f.)

setthreads is a ‘scriptlette’ but also a reminder about how to manually set the environment variables so it works.

The details:

First, settthreads. It is configured as a script, but mostly is there to remind me to type it:

pi@RaPiM2 ~/Desktop/fort $ cat setthreads
export OMP_NUM_THREADS=${1-4}
pi@RaPiM2 ~/Desktop/fort $ 

This says to set the environment variable “OMP_NUM_THREADS” equal to the value of “4” or whatever parameter was passed. It is what I’d put at the top of a script running a parallel program. In reality, I just typed:

export OMP_NUM_THREADS=4

and then ran the program.

What about “doit” and “doit2”?

pi@RaPiM2 ~/Desktop/fort $ cat doit
gfortran paratest.f
pi@RaPiM2 ~/Desktop/fort $ cat doit2
gfortran -O3 -fopenmp -o pt paratest.f90
pi@RaPiM2 ~/Desktop/fort $ 

One compiles paratest.f to yield a.out. The other one says to put the object file “-o” into pt and use the .f90 version.

paratest.f has a couple of ‘old style’ comments that start with C and worked fine in the compiler until I changed the ending to .f90 at which time it complained. OK… some time between 1977 and f77 and 1990 with f90 I guess they depricated the “C” comments… but they worked under g95 on the other compiler… Sigh… Folks who break backward compatibility will regret it in their later years when someone else does it to them…

As paratest.f it had compiled and ran fine, but only in one core. What The… Double and triple checking the syntax, spacing, variable names, every comma really a comma, and about those semi-colons… Eventually I noticed that the only difference between mine and the ‘model’ in the link was the file name and ending. When I changed it to .f90 it suddenly used all 4 cores. IMHO, I’d count that as a bug. IF the code has a parallelizing compiler directive in it, use it. But I’m not the compiler guy nor on the standards committee.

So what happened when it ran?

Here’s some of the ‘straight’ form output:

pi@RaPiM2 ~/Desktop/fort $ ./a.out
 Experiment           1
 Experiment           2
 Experiment           3
 Experiment           4
 Experiment           5
 Experiment           6
 Experiment           7
 Experiment           8
 Experiment           9
 Experiment          10
 Experiment          11
 Experiment          12
 Experiment          13
 Experiment          14
[...]
 Experiment          96
 Experiment          97
 Experiment          98
 Experiment          99
 Experiment         100
 mean  0.499993414    
 stdev   2.69877473E-05

real	0m9.118s
user	0m9.060s
sys	0m0.000s
pi@RaPiM2 ~/Desktop/fort $ 

I made the loop 10 x longer so things would take long enough to measure, so the “stdev” is not at all like the example in the link. But you can see that each loop count is one more than the prior and it took about 9 elapse seconds and 9 user CPU seconds to run.

How about the parallel version?

pi@RaPiM2 ~/Desktop/fort $ time ./pt
 Experiment           1
 Experiment          26
 Experiment          51
 Experiment          76
 Experiment           2
 Experiment          77
 Experiment          52
 Experiment          27
 Experiment           3
 Experiment          78
 Experiment          53
 Experiment          28
 Experiment          79
 Experiment           4
 Experiment          54
 Experiment          29
 Experiment          80
 Experiment          55

OK, it “unrolled” the DO loop of 100 cycles and starts each core at a 25 offset. 1-25 in one core, then 25-50 in the next, then 51-75, and the last core getting 76-100. Nice. Just what we wanted (though I would have been happier with strides that had 1, 5, 9, and 2, 6, 10 and 3, 7, 11 and 4, 8, 12 since things would come out a bit closer to the original, but that’s more esthetics than technical.)

And the “top” program showed the execution in more than one core, though not as I expected (but as it likely ought):

top - 02:38:55 up  1:30,  4 users,  load average: 0.89, 0.84, 0.69
Tasks: 140 total,   2 running, 138 sleeping,   0 stopped,   0 zombie
%Cpu(s): 31.3 us, 42.3 sy,  0.0 ni, 25.8 id,  0.0 wa,  0.0 hi,  0.6 si,  0.0 st
KiB Mem:    949408 total,   672460 used,   276948 free,    33928 buffers
KiB Swap:   102396 total,        0 used,   102396 free,   264788 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND           
 4986 pi        20   0 27812  668  576 R 264.9  0.1   0:12.15 pt                
 3250 root      20   0  109m  61m  25m S   7.9  6.6  10:13.83 Xorg              
 3438 pi        20   0  105m  19m  16m S   4.0  2.1   0:30.79 lxterminal        
 3944 pi        20   0  517m 201m  47m S   4.0 21.7  33:55.68 iceape-bin        
 3372 pi        20   0  131m  28m  23m S   1.0  3.0   0:35.27 lxpanel           
 3930 pi        20   0  4672 2456 2084 R   0.7  0.3   0:37.82 top               
 2164 debian-t  20   0 33148 3668 3184 S   0.3  0.4   0:03.27 transmission-da   
 2717 mysql     20   0  308m  36m 8952 S   0.3  3.9   0:10.65 mysqld            
 3641 pi        20   0  391m 120m  69m S   0.3 13.0   0:34.48 epiphany-browse   
    1 root      20   0  2148 1360 1256 S   0.0  0.1   0:02.05 init              
    2 root      20   0     0    0    0 S   0.0  0.0   0:00.01 kthreadd 

The “pt” program is shown using 264% of ‘the cpu’… I can live with that. While I’d expected 4 instances of pt, with usage of each, in reality this is more accurate. There is only one program, and it is using that much CPU.

But the real surprise came at the end:

 Experiment          21
 Experiment          72
 Experiment          46
 Experiment          97
 Experiment          22
 Experiment          47
 Experiment          73
 Experiment          98
 Experiment          23
 Experiment          74
 Experiment          48
 Experiment          24
 Experiment          99
 Experiment          49
 Experiment          25
 Experiment          75
 Experiment         100
 Experiment          50
 mean  0.499992907    
 stdev   2.81007051E-05

real	0m47.735s
user	0m49.710s
sys	0m49.040s
pi@RaPiM2 ~/Desktop/fort $ 

Yes, the stdev is a little different. We expected that as we are not yet doing the correction for starting 4 ‘call random’ events at the same seed as opposed to one that changes constantly. No, the real surprise is that 47 seconds of real time and 49 seconds of User CPU with another 49 seconds of System CPU.

Doing ‘wall clock’ comparisons, it really is about a 1 : 4 speed ratio. The overhead of the 4 core problem split is way more than the gain from using 4 cores. Just amazing.

The article DID warn of this potential, but I’d figured it was way more far away than this. I was wrong.

Depending on your computer, because of the simplicity of this 
experiment you will probably need to set n very high in order 
to see the difference in execution speed. In fact, for small 
values of n (even 100000 in my case) the parallel code is actually 
slower. This illustrates the earlier point about the overhead 
of spawning new threads. The chunks of code that are parallelized 
should be sufficiently hard to keep the computational throughput high.

From my code:

      Program mc1
10    implicit none

        integer, parameter :: nmc = 100
        integer, parameter :: n   = 1000000
        real, dimension(nmc) :: mu
        real :: mean, stdev
        integer :: j

I added the line number of “10” just out of spite. The idea of a FORTRAN program without line numbers is “just wrong”! ;-)

But even setting the value of n to 1000000 was not enough to make it a gain. 10 x his setting.

Clearly One Big Honking CPU has a major advantage over a collection of lesser chips and a context switch is very expensive.

I upped it another 10… to 10 Million…

 Experiment          90
 Experiment          91
 Experiment          92
 Experiment          93
 Experiment          94
 Experiment          95
 Experiment          96
 Experiment          97
 Experiment          98
 Experiment          99
 Experiment         100
 mean  0.500015676    
 stdev   9.86549912E-06

real	1m29.750s
user	1m29.420s
sys	0m0.030s
pi@RaPiM2 ~/Desktop/fort $ 
 Experiment          22
 Experiment          72
 Experiment          48
 Experiment          98
 Experiment          23
 Experiment          73
 Experiment          99
 Experiment          49
 Experiment          24
 Experiment          74
 Experiment         100
 Experiment          50
 Experiment          25
 Experiment          75
 mean  0.500015855    
 stdev   9.25805216E-06

real	8m2.981s
user	8m10.170s
sys	11m32.710s
pi@RaPiM2 ~/Desktop/fort $ 

So about 5 x longer and 5 x more CPU using all the cores than using one…

Since rarely did CPU go over 280 to 320 or so percent, what happens if we cut threads back to 2 and eliminate context switches for the browser et. al. to run?

pi@RaPiM2 ~/Desktop/fort $ export OMP_NUM_THREADS=2
pi@RaPiM2 ~/Desktop/fort $ time ./pt 
 Experiment           1
 Experiment          51
 Experiment           2
 Experiment          52

And we can see it using 2 cores, though with about 2 idle…

top - 03:06:50 up  1:58,  4 users,  load average: 1.64, 1.94, 1.77
Tasks: 137 total,   2 running, 135 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.0 us, 25.1 sy,  0.0 ni, 49.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:    949408 total,   683300 used,   266108 free,    39520 buffers
KiB Swap:   102396 total,        0 used,   102396 free,   336424 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND           
 5871 pi        20   0 11428  656  564 R 189.6  0.1   2:19.83 pt                
 3438 pi        20   0  106m  20m  16m S   4.6  2.2   0:46.66 lxterminal        
 3944 pi        20   0  512m 191m  47m S   4.3 20.6  45:41.59 iceape-bin   

So it’s using about 2 CPUs and we’ve got some idle time. Any “issues” are not due to Iceape sucking up a CPU.

 Experiment          46
 Experiment          96
 Experiment          47
 Experiment          97
 Experiment          48
 Experiment          98
 Experiment          49
 Experiment          99
 Experiment          50
 Experiment         100
 mean  0.500016987    
 stdev   9.67524738E-06

real	9m20.000s
user	8m23.740s
sys	8m51.080s
pi@RaPiM2 ~/Desktop/fort $ 

9 minutes elapsed, 8 min 23 seconds of user CPU. 8 min 51 seconds of system CPU. Remember the “system CPU” of the one CPU case?

sys	0m0.000s

What parallel removes from User CPU, it is adding to System CPU. The necessary conclusion is that on this version of the Raspian OS, parallelizing FORTRAN is a losing proposition.

And with that, I’ve just saved myself a few dozen days of coding making GIStemp run parallel on the R.PiM2 4 cores.

Why we test and profile…

For The Terminally Curious

Here is the listing of paratest.f90. The first comment is a ‘ruler” that I put in all FORTRAN programs I write to maintain column discipline. I had to change the first character to a ! to make it OK. I’m pretty sure a C in char 1 is comment, but maybe this compiler wants it in char 7? who knows…

pi@RaPiM2 ~/Desktop/fort $ cat paratest.f90 
!23456012345678901234567890123456789012345678901234567890123456789012372

      Program mc1
10    implicit none

        integer, parameter :: nmc = 100 
        integer, parameter :: n   = 10000000 
        real, dimension(nmc) :: mu
        real :: mean, stdev
        integer :: j 

        !$OMP PARALLEL DO
        do j = 1 , nmc
           print *, 'Experiment', j
           mu(j) = monte_carlo(n)
        end do
        !$OMP END PARALLEL DO

        mean = sum(mu) / dble(nmc)
        stdev = sqrt ( sum ( (mu-mean)**2 ) ) / dble(nmc)

        print *, 'mean', mean
        print *, 'stdev', stdev


       Contains

         function monte_carlo(n) result(y)
           integer, intent(in) :: n
           integer :: i
           real :: y , u

           y= 0.d0
           do i = 1, n
              call random_number(u)
              y=y+u
           end do

              y=y / dble(n)
       
         end function monte_carlo
      end program mc1
pi@RaPiM2 ~/Desktop/fort $ 

And yes, I know it declares mc1 as the program name and I’ve named the file paratest.f90 but if you can’t torment a compiler with logical inconsistencies, who can you torment?…

Subscribe to feed

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in Tech Bits and tagged , , , . Bookmark the permalink.

15 Responses to An Amusing Parallel Pi FORTRAN Experiment

  1. M Simon says:

    FYI: The STM32 F2xx and STM32F4xx microcontrollers embed a True random number
    generator.

    I’m using the STM32F205 in a project.

  2. E.M.Smith says:

    @M.Simon:

    Nice to know. We actually spent some significant amount of time on the question of how to write your own really random number generator in class way back when. I think I remember how to do it ;-) (IIRC it was to use the low end bits of the call to time as the seed, then bit shift it one way or the other so it didn’t cyclically increment and use that as the seed to the actual compute of a value based on some large multiplies that made the low order bits flakey and then pluck those out as the random return… something like that.)

    Wonder how they get a “True” random… maybe jitter off the heat sensor?…

  3. M Simon says:

    These are the ruler lines I use in Forth:

    ( 1 2 3 4 5 6 7 8
    12345678901234567890123456789012345678901234567890123456789012345678901234567890
    )

    Well obviously the spacing is wrong. “1” goes over the first “0”. — “2” goes over the 2nd zero. etc.

  4. M Simon says:

    It is explained here: http://www.st.com/web/en/resource/technical/document/application_note/DM00073853.pdf – an analog noise generator feeding an LFSR. And some other manipulations.

  5. E.M.Smith says:

    Ah, a couple of analog oscillators XORed then shifted. Nice. I think what we did ‘way back when’ was the same idea, done less well due to less available stuff to work with. Since each run had an essentially random seconds field in the time, that was the analog of the XOR and then the ‘multiply and bit shift’ made it into a full bit field. IIRC we did some kind of circular shift… Used a double precision, but chopped it down to just the low order int of bits and shifted them… something like that. Now a little chip makes that sort of thing irrelevant I guess…

  6. E.M.Smith says:

    Golly…. looks like folks are still doing things with the clock bits:

    http://stackoverflow.com/questions/18754438/generating-random-numbers-in-a-fortran-module

    c initialize a random seed from the system clock at every run (fortran 95 code)

    subroutine init_random_seed()

    INTEGER :: i, n, clock
    INTEGER, DIMENSION(:), ALLOCATABLE :: seed

    CALL RANDOM_SEED(size = n)
    ALLOCATE(seed(n))

    CALL SYSTEM_CLOCK(COUNT=clock)

    seed = clock + 37 * (/ (i – 1, i = 1, n) /)
    CALL RANDOM_SEED(PUT = seed)

    DEALLOCATE(seed)
    end

    @M. Simon:

    Per Forth, I’ve been pondering something and think you can likely answer without the long ponder time I need…

    In GCMs and similar codes (even the later stages of GIStemp) there are grid cells that look to their neighbors to change their values. It seems to me that this just screams “parallel processors” and that a grid of small chips cranking away ought to work well. In particular, little stack machines running Forth. So:

    How hard is it to make a massively parallel program in Forth with a giant shared data space (the grid of data) but where each Forth instance would have the local cell data local to that instance?

    I’ve never done large data structures in Forth nor interprocess communications in Forth, so those are black box bits to me. ( I’ve only ever done small class type problems using Forth).

    Something about the character of the problem just keeps nagging me that Forth looks like a more efficient solution than FORTRAN… Small tight core program, reach out for last changed {temp, pressure, wind, humidity, insolation, TOD} from the neighbor cell, compute your round of changes, share to next cells, repeat.

    Then it would just be a matter of adding cores to the point of matching your grid scale and you have a monster weather simulator… With 100+ on a board, then 60 boards give you a 6000 cell engine and that’s about what the old GIStemp used (and more cells than we have temperature data points in any one year with which to initialize or compare them…). A dozen boards to a bay and it’s only 5 bays high. At 4 U, that’s about 20 inches. Call it 2 feet with spacers. Thing would fit on a desktop and, IMHO, really crank!

    But I’m not sure if the language handles that kind of problem well, and it would take me a few weeks to get good enough to know. So can you tell me the pro’s and con’s here?

  7. Another Ian says:

    E.M.

    I haven’t used FORTRAN in quite a while but liked the idea of IFIX

    As what you used after you’d invoked IF#*K

  8. M Simon says:

    Look up “GreenArrays”. It is already being done. 144 cores IIRC. But the cores are a little light (that is an understatement).

    If each core was a grid cell getting data from adjacent cells is trivial.

    But of course the whole idea is stupid if you can’t get clouds right. And use much smaller cells.

    1800*3600 cores (6,480,000) would be a good place to start. Unless you need to go 3D. Then X10 on the number of cores. So 1800*3600*10 – to start. But that is quite a large machine. And you also have to deal with the edges 360deg = 0deg.

    Suppose each core averaged 1W – for the 2D machine that is 7 or 8 MW. Even 1/10th that is a fair amount of power.

  9. M Simon says:

    Pros and cons?

    Forth uses a LOT less verbiage to get something done. The only unusual symbols are @ and !. Fetch and store. NOT = NOT. AND = AND. etc. Less (and clearer) verbiage makes maintenance easier, clarifies thinking, and reduces errors.

    And there is the fact that you always know (if you want) what the assy lang. produced will be, No more “the compiler screwed my program”.

    You tend to write in the correct way – lots of subroutine calls. Use small subroutines that are easily tested. That is because unlike “C” the call overhead is small.

    Cons? Floating point is not well handled by the core routines. But you could write your own. I have. In fact I have designed a 3 stack machine for that. Data stack. Return stack. FP Stack. Knowing how FP works is quite handy. It does get a little tricky when you leave off the leading “1” (you don’t need it because it is always there – FP numbers are between .5000… and .999999…)

    Where Forth really shines is in hardware control. Since I’m mainly a hardware guy….

    It is also good where you want to write your own language to solve a problem.

    The ARM stuff re:stacks is an abomination. You would think that there is a prejudice against stacks in the CPU world. (I miss the 68000). ARM gives you 1 1/2 stacks (fine for C). A two stack extension would be trivial hardware wise and software wise. But “it just isn’t done”.

  10. M Simon says:

    And I might add that every single compiler is a two stack machine. – it is almost like there is minimal communication between the higher level software guys and the hardware designers.

  11. Pingback: What is enough ARM CPU speed? | Musings from the Chiefio

  12. Pingback: RaTails – Draft High Level Steps | Musings from the Chiefio

  13. Pingback: Perspective on Raspberry Pi Benchmarks | Musings from the Chiefio

  14. Pingback: Raspberry Pi Model 2 and Amdahl | Musings from the Chiefio

  15. Pingback: Personal TeraFlop Climate Simulator for How Much? | Musings from the Chiefio

Comments are closed.