This is mostly pointers to things I’m looking at / reading as I get some skill with coarray FORTRAN.

It’s fairly simple in concept. Extend arrays to cross processor and machine boundaries with relatively few syntax changes. In practice, there’s much hair on the dog that needs some attention to get best results.

### Some Intro Slides & a Speed Comparison

This is a nice simple low level first introduction to what it looks like and how to use it.

http://www.training.prace-ri.eu/uploads/tx_pracetmo/coarrayvideo1.pdf

I get the impression it is a set of slides that are supposed to accompany a class and class handouts / exercises.

Parallel Programming with Coarray Fortran

PRACE Autumn School, October 29th 2010

David Henty, Alan Simpson (EPCC)

Harvey Richardson, Bill Long, Nathan Wichmann (Cray)Motivation

• Fortran now supports parallelism as a full first‐class feature of the language

• Changes are minimal

• Performance is maintained

• Flexibility in expressing communication patternsProgramming models for HPC

• The challenge is to efficiently map a problem to the architecture we have

— # Take advantage of all computational resources

— # Manage distributed memories etc.

— # Optimal use of any communication networks• The HPC industry has long experience in parallel programming

— # Vector, threading, data‐parallel, message‐passing etc.•We would like to have models or combinations that are

— # efficient

— # safe

— # easy to learn and use

It does a nice simple overview comparison of the different ways to do parallel programming, then shows a sample of how to do it in Coarrays that’s almost simple. (There seems to be a use of a sub-routine to find primes that is not defined in the example… so only a partial program is provided. Not enough to test-run.)

This one starts out with the same boiler plate, then diverges. Different “author”, so I think it’s a larger Cray corporate library of slides individual presenters dip into to make their presentation. This one has more complex coverage further down the pages.

https://fs.hlrs.de/projects/par/events/2011/parallel_prog_2011/2011XE6-1/09.1-coarrays.pdf

This next one is an interesting comparison of the major FORTRAN compiler choices for use with Coarrays. Intel, Cray, and GNU FORTRAN (gfortran).

http://www.opencoarrays.org/uploads/6/9/7/4/69747895/pgas14_submission_7-2.pdf

Strangely, I could not find a date in it for when written. Claims to be a product of the 2014 “Summer of Code” and the latest citation at the bottom is 2014, so I’d guess that or 2015. Generally finds gfortran to be just dandy. It is a technical study of the ‘Transport Layer’ that’s so critical to shared systems use, so not looking at things like how complete is the syntax other than in support of their paper.

OpenCoarrays: Open-source Transport Layers Supporting Coarray Fortran Compilers

Alessandro Fanfarillo, University of Rome

Tobias Burnus, Munich, Germany

Valeria Cardellini, University of Rome

Salvatore Filippone, University of Rome

Dan Nagle, National Center for Atmospheric Research, Boulder, Colorado

Damian Rouson, Sourcery Inc., Oakland, California

ABSTRACT

Coarray Fortran is a set of features of the Fortran 2008 standard that make Fortran a PGAS parallel programming language. Two commercial compilers currently support coarrays: Cray and Intel. Here we present two coarray transport layers provided by the new OpenCoarrays project: one library based on MPI and the other on GASNet. We link the GNU Fortran (GFortran) compiler to either of the two OpenCoarrays implementations and present performance comparisons between executables produced by GFortran and the Cray and Intel compilers. The comparison includes synthetic benchmarks, application prototypes, and an application kernel. In our tests, Intel outperforms GFortran only on intra-node small transfers (in particular, scalars). GFortran outperforms Intel on intra-node array transfers and in all settings that require inter-node transfers. The Cray comparisons are mixed, with either GFortran or Cray being faster depending on the chosen hardware platform, network, and transport layer.

### Primarily Primes

Contemplating a problem to problem to code in Coarray FORTRAN I though of primes. Looking for something already done came up empty. Either my search-foo is off, or nobody has done it yet and advertised the fact. I did run into this interesting RosettaCode page listing every language doing the Sieve of Eratosthenese:

https://rosettacode.org/wiki/Sieve_of_Eratosthenes

Some, like Julia, are small and elegant. I’m going to try it too ;-) Others are surprisingly long and complicated. All, in theory, doing a fairly simple thing.

program sieve implicit none integer, parameter :: i_max = 100 integer :: i logical, dimension (i_max) :: is_prime is_prime = .true. is_prime (1) = .false. do i = 2, int (sqrt (real (i_max))) if (is_prime (i)) is_prime (i * i : i_max : i) = .false. end do do i = 1, i_max if (is_prime (i)) write (*, '(i0, 1x)', advance = 'no') i end do write (*, *) end program sieve

It also lists a more optimized version using a “wheel” of 2. That is, you know all the even numbers are not prime, so skipt them…

program sieve_wheel_2 implicit none integer, parameter :: i_max = 100 integer :: i logical, dimension (i_max) :: is_prime is_prime = .true. is_prime (1) = .false. is_prime (4 : i_max : 2) = .false. do i = 3, int (sqrt (real (i_max))), 2 if (is_prime (i)) is_prime (i * i : i_max : 2 * i) = .false. end do do i = 1, i_max if (is_prime (i)) write (*, '(i0, 1x)', advance = 'no') i end do write (*, *) end program sieve_wheel_2

These folks have a version that is for parallel clusters, but using a different method. MPI.

To download the tarball:

https://osdn.net/projects/wolfchen/downloads/28118/EratosthenesFortran-0.00.05.tar.gz/

I’ve downloaded it and looked at it. Does a LOT of environmental housekeeping with things like asking how many CPUs and how much memory and sizing itself. The actual Sieve code is fairly small and the parallel part just a few lines. I might take a crack at rewriting it to be Coarray (and of fixed size so skipping all the housekeeping…)

Or I might just pick some other problem as my test case.

If one chooses not to sieve, then the Miller-Rabin primality test would let you take chunks of number ranges and assign them as blocks to each processor. Fairly obvious parallel processing possible. This code implements it. As the top of the page says, it is based off of pseudo code from the Wiki page about the method. They also note the overflow problem one is likely to encounter. Clearly as soon as you hit the top of 64 bit numbers you have “issues”. I’m sure a lot of brain time could be sunk into cleaning up that abstruse math question…

https://rosettacode.org/wiki/Miller%E2%80%93Rabin_primality_test#With_some_avoidance_of_overflow

This code incorporates a call to another bit of code, referenced in a link there, that finds all the primes making up a given number. So a complicated program, but it isn’t limited up front to a given array size as is the Sieve. (Note how those codes keep saying things like “All primes up to 1000” or “up to 200″… less useful for “find all primes.”…)

I’m not real thrilled at the idea of just whacking together some mindless do-nothing code (like fill an array of 2000 with random numbers and then find the square root of them all). I’d rather write some bit of code that does something at least plausibly useful. Then again, I have a tape (somewhere) with the first 1/2 million prime numbers on it so I guess it is sort of useless to re-do that…

Long ago, when programming on an HP3000, the Image database used the DB size to compute hash values. This was made more efficient if the DB size was a prime number. So I wrote a program to find the nearest prime larger than some value. Want a 20 MB DB? Ask for the next prime larger than 20 MB and BINGO! it gave it to you. I also printed out a 6 inch thick binder of the first 1/2 million primes so others could just open the page and pick one. It was actually useful as we were a database shop and doing this every day. Now not so much…

### Comment On Array Manipulation

That’s where I’ve gotten to today. Learned some newer FORTRAN syntax on arrays (that whole a : b : c thing of start : end : stride ) from this line:

if (is_prime (i)) is_prime (i * i : i_max : 2 * i) = .false.

in the find primes code above. Took me a while to figure out the “guts” of the process being done was those semi-colons…

https://courses.physics.illinois.edu/phys466/sp2013/comp_info/array.html

In general, a section of an array is specified by v(start:end:stride) A bare colon (:) specifies the entire dimension, as shown in the examples above.

Obtaining the diagonal of a matrix requires converting the matrix to an array, and then using a stride that is one greater than the dimension.

There’s some interesting things FORTRAN lets you do with arrays now. Back when I learned it, there was much less of that to learn ;-)

It’s the kind of stuff that makes it so useful for Math and Engineering oriented work. I remind myself while trying to remember the Matrix Algebra I learned in High School and never used since ;-)

### In Conclusion

And, for completion, I do have to mention there’s a Wiki:

https://en.wikipedia.org/wiki/Coarray_Fortran

It has an example “Hello World!” program in it. Easy, not useful.

Finally, there is this intro / tutorial that looks fairly good on a quick scan. It is also fairly long and has a lot of example code, thus the ‘only a scan’ so far.

http://coarrays.sourceforge.net/doc.html

If anyone has any pointers to other guides, tutorials, or interesting example codes to run; please post a link. I’m OK with the whole “do it by the seat of your pants” thing, IF I have to, but prefer starting from a higher base camp via reading up on a thing, if that’s an option.

Image processing would be a natural. Divide the image up into equal areas. Run the kernel (filter) over the areas, one per processor. Then run the kernel down the dividing lines. Viola!