GIStemp STEP0 antarc_to_v2


This tasty little script takes three datasets (Antarctic 1, 2, and 3), changes some missing data markers and merges the three into a single output data set and puts it in sorted order.

We will be looking at in this section. Just the script. The person who wrote this step realized you don’t need to drop into a FORTRAN program just to do some basic line manipulations and some text substitutions.

It is nicely structured, fairly easy to follow, and written by someone familar with what you can do with Korn shell scripting and wildcard characters. It does exploit nested “if” structures quite a bit and you must watch your exits carfully in that context (so you don’t get distracted by the radio and miss your exit!).

The purpose of the script is clear, the execution is clean, and there isn’t much here to screw up, so unless you are really interested in the program or style, there will not be much “action” in this dissection.

The script:

First a listing of the script:


m=8 ; # m=8: surface station, 9: automatic weather station
for x in antarc1.txt antarc3.txt
do if [[ m -eq 8 ]] ; then echo 'collecting surface station data' ; fi
   if [[ m -eq 9 ]] ; then echo '... and autom. weather stn data' ; fi
   while read a
   do if [[ $a = '' ]] ; then continue ; fi
      if [[ $a = 'Get '* ]] ; then continue ; fi
      if [[ $a != [12A-Z]* ]] ; then continue ; fi
      if [[ $a = *' temperature'* ]]                      ; # header line
      then name=${a%% *}                                  ; # get name to
        b=$( grep " ${name} " input_files/${x%.txt}.list ) ; # find ID
        id=${b%% *}                                       ; # from ...list
      elif [[ $a != *'.'* ]] ; then continue              ; # skip no-data-lines
      elif [[ $a = [12]* ]] then  echo "${id}${m}$a " >> v2_antarct.dat
   done < input_files/$x
   (( m = m+1 ))
         echo "... and australian data"
m=7  ; # marker for australian stations
while read a
do if [[ $a = '' ]] ; then continue ; fi
   if [[ $a = *':'* ]] ; then continue ; fi
   name=${a%%'  '*} ; b=${a#${name}}
   if [[ $b = *'E' || $b = *'E '* ]]                 ; # header line
   then b=$( grep " ${name} " input_files/antarc2.list ) ; id=${b%% *} ;
        # get ID
   elif [[ $a != [12]* ]] ; then continue            ; # skip non-data-lines
   elif [[ $a != *'.'* ]] ; then continue            ; # skip no-data-lines
   else echo "${id}${m}$a " >> v2_antarct.dat
done < input_files/antarc2.txt
echo "replacing '-' by -999.9, blanks are left alone at this stage"
sed 's/       - /  -999.9 /g' < v2_antarct.dat > v2_antarct.datt
sed 's/       - /  -999.9 /g' < v2_antarct.datt > v2_antarct.dat
sort v2_antarct.dat > v2_antarct.datt
mv -f v2_antarct.datt v2_antarct.dat


That’s it. The whole thing.

So what does it do?

I often like to start at the bottom and work up, or start at both ends and work in. This usually simplifies the decoding to a “core” bit and gets you familiar with where you are headed and where you started at the same time. We’ll do that here.

At the bottom, we end with the “missing data” substitution step followed by a merry-go-round of sorting. We even get a message telling us so (The “echo replacing – by -999.9 “ line; though we were not brave enough to go all the way and do the blanks too).

The command “sed” is run to edit the stream of data. We have the directive to substitute for “ -” the value “-999.9” globally run twice in a merry-go-round with the input v2_antarct.dat and the temp file v2_antarct.datt but ending up back where we started in v2_antarct.dat at the end.

The “sort” command puts it all in ascending order, then we move it with force back over to it’s old home as v2_antarct.dat and we’re done.

The only thing I’d comment on here is the choice of temp file name. I would have used work_files/[something] so the temp output was not happening into the same space where the live data and program source code lived. A minor point, but just ask yourself what happens in 10 years when some “newbie” decides to add another dataset and calls it “v2_antarc.datt” (maybe for Tasmanian or whatever) not knowing that it’s going to get nuked at execution time? Or worse, no one notices that one set of data got replaced by another at execution time?

At this point, we’ve got a handle on the “bottom bits”. We know where we are headed. Scanning back up the script we see a couple of places where data are concatenated onto the output dataset v2_antarc.dat (the double greater than signs) but no places where it is created or overwrtten with an erase and start fresh write (a single greater than sign). OK, there is a small risk that if this step is run a couple of times in a row it will just keep concatenating duplicates onto the output file. I’d have started it with a “wipe clean my output file” step to avoid that, but maybe that’s just me. I worry as much about what a program will do when run “wrong” as getting it to do what I want when it is run “right”. Then again, I’ve spent many sleepless nights finding where someone wrote code that worked just fine, right up until something went “bump” in the night…

OK, back to the top. We set a control filter “m” to the value “8” and the programmer was professional enough to leave us a comment as to what it means “surface stations”. A “for x” starts the major contol loop (put a finger next to the matching “done” at the left edge, that’s where you end up.) Looking down a bit from the “for x” we see that there are a couple of “if” statements that use “m” to print out what stations we are doing and if we skip to the bottom of that “do” loop we find the place where 8 gets turned into 9 (the m=m+1 ) and notice right above it where our input data come from: input_files/[each value of the loop “for x” above.] So we loop through twice on two files.

What happens inside? We read in a text buffer “a”. If it is blank, starts with “Get “ we skip it. If it has the right characters in the first positions (that 12A-Z range) we use it. (or what the code says, if it’s not! those we skip – continue – with the next line).

Now it gets a bit cute. If the line has a blank followed by the word “temperature” we know it’s a header line so we pick out the station name. Then we go fish in the input_files directory in the affilated station list file to pick out the station ID number using an embedded call to “grep” the global rugular expression pprint command that finds lines of matching text in files. The text buffer is ‘b’ and the ID is then fished out of that. We then hit two matching “elif” commands (Else, If) before we reach the associated “fi” that ends the “temperature” if nested set. One tosses out non-data lines, the other is the “money line” that checks that “a” starts with a 1 or a 2 and if it does, prints our product to the output file.

That product is a line text with the station ID, the type flag, and the “a” buffer.

If redesigning this, I would just have a load script that loads this data into a consistent database. Define your database. Each load script matches its associated raw data input file. End of story. This is a nice well written script that works well and shows mastery of the craft; but it’s doing a somewhat silly thing, IMHO. (Though in fairness, it needs to be done at some point – bringing all the data sources into a single format. It’s just the scattered sequential way it’s done in GIStemp that’s a bit daft.)

So, next we go “down under” to Australia.

This is a short loop and you ought to be getting used to it now. We start off with reading in a text buffer “a”. If it’s a blank or has a colon we skip it and go read another (“continue”). Next the name field gets picked out and if it has an “E” in the right place it’s a header so we skip it. Then we use the name as a key to search for the station ID (just like above) and if it does NOT start with a 1 or 2 or does NOT contain a “.” we skip it. The survivors are written to the output file just like before.

Well, that’s it.

We clean up some missing data flags, merge a couple of data sets, added station ID to name and it was followed by a sort of the output. Oh, and any differences in file formats were ironed over. Not exactly rocket science…

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in GISStemp Technical and Source Code and tagged . Bookmark the permalink.