Cloning USHCN Daily Data

This is just a quick little posting on how to mirror or rapidly clone all the data in a particular subdirectory of a web site (and a smaller sidebar on how not to…)

First, you need really really ought to have ;-) a Unix or Linux machine for this to work. It isn’t all that hard to get one or build your own. The one I’m using for this exercise cost me $60 “all up” and took me about 6 hours to assemble and get working (and that was without really trying and with a lot of gratuitous playing along the way – a ‘production set up’ would likely take 1/2 that if the play and beverage and photo breaks were left out…)

Directions for the DIY low cost Linux Machine set up are here:

https://chiefio.wordpress.com/2015/07/18/raspberry-pi-m2-unboxing-and-setup/

https://chiefio.wordpress.com/2015/07/22/raspberry-pi-software-setup/

This script is pretty simple. Really just one line of the “wget” or ‘web get’ command. I’ve put a load of explanatory comments in front of it (lines starting with #) but you really can ignore all of them and just use the one line.

I have the basic command running fine. I had not tested this specific example of the script at the time of the original posting for the simple reason that the machine had been busy and the last ‘enhancement’ (also known as a ‘one line bug fix’) had not been tested, but I was pretty sure it was all correct. I’ve now tested it and it is working as expected.

With that, here’s the “script”. It is in a file named “getushcn” and with execute permissions set by doing a “chmod +x getushcn” (changes the file mode to add the eXecutable bit). I then print it out with the ‘cat getushcn’ command (concatenate a list of files and print out the contents – but with only one file, just print it).

cat getushcn 

# Fetch a mirrored copy of the CDIAC USHCN Daily temperature data.
#
# wget is the command that does the fetching.  
# It can be fed an http: address or an ftp: address.
#
# The -w or --wait command specifies a number of seconds to pause 
# between file fetches.  This helps to prevent over pestering a 
# server by nailing a connection constantly; while the 
# --limit-rate={size} limits via an average of pausing between 
# transfers.  Over time this is about the rate of bandwidth used, 
# but on a gaggle of small files can take a while to stabilize, 
# thus the use of both.
#
# Since CDIAC uses a "parent" link that points "up one" you need 
# to not follow those or you will end up duplicating the whole 
# structure ( I know... don't ask...) thus the -np or 
# --no-parent option.
#
# The -m or --mirror option sets a bunch of other flags (in effect)
# so as to recursively copy the entire subdirectory of the target 
# given in the address.  Fine, unless they use 'parent' a lot...
#
# Then you list the http://name.site.domain/directory or 
# ftp://ftp.site.domain/directory to clone
#
# Long form looks like:
#
# wget --wait 10 --limit-rate=100k --no-parent --mirror http://cdiac.ornl.gov/ftp/ushcn_daily
#
# but I think the --commands look silly and are for people who can't
# keep a table of 4000 things that -c does in 3074 Unix / Linux 
# commands, all of them different, in their head at all times ;-) 
# so I use the short forms.  Eventually not typing all those wasted
# letters will give me decades more time to spend on useful things,
# like comparing the merits of salami vs. prosciutto... 

wget -w 10 --limit-rate=100k -np -m http://cdiac.ornl.gov/ftp/ushcn_daily

And yes, you could just type

wget -np -m http://cdiac.ornl.gov/ftp/ushcn_daily

when in the desired directory and be done… which is what I did last night, but without the -np… but where’s the fun in that? Putting things in a file and calling it a script means never having to make the same typo twice ;-)

I launched this last night (minus the-np and the -w and the –limit-rate) and it ran for about 12 hours cloning away, having copied about 11 GB so far of everything both below the “ushcn_daily” directory and above, thanks to them having a “parent” link in their directories… and me not having the –no-parent set. At this point I’m curious about just how long it will take so I’m was letting it finish. On the one hand, that’s a bit rude as I’m sucking up resources for not much reason. OTOH, it is rate limited by my slow pipe, so not a big impact on them, and once the mirror is done, I need never pester their site again as I have a local cached copy and even if I do an update on the whole thing, only the changed files would be re-sent.

So in the interest of showing everyone just how long it will take if you screw up like I did and leave out the –no-parent setting, I’m I was going to let it complete and then ‘fess up’ (and / or tell the horror story… instead of just junking it incomplete and having no value at all).

UPDATE: Due to a power dropout of about 5 minutes, the ‘let it run’ has ended and I see no reason to restart it. Just accept that it takes fractional days to days if you forget that ‘no parent’ flag… END UPDATE.

Most of the time it was using my full speed of about 230 K/s without the bandwidth limiting flags, but the protocol is polite, so doesn’t seem to bother normal browser use much. Meaning those rate limit and wait settings are a nice idea, but likely not really needed. It is the polite thing to do, though. With those flags set, usage can be tailored to any degree desired. At ‘less than 1/2 the capacity’ it isn’t noticed much at all.

From my watching it last night, cloning just the daily data portion took fairly little time. About an hour, I think, but hard to tell for sure.

pi@RaPiM2 ~/ushcnd/cdiac.ornl.gov/ftp/ushcn_daily $ du -ks
1045568	.

It has about 1 GB of data in it, so divide that by your data rate and that’s how long it will take.

Once mirrored, future runs compare date stamps and only re-copy a file when it has changed. (For most files, some small ones just get resent). IFF you want a rolling archive, you will need to take snapshots prior to re-running the script and updating.

As my R.Pi is built on a 64 GB SD card, using 10 GB, or even 20 GB for a full archive of their whole site (however big it ends up being) costs about $1 / GB. So for a very modest sum you can clone the data archive at a point in time. About $1.05 for the USHCN daily directory and that has two copies of the data in it (by State and in a lump).

Note, too, that this will work for any site that uses the FTP or HTTP file transfer protocols. Just change the link

http://cdiac.ornl.gov/ftp/ushcn_daily

to point to a different sub-directory or site. It is my intent to make one of these for each of the major temperature data archives and then assess the operational work required to keep a current copy and selected archive copies.

As posted, the script will mirror into the current working directory, so make sure you have picked a place you like before running it. In the final production version, I will have a line at the top of the script to assure things go where desired. For me, that will be a directory named /MIRRORS and the first line in the script will be:

cd /MIRRORS

That way, whenever I run it, the results will always end up in that same target directory.

I’ve left it out of the posted version so that you can use this more as an interactive tool to make a copy wherever you like. Once the whole process is set up, I will have an automated script launch (sometime in off hours… dead of night or one weekend a year) that will do the mirror into the ‘usual place’ and make any archival backups desired. The future versions for other data sets and the ‘background job’ will also be posted so folks know how to set that up if desired.

But for now, it’s a ‘one off’ script you can run on your own and in any directory.

The wget command has dozens of other options, so whatever you want it to do, there is likely an option for that. Those wanting to do more can type “man wget” at the command prompt in a terminal window on their Unix / Linux machine.

Happy Cloning!

Subscribe to feed

Advertisement

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in AGW and GIStemp Issues, GISStemp Technical and Source Code, Tech Bits and tagged , , , , , , . Bookmark the permalink.

11 Responses to Cloning USHCN Daily Data

  1. Larry Ledwick says:

    There is also a wget for windows for those so inclined. I have used it once to archive a soon to be defunct web site.

    http://gnuwin32.sourceforge.net/packages/wget.htm

  2. E.M.Smith says:

    @Larry Ledwick:

    Well, that’s convenient! Maybe we can get the whole Unix / Linux Userland ported on top of the Windows kernel and have a reasonable environment! ;-)

    (Then just swap the kernel when no one is looking…. 9-}

    But nice to know that one can clone in Windows land too. Just remember that some file systems, (like FAT, FAT-32, and the like) have a broken idea of time stamps and a broken idea of ownership and permissions. NTFS is somewhat better, but not quite the full suite of EXT type … So if cloning, best to do it to an NTFS file system (or something better if it exists).

  3. E.M.Smith says:

    Interesting what you can find when you rummage around in the attic…

    The GHCN V1 that has been disappeared from the NOAA site:

    http://cdiac.ornl.gov/ftp/ndp041/

    And some copies of the FSU / USSR data:

    http://cdiac.ornl.gov/ftp/ndp048/
    http://cdiac.ornl.gov/ftp/ndp048r0/
    http://cdiac.ornl.gov/ftp/ndp048r1/

    Probably a bunch of other bits closer to ‘the source’ than the finished adjusted homogenized data food product of V3.3.x.y.z …

    I think this might benefit from ‘many hands’ looking… As the names are ‘unhelpful’ and there doesn’t seem to be any obvious index, either the ‘directions’ need to be found or each directory looked into and the contents figured out.

  4. Larry Ledwick says:

    ;)
    Windows at the command line is pretty much a poorly executed unix clone many of the same commands exist of have close cousins, ie traceroute (unix) vs tracert (windows) netstat and lots of the essential commands are very similar.

  5. Wow! That is amazing. I just typed that command into the “Terminal” prompt and all kinds of stuff started to download.

    Even with my 20 Mbps (ATM) download speed it will take a while. It is past my bed time so I will take a look at things in the morning.

  6. E.M.Smith says:

    @Galloping Camel:

    I have found on a rerun that the -w 10 really slows down the “skipping” of already copied files… so I would leave it out in final production. The skip is about 1 sec per file if not ‘wait’ flagged, so 10 seconds is 10 times slower for no real benefit (since data is not downloaded for the unchanged skipped files)

    Hope you have a big disk…

  7. Chiefio,
    It reminds me of Dukas (The Sorcerer’s Apprentice). Already 3 GB downloaded. I will let it run one more day.

  8. LG says:

    A good Linux/Unix command reference for those less acquainted with Unix-like OS

    http://www.computerhope.com/unix.htm
    http://www.computerhope.com/unix/overview.htm

  9. E.M.Smith says:

    @LG:

    Good idea!

    @GallopingCamel:

    Well, I’m at about 35 GB downloaded so far (from 3 sites) and still going… but I’m sucking down the whole thing.

    But the good news is that as long as you have the -np set, it will only take from that point on down in the tree; and that means I can likely do a “du -s” on that subtree and tell you how much data it contains… One of my “complaints” about FTP sites is the lack of any “it’s this big” notices. I would do a ‘du -ks .’ in major directories and put that up as README.SIZE.txt files so folks could decide what made sense and when “I’ll grab that directory” means a new disk…

  10. Pingback: When Big Disks Meets Slow Process… | Musings from the Chiefio

  11. Pingback: Well That Was Fun, sort of… | Musings from the Chiefio

Comments are closed.