Scraping NOAA and CDIAC

Given the various “rumors” about a large defunding of various “Climate Programs”, that might well include some of the data archives at NOAA or CDIAC, I thought I’d ‘remind’ folks how to scrape a site for data.

This is particularly important for CDIAC (Carbon Dioxide Information Analysis Center) as they are having a “Going Out Of Business” sale on data at the moment… Per their website (linked above):

NOTICE: CDIAC as currently configured and hosted by ORNL will cease operations on September 30, 2017. Data will continue to be available through this portal until that time. Data transition plans are being developed with DOE to ensure preservation and availability beyond 2017.

Well, IMHO it doesn’t matter if you are a True Global Warming Believer, or a Died In The Wool Skeptic:

Data ought to be preserved. Period. FULL stop.

So be you skeptic or AGW Activist, IF you want to assure the ‘transition plan’ doesn’t drop things, well, you can always get your own copy.

I covered that to some extent in earlier postings here:

But right now I think a ‘quick review’ might be in order.

As you can see, I have a 2015 copy from each of those. I’ve since updated most of them. But, should any of you want your own copy, here’s how to do it. I would advise using the “polite” options that rate limit, especially if you share a line with the rest of the family, OR if you are on a very very fast line. If a dozen folks on very fast links run scrapes wide-open, the admins will end up quashing it and nobody wins. I only ran mine wide open as 1) It was only me. 2) I was only updating a substantially complete copy. 3) I’m on an Orange Pi $16 computer on a slow link so can’t speed slam anybody if I tried…

This script is an example of how you can collect the contents of a web site. (several examples, really, as each scrape has a couple of commented out examples too). Just “uncomment” by removing the leading “#” any particular site you would like to ‘scrape’ with the options you choose (mostly it is nice to use the -w to wait a few seconds between file pulls or the –limit-rate to make sure you don’t saturate your internet connection and have the spouse or boss breathing down your neck…)

You will want to change the “cd …/Archives” line to have your destination directory instead of the /Archives one. This can be passed in at execution time as parameter 1, so you could say “syncnoaa /My/Directory/On/BIG_disk” and have it send stuff there.

I’ve uncommented the more “rude” example of “just make a mirror pronto” versions of the command. As explained above I was a “man in a hurry” on a slow link with a $16 computer and $59 disk drive so wasn’t likely to slam anybody with rapid data demands… Tune your wget line as appropriate to be a good net citizen AND preserve copies of the data. (Yes, all up, including the $25 hub and $7 power supply, the needed hardware and software to build an archive from scratch is about $107 plus shipping. Call it $125 max. I got free shipping on most of mine.)

chiefio@orangepione:~$ cat bin/syncnoaa 
# Fetch a mirrored copy of the NOAA, GHCN Daily temperature data,
# or cdiac web sites.
# wget is the command that does the fetching.  
# It can be fed an http: address or an ftp: address.
# The -w or --wait command specifies a number of seconds to pause 
# between file fetches.  This helps to prevent over pestering a 
# server by nailing a connection constantly; while the 
# --limit-rate={size} limits via an average of pausing between 
# transfers.  Over time this is about the rate of bandwidth used, 
# but on a gaggle of small files can take a while to stabilize, 
# thus the use of both.
# Since CDIAC uses a "parent" link that points "up one" you need 
# to not follow those or you will end up duplicating the whole 
# structure ( I know... don't ask...) thus the -np or 
# --no-parent option.
# The -m or --mirror option sets a bunch of other flags (in effect)
# so as to recursively copy the entire subdirectory of the target 
# given in the address.  Fine, unless they use 'parent' a lot...
# Then you list the or 
# to clone
# Long form looks like:
# wget --wait 10 --limit-rate=100k --no-parent --mirror
# but I think the --commands look silly and are for people who can't
# keep a table of 4000 things that -c does in 3074 Unix / Linux 
# commands, all of them different, in their head at all times ;-) 
# so I use the short forms.  Eventually not typing all those wasted
# letters will give me decades more time to spend on useful things,
# like comparing the merits of salami vs. prosciutto... 

#Flags:  -w n {wait n seconds}  --limit-rate=fook {limit rate to foo in k}
#        -np  {do not follow parent links upward}  
#	-nc  --no-clobber:  Complex, but leave old copies. Default for -r or -p
#	can not be specified with -N as -N says keep the newer time stamp.
#	-N  Use time stamping
#        -r  recursively follow the directory structure to 5 deep.
#	-l n  Go n deep on recursion instead of the default 5.
#	--no-remove-listing  keep the listing of what is on the server
#  	-m
#       --mirror
#           Turn on options suitable for mirroring.  This option turns on
#           recursion and time-stamping, sets infinite recursion depth and keeps
#           FTP directory listings.  It is currently equivalent to -r -N -l inf
#           --no-remove-listing.

cd ${1-/Archives}

# USHCN Daily

echo Doing USHCN Daily

wget -m -np

#wget -m -np -w 10

#wget -w 10 --limit-rate=100k -np -m

#wget -r -N -l inf --no-remove-listing -w 10 --limit-rate=100k -np

echo Doing World Weather Records

wget -np -m

#wget -np -m -w 20

#wget --limit-rate=100k -np -m

#wget --limit-rate=100k -nc -np -r -l inf

echo Doing World War II Data

wget -np -m

#wget -np -m -w 20

#wget --limit-rate=100k -np -m

#wget --limit-rate=100k -nc -np -r -l inf

echo Doing NOAA set

wget -np -m

#wget -np -m  -w 10

#wget --limit-rate=100k -np -m

#wget -nc -np -r -l inf

echo Doing Global Data Bank set

wget  -np -m

#wget -w 10 --limit-rate=100k -np -m

#wget  -np -m -w 10

echo Doing GHCN

wget  -np -m

#wget -w 10 --limit-rate=100k -np -m

#wget  -np -m -w 10

echo  ALL DONE!!! 

Note that this takes a LOT of disk and a LOT of time. On the order of a week on a moderately fast home network and about 3/4 of a TB of disk would be a good place to start. (A 1 TB disk better as you might want to unpack some things and look them over).

You can use this basic syntax on just about any ftp: or http: website, with a little tuning sometimes.

Subscribe to feed


About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in AGW and GIStemp Issues, Tech Bits and tagged , , , , , , , . Bookmark the permalink.

10 Responses to Scraping NOAA and CDIAC

  1. pearce m. schaudies says:

    Hi Chief. Greetings from the Big Mango (BKK). Off topic a little, heh.
    You may have seen this before. In the beginning talks a lot about how climate models work. And how they had been messed up by the government. So don’t be surprised if yours acts strange also.

    How’s that coming along? I’ve been following hardware updates. Do you hav target date for first sim run?

    Pearce M. Schaudies.
    Minister of Future

  2. E.M.Smith says:


    Saw that earlier tonight… it is about right, IMHO.

    Per “Status update”: Well, I had planned to be further along… forgot about an “Obligatory Birthday Dinner” a few hundred miles of driving away… (2 Sisters with birthdays about now, added to the spouse and her twin, and me, and the grandson just one month prior and…) Lets just say that there’s about 7 or 8 Birthdays between Thanksgiving and now… so you can see how I’d be able to forget one event.

    So that sucked up most of yesterday. 5 hours of driving and about the same of party.

    Then I was moving a load of data and started to get an odd disk error. The “from” disk would just suddenly be mounted read only. ( I suspect a result of some safety ‘feature’ on a disk error state). So most of today has been diagnostics and finishing the data move. (Doing hash code error checks on 1/2 TB of data on two disks and comparing them takes time, even if automated.)

    At this point I think it is just that if you drive the Pi Model 3 I/O subsystem “to the wall” for 12+ hours straight, eventually it errors on you. Moving the data off in a more managed way is working Just Fine. I had put my home directory, swap, and a bit more on that one disk, and then launched a couple of “move a few hundred GB” to a different disk all in parallel. I think it just drove the poor dear into a wait state that caused something to throw an error code on a time out and then the remount RO result. (All the data copied so far is testing clean, but the ‘delete’ at the end of the move failed due to the RO mount status…)

    Well, as you might guess, that got in the way of running the model…

    So where AM I at?

    I’ve got Model II compiled and ready to go, and input data sets in hand. I “just” need to sort out what goes where, what to call the input data sets, and what to measure (and how) about the run. Probably a week from now given all the Honey Do-s and other obligatory bits this week.

    It only runs single system at the moment, so will just run it on the Pi Model 3 or the Odroid-C2 to try the effect of 64 bit system on it.

    Model E is more a mess. Not even tried a compile yet, but have reviewed the code (first pass). Don’t know what input data or where to get it, nor how to structure a run. It is the preferred one for distributed operation on the Pi Stack, but will be a PITA to get running. I expect it will take me a month to get anywhere of interest with it, and that will be a ‘cut-out’ core of it as a demo run (i.e. not intended to be useful for anything R&D like).

    It is already configured for MPI, so I’ll try running it on the Pi Stack. Failing that, I’ll drop back to single board while I figure out the config / patch needs. The Pi Stack is “good enough” build status ( I can do 12 cores of 32 bit computes with the Pi M3 in 32 bit armhf mode, and could fairly easily do 16 cores by adding the Orange Pi cores to the mix – the H3 chip is a 32 bit quad core).

    Or I can do 8 cores of 64 bit arm64 if I put an arm64 build into the Pi M3 and match it with the Odroid-C2. That’s 8 cores of 64 bit wide computes which ought to speed up DOUBLE math significantly. The timing from that vs 12 cores of 32 bit would be Very Interesting, and would drive future purchase / OS build choices.

    The 2 x Pi M2 boards are A7 cores and can only do 32 bit OS builds. The Orange Pi too. The Pi Model 3 and the Odroid-C2 are both A53 cores and can do 32 bit OS builds or 64 bit (and are also ‘faster per clock’ in having superscalar and pipelines and more…) So if 32 bit is ‘fast enough” then the Pi M3 and Odroid-C2 can be added to an all 32 bit cluster / stack. IFF the 64 Bit DOUBLE speed is a huge gain, then the 2 x Pi M2 just stay as they are as the Build Monster for doing software compiles and I add a couple of more A53 core boards in a new dog bone stack as the Model Cluster. (About $100 and I already have enough left in donations to fund that, if it is the right thing to do). But that is a decision for a couple of weeks from now…

    Oh, and at some point when I have think time I’ll need to look at doing “heterogeneous model cluster runs”. MPI has the facility. So I can likely use both 32 bit and 64 bit builds together in one cluster; and even add in Intel chip based boxes if desirable. That’s likely a few months away.

    Oh, and then I discovered that CDIAC is saying “Get it while you can” on their data, so I’ve also had a priority task to get the site scraper set up again…

    Will any of it really matter?

    I doubt it.

    The models are already repudiated widely. The Trump / Brexit / Frexit etc. process is already underway and “Climate Scientists” are about to be a dime a dozen with at least a 4 year, and more likely an 8 year “Never Mind” posted on their worksite doors…

    So not a giant priority to me to port the models and make them run. Important for personal interest reasons, and as a way to finish putting a few nails in the climate code coffin, maybe, but far less important than things like, oh, CDIAC closing up shop… or BREXIT moving Britain away from an EU Agenda on all things Climate.

    So that’s where I’m at.

  3. pearce m. schaudies says:

    Hi Chief. Greetings from the Big Mango (BKK). Thanx for reply. Here’s another wild idea, heh.

    Run a half- vast earth model. From equator north, California east to China, leaving out pacific. Might give faster, half accurate results for atlantic rim dirt, heh.

    Pearce M. Schaudies.
    Minister of Future

  4. pearce m. schaudies says:

    Hi Chief. There’s also this recent post by Wim Rost showing net energy balance maintained by ocean mixing and wind. So even when there is a ten or hundred year warming cooling spell in the air, it will come back to a neutral set point via the ocean energy.

    Pearce M. Schaudies.
    Minister of Future

  5. John Silver says:

    Chefio, slightly OT
    Remember abandoning YouTube for a Bittorrent solution?
    That was quick, it’s right here:

  6. omanuel says:

    A consensus is building that the UN is Big Brother in George Orwell’s futuristic novel, Nineteen Eighty-Four.”

    1. The last two allied atomic bombs destroyed Hiroshima & Nagasaki on 6 & 9 AUG 1945.

    2. Stalin’s USSR troops overran and captured the world’s remaining inventory of atomic bombs in Japan’s plant at Konan, Korea, a week or two later, in AUG 1945.

    3. Nations and national academies of sciences were united under the UN on 24 OCT 1945, with Stalin in control.

    4. Although George Orwell was dying of TB, he moved from London to the Scottish Isle of Jura to start writing Nineteen Eighty-Four in 1946.

  7. E.M.Smith says:

    I think I may have refined the disk failure issue…

    After moving about 1.5 TB from the Seagate disk, I reformatted it to test if it was failing. I then divided it into partitions for /usr /var /tmp /lib and copied those filesystem trees from the SD chip onto those disk partions. At that point, mounted the disk partions over the SD chip directories. (This also makes a faster system and reduces SD chip ‘wear’ while speeding up disk writes, so I’d wanted to try it anyway…)

    By doing this I can use the disk (until it errors, if it errors) then just unmounting that partition restores the original SD filesystem to active use.

    Doing this, there have been no errors…

    Then, on one of the PiM2 systems, I was moving a few hundred GB to a new disk, and it did the same “go to read only” error. Hmmmm…. what was done during that time?

    The “headend” system was rebooted, and it was serving an NFS file system to the one that errored.

    I think the first error happened when an NFS server was rebooted while thst filesystem was mounted to the system doing the high disk traffic, too.

    At this point, it is NOT looking like actual disk issues, but as likely to be a bad interaction of an NFS server reboot while a file system is mounted to the one doing high disk traffic (even just between two local disks and not using NFS disks actively). A rare usage in Pi land, but more common industrially…

    OK, easy fix, just unmount NFS mounts prior to rebooting any NFS server… good practice anyway.

    I’ll be following that path until such time as the error happens again anyway, or I get time and inclination to do deliberate testing of the proposed cause.

  8. Pingback: My, What Big Datasets You Have, Grandma NASA | Musings from the Chiefio

  9. E.M.Smith says:

    Well, after a couple of weeks of running my OS off of the 2 GB Seagate (that had given the surprise flip to read only on NFS toggle) is is having NO issues. At this point, I’m certain it isn’t the disk. Regular fsck at boot time finding no issues. I’ve got /var /usr /lib /tmp and some other on it (so gets regular use when running). They are just copies of what was on the SD card, then mounted on top of the SD card, so ‘recovery’ would just be boot without the disk mount. But just nothing wrong with the disk.

    On a second note: I’ve ordered another 4 TB external disk for the Opi site scraper. I pick it up tomorrow. Then I just need to glue it together with 4 more TB in an LVM group… I figured at the present rate of download, I run out of disk on the present one in a few days to a week (it goes fairly quickly at night…) and the grand total disk space needed will be about 8 TB.

    The daily “superGHCNd” has about a 10 GB to 11 GB file for each day. It starts in Sept 2015, and goes to the present. Roughly 4.7 GB just by itself. Add a couple of GB for other things (like older copies of GHCN and the cdiac set) they allow that a 4 TB disk is really about 3.5 when formatted. So I figure 8 TB in an LVM (Logical Volume Manager) group will be all I really need, but I can also grow it by adding a disk if needed / when desired (i.e. as the year progresses…)

    Once again, thank you to the donors who made this possible. I ought to have the scrape moved over onto the LVM sometime toward the start of next week.

  10. E.M.Smith says:

    FWIW, I’m now running on an LVM volume group so I can just dynamically glue on added disk as needed. I’m up to February 15 of 2016, so only one more year of SuperGHCNd Daily to go! Or about 3 more GB of raw disks.

Anything to say?

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s