This is partly just an “aggregator” of things already discussed. Some in specific articles, some in “tips” as I was just making some notes as I went along. I’m putting this up for some added information and so that finding the other bits is easier in the future.
First off, what is “scraping” a site, and why do it?
Scraping is in essence just making a full copy of it for use later as an archive, or as an offline copy. You do it to preserve what is there either at a point in time or as protection from loss.
For some reasons beyond my ken, some site operators don’t like that. Partially, I can see it if they are being hit hard by a bunch of site scrapers and all of them are wide open on fast links. It can saturate their internet connection and is a sort of ‘denial of service’ to others. For those of us on slow home links, this isn’t an issue, but we tend to be whacked by the same “protective” measures used against the others. Oh Well.
There are fairly trivial ways to bypass that kind of block, and for starters one can just set polite settings about a site scraping script. Most such ‘scripts’ are really just a one line command, but I put them in an executable file anyway, so it is a trivial kind of script.
The preferred command is “wget” (at least, it is my preferred command). Which stands for “Web Get”, as that is what it does. It goes out on the web and gets stuff. There are many parameters you can set. Most of them can be ignored. But if you run into issues, RTFM on wget. Read The (um) “Friendly” Manual.
Prior postings have looked specifically at doing a site scrape of the NOAA/NCDC (now renamed to protect the guilty to NCEI though the links / paths have the old name) data and site, along with the CDIAC site (Carbon Dioxide Information Analysis Center). Since CDIAC has posted a “Going Off Line Real Soon Now” notice on their site, I figured it would be a “very good thing” to capture and preserve what I could since it is unclear where, or if, it will come back on line.
NOTICE (August 2016): CDIAC as currently configured and hosted by ORNL will cease operations on September 30, 2017. Data will continue to be available through this portal until that time. Data transition plans are being developed with DOE to ensure preservation and availability beyond 2017.
So it says it will be preserved and available, but… So I snagged a copy of what was publicly available. This also means that, over time, I don’t need to whack their site just to look at a particular bit of data nor do I need to take the network traffic load. All good things. My take on it is here:
So how big is this bundle? I have a little command named DU that tots up disk usage, sorts it, and prints out a nice summary in a dated file. It looks like this:
root@odroid32:/WD4/ext/7Feb2017_Scrape# cat ~chiefio/bin/DU du -BMB -s * .[a-z,A-Z]* | sort -rn > 1DU_`date +%Y%b%d` & #du -ks * .[a-z]* .[A-Z]* | sort -rn > 1DU_`date +%Y%b%d` &
The -BMB causes the Macintosh to barf, so you can use -ms instead of “-BMB -s” and it is fine. One gives you megabytes in binary (1024 per KB) while the other gives it base 10 (1000 per KB) so most folks will not care. I also have a commented out “-ks” form that gives the KB count for things too small for MB to be informative… All that .[a-z] .[A-Z] stuff is to catch the hidden files in your home directory that normally you don’t see. Those starting with a “.” so not normally displayed.
root@odroid32:/WD4/ext/7Feb2017_Scrape# cat 1DU_2017Feb17 163382MB Temps 142051MB cdiac.ornl.gov 15875MB GHCN_Daily_NOAA_NCDC 2413MB Old_Logs 1MB lost+found 1MB 1DU_2017Jan22
So the scrape of NOAA / NCDC was all of 15.8 GB, and that of CDIAC was 142 GB. A lot, but quite manageable. The commands used were a mixed set over time. (wget is smart and doesn’t download a new copy of things that have not changed.) I’ve commented out various iterations as I’d at times used flags to slow total bandwidth, or be simpler. All of them worked, though in slightly different ways. I broke up the fetches into chunks, so I could get any given bit updated with just commenting out, or uncommenting various bits. Note that the only active line is presently the first one that lacks the “-np” flag? By leaving off that “no parent”, it fetches all of USHCN Daily first, then wanders up the parent directory and back down again, collecting most everything not blocked. That would normally be an “error” (so you see the others have “-np”) but as I wanted to preserve the site, I let it walk the whole tree, parent directories included.
# cdiac.ornl.gov USHCN Daily echo echo Doing cdiac.ornl.gov USHCN Daily echo wget -m http://cdiac.ornl.gov/ftp/ushcn_daily #wget -m -np http://cdiac.ornl.gov/ftp/ushcn_daily #wget -m -np -w 10 http://cdiac.ornl.gov/ftp/ushcn_daily #wget -w 10 --limit-rate=100k -np -m http://cdiac.ornl.gov/ftp/ushcn_daily #wget -r -N -l inf --no-remove-listing -w 10 --limit-rate=100k -np http://cdiac.ornl.gov/ftp/ushcn_daily echo echo Doing World Weather Records echo #wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/wwr/ #wget -np -m -w 20 ftp://ftp.ncdc.noaa.gov/pub/data/wwr/ #wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/wwr/ #wget --limit-rate=100k -nc -np -r -l inf ftp://ftp.ncdc.noaa.gov/pub/data/wwr/ echo echo Doing World War II Data echo #wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/ #wget -np -m -w 20 ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/ #wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/ #wget --limit-rate=100k -nc -np -r -l inf ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/
Of all the directories and files that are grabbed, only a portion exceed one MB of size:
root@odroid32:/WD4/ext/7Feb2017_Scrape/cdiac.ornl.gov# cat 1DU_mb_out 125576 ftp 574 oceans 167 epubs 74 trends 70 SOCCR 25 programs 22 carbonmanagement 19 newsletr 16 images 11 wwwstat.html 4 science-meeting 3 ndps 2 datasets
All the rest are 1 MB or smaller. Here’s the listing:
root@odroid32:/WD4/ext/7Feb2017_Scrape/cdiac.ornl.gov# ls 1DU_mb_out ftp.2 about ftpdir aerosol_parameters.html GCP aerosol_particle_types.html glossary.html aerosols.html halons.html authors hcfc.html backgrnds hfcs.html by_new home.html carbon_cycle_data.html hydrogen.html carbon_cycle.html ice_core_no.html carbonisotopes.html ice_cores_aerosols.html carbonmanagement icons carbonmanagement.1 images carbonmanagement.10 includes carbonmanagement.11 index.html carbonmanagement.12 js carbonmanagement.13 land_use.html carbonmanagement.14 library carbonmanagement.2 methane.html carbonmanagement.3 methylchloride.html carbonmanagement.4 methylchloroform.html carbonmanagement.5 mission.html carbonmanagement.6 modern_aerosols.html carbonmanagement.7 modern_halogens.html carbonmanagement.8 modern_no.html carbonmanagement.9 ndps cdiac new cdiac_welcome.au newsletr cfcs.html newsletter.html chcl3.html no.html climate oceans CO2_Emission oceans.1 CO2_Emission.1 oceans.10 CO2_Emission.10 oceans.2 CO2_Emission.11 oceans.3 CO2_Emission.12 oceans.4 CO2_Emission.13 oceans.5 CO2_Emission.14 oceans.6 CO2_Emission.15 oceans.7 CO2_Emission.16 oceans.8 CO2_Emission.2 oceans.9 CO2_Emission.3 oxygenisotopes.html CO2_Emission.4 ozone.html CO2_Emission.5 permission.html CO2_Emission.6 pns CO2_Emission.7 programs CO2_Emission.8 recent_publications.html CO2_Emission.9 science-meeting comments.html search.html css sfsix.html data shutdown-notice.css data_catalog.html SOCCR datasets staff.html datasubmission.html tetrachloroethene.html deuterium.html trace_gas_emissions.html disclaimers.html tracegases.html epubs trends factsdata.html vegetation.html faq.html wdca frequent_data_products.html wdcinfo.html ftp whatsnew.html ftp.1 wwwstat.html
You can see that a lot of it is just the html files that make the site go.
Most of the actual volume is the ftp site, as you would expect.
OK, that’s how you can grab a copy of CDIAC before the world changes…
NOAA NCDC / NCEI
The NOAA/NCDC scrape was a similar command. You will note in this listing all of it is commented out except the last bit that is getting “superghcnd”. That was added after this first scrape, and it is HUGE. So not in the above size information (it isn’t done yet). As I had just finished the other bits, I commented them out. Now it only chews on a chunk of syperghcnd when I launch it:
echo echo Doing NOAA set echo #wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ #wget -np -m -w 10 ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ #wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ #wget -nc -np -r -l inf ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ echo echo Doing Global Data Bank set echo #wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/ #wget -w 10 --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/ #wget -np -m -w 10 ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/ echo echo Doing GHCN echo #wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/ #wget -w 10 --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/ #wget -np -m -w 10 ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/ echo echo Doing GHCN -daily- SuperGHCNd echo wget -np -m -w 10 ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/superghcnd/superghcnd_full_20170204.csv.gz #wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/
SO FAR I’m at 2.5 TB or so of that hourly daily data. It is 10 GB / day and about 1.5 years worth.
root@odroid32:/LVM/ftp.ncdc.noaa.gov/pub/data/ghcn/daily/superghcnd# du -ms . 2573927 .
I’m figuring on about 4 TB when it is done, so be advised…
I also grabbed a locally readable mirror of my site. This lets me look at it with a browser offline. Nice for checking old articles without creating web traffic. Like when on a slow link (let it scrape all night, then browse lightning fast during the day). It is a ‘snapshot’ so not useful for things like recent comments and / or interaction. Some images may get downloaded, other things remain live links to the outside world (like video from youtube) so it isn’t 100% network free. (Tuning parameters to wget can grab more stuff outside the original site on links, but I’ve not done that yet. It is tricky to not end up scraping the entire world… so set the depth to capture all links, and all depths, and you end up putting the whole internet on your disk drive…)
What command did I use?
wget -U Mozilla -mkEpnp https://chiefio.wordpress.com
I was testing the “-U Mozilla” prior to doing GISS and didn’t want an error of syntax to lock me out for a day again… (GISS is picky about scraping, so gave me a one day lockout on my first scrape attempt)
How much disk did that take?
root@odroid32:/LVM/chiefiowp# du -ms chiefio.wordpress.com/ 1373 chiefio.wordpress.com/
1.3 GB. Not bad, but I can see I need to check where the “free” limit on disk is located on WordPress ;-)
This one was more problematic. With the news being that President Trump would be refocusing NASA on space, and out of the politicized field of Climate, I’d figured a nice thing to do would be to preserve a copy. A couple of folks “tipped” this, but this is the link I can find at the moment. From P.G. here:
Trump scrapping NASA climate research division in crackdown on ‘politicized science’
February 19, 2017
Donald Trump is poised to eliminate all climate change research conducted by Nasa as part of a crackdown on “politicized science”, his senior adviser on issues relating to the space agency has said. Nasa’s Earth science division is set to be stripped of funding in favor of exploration of deep space, with the president-elect having set a goal during the campaign to explore the entire solar system by the end of the century. This would mean the elimination of Nasa’s world-renowned research into temperature, ice, clouds and other climate phenomena. Nasa’s network of satellites provide a wealth of information on climate change, with the Earth science division’s budget set to grow to $2bn next year. By comparison, space exploration has been scaled back somewhat, with a proposed budget of $2.8bn in 2017. Bob Walker, a senior Trump campaign adviser, said there was no need for Nasa to do what he has previously described as “politically correct environmental monitoring”. “We see Nasa in an exploration role, in deep space research,” Walker told the Guardian. “Earth-centric science is better placed at other agencies where it is their prime mission. “My guess is that it would be difficult to stop all ongoing Nasa programs but future programs should definitely be placed with other agencies. I believe that climate research is necessary but it has been heavily politicized, which has undermined a lot of the work that researchers have been doing. Mr Trump’s decisions will be based upon solid science, not politicized science.”
Well, to me, that sure sounded like GISS climate work, and GIStemp, were likely to get the boot. So being responsible for backups and archives at companies for much of my professional life, I naturally thought: “Make a Golden Master Archive” of what you can.
Well, my first attempt was immediately slapped down by a bot assassin. Details in comments here:
The bottom line of all that is that NASA GISS has anti-site scraper settings in their robot.txt file. I did get the scrape to work, after waiting a day or two for the block to expire. The command that worked is:
wget -U Mozilla --wait=10 --limit-rate=50K -mkEpnp https://data.giss.nasa.gov
Most likely one could leave out the “–limit-rate” and even the “–wait” commands, but as I’m still working off the “superghcnd” TB wad, I didn’t want to slow it down. The “wait” says to pause that many seconds between fetches (so it looks like someone clicked a key) and the “limit-rate” makes it polite about being a bandwidth hog. The “-U Mozilla” says to tell the site, when asked, that I’m really Mozilla browser. You can put many different browser types in that spot, as you like it.
As of now (all of a few hours of running, waiting and rate-limiting) I’ve already got some data on downloads. Here’s what I’ve go so far:
root@odroid32:/LVM/GISS/data.giss.nasa.gov# ls cassini dust_tegen impacts mineralfrac precip_cru sageii ch4_fung efficacy index.html modelE precip_dai seawifs co2_fung gistemp landuse modelforce robots.txt stormtracks csci imbalance mcrates o18data rsp_air swing2 root@odroid32:/LVM/GISS/data.giss.nasa.gov# du -ms * 2 cassini 22 ch4_fung 1 co2_fung 8 csci 21 dust_tegen 7 efficacy 1 gistemp 1 imbalance 130 impacts 1 index.html 1 landuse 3 mcrates 49 mineralfrac 259 modelE 2 modelforce 1 o18data 1 precip_cru 1 precip_dai 1 robots.txt 2 rsp_air 1 sageii 7 seawifs 5 stormtracks 349 swing2
So there are 14 out of 22 directories either done, or in progress (so one of them is actively downloading at the moment. I can see it is the ModelE directory in another window)
That leaves only 8 more directories to go, one of the items is the file ‘index.html’ and another being the robots.txt file, so not a directory. A total of 879 MB so far. Unless something is very very large in the other directories, not a big scrape load, really. We’ll see when it completes.
Now, about that robots file… Sites can send a file to your code that says, basically, “If you are not a human, but are a computer robot doing a task for a human, don’t do this list of things.” Here’s the robots file from GISS:
root@odroid32:/LVM/GISS/data.giss.nasa.gov# cat robots.txt User-agent: * Disallow: /cgi-bin/ Disallow: /gistemp/graphs/ Disallow: /gfx/ Disallow: /modelE/transient/ Disallow: /outgoing/ Disallow: /pub/ Disallow: /tmp/ User-agent: msnbot Crawl-delay: 480 Disallow: /cgi-bin/ Disallow: /gfx/ Disallow: /modelE/transient/ Disallow: /tmp/ User-agent: Slurp Crawl-delay: 480 Disallow: /cgi-bin/ Disallow: /gfx/ Disallow: /modelE/transient/ Disallow: /tmp/ User-agent: Scooter Crawl-delay: 480 Disallow: /cgi-bin/ Disallow: /gfx/ Disallow: /modelE/transient/ Disallow: /tmp/ User-agent: discobot Disallow: /
Now I don’t really care about a robots.txt file, I just “flow around it” by spoofing and saying I’m not a robot. So I’ve never really learned how to read one. To me, it looks like “IF your ‘user agent’ text is FOO, forbid / Disallow these directories”. Looks like “discobot” gets screwed with nothing allowed, and “MSNbot, Scooter and Slurp” get a speedlimt and some various transitory things blocked, all else OK. Everyone else gets even more blocked (but not all like discobot). That “*” is a wild card that usually says “match everything”.
I’m not sure if being Mozilla gets me past that, or not. We’ll see when this scrape is done, if those directories are all missing, or not. (I may need to spoof a different user-agent string in a future scrape). Re-runs of scrapes only pick up what has changed or has been added (IF you set the flags right), so a rerun on a mostly static site can go very very fast. It does not hurt to re-run a scrape in those conditions.
So there you have it. How to snag huge chunks of data and such from various climate related sites.
You could do similar things for just about any site out there (depending on how tight they are on robots.txt, how creative you are getting past it, and how much disk you have).
I can now point my browser at that local file set and read the pages from my own disk, if desired. This is an example URL from my browser title bar:
And I’m looking at the top page of the data.giss.nasa.gov site as of the time I scraped it.