This is partly just an “aggregator” of things already discussed. Some in specific articles, some in “tips” as I was just making some notes as I went along. I’m putting this up for some added information and so that finding the other bits is easier in the future.
First off, what is “scraping” a site, and why do it?
Scraping is in essence just making a full copy of it for use later as an archive, or as an offline copy. You do it to preserve what is there either at a point in time or as protection from loss.
For some reasons beyond my ken, some site operators don’t like that. Partially, I can see it if they are being hit hard by a bunch of site scrapers and all of them are wide open on fast links. It can saturate their internet connection and is a sort of ‘denial of service’ to others. For those of us on slow home links, this isn’t an issue, but we tend to be whacked by the same “protective” measures used against the others. Oh Well.
There are fairly trivial ways to bypass that kind of block, and for starters one can just set polite settings about a site scraping script. Most such ‘scripts’ are really just a one line command, but I put them in an executable file anyway, so it is a trivial kind of script.
The preferred command is “wget” (at least, it is my preferred command). Which stands for “Web Get”, as that is what it does. It goes out on the web and gets stuff. There are many parameters you can set. Most of them can be ignored. But if you run into issues, RTFM on wget. Read The (um) “Friendly” Manual.
Prior postings have looked specifically at doing a site scrape of the NOAA/NCDC (now renamed to protect the guilty to NCEI though the links / paths have the old name) data and site, along with the CDIAC site (Carbon Dioxide Information Analysis Center). Since CDIAC has posted a “Going Off Line Real Soon Now” notice on their site, I figured it would be a “very good thing” to capture and preserve what I could since it is unclear where, or if, it will come back on line.
NOTICE (August 2016): CDIAC as currently configured and hosted by ORNL will cease operations on September 30, 2017. Data will continue to be available through this portal until that time. Data transition plans are being developed with DOE to ensure preservation and availability beyond 2017.
So it says it will be preserved and available, but… So I snagged a copy of what was publicly available. This also means that, over time, I don’t need to whack their site just to look at a particular bit of data nor do I need to take the network traffic load. All good things. My take on it is here:
https://chiefio.wordpress.com/2017/01/30/scraping-noaa-and-cdiac/
So how big is this bundle? I have a little command named DU that tots up disk usage, sorts it, and prints out a nice summary in a dated file. It looks like this:
root@odroid32:/WD4/ext/7Feb2017_Scrape# cat ~chiefio/bin/DU du -BMB -s * .[a-z,A-Z]* | sort -rn > 1DU_`date +%Y%b%d` & #du -ks * .[a-z]* .[A-Z]* | sort -rn > 1DU_`date +%Y%b%d` &
The -BMB causes the Macintosh to barf, so you can use -ms instead of “-BMB -s” and it is fine. One gives you megabytes in binary (1024 per KB) while the other gives it base 10 (1000 per KB) so most folks will not care. I also have a commented out “-ks” form that gives the KB count for things too small for MB to be informative… All that .[a-z] .[A-Z] stuff is to catch the hidden files in your home directory that normally you don’t see. Those starting with a “.” so not normally displayed.
root@odroid32:/WD4/ext/7Feb2017_Scrape# cat 1DU_2017Feb17 163382MB Temps 142051MB cdiac.ornl.gov 15875MB GHCN_Daily_NOAA_NCDC 2413MB Old_Logs 1MB lost+found 1MB 1DU_2017Jan22
So the scrape of NOAA / NCDC was all of 15.8 GB, and that of CDIAC was 142 GB. A lot, but quite manageable. The commands used were a mixed set over time. (wget is smart and doesn’t download a new copy of things that have not changed.) I’ve commented out various iterations as I’d at times used flags to slow total bandwidth, or be simpler. All of them worked, though in slightly different ways. I broke up the fetches into chunks, so I could get any given bit updated with just commenting out, or uncommenting various bits. Note that the only active line is presently the first one that lacks the “-np” flag? By leaving off that “no parent”, it fetches all of USHCN Daily first, then wanders up the parent directory and back down again, collecting most everything not blocked. That would normally be an “error” (so you see the others have “-np”) but as I wanted to preserve the site, I let it walk the whole tree, parent directories included.
# cdiac.ornl.gov USHCN Daily echo echo Doing cdiac.ornl.gov USHCN Daily echo wget -m http://cdiac.ornl.gov/ftp/ushcn_daily #wget -m -np http://cdiac.ornl.gov/ftp/ushcn_daily #wget -m -np -w 10 http://cdiac.ornl.gov/ftp/ushcn_daily #wget -w 10 --limit-rate=100k -np -m http://cdiac.ornl.gov/ftp/ushcn_daily #wget -r -N -l inf --no-remove-listing -w 10 --limit-rate=100k -np http://cdiac.ornl.gov/ftp/ushcn_daily echo echo Doing World Weather Records echo #wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/wwr/ #wget -np -m -w 20 ftp://ftp.ncdc.noaa.gov/pub/data/wwr/ #wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/wwr/ #wget --limit-rate=100k -nc -np -r -l inf ftp://ftp.ncdc.noaa.gov/pub/data/wwr/ echo echo Doing World War II Data echo #wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/ #wget -np -m -w 20 ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/ #wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/ #wget --limit-rate=100k -nc -np -r -l inf ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/
Of all the directories and files that are grabbed, only a portion exceed one MB of size:
root@odroid32:/WD4/ext/7Feb2017_Scrape/cdiac.ornl.gov# cat 1DU_mb_out 125576 ftp 574 oceans 167 epubs 74 trends 70 SOCCR 25 programs 22 carbonmanagement 19 newsletr 16 images 11 wwwstat.html 4 science-meeting 3 ndps 2 datasets
All the rest are 1 MB or smaller. Here’s the listing:
root@odroid32:/WD4/ext/7Feb2017_Scrape/cdiac.ornl.gov# ls 1DU_mb_out ftp.2 about ftpdir aerosol_parameters.html GCP aerosol_particle_types.html glossary.html aerosols.html halons.html authors hcfc.html backgrnds hfcs.html by_new home.html carbon_cycle_data.html hydrogen.html carbon_cycle.html ice_core_no.html carbonisotopes.html ice_cores_aerosols.html carbonmanagement icons carbonmanagement.1 images carbonmanagement.10 includes carbonmanagement.11 index.html carbonmanagement.12 js carbonmanagement.13 land_use.html carbonmanagement.14 library carbonmanagement.2 methane.html carbonmanagement.3 methylchloride.html carbonmanagement.4 methylchloroform.html carbonmanagement.5 mission.html carbonmanagement.6 modern_aerosols.html carbonmanagement.7 modern_halogens.html carbonmanagement.8 modern_no.html carbonmanagement.9 ndps cdiac new cdiac_welcome.au newsletr cfcs.html newsletter.html chcl3.html no.html climate oceans CO2_Emission oceans.1 CO2_Emission.1 oceans.10 CO2_Emission.10 oceans.2 CO2_Emission.11 oceans.3 CO2_Emission.12 oceans.4 CO2_Emission.13 oceans.5 CO2_Emission.14 oceans.6 CO2_Emission.15 oceans.7 CO2_Emission.16 oceans.8 CO2_Emission.2 oceans.9 CO2_Emission.3 oxygenisotopes.html CO2_Emission.4 ozone.html CO2_Emission.5 permission.html CO2_Emission.6 pns CO2_Emission.7 programs CO2_Emission.8 recent_publications.html CO2_Emission.9 science-meeting comments.html search.html css sfsix.html data shutdown-notice.css data_catalog.html SOCCR datasets staff.html datasubmission.html tetrachloroethene.html deuterium.html trace_gas_emissions.html disclaimers.html tracegases.html epubs trends factsdata.html vegetation.html faq.html wdca frequent_data_products.html wdcinfo.html ftp whatsnew.html ftp.1 wwwstat.html
You can see that a lot of it is just the html files that make the site go.
Most of the actual volume is the ftp site, as you would expect.
OK, that’s how you can grab a copy of CDIAC before the world changes…
NOAA NCDC / NCEI
The NOAA/NCDC scrape was a similar command. You will note in this listing all of it is commented out except the last bit that is getting “superghcnd”. That was added after this first scrape, and it is HUGE. So not in the above size information (it isn’t done yet). As I had just finished the other bits, I commented them out. Now it only chews on a chunk of syperghcnd when I launch it:
echo echo Doing NOAA set echo #wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ #wget -np -m -w 10 ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ #wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ #wget -nc -np -r -l inf ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ echo echo Doing Global Data Bank set echo #wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/ #wget -w 10 --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/ #wget -np -m -w 10 ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/ echo echo Doing GHCN echo #wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/ #wget -w 10 --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/ #wget -np -m -w 10 ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/ echo echo Doing GHCN -daily- SuperGHCNd echo wget -np -m -w 10 ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/superghcnd/superghcnd_full_20170204.csv.gz #wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/
SO FAR I’m at 2.5 TB or so of that hourly daily data. It is 10 GB / day and about 1.5 years worth.
root@odroid32:/LVM/ftp.ncdc.noaa.gov/pub/data/ghcn/daily/superghcnd# du -ms . 2573927 .
I’m figuring on about 4 TB when it is done, so be advised…
My Site
I also grabbed a locally readable mirror of my site. This lets me look at it with a browser offline. Nice for checking old articles without creating web traffic. Like when on a slow link (let it scrape all night, then browse lightning fast during the day). It is a ‘snapshot’ so not useful for things like recent comments and / or interaction. Some images may get downloaded, other things remain live links to the outside world (like video from youtube) so it isn’t 100% network free. (Tuning parameters to wget can grab more stuff outside the original site on links, but I’ve not done that yet. It is tricky to not end up scraping the entire world… so set the depth to capture all links, and all depths, and you end up putting the whole internet on your disk drive…)
What command did I use?
wget -U Mozilla -mkEpnp https://chiefio.wordpress.com
I was testing the “-U Mozilla” prior to doing GISS and didn’t want an error of syntax to lock me out for a day again… (GISS is picky about scraping, so gave me a one day lockout on my first scrape attempt)
How much disk did that take?
root@odroid32:/LVM/chiefiowp# du -ms chiefio.wordpress.com/ 1373 chiefio.wordpress.com/
1.3 GB. Not bad, but I can see I need to check where the “free” limit on disk is located on WordPress ;-)
GISS
This one was more problematic. With the news being that President Trump would be refocusing NASA on space, and out of the politicized field of Climate, I’d figured a nice thing to do would be to preserve a copy. A couple of folks “tipped” this, but this is the link I can find at the moment. From P.G. here:
https://chiefio.wordpress.com/2017/02/01/tips-february-2017/#comment-79793
Trump scrapping NASA climate research division in crackdown on ‘politicized science’
February 19, 2017
Donald Trump is poised to eliminate all climate change research conducted by Nasa as part of a crackdown on “politicized science”, his senior adviser on issues relating to the space agency has said. Nasa’s Earth science division is set to be stripped of funding in favor of exploration of deep space, with the president-elect having set a goal during the campaign to explore the entire solar system by the end of the century. This would mean the elimination of Nasa’s world-renowned research into temperature, ice, clouds and other climate phenomena. Nasa’s network of satellites provide a wealth of information on climate change, with the Earth science division’s budget set to grow to $2bn next year. By comparison, space exploration has been scaled back somewhat, with a proposed budget of $2.8bn in 2017. Bob Walker, a senior Trump campaign adviser, said there was no need for Nasa to do what he has previously described as “politically correct environmental monitoring”. “We see Nasa in an exploration role, in deep space research,” Walker told the Guardian. “Earth-centric science is better placed at other agencies where it is their prime mission. “My guess is that it would be difficult to stop all ongoing Nasa programs but future programs should definitely be placed with other agencies. I believe that climate research is necessary but it has been heavily politicized, which has undermined a lot of the work that researchers have been doing. Mr Trump’s decisions will be based upon solid science, not politicized science.”
Well, to me, that sure sounded like GISS climate work, and GIStemp, were likely to get the boot. So being responsible for backups and archives at companies for much of my professional life, I naturally thought: “Make a Golden Master Archive” of what you can.
Well, my first attempt was immediately slapped down by a bot assassin. Details in comments here:
https://chiefio.wordpress.com/2017/02/01/tips-february-2017/#comment-79749
https://chiefio.wordpress.com/2017/02/01/tips-february-2017/#comment-79767
https://chiefio.wordpress.com/2017/02/01/tips-february-2017/#comment-79820
The bottom line of all that is that NASA GISS has anti-site scraper settings in their robot.txt file. I did get the scrape to work, after waiting a day or two for the block to expire. The command that worked is:
wget -U Mozilla --wait=10 --limit-rate=50K -mkEpnp https://data.giss.nasa.gov
Most likely one could leave out the “–limit-rate” and even the “–wait” commands, but as I’m still working off the “superghcnd” TB wad, I didn’t want to slow it down. The “wait” says to pause that many seconds between fetches (so it looks like someone clicked a key) and the “limit-rate” makes it polite about being a bandwidth hog. The “-U Mozilla” says to tell the site, when asked, that I’m really Mozilla browser. You can put many different browser types in that spot, as you like it.
As of now (all of a few hours of running, waiting and rate-limiting) I’ve already got some data on downloads. Here’s what I’ve go so far:
root@odroid32:/LVM/GISS/data.giss.nasa.gov# ls cassini dust_tegen impacts mineralfrac precip_cru sageii ch4_fung efficacy index.html modelE precip_dai seawifs co2_fung gistemp landuse modelforce robots.txt stormtracks csci imbalance mcrates o18data rsp_air swing2 root@odroid32:/LVM/GISS/data.giss.nasa.gov# du -ms * 2 cassini 22 ch4_fung 1 co2_fung 8 csci 21 dust_tegen 7 efficacy 1 gistemp 1 imbalance 130 impacts 1 index.html 1 landuse 3 mcrates 49 mineralfrac 259 modelE 2 modelforce 1 o18data 1 precip_cru 1 precip_dai 1 robots.txt 2 rsp_air 1 sageii 7 seawifs 5 stormtracks 349 swing2
So there are 14 out of 22 directories either done, or in progress (so one of them is actively downloading at the moment. I can see it is the ModelE directory in another window)
That leaves only 8 more directories to go, one of the items is the file ‘index.html’ and another being the robots.txt file, so not a directory. A total of 879 MB so far. Unless something is very very large in the other directories, not a big scrape load, really. We’ll see when it completes.
Now, about that robots file… Sites can send a file to your code that says, basically, “If you are not a human, but are a computer robot doing a task for a human, don’t do this list of things.” Here’s the robots file from GISS:
root@odroid32:/LVM/GISS/data.giss.nasa.gov# cat robots.txt User-agent: * Disallow: /cgi-bin/ Disallow: /gistemp/graphs/ Disallow: /gfx/ Disallow: /modelE/transient/ Disallow: /outgoing/ Disallow: /pub/ Disallow: /tmp/ User-agent: msnbot Crawl-delay: 480 Disallow: /cgi-bin/ Disallow: /gfx/ Disallow: /modelE/transient/ Disallow: /tmp/ User-agent: Slurp Crawl-delay: 480 Disallow: /cgi-bin/ Disallow: /gfx/ Disallow: /modelE/transient/ Disallow: /tmp/ User-agent: Scooter Crawl-delay: 480 Disallow: /cgi-bin/ Disallow: /gfx/ Disallow: /modelE/transient/ Disallow: /tmp/ User-agent: discobot Disallow: /
Now I don’t really care about a robots.txt file, I just “flow around it” by spoofing and saying I’m not a robot. So I’ve never really learned how to read one. To me, it looks like “IF your ‘user agent’ text is FOO, forbid / Disallow these directories”. Looks like “discobot” gets screwed with nothing allowed, and “MSNbot, Scooter and Slurp” get a speedlimt and some various transitory things blocked, all else OK. Everyone else gets even more blocked (but not all like discobot). That “*” is a wild card that usually says “match everything”.
I’m not sure if being Mozilla gets me past that, or not. We’ll see when this scrape is done, if those directories are all missing, or not. (I may need to spoof a different user-agent string in a future scrape). Re-runs of scrapes only pick up what has changed or has been added (IF you set the flags right), so a rerun on a mostly static site can go very very fast. It does not hurt to re-run a scrape in those conditions.
In Conclusion
So there you have it. How to snag huge chunks of data and such from various climate related sites.
You could do similar things for just about any site out there (depending on how tight they are on robots.txt, how creative you are getting past it, and how much disk you have).
I can now point my browser at that local file set and read the pages from my own disk, if desired. This is an example URL from my browser title bar:
file:///LVM/GISS/data.giss.nasa.gov/index.html
And I’m looking at the top page of the data.giss.nasa.gov site as of the time I scraped it.
Nice, eh?
From my memory of the robots.txt, it is really just an honour system that tells well behaved bots where not to go.
I thought you meant “Scrapping” GISS etc. that made much more sense that trying to download fiddled data.
looks like GISS has divided their world into data.giss… “www.giss…” and isccp.giss… and gcss-dime.giss… and pubs.giss… and a few more.
I also noticed that download of the gistemp tarball didn’t happen.
This implies I either need to add -H (that says follow links to other hosts. i.e. from data.giss to www,giss…) or add –follow-ftp (that says to follow ftp links, which is off by default, go figure)
so, as usual, some fiddling required to get the whole package but not the whole world…
@Scottish Sceptic:
A site scrape does not know if it will be used as a data archive or as evidence…
BTW, in one draft I had typoed “scrap data.giss…” Freud would be proud ;-)
@wyzelli:
“yes, but” One needs to figure out if they are honoring it or not, what defaults are set, how to change them if needed….
By ‘honouring’ – it is your bot that chooses to do the honouring or not, not the server, so essentially what I am saying is that you have the choice to be a bad bot and crawl those folders or be a good bot and not go to those folders.
http://www.robotstxt.org/robotstxt.html
@Wyzelli:
I got it the first time… “One needs to figure out if they” is identical in meaning to “I need to figure out if I” but in the generic …
Re-scraping just data.giss.nasa.gov/gistemp with just adding –follow-ftp has already increased the size of that save directory from 656k to 25820k and it isn’t done yet.
It has not yet downloaded the tarball, but it has downloaded some other stuff that was clearly missed before, including a clear download of ftp based links including a pdf:
as just one example. So looks like whenever the first pass completes, I’ll need to go back and re-run it with that option added to get more of the site. An open question is “will this also get the source code tarball?” that has so far not been retrieved with this scrape? ( I manually downloaded it already just to be assured of having a most recent copy). Then also, with their multiple high level qualifiers, do I need the -H flag or will that wander off to far…
Golly, whole directories showing up… Wonder if the first scrape (still running) just had not gotten to all of this directory yet and this more focused one has? I’d presumed it would work all of a directory before moving on to the next one, but that doesn’t seem to be the case (looks more like a tree-walk from the file names I see going by). In any case, the directory listing of gistemp before the re-scrape:
And in the middle of the re-scrape, so may well grow even beyond this:
Well, I think I need to let the original run complete before I make too many assumptions about what added flags are needed for the wget. That at least some pdf files are ftp says that adding the –follow-ftp flag is desired; but given the non-directory centric name space walk, it is too early to say if tarballs are being skipped or just later on the tree-walk.
Well, the re-scrape picked up the tarball:
So at a minimum just adding –follow-ftp is enough, and in a best case the original scrape will eventually come back to that directory and look for it too.
In all cases, I don’t need that -H flag (since the re-scrape got that tar.gz without it) and we have some clear ftp files that do need the ftp flag. So looks to me like “add the –follow-ftp flag and skip the -H”.
I admire your effort. I have a tiny fear that should we change administrations again, you might be banished to the Ecuadorian embassy, as climate data is a variable truth on one side of our befuddled aisle. Were I the exec, I’d appoint a nice neutral (like J Currey) with the charge to make all climate data Easily accessible — un-homogenized and uncorrected, exactly as collected, with surface documentation attached. It is publicly funded public data, after all. I suspect the sunshine would terminate CAGW (pun intended). Watch out for the black helicopters.
@Richard Ilfeld:
I doubt there would be any interest in me for collecting the published data. Now if I did something notorious with it, that might change. Since my personal goal is simply to be an available archive for everyone (any side) should the sites go POOF! I do not expect a problem.
But yes, I have other things I’d rather be doing than this. A formal and clean public archive would improve my life and free up my equipment.
I’m only doing this due to the “Going Out Of Business” potential of the CDIAC notice and the NASA news. Too many years of being a data preservationist and the habit doesn’t stop easily. “The data just are” and you must make sure there is a clean backup of it…
Oh, and by publishing how to do it, I’m hoping there will be a few other folks doing the same thing so I can hide a bit in the flow… and if I am fingered, someone else would have a copy too.
Well good on you, sir. This seems to me to have the potential of being one of those really important things that looked like no big deal at the time. Like you, I was raised in an era when a ‘scientific’ report that got a result by changing historic data would have had a very high bar to climb. I worry now, as you do, that what will eventually be presented to us as the official archive will be in fact falsified data. Your set, and that of the others you may inspire, may be the apocrophal books, but I more suspect they will be the dead sea scrolls.
Playing off your metaphor: or perhaps more like the Nag Hamadi Texts. Something some unknown guy stuck in a jar in the desert as his personal library that has now confirmed much of the Dead Sea Scrolls and our present biblical texts, but includes some apparently “lost” materials.
The Dead Sea Scrolls were likely stored by devout scholars of their era. Our present Bible is a “homogenized text” created from scraps of originals and copies of copies (Masoritic, Vulgate, etc.) by the received wisdom of biblical scholars. The Nag Hamadi set looks to just be “some guy” ;-)
Yay!!!
After a month… I’ve finally finished all of 2015 ( 4 months) and 2016 (12 months) and have started on the 2017 data for the superghcnd block.
As of now, I’ve used up 5 TB of my 7 TB LVM group.
The rest ought to fit in the available space, but I’m about $20 short of another 4 TB disk in the “donations” kitty… so if anyone wants to put a bit more in, I can scrape a few more places. (Still have GISS to do, along with some others).
With about 60 days to go, and 11 GB / day, that’s 660 GB to complete the current set. That ought to leave about 1 TB of empty space for other sites, but I don’t know how big they are. Also, a few months more from superghcnd and that fills it…
Well, after a month? or maybe more, I’ve finally had the superghcnd finish.
Of all things, I am “up to date”.
Golly.
it is a bit of a mix, what with the other NCDC stuff and some GISS stuff, but at the moment, I have 5.9 TB (yes, terra-bytes) of stuff on my disk.
And yes, I know I’ve “fingered” myself via the time stamp of completion. This IS the Trump administration after all. (I’d have delayed a couple of days and done some “other things” were it still Obama Town…)
At this point I have all of NCDC “stuff” downloaded into a small bucket of “only”6 ish TB.
I’d like to also do a few other places, and keep this one up to date, but at the moment I have a little under 1 TB free. As superghcnd takes about 11 GB a day, that’s about 90 days of superghcnd (assuming nothing else of interest shows up) and them I’m “Full up”.
I’m thinking that at this point maybe I ought to just do one last “touch up” scrape, then shut down the disks and declare it an unchanging archive. Opinions are solicited…
@EMSmith; yes, you may need to cool your jets for a few days and do clean up. Sometimes the faster I go, the behinder I get…pg
OK, now that it has completed, and set for a day doing nothing, I’m thinking maybe it is time to look at some size information. First up, the LVM and how much is used in total:
Notice that this is done with the -m flag for Megabytes. So it is a 7 TB system, and 5.7 TB are used. A tiny bit under 1 TB is still free. As this is the collection of a 4 TB, 2 TB, and 1.5 TB disks and they were added in that order, the first two are full and the last one is about 1/2 TB full.
That’s a lot of data.
So what is it?
First off, there are some random small bits at the top level. May as well mention them. They largely are irrelevant to the large data totals, but do need to be mentioned so the bits all add up. I’ll cover the directories in another section. This is edited down to just the files laying around at the top level.
So three of these are just “du -ms” things counting up how much disk was used at various times. The two at the top that start with 1DU and the one named GISS_part-DU where I did an intermediate measure during the scrape. I can dump it now as I need a new one. Clearly the 1DU with today’s date stamp is “in progress” as I type ;-)
Then there are 3 “nohup.out” files. One from my own scrape of this site, one from my first sample of the John Daily site, and then the one for the final pass of the “superGHCNd” blob. It finished on Saturday about this time, but had been launched on the 7th, so needed another run to catch it up to the 10th. Yes, that’s 52 MB of log file just for the ‘touch up’ of 3 days data.
Those log files will eventually be moved into the Logs directory (and then once looked over, deleted). What’s in it now?
Yeah, about 7.5 GB of various logs… I think I’ll prune those later today…
So about those directories… What size are they?
Clearly the top line, ftp. ncdc. noaa. gov, is almost all of it. 6 TB. (Of that, most of it is the superghcnd block that I’ll measure down below).
I’ve done a partial download of GISS in that data.giss.nasa.gov block, but they have a byzantine naming system where that first level qualifier changes a lot. I need to sort out the flags better to just get the rest of it, or make dedicated scrapers for each of the high level qualifiers. So, for example, when I use the browser to pull up the top level of the GISS site:
It includes a link to pick up clouds:
which has the URL:
Note that links out of my files and on to their web site. Useful for some purposes, not so useful if the goal is preservation of all their stuff in case of budget removal… So “some thinking required” on the exact scrape flags and / or how many scrape URLs to explicitly state.
But you can figure that the GISS blob WILL grow if rerun to get the rest of it, just how much is unclear.
We’ve covered the logs already. Next up is this site, chiefio.wordpress.com, at a paltry 1.4 GB ;-) It looks like everything is there, but many of the links still point back to the internet from the index page. Is it a flags issue? (Not set right to modify links) or a scope issue? (not picking up things outside of the specified URL scope) Don’t know yet. But a bit of QA and a re-run needed at some point.
Then we have the John Daily site. I know it is being “kept up” as a kind of memorial, but there is a lot of good stuff there that I’d rather not have evaporate if folks lose interest in memorials. This is just from a VERY short test / sample run. Again, more work on flags and such needed.
For “ems” and “From_Mac” – these are just small attempts at running a login via an NFS mounted home directory on this volume. It works, sort of, but when being heavily thrashed with TB downloads and through a modestly slow router, well, sometimes I had NFS dropouts. So I’ve abandoned that effort for now. I’m going to retest it with more ‘idle download’ activity and see if it is better. It could be an OK way to do some things and schedule the “touch up” scrapes for midnight on Sunday or something… Or I might just delete them.
All of 30 MB, so not exactly a pressing issue. Most of it just browser cache crap from my testing and some generic tool sets I like to have around. In any case, it is a halfway house to my final goal anyway. A dedicated locked down SBC running a secured NFS server from encrypted disks. Pull power it’s invisible…and locked down. Whenever I get around to doing the rest of that, this becomes irrelevant. But it is a good test bed for “Do I want to use LVM on that server too?”. And the answer is “Only if it is lightly loaded with network stuff other than NFS”…
Finally, the last two bits. They showed up during testing of some “follow links to other hosts” stuff. Basically left over junk at this point. (At one time I had a full screen of them as one attempt began to scrape the world… watch those flags!)
Finally, about that superghcnd blob:
Again, note the -m flag so megabytes. So this is 5.3 TB of something… I think it is hourly data for selected stations, but have not gone digging yet. It is current as of yesterday, but grows by 11 GB/day so I can likely keep up doing one scrape / weekend for 10 to 12 hours. That Sunday Scrape is going to be busy ;-)
“Someday” I’m going to do some comparisons between the files and figure out a cheaper way to store this. (That is, do I now have 400 copies of the same first 99.9% and really just need to save the much smaller ‘diff’ files from each day going backward? Or is their utility in the full file?). It is about a $140 question (cost of that much disk) so may not be worth the time to answer it…
So there you have it. The “heavy lifting” part is done. Now I need to do a bit more finish work to tidy up, and tune up the other scrapes to make sure I’m getting all that I want (and not more than that). Then a touch up rerun. Finally, put in a “one a week” touchup. That 1-a-week will take some flag twiddling too, as I do NOT want to overwrite this base set, but want to keep incremental changes noted. There’s a flag for that, but I need to get comfortable with what it does (and how it interacts with the other flags… this thing has interactions by design… I hate commands with flag interactions…)
Oh, and it has been very nice the last day not fighting a massive download for web access ;-) Having to suck 6 TB through a soda straw 24 x 7 for a month can be a real PITA at times. I did notice that in the last couple of weeks it ‘sped up’. Don’t know if something changed at NASA, the Telco decided to open the spigot to get me out of the way of other things, or all the “regulars” who were downloading ftp. ncdc. noaa. gov who would have gotten slammed by the same sudden arrival of another 5 TB of data finally finished their scrapes and more time to come to me.
Here’s a bit from the recent log file: (scroll to the right to see speeds for each segment)
Then some from an earlier log file:
Clearly “something happened”, but who knows what. (Maybe the CIA, NSA, FBI, FSB, etc. etc. all collectively decided to turn off their feed of my downloads as they didn’t have the disk ready for it ;-) and the improved router efficiency kicked in ;-)
Well this is interesting…
Mounting /LVM/ems as the home directory even with the scrape idle “has issues”. The connection has interrupts or some such. Works fine from the R.Pi and the Odroid and from several other individual disks, but not from the Orange Pi on LVM.
I’ll try it from a dedicated disk on the Orange Pi just to sort it between the LVM group and the board / OS level, but at this point it is pretty clear that it isn’t going to work as a remote NFS home directory.
FWIW I am using the Mac-with-no-SSD running from an SD Card as target for the mount. This machine takes significant pauses at times as the SD is way slower than the proper SSD and the MacOS is highly chatty to “disk”. (cache for the web browser is big and active). It is highly likely nobody has tested NFS with something this slow / obscure / prone to pauses.
I could likely do a lot of ‘tuning’ and make it better, but it is unlikely to be worth it when the Odroid make a fine NFS server and I really want isolation between the web facing scraper and the interior facing NFS server. (i.e. this was more exploring / playing with the tech than infrastructure build).
But at least I’ve now eliminated “scraping load” as the cause of the “issues”.
Well, don’t know what to make of this…
Added a TB dedicated disk. Moved /LVM/ems onto it. Exported (and mounted) it.
Same problem with “NFS mount interrupted”.
So something about the Orange Pi, its configuration, or the Mac interaction with it, “has Issues” where the Odroid does not. Is it the hardware? Likely not as both use chips that are widely used. Debian vs Armbian? I’d suspect that most, since implementation issues on new ports are common especially in edge cases like a Mac with long timeouts.
Mounts on the Mac done the same way, so server side is where there is variation.
Oddly, the issue only seems to show up when launching a browser (perhaps due to the high cache load) while things like the ls command have no issue.
Well, it goes to the “someday” list. I’ll try using the scrape products via NFS and see how that goes (copy compressed wads, decompressing, etc) and move the NFS home dir stuff to another board…
Pingback: Scraping GISS, CDIAC, NCDC / NCEI, and Me – Climate Collections
@EMSmith; the eagle flies, finally, hope you can find more storage space for your treasures…pg