Scraping GISS, CDIAC, NCDC / NCEI, and Me

This is partly just an “aggregator” of things already discussed. Some in specific articles, some in “tips” as I was just making some notes as I went along. I’m putting this up for some added information and so that finding the other bits is easier in the future.

First off, what is “scraping” a site, and why do it?

Scraping is in essence just making a full copy of it for use later as an archive, or as an offline copy. You do it to preserve what is there either at a point in time or as protection from loss.

For some reasons beyond my ken, some site operators don’t like that. Partially, I can see it if they are being hit hard by a bunch of site scrapers and all of them are wide open on fast links. It can saturate their internet connection and is a sort of ‘denial of service’ to others. For those of us on slow home links, this isn’t an issue, but we tend to be whacked by the same “protective” measures used against the others. Oh Well.

There are fairly trivial ways to bypass that kind of block, and for starters one can just set polite settings about a site scraping script. Most such ‘scripts’ are really just a one line command, but I put them in an executable file anyway, so it is a trivial kind of script.

The preferred command is “wget” (at least, it is my preferred command). Which stands for “Web Get”, as that is what it does. It goes out on the web and gets stuff. There are many parameters you can set. Most of them can be ignored. But if you run into issues, RTFM on wget. Read The (um) “Friendly” Manual.

Prior postings have looked specifically at doing a site scrape of the NOAA/NCDC (now renamed to protect the guilty to NCEI though the links / paths have the old name) data and site, along with the CDIAC site (Carbon Dioxide Information Analysis Center). Since CDIAC has posted a “Going Off Line Real Soon Now” notice on their site, I figured it would be a “very good thing” to capture and preserve what I could since it is unclear where, or if, it will come back on line.

NOTICE (August 2016): CDIAC as currently configured and hosted by ORNL will cease operations on September 30, 2017. Data will continue to be available through this portal until that time. Data transition plans are being developed with DOE to ensure preservation and availability beyond 2017.

So it says it will be preserved and available, but… So I snagged a copy of what was publicly available. This also means that, over time, I don’t need to whack their site just to look at a particular bit of data nor do I need to take the network traffic load. All good things. My take on it is here:

https://chiefio.wordpress.com/2017/01/30/scraping-noaa-and-cdiac/

So how big is this bundle? I have a little command named DU that tots up disk usage, sorts it, and prints out a nice summary in a dated file. It looks like this:

root@odroid32:/WD4/ext/7Feb2017_Scrape# cat ~chiefio/bin/DU

du -BMB -s * .[a-z,A-Z]* | sort -rn > 1DU_`date +%Y%b%d` &

#du -ks * .[a-z]* .[A-Z]* | sort -rn > 1DU_`date +%Y%b%d` &

The -BMB causes the Macintosh to barf, so you can use -ms instead of “-BMB -s” and it is fine. One gives you megabytes in binary (1024 per KB) while the other gives it base 10 (1000 per KB) so most folks will not care. I also have a commented out “-ks” form that gives the KB count for things too small for MB to be informative… All that .[a-z] .[A-Z] stuff is to catch the hidden files in your home directory that normally you don’t see. Those starting with a “.” so not normally displayed.

root@odroid32:/WD4/ext/7Feb2017_Scrape# cat 1DU_2017Feb17 
163382MB	Temps
142051MB	cdiac.ornl.gov
15875MB	GHCN_Daily_NOAA_NCDC
2413MB	Old_Logs
1MB	lost+found
1MB	1DU_2017Jan22

So the scrape of NOAA / NCDC was all of 15.8 GB, and that of CDIAC was 142 GB. A lot, but quite manageable. The commands used were a mixed set over time. (wget is smart and doesn’t download a new copy of things that have not changed.) I’ve commented out various iterations as I’d at times used flags to slow total bandwidth, or be simpler. All of them worked, though in slightly different ways. I broke up the fetches into chunks, so I could get any given bit updated with just commenting out, or uncommenting various bits. Note that the only active line is presently the first one that lacks the “-np” flag? By leaving off that “no parent”, it fetches all of USHCN Daily first, then wanders up the parent directory and back down again, collecting most everything not blocked. That would normally be an “error” (so you see the others have “-np”) but as I wanted to preserve the site, I let it walk the whole tree, parent directories included.

# cdiac.ornl.gov USHCN Daily

echo
echo Doing cdiac.ornl.gov USHCN Daily
echo

wget -m http://cdiac.ornl.gov/ftp/ushcn_daily

#wget -m -np http://cdiac.ornl.gov/ftp/ushcn_daily
#wget -m -np -w 10 http://cdiac.ornl.gov/ftp/ushcn_daily

#wget -w 10 --limit-rate=100k -np -m http://cdiac.ornl.gov/ftp/ushcn_daily
#wget -r -N -l inf --no-remove-listing -w 10 --limit-rate=100k -np http://cdiac.ornl.gov/ftp/ushcn_daily

echo
echo Doing World Weather Records
echo

#wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/wwr/
#wget -np -m -w 20 ftp://ftp.ncdc.noaa.gov/pub/data/wwr/

#wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/wwr/

#wget --limit-rate=100k -nc -np -r -l inf ftp://ftp.ncdc.noaa.gov/pub/data/wwr/

echo
echo Doing World War II Data
echo

#wget -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/

#wget -np -m -w 20  ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/

#wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/

#wget --limit-rate=100k -nc -np -r -l inf ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/

Of all the directories and files that are grabbed, only a portion exceed one MB of size:

root@odroid32:/WD4/ext/7Feb2017_Scrape/cdiac.ornl.gov# cat 1DU_mb_out 
125576	ftp
574	oceans
167	epubs
74	trends
70	SOCCR
25	programs
22	carbonmanagement
19	newsletr
16	images
11	wwwstat.html
4	science-meeting
3	ndps
2	datasets

All the rest are 1 MB or smaller. Here’s the listing:

root@odroid32:/WD4/ext/7Feb2017_Scrape/cdiac.ornl.gov# ls
1DU_mb_out		     ftp.2
about			     ftpdir
aerosol_parameters.html      GCP
aerosol_particle_types.html  glossary.html
aerosols.html		     halons.html
authors			     hcfc.html
backgrnds		     hfcs.html
by_new			     home.html
carbon_cycle_data.html	     hydrogen.html
carbon_cycle.html	     ice_core_no.html
carbonisotopes.html	     ice_cores_aerosols.html
carbonmanagement	     icons
carbonmanagement.1	     images
carbonmanagement.10	     includes
carbonmanagement.11	     index.html
carbonmanagement.12	     js
carbonmanagement.13	     land_use.html
carbonmanagement.14	     library
carbonmanagement.2	     methane.html
carbonmanagement.3	     methylchloride.html
carbonmanagement.4	     methylchloroform.html
carbonmanagement.5	     mission.html
carbonmanagement.6	     modern_aerosols.html
carbonmanagement.7	     modern_halogens.html
carbonmanagement.8	     modern_no.html
carbonmanagement.9	     ndps
cdiac			     new
cdiac_welcome.au	     newsletr
cfcs.html		     newsletter.html
chcl3.html		     no.html
climate			     oceans
CO2_Emission		     oceans.1
CO2_Emission.1		     oceans.10
CO2_Emission.10		     oceans.2
CO2_Emission.11		     oceans.3
CO2_Emission.12		     oceans.4
CO2_Emission.13		     oceans.5
CO2_Emission.14		     oceans.6
CO2_Emission.15		     oceans.7
CO2_Emission.16		     oceans.8
CO2_Emission.2		     oceans.9
CO2_Emission.3		     oxygenisotopes.html
CO2_Emission.4		     ozone.html
CO2_Emission.5		     permission.html
CO2_Emission.6		     pns
CO2_Emission.7		     programs
CO2_Emission.8		     recent_publications.html
CO2_Emission.9		     science-meeting
comments.html		     search.html
css			     sfsix.html
data			     shutdown-notice.css
data_catalog.html	     SOCCR
datasets		     staff.html
datasubmission.html	     tetrachloroethene.html
deuterium.html		     trace_gas_emissions.html
disclaimers.html	     tracegases.html
epubs			     trends
factsdata.html		     vegetation.html
faq.html		     wdca
frequent_data_products.html  wdcinfo.html
ftp			     whatsnew.html
ftp.1			     wwwstat.html

You can see that a lot of it is just the html files that make the site go.

Most of the actual volume is the ftp site, as you would expect.

OK, that’s how you can grab a copy of CDIAC before the world changes…

NOAA NCDC / NCEI

The NOAA/NCDC scrape was a similar command. You will note in this listing all of it is commented out except the last bit that is getting “superghcnd”. That was added after this first scrape, and it is HUGE. So not in the above size information (it isn’t done yet). As I had just finished the other bits, I commented them out. Now it only chews on a chunk of syperghcnd when I launch it:

echo
echo Doing NOAA set
echo

#wget -np -m  ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

#wget -np -m  -w 10 ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

#wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

#wget -nc -np -r -l inf ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

echo
echo Doing Global Data Bank set
echo

#wget  -np -m ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/

#wget -w 10 --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/

#wget  -np -m -w 10 ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/

echo
echo Doing GHCN
echo

#wget  -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/

#wget -w 10 --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/

#wget  -np -m -w 10 ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/

echo
echo Doing GHCN -daily-   SuperGHCNd
echo

wget -np -m -w 10 ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/superghcnd/superghcnd_full_20170204.csv.gz

#wget  -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/

SO FAR I’m at 2.5 TB or so of that hourly daily data. It is 10 GB / day and about 1.5 years worth.

root@odroid32:/LVM/ftp.ncdc.noaa.gov/pub/data/ghcn/daily/superghcnd# du -ms .
2573927	.

I’m figuring on about 4 TB when it is done, so be advised…

My Site

I also grabbed a locally readable mirror of my site. This lets me look at it with a browser offline. Nice for checking old articles without creating web traffic. Like when on a slow link (let it scrape all night, then browse lightning fast during the day). It is a ‘snapshot’ so not useful for things like recent comments and / or interaction. Some images may get downloaded, other things remain live links to the outside world (like video from youtube) so it isn’t 100% network free. (Tuning parameters to wget can grab more stuff outside the original site on links, but I’ve not done that yet. It is tricky to not end up scraping the entire world… so set the depth to capture all links, and all depths, and you end up putting the whole internet on your disk drive…)

What command did I use?

wget -U Mozilla -mkEpnp https://chiefio.wordpress.com

I was testing the “-U Mozilla” prior to doing GISS and didn’t want an error of syntax to lock me out for a day again… (GISS is picky about scraping, so gave me a one day lockout on my first scrape attempt)

How much disk did that take?

root@odroid32:/LVM/chiefiowp# du -ms chiefio.wordpress.com/
1373	chiefio.wordpress.com/

1.3 GB. Not bad, but I can see I need to check where the “free” limit on disk is located on WordPress ;-)

GISS

This one was more problematic. With the news being that President Trump would be refocusing NASA on space, and out of the politicized field of Climate, I’d figured a nice thing to do would be to preserve a copy. A couple of folks “tipped” this, but this is the link I can find at the moment. From P.G. here:

https://chiefio.wordpress.com/2017/02/01/tips-february-2017/#comment-79793

https://www.europebreakingnews.net/2017/02/trump-scrapping-nasa-climate-research-division-in-crackdown-on-politicized-science/

Trump scrapping NASA climate research division in crackdown on ‘politicized science’

February 19, 2017
Donald Trump is poised to eliminate all climate change research conducted by Nasa as part of a crackdown on “politicized science”, his senior adviser on issues relating to the space agency has said. Nasa’s Earth science division is set to be stripped of funding in favor of exploration of deep space, with the president-elect having set a goal during the campaign to explore the entire solar system by the end of the century. This would mean the elimination of Nasa’s world-renowned research into temperature, ice, clouds and other climate phenomena. Nasa’s network of satellites provide a wealth of information on climate change, with the Earth science division’s budget set to grow to $2bn next year. By comparison, space exploration has been scaled back somewhat, with a proposed budget of $2.8bn in 2017. Bob Walker, a senior Trump campaign adviser, said there was no need for Nasa to do what he has previously described as “politically correct environmental monitoring”. “We see Nasa in an exploration role, in deep space research,” Walker told the Guardian. “Earth-centric science is better placed at other agencies where it is their prime mission. “My guess is that it would be difficult to stop all ongoing Nasa programs but future programs should definitely be placed with other agencies. I believe that climate research is necessary but it has been heavily politicized, which has undermined a lot of the work that researchers have been doing. Mr Trump’s decisions will be based upon solid science, not politicized science.”

Well, to me, that sure sounded like GISS climate work, and GIStemp, were likely to get the boot. So being responsible for backups and archives at companies for much of my professional life, I naturally thought: “Make a Golden Master Archive” of what you can.

Well, my first attempt was immediately slapped down by a bot assassin. Details in comments here:

https://chiefio.wordpress.com/2017/02/01/tips-february-2017/#comment-79749

https://chiefio.wordpress.com/2017/02/01/tips-february-2017/#comment-79767

https://chiefio.wordpress.com/2017/02/01/tips-february-2017/#comment-79820

The bottom line of all that is that NASA GISS has anti-site scraper settings in their robot.txt file. I did get the scrape to work, after waiting a day or two for the block to expire. The command that worked is:

wget -U Mozilla --wait=10 --limit-rate=50K -mkEpnp https://data.giss.nasa.gov

Most likely one could leave out the “–limit-rate” and even the “–wait” commands, but as I’m still working off the “superghcnd” TB wad, I didn’t want to slow it down. The “wait” says to pause that many seconds between fetches (so it looks like someone clicked a key) and the “limit-rate” makes it polite about being a bandwidth hog. The “-U Mozilla” says to tell the site, when asked, that I’m really Mozilla browser. You can put many different browser types in that spot, as you like it.

As of now (all of a few hours of running, waiting and rate-limiting) I’ve already got some data on downloads. Here’s what I’ve go so far:

root@odroid32:/LVM/GISS/data.giss.nasa.gov# ls
cassini   dust_tegen  impacts	  mineralfrac  precip_cru  sageii
ch4_fung  efficacy    index.html  modelE       precip_dai  seawifs
co2_fung  gistemp     landuse	  modelforce   robots.txt  stormtracks
csci	  imbalance   mcrates	  o18data      rsp_air	   swing2
root@odroid32:/LVM/GISS/data.giss.nasa.gov# du -ms *
2	cassini
22	ch4_fung
1	co2_fung
8	csci
21	dust_tegen
7	efficacy
1	gistemp
1	imbalance
130	impacts
1	index.html
1	landuse
3	mcrates
49	mineralfrac
259	modelE
2	modelforce
1	o18data
1	precip_cru
1	precip_dai
1	robots.txt
2	rsp_air
1	sageii
7	seawifs
5	stormtracks
349	swing2

So there are 14 out of 22 directories either done, or in progress (so one of them is actively downloading at the moment. I can see it is the ModelE directory in another window)

That leaves only 8 more directories to go, one of the items is the file ‘index.html’ and another being the robots.txt file, so not a directory. A total of 879 MB so far. Unless something is very very large in the other directories, not a big scrape load, really. We’ll see when it completes.

Now, about that robots file… Sites can send a file to your code that says, basically, “If you are not a human, but are a computer robot doing a task for a human, don’t do this list of things.” Here’s the robots file from GISS:

root@odroid32:/LVM/GISS/data.giss.nasa.gov# cat robots.txt 
User-agent: *
Disallow: /cgi-bin/
Disallow: /gistemp/graphs/
Disallow: /gfx/
Disallow: /modelE/transient/
Disallow: /outgoing/
Disallow: /pub/
Disallow: /tmp/

User-agent: msnbot
Crawl-delay: 480
Disallow: /cgi-bin/
Disallow: /gfx/
Disallow: /modelE/transient/
Disallow: /tmp/

User-agent: Slurp
Crawl-delay: 480
Disallow: /cgi-bin/
Disallow: /gfx/
Disallow: /modelE/transient/
Disallow: /tmp/

User-agent: Scooter
Crawl-delay: 480
Disallow: /cgi-bin/
Disallow: /gfx/
Disallow: /modelE/transient/
Disallow: /tmp/

User-agent: discobot
Disallow: /

Now I don’t really care about a robots.txt file, I just “flow around it” by spoofing and saying I’m not a robot. So I’ve never really learned how to read one. To me, it looks like “IF your ‘user agent’ text is FOO, forbid / Disallow these directories”. Looks like “discobot” gets screwed with nothing allowed, and “MSNbot, Scooter and Slurp” get a speedlimt and some various transitory things blocked, all else OK. Everyone else gets even more blocked (but not all like discobot). That “*” is a wild card that usually says “match everything”.

I’m not sure if being Mozilla gets me past that, or not. We’ll see when this scrape is done, if those directories are all missing, or not. (I may need to spoof a different user-agent string in a future scrape). Re-runs of scrapes only pick up what has changed or has been added (IF you set the flags right), so a rerun on a mostly static site can go very very fast. It does not hurt to re-run a scrape in those conditions.

In Conclusion

So there you have it. How to snag huge chunks of data and such from various climate related sites.

You could do similar things for just about any site out there (depending on how tight they are on robots.txt, how creative you are getting past it, and how much disk you have).

I can now point my browser at that local file set and read the pages from my own disk, if desired. This is an example URL from my browser title bar:

file:///LVM/GISS/data.giss.nasa.gov/index.html

And I’m looking at the top page of the data.giss.nasa.gov site as of the time I scraped it.

Nice, eh?

Subscribe to feed

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in AGW Science and Background, Earth Sciences, Tech Bits and tagged , , , , . Bookmark the permalink.

20 Responses to Scraping GISS, CDIAC, NCDC / NCEI, and Me

  1. wyzelli says:

    From my memory of the robots.txt, it is really just an honour system that tells well behaved bots where not to go.

  2. I thought you meant “Scrapping” GISS etc. that made much more sense that trying to download fiddled data.

  3. E.M.Smith says:

    looks like GISS has divided their world into data.giss… “www.giss…” and isccp.giss… and gcss-dime.giss… and pubs.giss… and a few more.

    I also noticed that download of the gistemp tarball didn’t happen.

    This implies I either need to add -H (that says follow links to other hosts. i.e. from data.giss to www,giss…) or add –follow-ftp (that says to follow ftp links, which is off by default, go figure)

    so, as usual, some fiddling required to get the whole package but not the whole world…

    @Scottish Sceptic:

    A site scrape does not know if it will be used as a data archive or as evidence…

    BTW, in one draft I had typoed “scrap data.giss…” Freud would be proud ;-)

    @wyzelli:

    “yes, but” One needs to figure out if they are honoring it or not, what defaults are set, how to change them if needed….

  4. wyzelli says:

    By ‘honouring’ – it is your bot that chooses to do the honouring or not, not the server, so essentially what I am saying is that you have the choice to be a bad bot and crawl those folders or be a good bot and not go to those folders.

    http://www.robotstxt.org/robotstxt.html

  5. E.M.Smith says:

    @Wyzelli:

    I got it the first time… “One needs to figure out if they” is identical in meaning to “I need to figure out if I” but in the generic …

  6. E.M.Smith says:

    Re-scraping just data.giss.nasa.gov/gistemp with just adding –follow-ftp has already increased the size of that save directory from 656k to 25820k and it isn’t done yet.

    It has not yet downloaded the tarball, but it has downloaded some other stuff that was clearly missed before, including a clear download of ftp based links including a pdf:

    ftp.ncdc.noaa.gov/pub/data/ghcn/v3/techreports/Technical Report GHCNM No15-01.pdf’
    

    as just one example. So looks like whenever the first pass completes, I’ll need to go back and re-run it with that option added to get more of the site. An open question is “will this also get the source code tarball?” that has so far not been retrieved with this scrape? ( I manually downloaded it already just to be assured of having a most recent copy). Then also, with their multiple high level qualifiers, do I need the -H flag or will that wander off to far…

  7. E.M.Smith says:

    Golly, whole directories showing up… Wonder if the first scrape (still running) just had not gotten to all of this directory yet and this more focused one has? I’d presumed it would work all of a directory before moving on to the next one, but that doesn’t seem to be the case (looks more like a tree-walk from the file names I see going by). In any case, the directory listing of gistemp before the re-scrape:

    root@odroid32:/LVM/GISS/data.giss.nasa.gov/gistemp# ls
    animations    link_animations.jpg  link_SC.gif	      references.html  tabledata_v3
    faq	      link_graphs.png	   link_stations.gif  seas_cycle.html  time_series.html
    history.html  link_LT.gif	   maps		      sources_v3       updates_v3
    index.html    link_maps.png	   news		      stdata
    

    And in the middle of the re-scrape, so may well grow even beyond this:

    root@odroid32:/LVM/GISS/data.giss.nasa.gov/gistemp# ls
    2005	      2010summer  history.html	       link_maps.png	  references.html	  stdata
    2007	      2011	  index.html	       link_SC.gif	  seas_cycle.html	  tabledata_v3
    2008	      animations  link_animations.jpg  link_stations.gif  sources_v3		  time_series.html
    2010july      faq	  link_graphs.png      maps		  station_data_v2	  updates
    2010november  FAQ.html	  link_LT.gif	       news		  station_data_v2.1.html  updates_v3
    

    Well, I think I need to let the original run complete before I make too many assumptions about what added flags are needed for the wget. That at least some pdf files are ftp says that adding the –follow-ftp flag is desired; but given the non-directory centric name space walk, it is too early to say if tarballs are being skipped or just later on the tree-walk.

  8. E.M.Smith says:

    Well, the re-scrape picked up the tarball:

    --2017-02-23 06:27:28--  https://data.giss.nasa.gov/gistemp/sources_v3/gistemp1.0.tar.gz
    Connecting to data.giss.nasa.gov (data.giss.nasa.gov)|128.183.4.33|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 223839 (219K) [application/x-gzip]
    Saving to: ‘data.giss.nasa.gov/gistemp/sources_v3/gistemp1.0.tar.gz’
    data.giss.nasa.gov/gistemp/sour 100%[=========================================================>] 218.59K  50.0KB/s   in 4.4s   
    2017-02-23 06:27:33 (50.0 KB/s) - ‘data.giss.nasa.gov/gistemp/sources_v3/gistemp1.0.tar.gz’ saved [223839/223839]
    

    So at a minimum just adding –follow-ftp is enough, and in a best case the original scrape will eventually come back to that directory and look for it too.

    In all cases, I don’t need that -H flag (since the re-scrape got that tar.gz without it) and we have some clear ftp files that do need the ftp flag. So looks to me like “add the –follow-ftp flag and skip the -H”.

  9. Richard Ilfeld says:

    I admire your effort. I have a tiny fear that should we change administrations again, you might be banished to the Ecuadorian embassy, as climate data is a variable truth on one side of our befuddled aisle. Were I the exec, I’d appoint a nice neutral (like J Currey) with the charge to make all climate data Easily accessible — un-homogenized and uncorrected, exactly as collected, with surface documentation attached. It is publicly funded public data, after all. I suspect the sunshine would terminate CAGW (pun intended). Watch out for the black helicopters.

  10. E.M.Smith says:

    @Richard Ilfeld:

    I doubt there would be any interest in me for collecting the published data. Now if I did something notorious with it, that might change. Since my personal goal is simply to be an available archive for everyone (any side) should the sites go POOF! I do not expect a problem.

    But yes, I have other things I’d rather be doing than this. A formal and clean public archive would improve my life and free up my equipment.

    I’m only doing this due to the “Going Out Of Business” potential of the CDIAC notice and the NASA news. Too many years of being a data preservationist and the habit doesn’t stop easily. “The data just are” and you must make sure there is a clean backup of it…

    Oh, and by publishing how to do it, I’m hoping there will be a few other folks doing the same thing so I can hide a bit in the flow… and if I am fingered, someone else would have a copy too.

  11. Richard Ilfeld says:

    Well good on you, sir. This seems to me to have the potential of being one of those really important things that looked like no big deal at the time. Like you, I was raised in an era when a ‘scientific’ report that got a result by changing historic data would have had a very high bar to climb. I worry now, as you do, that what will eventually be presented to us as the official archive will be in fact falsified data. Your set, and that of the others you may inspire, may be the apocrophal books, but I more suspect they will be the dead sea scrolls.

  12. E.M.Smith says:

    Playing off your metaphor: or perhaps more like the Nag Hamadi Texts. Something some unknown guy stuck in a jar in the desert as his personal library that has now confirmed much of the Dead Sea Scrolls and our present biblical texts, but includes some apparently “lost” materials.

    The Dead Sea Scrolls were likely stored by devout scholars of their era. Our present Bible is a “homogenized text” created from scraps of originals and copies of copies (Masoritic, Vulgate, etc.) by the received wisdom of biblical scholars. The Nag Hamadi set looks to just be “some guy” ;-)

  13. E.M.Smith says:

    Yay!!!

    After a month… I’ve finally finished all of 2015 ( 4 months) and 2016 (12 months) and have started on the 2017 data for the superghcnd block.

    As of now, I’ve used up 5 TB of my 7 TB LVM group.

    The rest ought to fit in the available space, but I’m about $20 short of another 4 TB disk in the “donations” kitty… so if anyone wants to put a bit more in, I can scrape a few more places. (Still have GISS to do, along with some others).

    With about 60 days to go, and 11 GB / day, that’s 660 GB to complete the current set. That ought to leave about 1 TB of empty space for other sites, but I don’t know how big they are. Also, a few months more from superghcnd and that fills it…

  14. E.M.Smith says:

    Well, after a month? or maybe more, I’ve finally had the superghcnd finish.

    Of all things, I am “up to date”.

    Golly.

    it is a bit of a mix, what with the other NCDC stuff and some GISS stuff, but at the moment, I have 5.9 TB (yes, terra-bytes) of stuff on my disk.

    /dev/mapper/TemperatureData-NoaaCdiacData 7207579544 5895225440 982737372  86% /LVM
    

    And yes, I know I’ve “fingered” myself via the time stamp of completion. This IS the Trump administration after all. (I’d have delayed a couple of days and done some “other things” were it still Obama Town…)

    At this point I have all of NCDC “stuff” downloaded into a small bucket of “only”6 ish TB.

    I’d like to also do a few other places, and keep this one up to date, but at the moment I have a little under 1 TB free. As superghcnd takes about 11 GB a day, that’s about 90 days of superghcnd (assuming nothing else of interest shows up) and them I’m “Full up”.

    I’m thinking that at this point maybe I ought to just do one last “touch up” scrape, then shut down the disks and declare it an unchanging archive. Opinions are solicited…

  15. p.g.sharrow says:

    @EMSmith; yes, you may need to cool your jets for a few days and do clean up. Sometimes the faster I go, the behinder I get…pg

  16. E.M.Smith says:

    OK, now that it has completed, and set for a day doing nothing, I’m thinking maybe it is time to look at some size information. First up, the LVM and how much is used in total:

    chiefio@orangepione:/LVM$ df -m
    Filesystem                                1M-blocks    Used Available Use% Mounted on
    [...]
    /dev/mapper/TemperatureData-NoaaCdiacData   7038652 5757057    959705  86% /LVM
    

    Notice that this is done with the -m flag for Megabytes. So it is a 7 TB system, and 5.7 TB are used. A tiny bit under 1 TB is still free. As this is the collection of a 4 TB, 2 TB, and 1.5 TB disks and they were added in that order, the first two are full and the last one is about 1/2 TB full.

    That’s a lot of data.

    So what is it?

    First off, there are some random small bits at the top level. May as well mention them. They largely are irrelevant to the large data totals, but do need to be mentioned so the bits all add up. I’ll cover the directories in another section. This is edited down to just the files laying around at the top level.

    chiefio@orangepione:/LVM$ ls -l
    total 52992
    -rw-r--r--  1 root    root         159 Feb 14 15:27 1DU_2017Feb14
    -rw-rw-r--  1 chiefio chiefio        0 Mar 11 19:13 1DU_2017Mar11
    -rw-r--r--  1 root    root      387692 Feb 28 23:08 Chief.nohup.out
    -rw-r--r--  1 root    root         431 Feb 24 16:55 GISS_part-DU
    -rw-r--r--  1 root    root     1018673 Feb 28 23:08 JD_nohup.out
    -rw-rw-r--  1 chiefio chiefio 52795264 Mar 10 22:49 Sghcn.nohup.out
    

    So three of these are just “du -ms” things counting up how much disk was used at various times. The two at the top that start with 1DU and the one named GISS_part-DU where I did an intermediate measure during the scrape. I can dump it now as I need a new one. Clearly the 1DU with today’s date stamp is “in progress” as I type ;-)

    Then there are 3 “nohup.out” files. One from my own scrape of this site, one from my first sample of the John Daily site, and then the one for the final pass of the “superGHCNd” blob. It finished on Saturday about this time, but had been launched on the 7th, so needed another run to catch it up to the 10th. Yes, that’s 52 MB of log file just for the ‘touch up’ of 3 days data.

    Those log files will eventually be moved into the Logs directory (and then once looked over, deleted). What’s in it now?

    root@orangepione:/LVM# ls -l Logs
    total 7386284
    -rw------- 1 chiefio chiefio    6230016 Feb 12 00:32 19Feb_nohup
    -rw------- 1 chiefio chiefio 3044074654 Mar  2 11:27 2Mar_ncdc_nohup.out
    -rw------- 1 root    root      32636425 Feb 20 11:36 chiefio.nohup.out
    -rw------- 1 chiefio chiefio 2089393918 Feb 18 22:09 Feb18_nohup
    -rw-rw-r-- 1 chiefio chiefio 1086263799 Mar 10 17:33 Sghcn.nohup.out_10Mar2017
    -rw-rw-r-- 1 chiefio chiefio 1304920080 Mar  7 03:23 Sghcn.nohup.out_7Mar2017
    

    Yeah, about 7.5 GB of various logs… I think I’ll prune those later today…

    So about those directories… What size are they?

    root@orangepione:/LVM# cat 1DU_2017Mar11 
    6010636MB	ftp.ncdc.noaa.gov
    16730MB	data.giss.nasa.gov
    7564MB	Logs
    1444MB	chiefio.wordpress.com
    68MB	www.john-daly.com
    26MB	ems
    4MB	From_Mac
    1MB	ftp.soest.hawaii.edu
    1MB	cdiac.esd.ornl.gov
    

    Clearly the top line, ftp. ncdc. noaa. gov, is almost all of it. 6 TB. (Of that, most of it is the superghcnd block that I’ll measure down below).

    I’ve done a partial download of GISS in that data.giss.nasa.gov block, but they have a byzantine naming system where that first level qualifier changes a lot. I need to sort out the flags better to just get the rest of it, or make dedicated scrapers for each of the high level qualifiers. So, for example, when I use the browser to pull up the top level of the GISS site:

    file:///LVM/data.giss.nasa.gov/index.html
    

    It includes a link to pick up clouds:

    Clouds
    
        International Satellite Cloud Climatology Project (ISCCP)
    

    which has the URL:

    https://isccp.giss.nasa.gov/products/onlineData.html
    

    Note that links out of my files and on to their web site. Useful for some purposes, not so useful if the goal is preservation of all their stuff in case of budget removal… So “some thinking required” on the exact scrape flags and / or how many scrape URLs to explicitly state.

    But you can figure that the GISS blob WILL grow if rerun to get the rest of it, just how much is unclear.

    We’ve covered the logs already. Next up is this site, chiefio.wordpress.com, at a paltry 1.4 GB ;-) It looks like everything is there, but many of the links still point back to the internet from the index page. Is it a flags issue? (Not set right to modify links) or a scope issue? (not picking up things outside of the specified URL scope) Don’t know yet. But a bit of QA and a re-run needed at some point.

    Then we have the John Daily site. I know it is being “kept up” as a kind of memorial, but there is a lot of good stuff there that I’d rather not have evaporate if folks lose interest in memorials. This is just from a VERY short test / sample run. Again, more work on flags and such needed.

    For “ems” and “From_Mac” – these are just small attempts at running a login via an NFS mounted home directory on this volume. It works, sort of, but when being heavily thrashed with TB downloads and through a modestly slow router, well, sometimes I had NFS dropouts. So I’ve abandoned that effort for now. I’m going to retest it with more ‘idle download’ activity and see if it is better. It could be an OK way to do some things and schedule the “touch up” scrapes for midnight on Sunday or something… Or I might just delete them.

    All of 30 MB, so not exactly a pressing issue. Most of it just browser cache crap from my testing and some generic tool sets I like to have around. In any case, it is a halfway house to my final goal anyway. A dedicated locked down SBC running a secured NFS server from encrypted disks. Pull power it’s invisible…and locked down. Whenever I get around to doing the rest of that, this becomes irrelevant. But it is a good test bed for “Do I want to use LVM on that server too?”. And the answer is “Only if it is lightly loaded with network stuff other than NFS”…

    Finally, the last two bits. They showed up during testing of some “follow links to other hosts” stuff. Basically left over junk at this point. (At one time I had a full screen of them as one attempt began to scrape the world… watch those flags!)

    Finally, about that superghcnd blob:

    root@orangepione:/LVM/ftp.ncdc.noaa.gov/pub/data/ghcn/daily# du -ms superghcnd/
    5312145	superghcnd/
    

    Again, note the -m flag so megabytes. So this is 5.3 TB of something… I think it is hourly data for selected stations, but have not gone digging yet. It is current as of yesterday, but grows by 11 GB/day so I can likely keep up doing one scrape / weekend for 10 to 12 hours. That Sunday Scrape is going to be busy ;-)

    “Someday” I’m going to do some comparisons between the files and figure out a cheaper way to store this. (That is, do I now have 400 copies of the same first 99.9% and really just need to save the much smaller ‘diff’ files from each day going backward? Or is their utility in the full file?). It is about a $140 question (cost of that much disk) so may not be worth the time to answer it…

    So there you have it. The “heavy lifting” part is done. Now I need to do a bit more finish work to tidy up, and tune up the other scrapes to make sure I’m getting all that I want (and not more than that). Then a touch up rerun. Finally, put in a “one a week” touchup. That 1-a-week will take some flag twiddling too, as I do NOT want to overwrite this base set, but want to keep incremental changes noted. There’s a flag for that, but I need to get comfortable with what it does (and how it interacts with the other flags… this thing has interactions by design… I hate commands with flag interactions…)

    Oh, and it has been very nice the last day not fighting a massive download for web access ;-) Having to suck 6 TB through a soda straw 24 x 7 for a month can be a real PITA at times. I did notice that in the last couple of weeks it ‘sped up’. Don’t know if something changed at NASA, the Telco decided to open the spigot to get me out of the way of other things, or all the “regulars” who were downloading ftp. ncdc. noaa. gov who would have gotten slammed by the same sudden arrival of another 5 TB of data finally finished their scrapes and more time to come to me.

    Here’s a bit from the recent log file: (scroll to the right to see speeds for each segment)

    root@orangepione:/LVM# tail -20 Sghcn.nohup.out 
    10873750K .......... .......... .......... .......... .......... 99% 5.26M 0s
    10873800K .......... .......... .......... .......... .......... 99% 5.75M 0s
    10873850K .......... .......... .......... .......... .......... 99% 2.57M 0s
    10873900K .......... .......... .......... .......... .......... 99% 2.68M 0s
    10873950K .......... .......... .......... .......... .......... 99% 4.35M 0s
    10874000K .......... .......... .......... .......... .......... 99% 3.11M 0s
    10874050K .......... .......... .......... .......... .......... 99% 4.53M 0s
    10874100K .......... .......... .......... .......... .......... 99% 5.88M 0s
    10874150K .......... .......... .......... .......... .......... 99% 5.92M 0s
    10874200K .......... .......... .......... .......... .......... 99% 5.01M 0s
    10874250K .......... .......                                    100% 5.24M=77m44s
    

    Then some from an earlier log file:

    7042400K .......... .......... .......... .......... .......... 64%  707K 28m12s
    7042450K .......... .......... .......... .......... .......... 64% 1.99M 28m12s
    7042500K .......... .......... .......... .......... .......... 64% 2.79M 28m12s
    7042550K .......... .......... .......... .......... .......... 64%  715K 28m12s
    7042600K .......... .......... .......... .......... .......... 64% 1.95M 28m12s
    7042650K .......... .......... .......... .......... .......... 64% 2.75M 28m12s
    7042700K .......... .......... .......... .......... .......... 64%  722K 28m12s
    7042750K .......... .......... .......... .......... .......... 64% 1.86M 28m12s
    7042800K .......... .......... .......... .......... .......... 64% 2.79M 28m12s
    7042850K .......... .......... .......... .......... .......... 64%  725K 28m12s
    7042900K .......... .......... .......... .......... .......... 64% 1.79M 28m12s
    7042950K .......... .......... .......... .......... .......... 64% 2.86M 28m12s
    

    Clearly “something happened”, but who knows what. (Maybe the CIA, NSA, FBI, FSB, etc. etc. all collectively decided to turn off their feed of my downloads as they didn’t have the disk ready for it ;-) and the improved router efficiency kicked in ;-)

  17. E.M.Smith says:

    Well this is interesting…

    Mounting /LVM/ems as the home directory even with the scrape idle “has issues”. The connection has interrupts or some such. Works fine from the R.Pi and the Odroid and from several other individual disks, but not from the Orange Pi on LVM.

    I’ll try it from a dedicated disk on the Orange Pi just to sort it between the LVM group and the board / OS level, but at this point it is pretty clear that it isn’t going to work as a remote NFS home directory.

    FWIW I am using the Mac-with-no-SSD running from an SD Card as target for the mount. This machine takes significant pauses at times as the SD is way slower than the proper SSD and the MacOS is highly chatty to “disk”. (cache for the web browser is big and active). It is highly likely nobody has tested NFS with something this slow / obscure / prone to pauses.

    I could likely do a lot of ‘tuning’ and make it better, but it is unlikely to be worth it when the Odroid make a fine NFS server and I really want isolation between the web facing scraper and the interior facing NFS server. (i.e. this was more exploring / playing with the tech than infrastructure build).

    But at least I’ve now eliminated “scraping load” as the cause of the “issues”.

  18. E.M.Smith says:

    Well, don’t know what to make of this…

    Added a TB dedicated disk. Moved /LVM/ems onto it. Exported (and mounted) it.

    Same problem with “NFS mount interrupted”.

    So something about the Orange Pi, its configuration, or the Mac interaction with it, “has Issues” where the Odroid does not. Is it the hardware? Likely not as both use chips that are widely used. Debian vs Armbian? I’d suspect that most, since implementation issues on new ports are common especially in edge cases like a Mac with long timeouts.

    Mounts on the Mac done the same way, so server side is where there is variation.

    Oddly, the issue only seems to show up when launching a browser (perhaps due to the high cache load) while things like the ls command have no issue.

    Well, it goes to the “someday” list. I’ll try using the scrape products via NFS and see how that goes (copy compressed wads, decompressing, etc) and move the NFS home dir stuff to another board…

  19. Pingback: Scraping GISS, CDIAC, NCDC / NCEI, and Me – Climate Collections

  20. p.g.sharrow says:

    @EMSmith; the eagle flies, finally, hope you can find more storage space for your treasures…pg

Comments are closed.