OK, a small update on computer status.
The R.PiM2 is now doing a wonderful job of site scraping, having accumulated about 150 GB so far. It is still running against the two first sites, and then I have two OTHER sites I want to scrape for temperature data as well. I think I’m gonna need another disk…
So sometime in the next week or three I’ll stop off at Best Buy or Walmart or Costco or… and get another TB disk for about $60, format it EXT4 or similar, and call it a dedicated temperature data archive.
Along the way several small adventures happened (already posted). But a couple of things not posted yet. One is ‘sort of amusing’… In the middle of all the data shuffle and trying to wedge an EXT3 file system into a file on NTFS (so as to avoid the lousy way permissions and owner are handled with NTFS – you map them by hand…via creating a mapping table from NTFS to original Linux / Unix owners…) and running head long into the ntfs-3g driver behaving badly with small block sizes and sparse files: I had all USB ports full with mouse, keyboard, uplink to USB hub, USB memory stick, and I had the powered USB hub driving 4 different USB disks.
A total of somewhere over 2.6 TB on 4 spindles. Yes, stuff was scattered around in different places and ought to have been pulled together into smaller spaces. But free space is where you find it as you cope with “issues” at 2 AM… Well, the USB disks tend to ‘idle down’ and sleep if they are not being accessed. So everything was going fine. Then I started launching more and more things. I got up to about 99% CPU usage (so all 4 cores using power) and all 4 spindles rotating and all heads seeking… And the little rainbow colored ‘low volts’ indicator started to blink on every minute or so. It was bad enough that it caused the R.Pi to detect a ‘plugging event’ on the USB drives and scan for partitions.
Since I had some unmounted FAT32 partitions on the disk, it would pop up the ‘partition found’ menu and ask me what to do with them. That was my clue that things were ‘not well’. Occasionally, but not consistently, I’d had wget downloads ‘hang’ with various disk write errors. File system not mounted, or not write enabled, or some such. Usually when I had been out of the room. The power low blink and file mount request finally showed up and that clue was the Ah Ha moment to realize that when all 4 disks were running, sometimes a ‘spin up’ or synchronized head seek and spin-up would exceed the power available from the powered hub. It would then draw a bit extra from the uplink, and would sag the power to the R.Pi.
What was surprising was that the R.Pi kept on running and didn’t crash, but disk mounts would become 1/2 unmounted. Not writable, but still showing mounted in the df listing. ( I think this was a side effect of the ‘plugging event detection’ software ).
So I stopped a couple of processes and, since I had gotten the disk issue cleaned up ( i.e. working on a non-sparse loop mounted file with big_writes ) I consolidated some of the activity. I no longer needed the tar.gz archive for the restore into that partition, so that disk could go offline. I now did have swap on one of the Toshiba drives (that one that took 3 days to resize the NTFS partition) so the WD could go offline. Then a short ‘disk check’ via fsck and reboot…
Now everything is running fine and stable. Both downloads are smoothly downloading (and the restart will have done a quick check that things match on both ends) and I’ve got 150 GB or so of more disk space for them spread between the two disks.
During most of this time, at least one of the downloads has been running to some disk or other, so the “futzing around” hasn’t stopped the process. But it has been interesting.
The major “lesson learned” here is that a “powered USB hub” can still suck power over the USB link to the R.Pi if you load it up with enough power hungry things (and hard disks suck power fairly strongly – and in surges at start up). This also implies it can supply a little too if it has extra, stabilizing a marginal power supply on the R.Pi itself.
So don’t think of a powered USB hub as a power isolation / protection device. It sometimes isn’t.
OK, enough on power supplies.
I’m now, more or less, back to normal. I’ve got the R.PiM2 busy on a task, and it will be 2 weeks before it can need more disk space; given my link speed. I hope by then these 2 downloads are all done.
I have my main desktop machine back where I can use it for postings (no longer doing disk re-sizes that take forever…) and things are getting back to normal.
At this point, I’d ‘size and price out’ a data scraper system as being about $150. One Raspberry Pi Model 2 kit for $60, a large USB disk for about $60, and about $30 of powered USB hub and misc cables. Then do the basic OS install, download the free software. With that, and the wget commands from the prior posting, you can effectively keep a mirrored copy of about 1 TB of climate and weather data from the sites of your choice. Get a 2 TB disk for about $30 more if you need it and you are still under $200. A Raspberry Pi B+ ought to also be fine for this use, so you could shave another $10 or so off the cost, and leave out the case and such you could likely get it down another $10, but frankly, it’s already cheap enough to be in the ‘noise’.
Once the downloads are done I’ll post some sizing information based on actual data sizes, and put up a more organized “How To” on a DIY data scraper with links back to the R.Pi hardware set up, the wget command use, and formatting the file system more rationally. But for now, you have the rough “how it was done” to work from for anyone wishing to “play along at home”. And with that, the Temperature Data Archive Station is up and running.
FWIW, the CDIAC site has the GHCN V1 data in one of their dataset archives, so even though NOAA has ‘disappeared’ it (and V2) they can still be found. I have both and will be preserving them in a permanent archive for anyone who can’t find them online (availability some future date when no online copies are available). I’ve also got a copy or two of the reputed “raw daily data” though I’ve not characterized them as to just what is what (and they are large…)
So “sometime” after this process has run to completion I’m going to be making a catalog of what temperature data is in this pile. (Yes, I know, it would be far more efficient to have made that search / catalog first, then only done the download on the parts that were temperatures; but where’s the fun in that? ;-) Realistically, it’s much much faster to search on a real set of files on your own machine with full Linux / Unix tools than to do it on an FTP site, and some times you end up downloading a bag-o-bits anyway just to see what is in it (the online size and content information is, er, sparse…) so I decided to just “let a machine do the grabbing” and I’ll sort it out later.
And, with that, I’m off to a cup of morning coffee and a think about “What next?”