Sometimes something simple and easy and not at all a “problem” can turn into a quagmire.
That’s what I’m stuck in at the moment.
It will eventually sort itself out, just a matter of time. But I thought I’d take this moment to explain my slightly lower than typical ‘participation rate’ in all things internet and posting.
A little while ago, I showed how to download a chunk of data from a web site with one easy command. I also showed “how that can go wrong” if you let it follow, recursively, parent links. No problem, thought I. (Or more correctly “problem kept away”…) as I plunged back into the task of just sucking down some subsets of the data.
Well, one thing leads to another, and as my 64 GB SD card on the R.PiM2 filled up, I had to cope. Either halt the whole thing and think about it (what would have been the right choice, since the ‘wget’ can be set to modestly rapidly rescan and only download missing / new bits) or pause the process, shuffle some data around, and then be ‘fancy’ about the restart. Wanting to show off to myself how ‘trick’ I could be, I chose that option.
Linux / Unix have a nice little feature. In a terminal window with a running process you can ‘susspend’ the process and return to a command shell to ‘do something’ and then resume the process. This is done with CRTL-z to background and “fg” to bring it to the foreground and start it running again.
I had one transfer going on that had used most of the space, and two smaller ones that had not used much, so simple… just pause the ‘big one’, move that data to a different disk (plugged in a USB disk with about 70 GB free and a nice EXT type linux file system) and then moved the directory where everything was being stored onto that disk. Last step was to put in place a ‘symbolic link’ in the old location that says “go look over there now”. At that point, about 34 GB of free space on the chip and “fg”… Off to the races again.
root@RaPiM2:/MIRRORS# ls -l lrwxrwxrwx 1 root root 27 Aug 31 22:14 cdiac.ornl.gov -> /WD/MIRRORS/cdiac.ornl.gov/ drwxr-xr-x 3 pi pi 4096 Aug 29 03:17 ftp.ncdc.noaa.gov
Notice that my ‘prompt’ says I’m in the “/MIRRORS” directory. Here there are top level directory names for each of the sites that are being mirrored. CDIAC (Carbon Dioxide Information …) was ‘the big one’ that I moved. That first line is a ‘link’, so notice that the first character is an ‘l’ for link; while the one just below it is a ‘d’ for ‘directory’. Here you can see that /MIRRORS/cdiac.ornl.gov is now pointed over to /WD/MIRRORS/chieac.ornl.gov instead. Now, when the wget command is resumed, it just carries on as though the data were still in /MIRRORS, but it gets redirected to the new home on the /WD disk.
And things continued to run for a few more days… Until both are filling up again../
root@RaPiM2:/home/pi# df Filesystem 1K-blocks Used Available Use% Mounted on rootfs 59805812 56704004 40768 100% / /dev/root 59805812 56704004 40768 100% / [...] /dev/sdb2 50264772 46496840 1207932 98% /WD
At this point I have all three transfers paused with CRTL-Z.
Over on the SD card:
root@RaPiM2:/MIRRORS/ftp.ncdc.noaa.gov/pub/data# ls noaa 1901 1912 1923 1934 1945 1956 1967 1978 1989 dsi3260.pdf ish-format-document.pdf ishJava_ReadMe.pdf 1902 1913 1924 1935 1946 1957 1968 1979 1990 isd-history.csv ish-history.csv ish-qc.pdf 1903 1914 1925 1936 1947 1958 1969 1980 1991 isd-history.txt ish-history.txt ish-tech-report.pdf 1904 1915 1926 1937 1948 1959 1970 1981 1992 isd-inventory.csv ish-inventory.csv NOTICE-ISD-MERGE-ISSUE.TXT 1905 1916 1927 1938 1949 1960 1971 1982 1993 isd-inventory.csv.z ish-inventory.csv.z readme.txt 1906 1917 1928 1939 1950 1961 1972 1983 1994 isd-inventory.txt ish-inventory.txt station-chart.jpg 1907 1918 1929 1940 1951 1962 1973 1984 1995 isd-inventory.txt.z ish-inventory.txt.z updates.txt 1908 1919 1930 1941 1952 1963 1974 1985 1996 isd-problems.docx ishJava.class 1909 1920 1931 1942 1953 1964 1975 1986 1997 isd-problems.pdf ishJava.java 1910 1921 1932 1943 1954 1965 1976 1987 1998 ish-abbreviated.txt ishJava.old.class 1911 1922 1933 1944 1955 1966 1977 1988 country-list.txt ish-format-document.doc ishJava.old.java root@RaPiM2:/MIRRORS/ftp.ncdc.noaa.gov/pub/data# du -ks noaa/ 30484864 noaa/
So I’m all the way up to near the end of the last century at year 1998, only 17 more years of data to go… and it’s already at over 30 GB…
The other transfer was in GHCN at the same site.
Almost 22 GB there and rising… The “biggy” there being “all”
18444948 all 36 COOPDaily_announcement_042011.doc 124 COOPDaily_announcement_042011.pdf 68 COOPDaily_announcement_042011.rtf 2822796 ghcnd_all.tar.gz 4 ghcnd-countries.txt 139556 ghcnd_gsn.tar.gz 281244 ghcnd_hcn.tar.gz 25676 ghcnd-inventory.txt 4 ghcnd-states.txt 8236 ghcnd-stations.txt 4 ghcnd-version.txt 24 readme.txt 32 status.txt
which is “almost done”… but not quite….
Now the first thing to realize is that had I done any ONE of these at a time, the total data transfer is highly likely to have fit on the SD card, or if not, on the external disk. It would have taken no more total time (as it is bandwidth limited on my internet pipe) and I’d have had far more ‘feedback’ along the way about how things were going.
But, since there is no sizing information on ‘how big’ on the web sites, I chose to just hope it would all fit and launched all three to “complete at night while I’m in bed”, which was about 4 or 5 days ago…
Lesson One: Do not ever guess how big a transfer / mirror will be from a government site. They are paid to create volume… (whenever I get this all done, I’m going to put up a set of “how big are they’ numbers for the sub-directories…)
But Wait! There’s MORE!!
So I’m thinking “No Problem, I’ve done this once, I can use the same fix again”.
Lesson Two: Beware of Hubris… especially if “I’ve done this before” crosses your mind.
Pawing around on some disks, I see that I have a duplicate copy of one of the disks onto the newly bought 2 TB disk. It is a 1 TB disk, but only about 3/4 of a TB are used. Surely a couple of hundred GB will be enough for ONE of these data sets, perhaps two or all three… but the disk is formatted NTFS. Linux can read and write that, but it isn’t very efficient and I’m not all that keen on swapping file system types under a running paused process… So, bright idea time, I’ll just shrink that NTFS partition and make an EXT one on the free space.
I boot up the CentOS box on a ‘rescue CD’ that has a nice little disk utility on it that I know works well doing exactly this as I’ve used it many times before ( Important… Not a time to find out which version doesn’t resize NTFS quite right…) and I’m ready to go. I launch “gparted”. Nice graphical User Interface and all.
Gparted examines the disk, finds the partitions, tells me how much is free. I tell it to take that big fat NTFS partition and shrink it down to about 7 GB free space. I want to leave a little bit of working space in case I need to later move, uncompress, or whatever some small files and not worry about it too much. No Problem, says Gparted. I get the layout set up as I want it (free space ahead, after, etc.) and click the “do it” button (actually a giant green Check Mark on the set of graphical commands). It pops up the “really?” and I say yes. It pops up the “Doing it NOW!” dialog box with the helpful note “Depending on the number and type of operations, this might take a long time”.
That was yesterday. Now I’m looking at this thinking “No Shit Sherlock!”. And again we have no indication of percent done or how big ‘long’ might be.
Lesson Three: A TByte is big. Really Really Big. It is not made smaller by being cheaper now. Moving it around takes a very very long time. Especially over slow WAN links, on slow SD Cards, and as a ‘tower of Hanoi’ (that I’m guessing they did in Gparted) defragment, relocate, resize on a slow NTFS file system on a slow USB-2 spigot.
So now I’ve got the main monitor tied up on the R.PiM2 with three paused processes, the WD disk tied up until I can move data (and it has my home dir on it for the EVO… that needs the same screen as is used by the R.Pi anyway – so doubly out of action) and the ASUS is busy fondling it’s disk. Oh, and the ChromBox can’t be used as it needs one of th;e two monitors that are both in use and locked to a process.
All of which leaves me with the Tablet as my only “do whatever” machine. Which works well for reading, not so well for things like making postings. (It is also pretty good at downloading, though it is a bit of an annoyance to pull the mini-sd card out of it to move bulk… which would need one of the busy machines anyway…)
So I’m making this posting from the Raspberry Pi Model2, despite it being ‘cluttered’ at the moment with a bunch of windows with paused processes and short on ‘disk’ space and… Sigh.
I’m hopeful that sometime today the partition resizing will complete. I’m also thinking I need to get a USB-3 speed box if I’m going to play with TB USB disks a lot. USB-2 is just way slow at that size. I’m also thinking that just tossing another $50 at a dedicated TB disk and formatting it to EXT would have been faster and smarter… But I’m now stuck in the muck and “woulda coulda shoulda” is not as important as “ought to do now”.
So for now, postings will be slower and a bit more limited as my data archives are kind of spread around and “in play”, the various boxes used for different things are ‘locked up’ or ‘locked out’, and I’m once again “Waiting For I/O, or someone like him…” to complete.
Once the disk diddle is done, I can get on with the format of EXT, put it on the R.PiM2 (that is known to take a ‘hot plug via the USB powered hub’ ok – it will crash a Pi with direct hot plug via the power sag), move about 30 GB to 50 GB of data, make the symbolic links, foreground a couple of processes, and once again saturate my internet pipe… in the hope that it will complete in a day or two.
That is the dim light I’m seeing at the end of this Big Data Small Pipe tunnel…
“But Hope is not a strategy. -E.M.Smith”, so there may be more ‘amusement’ to come…