Technically, the title is a little misleading. I cut the “ambition” of the script back to just one directory named “noaa” and that part is what finally completed. There are still a couple of other directories to scrape at that same level, but I wanted this large part “in the bag” so I could look it over. As it was, this has been running for many days, most of the time for most days, at about 100k to 150k data rate most of that time. (Rate cut back when wanting to do other internet things).
This is the script doing the work. As with many such, I’ve edited bits to change what it does as I’m running and restarting it. So the first small bits had completed, and I saw no reason to be re-running them for updates when I was trying to get the “noaa” part done, so commented them out ( put a # in the first character ). Similarly, the following bits were commented out so that when “noaa” completed, it would not keep on doing the others (potentially for a few more weeks) but would let me know by stopping. ( Now I can go back, uncomment them, and let it run on them for a while ).
I’ve also changed the parameters of the wget a bit from time to time (apportioning bandwidth between processes and adjusting recursion, for example). A reminder to those not steeped in Unix / Linux – the pi@dnsTorrent is userID@machine name, the $ is the prompt after it, and “cat” is the “concatenate and print” command that is printing out the script I’ve named syncnoaa. The next line starting with #Fetch is a comment as are all lines starting with a # and they explain / document what the script is doing. In this script, the only really active line is the one “wget” without a # in front of it.
BTW, that first line needs updating as this site clearly has a whole lot more going on than just the GHCN data. The comment about CDIAC isn’t relevent to this scrape, but to the copy of this script that is doing the CDIAC data, but it is a good idea anyway and explains why that option is needed. Note that this one now does a “cd /Temps” as I’d moved all the prior data onto a dedicated USB disk mounted on /Temps.
The Script:
pi@dnsTorrent ~/bin $ cat syncnoaa # Fetch a mirrored copy of the NOAA GHCN Daily temperature data. # # wget is the command that does the fetching. # It can be fed an http: address or an ftp: address. # # The -w or --wait command specifies a number of seconds to pause # between file fetches. This helps to prevent over pestering a # server by nailing a connection constantly; while the # --limit-rate={size} limits via an average of pausing between # transfers. Over time this is about the rate of bandwidth used, # but on a gaggle of small files can take a while to stabilize, # thus the use of both. # # Since CDIAC uses a "parent" link that points "up one" you need # to not follow those or you will end up duplicating the whole # structure ( I know... don't ask...) thus the -np or # --no-parent option. # # The -m or --mirror option sets a bunch of other flags (in effect) # so as to recursively copy the entire subdirectory of the target # given in the address. Fine, unless they use 'parent' a lot... # # Then you list the http://name.site.domain/directory or # ftp://ftp.site.domain/directory to clone # # Long form looks like: # # wget --wait 10 --limit-rate=100k --no-parent --mirror http://cdiac.ornl.gov/ftp/ushcn_daily #pi@dnsTorrent ~/bin $ cat syncnoaa # but I think the --commands look silly and are for people who can't # keep a table of 4000 things that -c does in 3074 Unix / Linux # commands, all of them different, in their head at all times ;-) # so I use the short forms. Eventually not typing all those wasted # letters will give me decades more time to spend on useful things, # like comparing the merits of salami vs. prosciutto... cd /Temps #wget -w 10 --limit-rate=100k -np -m http://cdiac.ornl.gov/ftp/ushcn_daily #wget -r -N -l inf --no-remove-listing -w 10 --limit-rate=100k -np http://cdiac.ornl.gov/ftp/ushcn_daily echo echo Doing World Weather Records echo #wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/wwr/ #wget --limit-rate=100k -nc -np -r -l inf ftp://ftp.ncdc.noaa.gov/pub/data/wwr/ echo echo Doing World War II Data echo #wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/ #wget --limit-rate=100k -nc -np -r -l inf ftp://ftp.ncdc.noaa.gov/pub/data/ww-ii-data/ echo echo Doing NOAA set echo #wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ wget --limit-rate=100k -nc -np -r -l inf ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ echo echo Doing Global Data Bank set echo #wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/ echo echo Doing GHCN echo #wget --limit-rate=100k -np -m ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/ echo echo ALL DONE!!! echo
So that “globaldatabank” isn’t done and is the next step. But I think I’ll let it rest a while before I start saturating my slow internet link for a week or two again…
Sizes?
We’ll start with just the disk and everything on it (so far).
pi@dnsTorrent ~/bin $ df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb3 955139868 333164376 573457144 37% /Temps
So I’ve used about 1/3 of a 1 TB USB disk, with a lot more left to do. That’s about a $60 disk at Best Buy. Not exactly a ‘break the bank’ operation to get your own copy.
Now we’ll look inside it at just what all makes up that 1/3 TB. Some of it is “old copies” of GHCN and related that I’ve stashed over the years. On the “to do” list is to compare them and see “what changed”. Kind of a very slow very coarse “audit” of degree of fiddle.
pi@dnsTorrent /Temps $ ls BEST_Temperature_Products cdiac.ornl.gov GHCNdaily1June2015 GISTemp Temperature_Data USHCNv2.5 BUPS.RH.gistemp CDIAC_wget_log GHCN_from_SG500 lost+found Temperature Data from Mac CDIAC ftp.ncdc.noaa.gov GHCNv1_partial NOAA_NCDC Temperature_data_from_Mac.zip cdiac.AntO.ndp032.txt GHCN GHCNv3_1June2015 NOAA_wget_log testeph.f
Quite a hodge-podge eh? A chunk from B.E.S.T., a dash of USHCNv2.5, an archive from about 1/5 decade back on my old Macintosh, a copy of GISTemp. So it goes. But how big are these things?
One of the “fun” bits about Unix / Linux (collectively *NIX) is that you can create a small little command that can run for a long time. Like that wget that took weeks. Now we’re doing “du -ms * | sort -rn ” that is rummaging through that entire 1/3 TB counting up the sizes of every single file finding the “Disk Usage” in Megabytes and Summarizing it, for “*” all names at the top level, then sending that via a pipe “|” to the sort command to sort it in Reverse Numeric order. So quick to flow off the fingers, but now I’m waiting and waiting… ( I have this in a command named “DU” for Disk Usage, that sends the output to a file, so I can usually just launch it and move on… but now I’m doing it live. Maybe I’ll go get coffee…)
root@dnsTorrent:/Temps# du -ms * | sort -rn 172034 ftp.ncdc.noaa.gov 111641 cdiac.ornl.gov 15139 NOAA_NCDC 6926 GHCN 6624 BUPS.RH.gistemp 5491 GHCNdaily1June2015 2640 BEST_Temperature_Products 1231 Temperature Data from Mac 1069 Temperature_Data 760 Temperature_data_from_Mac.zip 669 CDIAC 543 GHCN_from_SG500 164 NOAA_wget_log 112 GISTemp 58 GHCNv3_1June2015 53 USHCNv2.5 32 CDIAC_wget_log 1 testeph.f 1 lost+found 1 GHCNv1_partial 1 cdiac.AntO.ndp032.txt
Now you can see the benefit of this process. It is instantly obvious that the only really big disk users are those two site scrapes (with the cdiac.ornl one still running, but it’s only at 50k rate) and two prior archives of just NOAA_NCDC saves over the years and a variety of old GHCN copies in that archive. (Plus honorable mention for a GISTemp archive with data).
Once we are down to “just data” in GHCN Daily, it is ‘only’ 5.5 GB while the B.E.S.T. archive is only 2.6 GB. Everything else is chump change. USHCNv2.5 comes in at only 53 MB.
Notice the two files that end with _log one starting with cdiac and the other with NOAA and with _wget_ in the middle. That’s the log file of each scrape. Darned big logs! 164 MB just for the file names and tracking of the NOAA latest run. Sheesh!
Here’s a bit of the log so you can see what it looks like. The “head” command gives you the first lines of a file. The “tail” command, the last lines. (For chunks in the middle you can ‘head’ about 1/2 the file, then ‘tail’ that). I’ll start with just counting the lines in the file. (That is done with the word count command “wc” but giving it an option to just count the lines) Note that I’ve swapped over to being “root”, the superuser, as some of these files have “root” owner.
root@dnsTorrent:/Temps# wc -l NOAA_wget_log 2289637 NOAA_wget_log root@dnsTorrent:/Temps#
2,289,637 lines of log file. I think I’ll not read the whole thing… ;-)
Remember that in the script, I commented out some of the wget commands that had already finished, but left in the ‘echo’ comments saying they were that step. So skip over the “World Weather Records” and “World War II Data”… The -20 says to give me 20 lines off the top instead of the default of only 10 lines.
root@dnsTorrent:/Temps# head -20 NOAA_wget_log Doing World Weather Records Doing World War II Data Doing NOAA set File `ftp.ncdc.noaa.gov/pub/data/noaa/.listing' already there; not retrieving. Removed `ftp.ncdc.noaa.gov/pub/data/noaa/.listing'. File `ftp.ncdc.noaa.gov/pub/data/noaa/country-list.txt' already there; not retrieving. File `ftp.ncdc.noaa.gov/pub/data/noaa/dsi3260.pdf' already there; not retrieving. File `ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.csv' already there; not retrieving. File `ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.txt' already there; not retrieving. File `ftp.ncdc.noaa.gov/pub/data/noaa/isd-inventory.csv' already there; not retrieving. File `ftp.ncdc.noaa.gov/pub/data/noaa/isd-inventory.csv.z' already there; not retrieving. File `ftp.ncdc.noaa.gov/pub/data/noaa/isd-inventory.txt' already there; not retrieving. File `ftp.ncdc.noaa.gov/pub/data/noaa/isd-inventory.txt.z' already there; not retrieving. File `ftp.ncdc.noaa.gov/pub/data/noaa/isd-problems.docx' already there; not retrieving.
When restarted, wget is smart enough to skip files it has already copied if you tell it to do that.
Here’s the bottom bit:
root@dnsTorrent:/Temps# tail -20 NOAA_wget_log 50850K .......... .......... .......... .......... .......... 99% 104K 2s 50900K .......... .......... .......... .......... .......... 99% 104K 1s 50950K .......... .......... .......... .......... .......... 99% 104K 1s 51000K .......... .......... .......... .......... .......... 99% 104K 0s 51050K .......... .......... .......... .......... ........ 100% 102K=8m31s 2015-10-01 04:56:13 (99.9 KB/s) - `ftp.ncdc.noaa.gov/pub/data/noaa/updates/999999-99999-1977.gz' saved [52325312] FINISHED --2015-10-01 04:56:13-- Total wall clock time: 1d 18h 7m 50s Downloaded: 104140 files, 9.7G in 1d 3h 27m 17s (103 KB/s) Doing Global Data Bank set Doing GHCN ALL DONE!!!
Again, note the blanks on “Global Data Set” and “GHCN” that were deliberately commented out in the script. Working up from there, it says the last run downloaded 104,140 files for 9.7 G and took a little over a day to do it. That is ONLY from the restart after moving the scrape from the PiM2 to the Pi_B+ and restarting… Each file gets a progress report (those top lines) and a summary when done that includes the average data rate (about 100 KB/s) and the file name /size.
Here’s a little bit from the middle toward the end. Since we know it was a mostly finished restart, we’ll get a chunk from the end first, then take the top of that bit:
root@dnsTorrent:/Temps# tail -10000 NOAA_wget_log | head -20 --2015-10-01 04:19:20-- ftp://ftp.ncdc.noaa.gov/pub/data/noaa/updates/951809-99999-1968.gz => `ftp.ncdc.noaa.gov/pub/data/noaa/updates/951809-99999-1968.gz' ==> CWD not required. ==> PASV ... done. ==> RETR 951809-99999-1968.gz ... done. Length: 8284 (8.1K) 0K ........ 100% 187K=0.04s 2015-10-01 04:19:20 (187 KB/s) - `ftp.ncdc.noaa.gov/pub/data/noaa/updates/951809-99999-1968.gz' saved [8284] --2015-10-01 04:19:20-- ftp://ftp.ncdc.noaa.gov/pub/data/noaa/updates/951809-99999-1969.gz => `ftp.ncdc.noaa.gov/pub/data/noaa/updates/951809-99999-1969.gz' ==> CWD not required. ==> PASV ... done. ==> RETR 951809-99999-1969.gz ... done. Length: 14334 (14K) 0K .......... ... 100% 223K=0.06s 2015-10-01 04:19:21 (223 KB/s) - `ftp.ncdc.noaa.gov/pub/data/noaa/updates/951809-99999-1969.gz' saved [14334] root@dnsTorrent:/Temps#
That’s what most of the process looks like. A long list of individual file transfers. That last one being a very small file at 14k it only used 233 KB/s of transfer as time was spent waiting for overhead of set-up and tear-down. A lot of small files looks like not much speed, while a single large file saturates the pipe. It happened at 4:19 A.M. while I was sleeping (yay!) and moved the file from ftp.ncdc.noaa.gov in the /pub/data/noaa/updates directory named “951809-99999-1969.gz” that is in compressed “gzip” format (that .gz). What’s in it? Who knows… I’d guess an update of temperature data from 1969 for station 951809 or something similar. It will take a good long while to figure out what all is in the scrape. Much of it will likely be left compressed on disk just as a “someday” archive; after inspection of a bit shows it to be “uninteresting” to me.
Inside NOAA subset
So what all is inside that biggest lump?
root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data# du -ms * 49983 ghcn 121929 noaa 17 ww-ii-data 107 wwr root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data#
( I didn’t bother to sort this small a listing…)
We can see that the W.W.II data are the smallest at 17 MB, while “World Weather Records” is also small at 107 MB. At some point I’ll take the “#” off those wget lines and let them update on each run. I’d also downloaded GHCN in a prior run, and it is about 50 GB. Most of that is a couple of copies of the daily data. Then this last batch of ‘noaa’ at 122 GB. So what’s in THOSE two?
root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/ghcn# du -ms * | sort -rn 47787 daily 1967 v3 115 blended 61 v2 30 forts 15 grid_gpcp_1979-2002.dat 4 v1 4 Lawrimore-ISTI-30Nov11.ppt 3 anom 2 snow 1 alaska-temperature-means.txt 1 alaska-temperature-anomalies.txt
For GHCN it is mostly daily data, then the V3 massaged results. That “ghcnd_all.tar.gz” is a handful at 334 MB. Then you also have it “by_year” and it doesn’t get any smaller when listed one year at a time (but you can choose to only download the newest year… if you already have the others). The “all” directory looks to me like it has daily records for each individual station, so you can just get the stations you care about. Basically, you get the same Gigs or so of data several times in several different sorts.
Note that in the “ls -l” listing the “size” of a directory (lines starting with d) is the size of the directory tracking the data blocks, not the total of the actual data blocks sizes. The “all” is 25 GB of data, tracked in 3.6 MB of directory structure. For non-directory files, the “ls -l” lists the actual size of the file.
root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/ghcn# ls -l daily total 784852 drwxr-xr-x 2 pi pi 3608576 Sep 6 03:54 all drwxr-xr-x 2 pi pi 12288 Sep 6 19:42 by_year -rw-r--r-- 1 pi pi 34304 Apr 19 2011 COOPDaily_announcement_042011.doc -rw-r--r-- 1 pi pi 125034 Apr 19 2011 COOPDaily_announcement_042011.pdf -rw-r--r-- 1 pi pi 68083 Apr 19 2011 COOPDaily_announcement_042011.rtf drwxr-xr-x 2 pi pi 4096 Sep 6 19:42 figures -rw-r--r-- 1 pi pi 334052970 Sep 4 03:43 ghcnd_all.tar.gz -rw-r--r-- 1 pi pi 3670 Jun 23 01:59 ghcnd-countries.txt -rw-r--r-- 1 pi pi 142902064 Sep 4 03:43 ghcnd_gsn.tar.gz -rw-r--r-- 1 pi pi 287989756 Sep 4 03:43 ghcnd_hcn.tar.gz -rw-r--r-- 1 pi pi 26289506 Sep 2 04:28 ghcnd-inventory.txt -rw-r--r-- 1 pi pi 1086 May 15 2011 ghcnd-states.txt -rw-r--r-- 1 pi pi 8431010 Sep 2 04:28 ghcnd-stations.txt -rw-r--r-- 1 pi pi 270 Sep 4 03:43 ghcnd-version.txt drwxr-xr-x 3 pi pi 4096 Sep 6 20:50 grid drwxr-xr-x 2 pi pi 36864 Sep 7 02:51 gsn drwxr-xr-x 2 pi pi 40960 Sep 7 05:47 hcn drwxr-xr-x 2 pi pi 4096 Sep 7 05:47 papers -rw-r--r-- 1 pi pi 24088 Jul 9 06:56 readme.txt -rw-r--r-- 1 pi pi 29430 Jul 8 02:31 status.txt root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/ghcn# root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/ghcn/daily# du -ms * 24925 all 13624 by_year 1 COOPDaily_announcement_042011.doc 1 COOPDaily_announcement_042011.pdf 1 COOPDaily_announcement_042011.rtf 8 figures 319 ghcnd_all.tar.gz 1 ghcnd-countries.txt 137 ghcnd_gsn.tar.gz 275 ghcnd_hcn.tar.gz 26 ghcnd-inventory.txt 1 ghcnd-states.txt 9 ghcnd-stations.txt 1 ghcnd-version.txt 5203 grid 865 gsn 2395 hcn 8 papers 1 readme.txt 1 status.txt
IMHO, this kind of “size of data” information would be very nice to have on their FTP site so folks knew what they were asking for with any given download. But now you have it.
What about those other v1 and v2 entries? Why are they so small? Well, simply because they took out the data. Note that they have various ghcnm.tavg* ghcnm.tmax* and ghcn.tmin* entries under GHCN v3, but they are now missing from v1 and v2. (No worries, though, I’ve saved copies over the years AND I found the GHCN v1 on the CDIAC site – who may still have v2 somewhere).
root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/ghcn# ls v1 country.codes flag.for.Z invent.sas.Z press.sea.statinv.Z press.sta.statinv.Z data.for.Z flag.sas.Z press.sea.data.Z press.sta.data.Z README.ghcn data.sas.Z invent.for.Z press.sea.flag.Z press.sta.flag.Z README.ghcn~ root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/ghcn# ls v2 grid v2.country.codes v2.prcp.failed.qc.Z v2.prcp.readme v2.read.data.f zipd source v2.prcp_adj.Z v2.prcp.inv v2.prcp.Z v2.read.inv.f root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/ghcn# ls v3 archives ghcnm.tavg.latest.qcu.tar.gz ghcnm.tmin.latest.qca.tar.gz grid software country-codes ghcnm.tmax.latest.qca.tar.gz ghcnm.tmin.latest.qcu.tar.gz products status.txt ghcnm.tavg.latest.qca.tar.gz ghcnm.tmax.latest.qcu.tar.gz GHCNM-v3.2.0-FAQ.pdf README techreports root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/ghcn#
The “noaa” directory is even more busy. It has some kind of archive by year that is the bulk of it, then some other files.
root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/noaa# ls 1901 1931 1961 1991 isd-inventory.csv 1902 1932 1962 1992 isd-inventory.csv.z 1903 1933 1963 1993 isd-inventory.txt 1904 1934 1964 1994 isd-inventory.txt.z 1905 1935 1965 1995 isd-lite 1906 1936 1966 1996 isd-problems.docx 1907 1937 1967 1997 isd-problems.pdf 1908 1938 1968 1998 ish-abbreviated.txt 1909 1939 1969 1999 ISH-DVD2012 1910 1940 1970 2000 ish-format-document.doc 1911 1941 1971 2001 ish-format-document.pdf 1912 1942 1972 2002 ish-history.csv 1913 1943 1973 2003 ish-history.txt 1914 1944 1974 2004 ish-inventory.csv 1915 1945 1975 2005 ish-inventory.csv.z 1916 1946 1976 2006 ish-inventory.txt 1917 1947 1977 2007 ish-inventory.txt.z 1918 1948 1978 2008 ishJava.class 1919 1949 1979 2009 ishJava.java 1920 1950 1980 2010 ishJava.old.class 1921 1951 1981 2011 ishJava.old.java 1922 1952 1982 2012 ishJava_ReadMe.pdf 1923 1953 1983 2013 ish-qc.pdf 1924 1954 1984 2014 ish-tech-report.pdf 1925 1955 1985 2015 NOTICE-ISD-MERGE-ISSUE.TXT 1926 1956 1986 additional readme.txt 1927 1957 1987 country-list.txt software 1928 1958 1988 dsi3260.pdf station-chart.jpg 1929 1959 1989 isd-history.csv updates 1930 1960 1990 isd-history.txt updates.txt
The “readme.txt” file covers it pretty well. Here’s a bit from the top:
root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/noaa# cat readme.txt This directory contains ISH/ISD data in directories by year. Please note that ISH and ISD refer to the same data--Integrated Surface Data, sometimes called Integrated Surface Hourly. Updated files will be listed in the updates.txt file, and will be stored both in the /updates directory and in the respective data directories by year. However, for the current year, updates will be more frequent and will not be indicated in the updates.txt file. Please note that all data files are compressed as indicated by the "gz" extension and can be uncompressed using Gunzip, WinZip or other similar software applications. The filenames correspond with the station numbers listed in the ish-history.txt file described below -- eg, 723150-03812-2006 corresponds with USAF number 723150 and WBAN number 03812. Extensive updates to the dataset took place in November 2004, as Version 2.3 of the dataset was released, and periodic updates have continued since then. Other files included in the main directory are:
and on it goes. So where are the “big lumps”? Not where I’d expected. First off, if we count the lines in a ‘listing’ we find there are 150 things to total up.
root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/noaa# ls | wc -l 150
The distribution of “data by year” is interesting… But it is the “isd-lite” and “additional” that have the most stuff in them. Something I learned while waiting endlessly for “additional” to complete after I thought I had most of the “stuff” in the yearly batches (that had taken a looong time to download)… 18 Gig of “isd-lite” and 11 Gig of “additional”.
root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/noaa# du -ms * | sort -rn 18371 isd-lite 11052 additional 6566 updates 4469 2012 4383 2011 4229 2010 4215 2014 4064 2013 3998 2009 3807 2008 3554 2007 3394 2006 3285 2005 3220 2015 2783 2004 2570 2003 2377 2002 2084 2001 1974 2000 1408 1999 1142 1998 1130 1997 1103 1996 1098 1991 1097 1992 1091 1993 1084 1995 1005 1990 984 1994 963 1988 962 1989 930 1987 906 1986 880 1985 872 1984 851 1983 819 1981 800 1982 789 1979 782 1980 774 1977 768 1976 766 1978 750 1975 720 1974 690 1973 293 1962 290 1961 289 1963 285 1960 281 1957 279 1958 278 1959 271 1954 266 1953 255 1964 252 1956 250 1952 249 1955 231 1951 212 1950 200 1949 199 1969 198 1970 169 1966 168 1965 165 1967 158 1968 137 1948 133 1971 131 ISH-DVD2012 95 1945 80 1972 80 1944 64 isd-inventory.txt 63 1943 58 1947 57 1946 51 isd-inventory.csv 34 1942 23 1941 19 1940 18 1937 16 1939 16 1936 15 1938 13 1935 12 isd-inventory.txt.z 11 isd-inventory.csv.z 11 1934 9 1933 9 1932 6 1931 3 isd-history.txt 3 isd-history.csv 2 ish-tech-report.pdf 2 1930 1 updates.txt 1 station-chart.jpg 1 software 1 readme.txt 1 NOTICE-ISD-MERGE-ISSUE.TXT 1 ish-qc.pdf 1 ishJava_ReadMe.pdf 1 ishJava.old.java 1 ishJava.old.class 1 ishJava.java 1 ishJava.class 1 ish-inventory.txt.z 1 ish-inventory.txt 1 ish-inventory.csv.z 1 ish-inventory.csv 1 ish-history.txt 1 ish-history.csv 1 ish-format-document.pdf 1 ish-format-document.doc 1 ish-abbreviated.txt 1 isd-problems.pdf 1 isd-problems.docx 1 dsi3260.pdf 1 country-list.txt 1 1929 1 1928 1 1927 1 1926 1 1925 1 1924 1 1923 1 1922 1 1921 1 1920 1 1919 1 1918 1 1917 1 1916 1 1915 1 1914 1 1913 1 1912 1 1911 1 1910 1 1909 1 1908 1 1907 1 1906 1 1905 1 1904 1 1903 1 1902 1 1901 root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/noaa#
Not much data prior to W.W. II at all. Here is the count in kB of data sorted by year. Kind of makes me wonder what happened in 1972…
root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/noaa# du -ks 1* 2* 76 1901 76 1902 76 1903 76 1904 72 1905 64 1906 64 1907 76 1908 88 1909 88 1910 88 1911 88 1912 104 1913 104 1914 104 1915 32 1916 104 1917 100 1918 92 1919 104 1920 100 1921 100 1922 88 1923 88 1924 88 1925 172 1926 124 1927 248 1928 816 1929 1856 1930 5368 1931 8616 1932 9152 1933 10348 1934 12468 1935 15472 1936 18296 1937 15348 1938 15944 1939 19020 1940 22872 1941 33992 1942 64368 1943 81088 1944 96320 1945 57916 1946 59124 1947 140280 1948 204436 1949 216940 1950 235816 1951 255252 1952 272056 1953 276780 1954 254144 1955 257884 1956 287332 1957 285580 1958 284004 1959 291780 1960 296548 1961 299124 1962 295648 1963 260996 1964 171432 1965 172104 1966 168152 1967 161100 1968 202792 1969 201748 1970 136076 1971 81032 1972 705788 1973 736588 1974 767920 1975 786268 1976 791928 1977 784300 1978 807800 1979 800276 1980 838408 1981 818604 1982 871176 1983 892128 1984 900176 1985 927064 1986 951728 1987 985148 1988 984864 1989 1028580 1990 1124208 1991 1122912 1992 1117148 1993 1007196 1994 1109632 1995 1129032 1996 1157036 1997 1169180 1998 1441192 1999 2020996 2000 2133168 2001 2433940 2002 2630936 2003 2849388 2004 3362872 2005 3475392 2006 3638676 2007 3897780 2008 4092976 2009 4329828 2010 4487960 2011 4575912 2012 4160708 2013 4315580 2014 3296520 2015 root@dnsTorrent:/Temps/ftp.ncdc.noaa.gov/pub/data/noaa#
In Conclusion
So that’s the look of the size of things in data land. I”m going to let the CDIAC scrape toodle along slowly and at some point let the next step of the NOAA script run. Then the question becomes “Do I archive a copy of all this, and run the update, or just have a sync copy that updtes?” I’ll likley do the archive with separate update copy. It’s “only” about $40 worth of disk…
The main take away here, though, is to just realize that for about $200 for all the equipment and letting your home data pipe be busy while you sleep, you, too, can have a giant archive of all the data. Just so it doesn’t tend to disappear on you (like those old v1 and v2 data sets that may have become an embarrassement to folks when compared and found different…)
And with that, I’m back to work on the next steps in too many projects ;-)
Karma is God’s way of remaining anonymous:
Representative Lamar Smith! right on the ball, demands that all IGES records be preserved for his committee on science, space and technology to investigate their activities. Those people may very very well get the RICO investigation that they asked for.
Sometimes you are the windshield, sometimes you are the bug! ;-) …pg
@P.G.:
So maybe I ought to get the “How to scrape a web site” script and posting up next?!
It is a variation on the data scraper with different recursion settings to follow linked items a layer or so down.
Any thoughts on the Theory of Chaos which seems to say that weather is a chaotic dynamic or as summarized by Edward Lorenz as: “When the present determines the future, but the approximate present does not approximately determine the future.” That sort of runs congruent with you thoughts on average temperature I think. But maybe I have misunderstood. I definitely do not like to concentrate that hard.
The other option would be to crowd source the scrape by having several people with linux systems each take a bite of the data (several parallel scrapes running to different systems) then consolidate the individual scrapes into one archive file.
That would keep from beating your daily web connection for days or weeks.
@EMSmith; It would appear that you may have a nucleus of helpers as you move to the next step;-)…pg
@Larry L
That’s a great idea!!. I’m in.
Well, I guess I’ll put together a set of what, where, and which is already done…
So far I have none of the non-USA data, so Australian, New Zealand, Canadian, Danish Arctic, Hadley, etc are all still ToDo.
I’ve done most of NOAA, but CDIAC is only part done. (I did the USHCN, then the rest was slow boat at 50 kb while I throttled up NOAA. )
There are a couple more, but I need to look up the names…
I mostly just launched NOAA and USHCN as I worked on other things, figuring I’d figure out what I didn’t really want later on my machine… but I would be smarter to figure out what was useful first and just get that.
I suppose anyone interested could just pick a site and start looking for temp related stuff, make a catalog of it, and put up a note…
Since a wget can be segmented by directory, and restarted with little lost time (use -nc to prevent updates of daily things already grabbed or each rerun does a resync update and the dailies take time to overwrite…) I’ve tended to just let it run when not doing much internet stuff, and stop / restart when I wanted an open pipe. Mostly just let it run.
The protocol is fairly polite and tends to get out of the way of browser traffic anyway. Setting a wget to limit at 110k to 150k of a 215 k pipe I have noticed little impact other than on other downloads. In some ways it just lets me use some of the bandwidth I otherwise waste while sleeping or not on the computer.
I’ll put up an updated “what I have” list in a few hours… after I sleep some…
For any future monster downloads, may I recommend the Samsung M3 drives? I picked up a 1TB one last year for just under 50 UKP, and was so impressed I went straight out and bought another one, helped along by discovering that the hardware inside them was Seagate. Pocket sized (not much bigger than its 2.5″ drive), and the darn thing is completely powered from its USB (3.0) connection. And I see there’s at least a 2TB one in the range now …
@Steve C.:
On my someday list is to do a performance compare on the 3 brands I’ve got. Western Digital, Seagate, and Toshiba.
One interesting note is that doing swap to them is interesting… At least for the Toshiba it tends to sleep and power down, that then causes a hang on a need to swap… (perhaps only on the swap in…). An older Western Digital doesn’t do that, and the Seagate has not caused a hang yet, but it has the data scrape running to it (why I moved swap there after the hangs… knowing that disk stayed active, preventing the sleep state).
Oh, and massive file copies causes a lot of memory use including roll to swap. I’ve had up to 400 GB of swap used when doing combined scrapes and disk dups. I think that the OS is tuned to hang-on to inode date due to flash disk sloth, but not smart enough to treat real disk differently. Doing a du on a few GB, the first one is slow, the second near instant, implies inodes cached.
At any rate, pending testing, the Seagate seems best, WD next, and the Toshiba a bit slower and prone to falling asleep… So I put most active files on the Seagate with swap; and backups or low use files on the Toyshiba… But that could change with real data instead of hypothesis…
Not seen Samsug disk locally, but will look for it. The Samsung tablet has been very reliable and well made. Using it right now 8-)
Blimey O’Reilly! 400GB of swap? That sounds like “cruel and unusual punishment”, especially if it was a Pi doing it. The only way I can see of speeding things up at that scale would be to swap onto an SSD – except that for the price of half a TB of SSD you could probably have a rack of Pis with one job apiece. ;-)
I use the technique on the machine where I process audio – OK, it’s the program’s working directory rather than a swap, but the same general principle. It speeds up jobs like resampling, noise reduction and so on quite well, where the whole of a big temp file has to be processed and rewritten in its new version. There’s also the pleasiure of not having to listen to your poor old spinning rust drive clattering about all over the place finding and writing fragments of file, of course.