NOAA data scrape completes. At last.

Technically, the title is a little misleading. I cut the “ambition” of the script back to just one directory named “noaa” and that part is what finally completed. There are still a couple of other directories to scrape at that same level, but I wanted this large part “in the bag” so I could look it over. As it was, this has been running for many days, most of the time for most days, at about 100k to 150k data rate most of that time. (Rate cut back when wanting to do other internet things).

This is the script doing the work. As with many such, I’ve edited bits to change what it does as I’m running and restarting it. So the first small bits had completed, and I saw no reason to be re-running them for updates when I was trying to get the “noaa” part done, so commented them out ( put a # in the first character ). Similarly, the following bits were commented out so that when “noaa” completed, it would not keep on doing the others (potentially for a few more weeks) but would let me know by stopping. ( Now I can go back, uncomment them, and let it run on them for a while ).

I’ve also changed the parameters of the wget a bit from time to time (apportioning bandwidth between processes and adjusting recursion, for example). A reminder to those not steeped in Unix / Linux – the pi@dnsTorrent is userID@machine name, the $ is the prompt after it, and “cat” is the “concatenate and print” command that is printing out the script I’ve named syncnoaa. The next line starting with #Fetch is a comment as are all lines starting with a # and they explain / document what the script is doing. In this script, the only really active line is the one “wget” without a # in front of it.

BTW, that first line needs updating as this site clearly has a whole lot more going on than just the GHCN data. The comment about CDIAC isn’t relevent to this scrape, but to the copy of this script that is doing the CDIAC data, but it is a good idea anyway and explains why that option is needed. Note that this one now does a “cd /Temps” as I’d moved all the prior data onto a dedicated USB disk mounted on /Temps.

The Script:

pi@dnsTorrent ~/bin $ cat syncnoaa
# Fetch a mirrored copy of the NOAA GHCN Daily temperature data.
# wget is the command that does the fetching.  
# It can be fed an http: address or an ftp: address.
# The -w or --wait command specifies a number of seconds to pause 
# between file fetches.  This helps to prevent over pestering a 
# server by nailing a connection constantly; while the 
# --limit-rate={size} limits via an average of pausing between 
# transfers.  Over time this is about the rate of bandwidth used, 
# but on a gaggle of small files can take a while to stabilize, 
# thus the use of both.
# Since CDIAC uses a "parent" link that points "up one" you need 
# to not follow those or you will end up duplicating the whole 
# structure ( I know... don't ask...) thus the -np or 
# --no-parent option.
# The -m or --mirror option sets a bunch of other flags (in effect)
# so as to recursively copy the entire subdirectory of the target 
# given in the address.  Fine, unless they use 'parent' a lot...
# Then you list the or 
# to clone
# Long form looks like:
# wget --wait 10 --limit-rate=100k --no-parent --mirror
#pi@dnsTorrent ~/bin $ cat syncnoaa
# but I think the --commands look silly and are for people who can't
# keep a table of 4000 things that -c does in 3074 Unix / Linux 
# commands, all of them different, in their head at all times ;-) 
# so I use the short forms.  Eventually not typing all those wasted
# letters will give me decades more time to spend on useful things,
# like comparing the merits of salami vs. prosciutto... 

cd /Temps 

#wget -w 10 --limit-rate=100k -np -m

#wget -r -N -l inf --no-remove-listing -w 10 --limit-rate=100k -np

echo Doing World Weather Records

#wget --limit-rate=100k -np -m

#wget --limit-rate=100k -nc -np -r -l inf

echo Doing World War II Data

#wget --limit-rate=100k -np -m

#wget --limit-rate=100k -nc -np -r -l inf

echo Doing NOAA set

#wget --limit-rate=100k -np -m

wget --limit-rate=100k -nc -np -r -l inf

echo Doing Global Data Bank set

#wget --limit-rate=100k -np -m

echo Doing GHCN

#wget --limit-rate=100k -np -m

echo  ALL DONE!!! 

So that “globaldatabank” isn’t done and is the next step. But I think I’ll let it rest a while before I start saturating my slow internet link for a week or two again…


We’ll start with just the disk and everything on it (so far).

pi@dnsTorrent ~/bin $ df
Filesystem      1K-blocks       Used Available Use% Mounted on
/dev/sdb3       955139868  333164376 573457144  37% /Temps

So I’ve used about 1/3 of a 1 TB USB disk, with a lot more left to do. That’s about a $60 disk at Best Buy. Not exactly a ‘break the bank’ operation to get your own copy.

Now we’ll look inside it at just what all makes up that 1/3 TB. Some of it is “old copies” of GHCN and related that I’ve stashed over the years. On the “to do” list is to compare them and see “what changed”. Kind of a very slow very coarse “audit” of degree of fiddle.

pi@dnsTorrent /Temps $ ls
BEST_Temperature_Products     GHCNdaily1June2015  GISTemp	 Temperature_Data		USHCNv2.5
BUPS.RH.gistemp		   CDIAC_wget_log     GHCN_from_SG500	  lost+found	 Temperature  Data from Mac
cdiac.AntO.ndp032.txt	   GHCN		      GHCNv3_1June2015	  NOAA_wget_log  testeph.f

Quite a hodge-podge eh? A chunk from B.E.S.T., a dash of USHCNv2.5, an archive from about 1/5 decade back on my old Macintosh, a copy of GISTemp. So it goes. But how big are these things?

One of the “fun” bits about Unix / Linux (collectively *NIX) is that you can create a small little command that can run for a long time. Like that wget that took weeks. Now we’re doing “du -ms * | sort -rn ” that is rummaging through that entire 1/3 TB counting up the sizes of every single file finding the “Disk Usage” in Megabytes and Summarizing it, for “*” all names at the top level, then sending that via a pipe “|” to the sort command to sort it in Reverse Numeric order. So quick to flow off the fingers, but now I’m waiting and waiting… ( I have this in a command named “DU” for Disk Usage, that sends the output to a file, so I can usually just launch it and move on… but now I’m doing it live. Maybe I’ll go get coffee…)

root@dnsTorrent:/Temps# du -ms * | sort -rn
6926	GHCN
6624	BUPS.RH.gistemp
5491	GHCNdaily1June2015
2640	BEST_Temperature_Products
1231	Temperature  Data from Mac
1069	Temperature_Data
543	GHCN_from_SG500
164	NOAA_wget_log
112	GISTemp
58	GHCNv3_1June2015
53	USHCNv2.5
32	CDIAC_wget_log
1	testeph.f
1	lost+found
1	GHCNv1_partial
1	cdiac.AntO.ndp032.txt

Now you can see the benefit of this process. It is instantly obvious that the only really big disk users are those two site scrapes (with the cdiac.ornl one still running, but it’s only at 50k rate) and two prior archives of just NOAA_NCDC saves over the years and a variety of old GHCN copies in that archive. (Plus honorable mention for a GISTemp archive with data).

Once we are down to “just data” in GHCN Daily, it is ‘only’ 5.5 GB while the B.E.S.T. archive is only 2.6 GB. Everything else is chump change. USHCNv2.5 comes in at only 53 MB.

Notice the two files that end with _log one starting with cdiac and the other with NOAA and with _wget_ in the middle. That’s the log file of each scrape. Darned big logs! 164 MB just for the file names and tracking of the NOAA latest run. Sheesh!

Here’s a bit of the log so you can see what it looks like. The “head” command gives you the first lines of a file. The “tail” command, the last lines. (For chunks in the middle you can ‘head’ about 1/2 the file, then ‘tail’ that). I’ll start with just counting the lines in the file. (That is done with the word count command “wc” but giving it an option to just count the lines) Note that I’ve swapped over to being “root”, the superuser, as some of these files have “root” owner.

root@dnsTorrent:/Temps# wc -l NOAA_wget_log 

2289637 NOAA_wget_log


2,289,637 lines of log file. I think I’ll not read the whole thing… ;-)

Remember that in the script, I commented out some of the wget commands that had already finished, but left in the ‘echo’ comments saying they were that step. So skip over the “World Weather Records” and “World War II Data”… The -20 says to give me 20 lines off the top instead of the default of only 10 lines.

root@dnsTorrent:/Temps# head -20 NOAA_wget_log 

Doing World Weather Records

Doing World War II Data

Doing NOAA set

File `' already there; not retrieving.
Removed `'.
File `' already there; not retrieving.
File `' already there; not retrieving.
File `' already there; not retrieving.
File `' already there; not retrieving.
File `' already there; not retrieving.
File `' already there; not retrieving.
File `' already there; not retrieving.
File `' already there; not retrieving.
File `' already there; not retrieving.

When restarted, wget is smart enough to skip files it has already copied if you tell it to do that.

Here’s the bottom bit:

root@dnsTorrent:/Temps# tail -20 NOAA_wget_log 
 50850K .......... .......... .......... .......... .......... 99%  104K 2s
 50900K .......... .......... .......... .......... .......... 99%  104K 1s
 50950K .......... .......... .......... .......... .......... 99%  104K 1s
 51000K .......... .......... .......... .......... .......... 99%  104K 0s
 51050K .......... .......... .......... .......... ........  100%  102K=8m31s

2015-10-01 04:56:13 (99.9 KB/s) - `' saved [52325312]

FINISHED --2015-10-01 04:56:13--
Total wall clock time: 1d 18h 7m 50s
Downloaded: 104140 files, 9.7G in 1d 3h 27m 17s (103 KB/s)

Doing Global Data Bank set

Doing GHCN


Again, note the blanks on “Global Data Set” and “GHCN” that were deliberately commented out in the script. Working up from there, it says the last run downloaded 104,140 files for 9.7 G and took a little over a day to do it. That is ONLY from the restart after moving the scrape from the PiM2 to the Pi_B+ and restarting… Each file gets a progress report (those top lines) and a summary when done that includes the average data rate (about 100 KB/s) and the file name /size.

Here’s a little bit from the middle toward the end. Since we know it was a mostly finished restart, we’ll get a chunk from the end first, then take the top of that bit:

root@dnsTorrent:/Temps# tail -10000 NOAA_wget_log | head -20
--2015-10-01 04:19:20--
           => `'
==> CWD not required.
==> PASV ... done.    ==> RETR 951809-99999-1968.gz ... done.
Length: 8284 (8.1K)

     0K ........                                              100%  187K=0.04s

2015-10-01 04:19:20 (187 KB/s) - `' saved [8284]

--2015-10-01 04:19:20--
           => `'
==> CWD not required.
==> PASV ... done.    ==> RETR 951809-99999-1969.gz ... done.
Length: 14334 (14K)

     0K .......... ...                                        100%  223K=0.06s

2015-10-01 04:19:21 (223 KB/s) - `' saved [14334]


That’s what most of the process looks like. A long list of individual file transfers. That last one being a very small file at 14k it only used 233 KB/s of transfer as time was spent waiting for overhead of set-up and tear-down. A lot of small files looks like not much speed, while a single large file saturates the pipe. It happened at 4:19 A.M. while I was sleeping (yay!) and moved the file from in the /pub/data/noaa/updates directory named “951809-99999-1969.gz” that is in compressed “gzip” format (that .gz). What’s in it? Who knows… I’d guess an update of temperature data from 1969 for station 951809 or something similar. It will take a good long while to figure out what all is in the scrape. Much of it will likely be left compressed on disk just as a “someday” archive; after inspection of a bit shows it to be “uninteresting” to me.

Inside NOAA subset

So what all is inside that biggest lump?

root@dnsTorrent:/Temps/ du -ms *
49983	ghcn
121929	noaa
17	ww-ii-data
107	wwr

( I didn’t bother to sort this small a listing…)

We can see that the W.W.II data are the smallest at 17 MB, while “World Weather Records” is also small at 107 MB. At some point I’ll take the “#” off those wget lines and let them update on each run. I’d also downloaded GHCN in a prior run, and it is about 50 GB. Most of that is a couple of copies of the daily data. Then this last batch of ‘noaa’ at 122 GB. So what’s in THOSE two?

root@dnsTorrent:/Temps/ du -ms * | sort -rn
47787	daily
1967	v3
115	blended
61	v2
30	forts
15	grid_gpcp_1979-2002.dat
4	v1
4	Lawrimore-ISTI-30Nov11.ppt
3	anom
2	snow
1	alaska-temperature-means.txt
1	alaska-temperature-anomalies.txt

For GHCN it is mostly daily data, then the V3 massaged results. That “ghcnd_all.tar.gz” is a handful at 334 MB. Then you also have it “by_year” and it doesn’t get any smaller when listed one year at a time (but you can choose to only download the newest year… if you already have the others). The “all” directory looks to me like it has daily records for each individual station, so you can just get the stations you care about. Basically, you get the same Gigs or so of data several times in several different sorts.

Note that in the “ls -l” listing the “size” of a directory (lines starting with d) is the size of the directory tracking the data blocks, not the total of the actual data blocks sizes. The “all” is 25 GB of data, tracked in 3.6 MB of directory structure. For non-directory files, the “ls -l” lists the actual size of the file.

root@dnsTorrent:/Temps/ ls -l daily
total 784852
drwxr-xr-x 2 pi pi   3608576 Sep  6 03:54 all
drwxr-xr-x 2 pi pi     12288 Sep  6 19:42 by_year
-rw-r--r-- 1 pi pi     34304 Apr 19  2011 COOPDaily_announcement_042011.doc
-rw-r--r-- 1 pi pi    125034 Apr 19  2011 COOPDaily_announcement_042011.pdf
-rw-r--r-- 1 pi pi     68083 Apr 19  2011 COOPDaily_announcement_042011.rtf
drwxr-xr-x 2 pi pi      4096 Sep  6 19:42 figures
-rw-r--r-- 1 pi pi 334052970 Sep  4 03:43 ghcnd_all.tar.gz
-rw-r--r-- 1 pi pi      3670 Jun 23 01:59 ghcnd-countries.txt
-rw-r--r-- 1 pi pi 142902064 Sep  4 03:43 ghcnd_gsn.tar.gz
-rw-r--r-- 1 pi pi 287989756 Sep  4 03:43 ghcnd_hcn.tar.gz
-rw-r--r-- 1 pi pi  26289506 Sep  2 04:28 ghcnd-inventory.txt
-rw-r--r-- 1 pi pi      1086 May 15  2011 ghcnd-states.txt
-rw-r--r-- 1 pi pi   8431010 Sep  2 04:28 ghcnd-stations.txt
-rw-r--r-- 1 pi pi       270 Sep  4 03:43 ghcnd-version.txt
drwxr-xr-x 3 pi pi      4096 Sep  6 20:50 grid
drwxr-xr-x 2 pi pi     36864 Sep  7 02:51 gsn
drwxr-xr-x 2 pi pi     40960 Sep  7 05:47 hcn
drwxr-xr-x 2 pi pi      4096 Sep  7 05:47 papers
-rw-r--r-- 1 pi pi     24088 Jul  9 06:56 readme.txt
-rw-r--r-- 1 pi pi     29430 Jul  8 02:31 status.txt
root@dnsTorrent:/Temps/ du -ms *
24925	all
13624	by_year
1	COOPDaily_announcement_042011.doc
1	COOPDaily_announcement_042011.pdf
1	COOPDaily_announcement_042011.rtf
8	figures
319	ghcnd_all.tar.gz
1	ghcnd-countries.txt
137	ghcnd_gsn.tar.gz
275	ghcnd_hcn.tar.gz
26	ghcnd-inventory.txt
1	ghcnd-states.txt
9	ghcnd-stations.txt
1	ghcnd-version.txt
5203	grid
865	gsn
2395	hcn
8	papers
1	readme.txt
1	status.txt

IMHO, this kind of “size of data” information would be very nice to have on their FTP site so folks knew what they were asking for with any given download. But now you have it.

What about those other v1 and v2 entries? Why are they so small? Well, simply because they took out the data. Note that they have various ghcnm.tavg* ghcnm.tmax* and ghcn.tmin* entries under GHCN v3, but they are now missing from v1 and v2. (No worries, though, I’ve saved copies over the years AND I found the GHCN v1 on the CDIAC site – who may still have v2 somewhere).

root@dnsTorrent:/Temps/ ls v1  flag.for.Z      press.sea.statinv.Z  press.sta.statinv.Z
data.for.Z     README.ghcn     invent.for.Z  press.sea.flag.Z  press.sta.flag.Z     README.ghcn~

root@dnsTorrent:/Temps/ ls v2
grid  v2.prcp.failed.qc.Z  v2.prcp.readme  zipd
source	v2.prcp_adj.Z	  v2.prcp.inv	       v2.prcp.Z

root@dnsTorrent:/Temps/ ls v3
archives		      ghcnm.tavg.latest.qcu.tar.gz  ghcnm.tmin.latest.qca.tar.gz  grid	    software
country-codes		      ghcnm.tmax.latest.qca.tar.gz  ghcnm.tmin.latest.qcu.tar.gz  products  status.txt
ghcnm.tavg.latest.qca.tar.gz  ghcnm.tmax.latest.qcu.tar.gz  GHCNM-v3.2.0-FAQ.pdf	  README    techreports

The “noaa” directory is even more busy. It has some kind of archive by year that is the bulk of it, then some other files.

root@dnsTorrent:/Temps/ ls
1901  1931  1961  1991		    isd-inventory.csv
1902  1932  1962  1992		    isd-inventory.csv.z
1903  1933  1963  1993		    isd-inventory.txt
1904  1934  1964  1994		    isd-inventory.txt.z
1905  1935  1965  1995		    isd-lite
1906  1936  1966  1996		    isd-problems.docx
1907  1937  1967  1997		    isd-problems.pdf
1908  1938  1968  1998		    ish-abbreviated.txt
1909  1939  1969  1999		    ISH-DVD2012
1910  1940  1970  2000		    ish-format-document.doc
1911  1941  1971  2001		    ish-format-document.pdf
1912  1942  1972  2002		    ish-history.csv
1913  1943  1973  2003		    ish-history.txt
1914  1944  1974  2004		    ish-inventory.csv
1915  1945  1975  2005		    ish-inventory.csv.z
1916  1946  1976  2006		    ish-inventory.txt
1917  1947  1977  2007		    ish-inventory.txt.z
1918  1948  1978  2008		    ishJava.class
1919  1949  1979  2009
1920  1950  1980  2010		    ishJava.old.class
1921  1951  1981  2011
1922  1952  1982  2012		    ishJava_ReadMe.pdf
1923  1953  1983  2013		    ish-qc.pdf
1924  1954  1984  2014		    ish-tech-report.pdf
1925  1955  1985  2015		    NOTICE-ISD-MERGE-ISSUE.TXT
1926  1956  1986  additional	    readme.txt
1927  1957  1987  country-list.txt  software
1928  1958  1988  dsi3260.pdf	    station-chart.jpg
1929  1959  1989  isd-history.csv   updates
1930  1960  1990  isd-history.txt   updates.txt

The “readme.txt” file covers it pretty well. Here’s a bit from the top:

root@dnsTorrent:/Temps/ cat readme.txt 
This directory contains ISH/ISD data in directories by year.  Please note that ISH and ISD refer to
the same data--Integrated Surface Data, sometimes called Integrated Surface Hourly.

Updated files will be listed in the updates.txt file, and will be stored both in the /updates directory
and in the respective data directories by year.  However, for the current year, updates will
be more frequent and will not be indicated in the updates.txt file.  Please note that all data files
are compressed as indicated by the "gz" extension and can be uncompressed using Gunzip, WinZip or other
similar software applications. 

The filenames correspond with the station numbers listed in the ish-history.txt file described below -- 
eg, 723150-03812-2006 corresponds with USAF number 723150 and WBAN number 03812.

Extensive updates to the dataset took place in November 2004, as Version 2.3 of the dataset was 
released, and periodic updates have continued since then.

Other files included in the main directory are: 

and on it goes. So where are the “big lumps”? Not where I’d expected. First off, if we count the lines in a ‘listing’ we find there are 150 things to total up.

root@dnsTorrent:/Temps/ ls | wc -l

The distribution of “data by year” is interesting… But it is the “isd-lite” and “additional” that have the most stuff in them. Something I learned while waiting endlessly for “additional” to complete after I thought I had most of the “stuff” in the yearly batches (that had taken a looong time to download)… 18 Gig of “isd-lite” and 11 Gig of “additional”.

root@dnsTorrent:/Temps/ du -ms * | sort -rn
18371	isd-lite
11052	additional
6566	updates
4469	2012
4383	2011
4229	2010
4215	2014
4064	2013
3998	2009
3807	2008
3554	2007
3394	2006
3285	2005
3220	2015
2783	2004
2570	2003
2377	2002
2084	2001
1974	2000
1408	1999
1142	1998
1130	1997
1103	1996
1098	1991
1097	1992
1091	1993
1084	1995
1005	1990
984	1994
963	1988
962	1989
930	1987
906	1986
880	1985
872	1984
851	1983
819	1981
800	1982
789	1979
782	1980
774	1977
768	1976
766	1978
750	1975
720	1974
690	1973
293	1962
290	1961
289	1963
285	1960
281	1957
279	1958
278	1959
271	1954
266	1953
255	1964
252	1956
250	1952
249	1955
231	1951
212	1950
200	1949
199	1969
198	1970
169	1966
168	1965
165	1967
158	1968
137	1948
133	1971
131	ISH-DVD2012
95	1945
80	1972
80	1944
64	isd-inventory.txt
63	1943
58	1947
57	1946
51	isd-inventory.csv
34	1942
23	1941
19	1940
18	1937
16	1939
16	1936
15	1938
13	1935
12	isd-inventory.txt.z
11	isd-inventory.csv.z
11	1934
9	1933
9	1932
6	1931
3	isd-history.txt
3	isd-history.csv
2	ish-tech-report.pdf
2	1930
1	updates.txt
1	station-chart.jpg
1	software
1	readme.txt
1	ish-qc.pdf
1	ishJava_ReadMe.pdf
1	ishJava.old.class
1	ishJava.class
1	ish-inventory.txt.z
1	ish-inventory.txt
1	ish-inventory.csv.z
1	ish-inventory.csv
1	ish-history.txt
1	ish-history.csv
1	ish-format-document.pdf
1	ish-format-document.doc
1	ish-abbreviated.txt
1	isd-problems.pdf
1	isd-problems.docx
1	dsi3260.pdf
1	country-list.txt
1	1929
1	1928
1	1927
1	1926
1	1925
1	1924
1	1923
1	1922
1	1921
1	1920
1	1919
1	1918
1	1917
1	1916
1	1915
1	1914
1	1913
1	1912
1	1911
1	1910
1	1909
1	1908
1	1907
1	1906
1	1905
1	1904
1	1903
1	1902
1	1901

Not much data prior to W.W. II at all. Here is the count in kB of data sorted by year. Kind of makes me wonder what happened in 1972…

root@dnsTorrent:/Temps/ du -ks 1* 2*
76	1901
76	1902
76	1903
76	1904
72	1905
64	1906
64	1907
76	1908
88	1909
88	1910
88	1911
88	1912
104	1913
104	1914
104	1915
32	1916
104	1917
100	1918
92	1919
104	1920
100	1921
100	1922
88	1923
88	1924
88	1925
172	1926
124	1927
248	1928
816	1929
1856	1930
5368	1931
8616	1932
9152	1933
10348	1934
12468	1935
15472	1936
18296	1937
15348	1938
15944	1939
19020	1940
22872	1941
33992	1942
64368	1943
81088	1944
96320	1945
57916	1946
59124	1947
140280	1948
204436	1949
216940	1950
235816	1951
255252	1952
272056	1953
276780	1954
254144	1955
257884	1956
287332	1957
285580	1958
284004	1959
291780	1960
296548	1961
299124	1962
295648	1963
260996	1964
171432	1965
172104	1966
168152	1967
161100	1968
202792	1969
201748	1970
136076	1971
81032	1972
705788	1973
736588	1974
767920	1975
786268	1976
791928	1977
784300	1978
807800	1979
800276	1980
838408	1981
818604	1982
871176	1983
892128	1984
900176	1985
927064	1986
951728	1987
985148	1988
984864	1989
1028580	1990
1124208	1991
1122912	1992
1117148	1993
1007196	1994
1109632	1995
1129032	1996
1157036	1997
1169180	1998
1441192	1999
2020996	2000
2133168	2001
2433940	2002
2630936	2003
2849388	2004
3362872	2005
3475392	2006
3638676	2007
3897780	2008
4092976	2009
4329828	2010
4487960	2011
4575912	2012
4160708	2013
4315580	2014
3296520	2015

In Conclusion

So that’s the look of the size of things in data land. I”m going to let the CDIAC scrape toodle along slowly and at some point let the next step of the NOAA script run. Then the question becomes “Do I archive a copy of all this, and run the update, or just have a sync copy that updtes?” I’ll likley do the archive with separate update copy. It’s “only” about $40 worth of disk…

The main take away here, though, is to just realize that for about $200 for all the equipment and letting your home data pipe be busy while you sleep, you, too, can have a giant archive of all the data. Just so it doesn’t tend to disappear on you (like those old v1 and v2 data sets that may have become an embarrassement to folks when compared and found different…)

And with that, I’m back to work on the next steps in too many projects ;-)

Subscribe to feed


About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in NCDC - GHCN Issues, Tech Bits and tagged , , , , , . Bookmark the permalink.

11 Responses to NOAA data scrape completes. At last.

  1. p.g.sharrow says:

    Representative Lamar Smith! right on the ball, demands that all IGES records be preserved for his committee on science, space and technology to investigate their activities. Those people may very very well get the RICO investigation that they asked for.
    Sometimes you are the windshield, sometimes you are the bug! ;-) …pg

  2. E.M.Smith says:


    So maybe I ought to get the “How to scrape a web site” script and posting up next?!

    It is a variation on the data scraper with different recursion settings to follow linked items a layer or so down.

  3. John Howard says:

    Any thoughts on the Theory of Chaos which seems to say that weather is a chaotic dynamic or as summarized by Edward Lorenz as: “When the present determines the future, but the approximate present does not approximately determine the future.” That sort of runs congruent with you thoughts on average temperature I think. But maybe I have misunderstood. I definitely do not like to concentrate that hard.

  4. Larry Ledwick says:

    The other option would be to crowd source the scrape by having several people with linux systems each take a bite of the data (several parallel scrapes running to different systems) then consolidate the individual scrapes into one archive file.

    That would keep from beating your daily web connection for days or weeks.

  5. p.g.sharrow says:

    @EMSmith; It would appear that you may have a nucleus of helpers as you move to the next step;-)…pg

  6. Paul Hanlon says:

    @Larry L

    That’s a great idea!!. I’m in.

  7. E.M.Smith says:

    Well, I guess I’ll put together a set of what, where, and which is already done…

    So far I have none of the non-USA data, so Australian, New Zealand, Canadian, Danish Arctic, Hadley, etc are all still ToDo.

    I’ve done most of NOAA, but CDIAC is only part done. (I did the USHCN, then the rest was slow boat at 50 kb while I throttled up NOAA. )

    There are a couple more, but I need to look up the names…

    I mostly just launched NOAA and USHCN as I worked on other things, figuring I’d figure out what I didn’t really want later on my machine… but I would be smarter to figure out what was useful first and just get that.

    I suppose anyone interested could just pick a site and start looking for temp related stuff, make a catalog of it, and put up a note…

    Since a wget can be segmented by directory, and restarted with little lost time (use -nc to prevent updates of daily things already grabbed or each rerun does a resync update and the dailies take time to overwrite…) I’ve tended to just let it run when not doing much internet stuff, and stop / restart when I wanted an open pipe. Mostly just let it run.

    The protocol is fairly polite and tends to get out of the way of browser traffic anyway. Setting a wget to limit at 110k to 150k of a 215 k pipe I have noticed little impact other than on other downloads. In some ways it just lets me use some of the bandwidth I otherwise waste while sleeping or not on the computer.

    I’ll put up an updated “what I have” list in a few hours… after I sleep some…

  8. Steve C says:

    For any future monster downloads, may I recommend the Samsung M3 drives? I picked up a 1TB one last year for just under 50 UKP, and was so impressed I went straight out and bought another one, helped along by discovering that the hardware inside them was Seagate. Pocket sized (not much bigger than its 2.5″ drive), and the darn thing is completely powered from its USB (3.0) connection. And I see there’s at least a 2TB one in the range now …

  9. E.M.Smith says:

    @Steve C.:

    On my someday list is to do a performance compare on the 3 brands I’ve got. Western Digital, Seagate, and Toshiba.

    One interesting note is that doing swap to them is interesting… At least for the Toshiba it tends to sleep and power down, that then causes a hang on a need to swap… (perhaps only on the swap in…). An older Western Digital doesn’t do that, and the Seagate has not caused a hang yet, but it has the data scrape running to it (why I moved swap there after the hangs… knowing that disk stayed active, preventing the sleep state).

    Oh, and massive file copies causes a lot of memory use including roll to swap. I’ve had up to 400 GB of swap used when doing combined scrapes and disk dups. I think that the OS is tuned to hang-on to inode date due to flash disk sloth, but not smart enough to treat real disk differently. Doing a du on a few GB, the first one is slow, the second near instant, implies inodes cached.

    At any rate, pending testing, the Seagate seems best, WD next, and the Toshiba a bit slower and prone to falling asleep… So I put most active files on the Seagate with swap; and backups or low use files on the Toyshiba… But that could change with real data instead of hypothesis…

    Not seen Samsug disk locally, but will look for it. The Samsung tablet has been very reliable and well made. Using it right now 8-)

  10. Steve C says:

    Blimey O’Reilly! 400GB of swap? That sounds like “cruel and unusual punishment”, especially if it was a Pi doing it. The only way I can see of speeding things up at that scale would be to swap onto an SSD – except that for the price of half a TB of SSD you could probably have a rack of Pis with one job apiece. ;-)

    I use the technique on the machine where I process audio – OK, it’s the program’s working directory rather than a swap, but the same general principle. It speeds up jobs like resampling, noise reduction and so on quite well, where the whole of a big temp file has to be processed and rewritten in its new version. There’s also the pleasiure of not having to listen to your poor old spinning rust drive clattering about all over the place finding and writing fragments of file, of course.

Comments are closed.