CDIAC Scrape Finished

I’ve been shoveling TB of data around to make more room for things. Along the way, all the temperature scrapes got moved to their own disk. In the last round of tidy-up (the one just done) I discovered that part way through it the first cut scrape of had reached completion. I’d originally just pointed it at the USHCN data, but left out the -np no-parent flag and it had proceeded to wander all the parent links grabbing all sorts of stuff. So on the 2nd or 3rd restart I decided to just let it, but at a low rate. So I set the rate to 50 kB/second and just let it crawl.

Here is a snip from the bottom of the script showing some of the variations over time. Lines with a “#” in the first position are commented out, but were run in the past.

#wget --limit-rate=50k -np -m

wget --limit-rate=200k -m

#wget -r -N -l inf --no-remove-listing -w 10 --limit-rate=50k -np

#wget --limit-rate=50k -nc -r -l inf

#wget -nc -r -l inf

I’ve upped the speed limit to 200k and in this final pass, set it just mirror everything -m but starting in the USHCN directory and following parent links for the rest.

I did set it to -nc no-clobber in prior runs so that on restarts it did not grab newer versions (skipping a daily update for every restart…) until completed. As of now, I’ve restarted the ‘sync’ run with clobber set, so any changed copies of data are being recopied with newer version. Even though that is running it will still be fast as any unchanged files are not re-sent. Just realize that any size data will change a little by the end of the day (and even after that as data sets are updated and added). The sizes will still be a decent guide to ‘how big is what’.

When it is done, I’m going to restart the NOAA scrape that mostly just needs one final large directory to be done, but for now I’ve paused it as CDIAC does a final tidy up.

Sizes In Total

Here’s the message from the end of my log file:

2015-10-04 23:36:39 (50.1 KB/s) - `' saved [94156/94156]

FINISHED --2015-10-04 23:36:39--
Total wall clock time: 5d 12h 44m 42s
Downloaded: 26270 files, 22G in 5d 11h 16m 39s (48.7 KB/s)

So this particular run ended just before midnight and had been running for 5 1/2 days at about 50 kB/second. This part of the run copied 26,270 files for 22 G of size. But how big is this thing in total? Including the prior runs?

root@RaPiM2:/Temps# cd
root@RaPiM2:/Temps/ du -ms .
127896	.

Roughly 127.8 GBytes of data and stuff. (As 22 GB took 5 days you can figure that was about a month worth at the slow rate, or about a week at full speed for my link).

There’s a LOT of stuff in there, and the top level directory is a bit ‘busy’. Lots of small files, and a few directories with big data caches in them. Here’s a listing of the top level:

root@RaPiM2:/Temps/ ls -l
total 13072
-rw-r--r--   1 pi pi     1791 Oct  6 09:06 1DU_mb_out
drwxr-xr-x   2 pi pi     4096 Oct  4 22:13 about
-rw-r--r--   1 pi pi    33705 Sep  7 21:23 aerosol_parameters.html
-rw-r--r--   1 pi pi    24773 Sep  7 21:23 aerosol_particle_types.html
-rw-r--r--   1 pi pi    20161 Oct  6 10:19 aerosols.html
drwxr-xr-x   2 pi pi     4096 Oct  4 02:31 authors
drwxr-xr-x   2 pi pi     4096 Sep 14 02:38 backgrnds
drwxr-xr-x   2 pi pi     4096 Sep 10 06:31 by_new
-rw-r--r--   1 pi pi    25412 Oct  6 09:30 carbon_cycle_data.html
-rw-r--r--   1 pi pi    21691 Oct  6 09:30 carbon_cycle.html
-rw-r--r--   1 pi pi    22107 Oct  6 10:19 carbonisotopes.html
drwxr-xr-x   7 pi pi     4096 Sep 23 00:19 carbonmanagement
-rw-r--r--   1 pi pi    22875 Sep  7 21:29 carbonmanagement.1
-rw-r--r--   1 pi pi    22875 Sep 16 06:14 carbonmanagement.10
-rw-r--r--   1 pi pi    22875 Sep 16 11:34 carbonmanagement.11
-rw-r--r--   1 pi pi    22875 Sep 23 13:50 carbonmanagement.12
-rw-r--r--   1 pi pi    22875 Sep 29 11:13 carbonmanagement.13
-rw-r--r--   1 pi pi    22875 Sep  9 05:31 carbonmanagement.2
-rw-r--r--   1 pi pi    22875 Sep  9 09:58 carbonmanagement.3
-rw-r--r--   1 pi pi    22875 Sep  9 14:22 carbonmanagement.4
-rw-r--r--   1 pi pi    22875 Sep  9 17:26 carbonmanagement.5
-rw-r--r--   1 pi pi    22875 Sep 13 09:11 carbonmanagement.6
-rw-r--r--   1 pi pi    22875 Sep 13 10:37 carbonmanagement.7
-rw-r--r--   1 pi pi    22875 Sep 13 13:00 carbonmanagement.8
-rw-r--r--   1 pi pi    22875 Sep 13 17:57 carbonmanagement.9
drwxr-xr-x   3 pi pi     4096 Sep 10 06:37 cdiac
-rw-r--r--   1 pi pi   148374 Aug 19  1998
-rw-r--r--   1 pi pi    21774 Oct  6 10:19 cfcs.html
-rw-r--r--   1 pi pi    20263 Oct  6 10:19 chcl3.html
drwxr-xr-x  13 pi pi     4096 Sep 23 00:19 climate
drwxr-xr-x   4 pi pi     4096 Sep 29 11:14 CO2_Emission
-rw-r--r--   1 pi pi     3872 Sep  9 09:57 CO2_Emission.1
-rw-r--r--   1 pi pi     3872 Sep 16 11:32 CO2_Emission.10
-rw-r--r--   1 pi pi     3872 Sep 23 13:49 CO2_Emission.11
-rw-r--r--   1 pi pi     3872 Sep 29 11:12 CO2_Emission.12
-rw-r--r--   1 pi pi     3872 Oct  6 10:18 CO2_Emission.13
-rw-r--r--   1 pi pi     3872 Sep  9 12:55 CO2_Emission.2
-rw-r--r--   1 pi pi     3872 Sep  9 14:21 CO2_Emission.3
-rw-r--r--   1 pi pi     3872 Sep  9 17:26 CO2_Emission.4
-rw-r--r--   1 pi pi     3872 Sep 13 09:05 CO2_Emission.5
-rw-r--r--   1 pi pi     3872 Sep 13 10:36 CO2_Emission.6
-rw-r--r--   1 pi pi     3872 Sep 13 12:59 CO2_Emission.7
-rw-r--r--   1 pi pi     3872 Sep 13 17:56 CO2_Emission.8
-rw-r--r--   1 pi pi     3872 Sep 16 06:13 CO2_Emission.9
-rw-r--r--   1 pi pi     1061 Sep 11 02:42 comments.html
drwxr-xr-x   2 pi pi     4096 Sep 23 00:19 css
-rw-r--r--   1 pi pi   110909 Oct  6 09:30 data_catalog.html
drwxr-xr-x   3 pi pi     4096 Aug 29 05:00 datasets
-rw-r--r--   1 pi pi    21223 Oct  6 09:30 datasubmission.html
-rw-r--r--   1 pi pi    20012 Oct  6 10:19 deuterium.html
-rw-r--r--   1 pi pi     3595 Sep  7 21:19 disclaimers.html
drwxr-xr-x   9 pi pi     4096 Oct  4 10:55 epubs
-rw-r--r--   1 pi pi    24114 Sep  7 21:23 factsdata.html
-rw-r--r--   1 pi pi    72588 Oct  6 09:30 faq.html
-rw-r--r--   1 pi pi    22345 Oct  6 09:30 frequent_data_products.html
drwxr-xr-x 192 pi pi    20480 Oct  6 09:28 ftp
-rw-r--r--   1 pi pi    59493 Oct  4 21:50 ftp.1
drwxr-xr-x   2 pi pi     4096 Sep 11 02:26 ftpdir
drwxr-xr-x   4 pi pi     4096 Sep 23 00:21 GCP
-rw-r--r--   1 pi pi    91264 Sep 10 07:03 glossary.html
-rw-r--r--   1 pi pi    20774 Oct  6 10:19 halons.html
-rw-r--r--   1 pi pi    20755 Oct  6 10:19 hcfc.html
-rw-r--r--   1 pi pi    20219 Oct  6 10:19 hfcs.html
-rw-r--r--   1 pi pi    29825 Sep  7 21:23 home.html
-rw-r--r--   1 pi pi    20543 Oct  6 10:19 hydrogen.html
-rw-r--r--   1 pi pi    27149 Sep  7 21:23 ice_core_no.html
-rw-r--r--   1 pi pi    29935 Sep  7 21:23 ice_cores_aerosols.html
drwxr-xr-x   2 pi pi     4096 Sep 30 23:40 icons
drwxr-xr-x   4 pi pi    12288 Oct  4 22:14 images
drwxr-xr-x   2 pi pi     4096 Aug 29 05:00 includes
-rw-r--r--   1 pi pi    29825 Oct  6 09:29 index.html
drwxr-xr-x   2 pi pi     4096 Sep 23 00:19 js
-rw-r--r--   1 pi pi    23674 Oct  6 09:30 land_use.html
drwxr-xr-x   3 pi pi     4096 Oct  4 21:48 library
-rw-r--r--   1 pi pi    21311 Oct  6 10:19 methane.html
-rw-r--r--   1 pi pi    19970 Oct  6 10:19 methylchloride.html
-rw-r--r--   1 pi pi    20252 Oct  6 10:19 methylchloroform.html
-rw-r--r--   1 pi pi    21918 Oct  6 09:30 mission.html
-rw-r--r--   1 pi pi    39401 Sep  7 21:23 modern_aerosols.html
-rw-r--r--   1 pi pi    37182 Oct  6 10:19 modern_halogens.html
-rw-r--r--   1 pi pi    40592 Sep  7 21:23 modern_no.html
drwxr-xr-x   2 pi pi     4096 Oct  4 22:14 ndps
drwxr-xr-x   2 pi pi     4096 Oct  4 23:36 new
drwxr-xr-x  14 pi pi     4096 Oct  4 02:31 newsletr
-rw-r--r--   1 pi pi    25963 Sep 11 02:27 newsletter.html
-rw-r--r--   1 pi pi    20620 Oct  6 10:19 no.html
drwxr-xr-x  53 pi pi    12288 Oct  4 21:49 oceans
-rw-r--r--   1 pi pi    14609 Sep 10 06:51 oceans.1
-rw-r--r--   1 pi pi    14609 Sep 13 09:18 oceans.2
-rw-r--r--   1 pi pi    14609 Sep 13 10:40 oceans.3
-rw-r--r--   1 pi pi    14609 Sep 13 13:01 oceans.4
-rw-r--r--   1 pi pi    14609 Sep 13 18:00 oceans.5
-rw-r--r--   1 pi pi    14609 Sep 16 06:15 oceans.6
-rw-r--r--   1 pi pi    14609 Sep 16 11:37 oceans.7
-rw-r--r--   1 pi pi    14609 Sep 23 15:13 oceans.8
-rw-r--r--   1 pi pi    14609 Sep 29 11:15 oceans.9
-rw-r--r--   1 pi pi    20501 Oct  6 10:19 oxygenisotopes.html
-rw-r--r--   1 pi pi    19963 Oct  6 10:19 ozone.html
-rw-r--r--   1 pi pi    20328 Oct  6 09:30 permission.html
drwxr-xr-x   2 pi pi     4096 Oct  4 02:32 pns
drwxr-xr-x   6 pi pi     4096 Oct  4 21:49 programs
-rw-r--r--   1 pi pi    33915 Oct  6 09:30 recent_publications.html
drwxr-xr-x   3 pi pi     4096 Aug 29 05:00 science-meeting
-rw-r--r--   1 pi pi      804 Sep 11 02:42 search.html
-rw-r--r--   1 pi pi    20218 Oct  6 10:19 sfsix.html
drwxr-xr-x   5 pi pi     4096 Oct  4 21:48 SOCCR
-rw-r--r--   1 pi pi    25392 Oct  6 09:30 staff.html
-rw-r--r--   1 pi pi    20205 Oct  6 10:19 tetrachloroethene.html
-rw-r--r--   1 pi pi    24615 Oct  6 09:30 trace_gas_emissions.html
-rw-r--r--   1 pi pi    22496 Oct  6 09:30 tracegases.html
drwxr-xr-x  16 pi pi     4096 Oct  4 21:48 trends
-rw-r--r--   1 pi pi    22899 Oct  6 09:30 vegetation.html
drwxr-xr-x   2 pi pi     4096 Sep 23 00:21 wdca
-rw-r--r--   1 pi pi    14568 Sep 10 07:03 wdcinfo.html
-rw-r--r--   1 pi pi    39921 Oct  6 09:30 whatsnew.html
-rw-r--r--   1 pi pi 11075997 Sep 11 02:44 wwwstat.html

Just as a reminder, lines starting with a ‘d’ are directories full of stuff, lines starting with a ‘-‘ are just ordinary files. Size is the size of a file, but the size of the directory structure NOT including saved data files, for directories. For example, wdca is shown as a 4k block. (That’s one ‘inode’ or information node size on this file system, and can hold pointers to a modest number of files plus their meta data). What’s in wdca?

root@RaPiM2:/Temps/ ls -l wdca
total 36
-rw-r--r-- 1 pi pi 31841 Oct  6 09:30 wdcinfo.html
-rw-r--r-- 1 pi pi  2350 Mar 29  1999 wdclogo.jpg

two files of about 34 kB total size. Looks like a web page (.html) and graphic (.jpg) in it.

Here’s sorted list of ‘big lumps’ cut off at a convenient place:

root@RaPiM2:/Temps/ cat 1DU_mb_out 
125576	ftp
574	oceans
167	epubs
74	trends
25	programs
22	carbonmanagement
19	newsletr
16	images
11	wwwstat.html
4	science-meeting
3	ndps
2	datasets
1	whatsnew.html
1	wdcinfo.html
1	wdca
1	vegetation.html
1	tracegases.html
1	trace_gas_emissions.html
1	tetrachloroethene.html
1	staff.html
1	sfsix.html
1	search.html
1	recent_publications.html
1	pns
1	permission.html
1	ozone.html
1	oxygenisotopes.html
1	oceans.9

Everything from there on down just shows as 1 MB as I counted these up in 1 MB chunks. (du -ms *)

As you can see, almost everything is in the ftp directory.

Even in that directory, it is a long list. I’m going to put all of it here. Files ending in a single digit on random checks of a couple seem to be web pages describing the associated data. Since MB chunks are not enlightening about the smaller files, I’m going to count up sizes in KB for the ftp directory. You can see that almost all of the data bulk is in the ‘ameriflux’ and ‘oceans’ at 66 GB and 20 GB respectively.

root@RaPiM2:/Temps/ du -ks * | sort -rn
66151664	ameriflux
20475520	oceans
11390940	ndp026c
3568792	us_recordtemps
3043020	nlcd92
2739560	ushcn_snow
2435772	nlcd2001
2187568	ndp026b
2098636	FACE
1453216	ale_gage_Agage
1202680	ndp026d
1066480	Atul_Jain_etal_Land_Use_Fluxes
1045564	ushcn_daily
985912	ndp068
869632	russia_daily
765300	ndp048
667460	ndp088
667168	ndp048r1
564988	ndp048r0
502224	ndp076
491396	db1013_v2011
485332	ndp040
357004	images
289244	global_carbon
284736	ndp026e
277492	ndp070
276784	ndp081
233812	ndp005a
227204	ndp055
200344	CDIAC_UWG_Presentations_Sept2010
166652	Nassar_Emissions_Scale_Factors
163064	ICRCCM-radiative_fluxes
142736	ushcn_v2.5_monthly
127356	db1005
90392	fossil_fuel_CO2_emissions_gridded_monthly_v2009
89740	fossil_fuel_CO2_emissions_gridded_monthly_del13C_v2009
82816	ndp026a
82380	fossil_fuel_CO2_emissions_gridded_monthly_del13C_v2013
82344	fossil_fuel_CO2_emissions_gridded_monthly_v2013
81068	fossil_fuel_CO2_emissions_gridded_monthly_del13C_v2012
81032	fossil_fuel_CO2_emissions_gridded_monthly_v2012
79712	fossil_fuel_CO2_emissions_gridded_monthly_del13C_v2011
79676	fossil_fuel_CO2_emissions_gridded_monthly_v2011
78920	ndp017b
78296	fossil_fuel_CO2_emissions_gridded_monthly_del13C_v2010
78236	fossil_fuel_CO2_emissions_gridded_monthly_v2010
77216	ndp005
69740	ndp064
68568	db1019
66928	cdiacpubs
53860	ndp059
47116	Tris_West_US_County_Level_Cropland_C_Estimates
45624	ndp020
40328	ushcn_v2_monthly
36500	ndp078a
35468	ndp041
35188	ndp080
35052	ndp035
29824	db1015
27852	cdiac129
24788	ndp043c
24596	ndp054
23732	ndp043a
23232	ndp046
22708	trends
22156	ndp074
21712	ndp037
21008	ndp026
20760	ndp065
20056	ndp058_v2009
19948	db1013_v2009
19444	ndp075
18040	ndp018
17836	db1013_v2013
17792	ndp058_v2013
17728	db1013_v2012
17704	ndp058_v2012
17580	ndp058_v2011
17484	ndp058_v2010
17348	ndp042
17092	ndp090
16828	ndp058
14892	trends93
14784	ndp043b
14176	HIPPO
13700	ndp039
13468	ndp067
13100	ndp030
13100	fossilfuel-co2-emissions
12384	ndp055b
11968	ndp082
11892	ndp047
11032	ndp044
10776	cdiac140
10312	db1012
9760	tr051
9372	ndp062
9072	ndp004
8332	ndp056
7908	cmp002
7532	ndp086
7456	bibliography
7376	ndp049
7112	ndp071
6916	ndp089
6876	ndp051
6792	ndp011
6752	ndp017
6544	ndp027
6216	ndp066
6140	ndp021
6060	ndp053
5928	ndp057a
5772	CSEQ
5672	maps-co
5672	db1020
5492	ndp087
5356	tr055
5056	ndp060
4952	maunloa.calibration.tar.Z
4352	ndp052
4148	ndp077
4020	Global_Carbon_Project
3948	ndp084
3932	db1009
3844	co2sys
3808	db1016
3764	ndp063
3720	ndp036
3612	ndp045
3420	db1021
3356	db1008
3352	maunaloa.hourly5886
3312	ndp001a
3164	ndp085
2592	ndp009
2572	cdiac74
2540	ndp032
2480	ndp057
2420	ndp079
2372	db1007
2276	ndp006
2212	ndp025
2084	er0649t
2024	Smith_Rothwell_Land-Use_Change_Emissions
1988	ndp001
1988	maunaloa-co2
1888	db1004
1796	ndp033
1600	ndp058a
1560	db1011
1484	db1017
1452	GISS3-D
1428	ndp007
1284	ndp061a
1212	ndp073
1140	ndp050
1064	ndp072
968	ndp013
912	db1013
896	tdemodel
460	quay_dc13_ch4
400	cdiac130
312	ndp028
196	methyl_chloride-khalil_rasmussen
196	db1010
144	ndp029
128	ndp034
112	db1022
104	ndp023
84	ndp003
84	ndp002
80	ndp022
80	cdiac136
68	ndp014
68	db1014
64	db1017.1
60	index.html?C=S;O=D
60	index.html?C=S;O=A
60	index.html?C=N;O=D
60	index.html?C=N;O=A
60	index.html?C=M;O=D
60	index.html?C=M;O=A
60	index.html?C=D;O=D
60	index.html?C=D;O=A
60	index.html
56	ndp058.1
52	ndp048.1
52	ndp026b.9
52	ndp026b.8
52	ndp026b.7
52	ndp026b.6
52	ndp026b.5
52	ndp026b.4
52	ndp026b.3
52	ndp026b.2
52	ndp026b.1
48	ndp040.1
44	db1018
44	db1016.9
44	db1016.8
44	db1016.7
44	db1016.6
44	db1016.5
44	db1016.4
44	db1016.3
44	db1016.2
44	db1016.12
44	db1016.11
44	db1016.10
44	db1016.1
40	ndp034r1
40	ndp030r8
40	ndp022r2
40	ndp021r1
40	ndp020r1
40	ndp019r3
40	ndp019
40	ndp008r4
40	ndp008
40	ndp005r3
40	ndp004r1
40	ndp003r1
40	ndp001r7
40	db1013_v2010
40	cdiac115
24	ale_gage_Agage.1
20	ndp044.1
20	ndp039.8
20	ndp039.7
20	ndp039.6
20	ndp039.5
20	ndp039.4
20	ndp039.3
20	ndp039.2
20	ndp039.1
20	ndp005a.9
20	ndp005a.8
20	ndp005a.7
20	ndp005a.6
20	ndp005a.5
20	ndp005a.4
20	ndp005a.3
20	ndp005a.2
20	ndp005a.1
16	ushcn_daily.8
16	ushcn_daily.7
16	ushcn_daily.6
16	ushcn_daily.5
16	ushcn_daily.4
16	ushcn_daily.3
16	ushcn_daily.20
16	ushcn_daily.2
16	ushcn_daily.19
16	ushcn_daily.18
16	ushcn_daily.17
16	ushcn_daily.16
16	ushcn_daily.15
16	ushcn_daily.14
16	ushcn_daily.13
16	ushcn_daily.12
16	ushcn_daily.10
16	ushcn_daily.1
16	ndp070.5
16	ndp070.4
16	ndp070.3
16	ndp070.2
16	ndp070.1
16	moisture.indices.prc.dat
12	ndp068.1
12	ndp055.9
12	ndp055.8
12	ndp055.7
12	ndp055.6
12	ndp055.5
12	ndp055.4
12	ndp055.3
12	ndp055.2
12	ndp055.1
12	ndp035.1
8	ushcn_v2.5_monthly.9
8	ushcn_v2.5_monthly.8
8	ushcn_v2.5_monthly.7
8	ushcn_v2.5_monthly.6
8	ushcn_v2.5_monthly.5
8	ushcn_v2.5_monthly.4
8	ushcn_v2.5_monthly.3
8	ushcn_v2.5_monthly.2
8	ushcn_v2.5_monthly.12
8	ushcn_v2.5_monthly.11
8	ushcn_v2.5_monthly.10
8	ushcn_v2.5_monthly.1
8	quay_dc13_ch4.8
8	quay_dc13_ch4.7
8	quay_dc13_ch4.6
8	quay_dc13_ch4.5
8	quay_dc13_ch4.4
8	quay_dc13_ch4.3
8	quay_dc13_ch4.2
8	quay_dc13_ch4.1
8	ndp078a.1
8	ndp076.1
8	ndp067.1
8	ndp064.1
8	ndp061a.1
8	ndp059.1
8	ndp058a.1
8	ndp057.1
8	ndp047.1
8	ndp043c.1
8	ndp043b.1
8	ndp043a.1
8	ndp042.5
8	ndp042.4
8	ndp042.3
8	ndp042.2
8	ndp042.1
8	ndp041.1
8	ndp032.1
8	ndp026c.1
8	ndp026a.9
8	ndp026a.8
8	ndp026a.7
8	ndp026a.6
8	ndp026a.5
8	ndp026a.4
8	ndp026a.3
8	ndp026a.2
8	ndp026a.12
8	ndp026a.11
8	ndp026a.10
8	ndp026a.1
8	ndp011.1
8	ndp009.1
8	mlo88.dat
8	db1015.1
8	1DU_mb_out
4	ushcn_daily.9
4	ushcn_daily.11
4	russia_daily.9
4	russia_daily.8
4	russia_daily.7
4	russia_daily.6
4	russia_daily.5
4	russia_daily.4
4	russia_daily.3
4	russia_daily.2
4	russia_daily.12
4	russia_daily.11
4	russia_daily.10
4	russia_daily.1
4	ndp077.1
4	ndp074.1
4	ndp073.1
4	ndp072.1
4	ndp071.1
4	ndp066.1
4	ndp065.1
4	ndp063.1
4	ndp062.1
4	ndp060.1
4	ndp057a.9
4	ndp057a.8
4	ndp057a.7
4	ndp057a.6
4	ndp057a.5
4	ndp057a.4
4	ndp057a.3
4	ndp057a.2
4	ndp057a.1
4	ndp056.1
4	ndp054.1
4	ndp053.1
4	ndp052.1
4	ndp051.1
4	ndp050.1
4	ndp049.9
4	ndp049.8
4	ndp049.7
4	ndp049.6
4	ndp049.5
4	ndp049.4
4	ndp049.3
4	ndp049.2
4	ndp049.13
4	ndp049.12
4	ndp049.11
4	ndp049.10
4	ndp049.1
4	ndp046.1
4	ndp045.1
4	ndp037.1
4	ndp036.1
4	ndp034.1
4	ndp033.1
4	ndp030.9
4	ndp030.8
4	ndp030.7
4	ndp030.6
4	ndp030.5
4	ndp030.4
4	ndp030.3
4	ndp030.2
4	ndp030.12
4	ndp030.11
4	ndp030.10
4	ndp030.1
4	ndp029.1
4	ndp028.1
4	ndp027.1
4	ndp026.9
4	ndp026.8
4	ndp026.7
4	ndp026.6
4	ndp026.5
4	ndp026.4
4	ndp026.3
4	ndp026.2
4	ndp026.13
4	ndp026.12
4	ndp026.11
4	ndp026.10
4	ndp026.1
4	ndp025.1
4	ndp023.9
4	ndp023.8
4	ndp023.7
4	ndp023.6
4	ndp023.5
4	ndp023.4
4	ndp023.3
4	ndp023.2
4	ndp023.12
4	ndp023.11
4	ndp023.10
4	ndp023.1
4	ndp022.1
4	ndp021.1
4	ndp020.1
4	ndp019.1
4	ndp018.1
4	ndp017.9
4	ndp017.8
4	ndp017.7
4	ndp017.6
4	ndp017.5
4	ndp017.4
4	ndp017.3
4	ndp017.2
4	ndp017.1
4	ndp014.1
4	ndp013.1
4	ndp008.1
4	ndp007.9
4	ndp007.8
4	ndp007.7
4	ndp007.6
4	ndp007.5
4	ndp007.4
4	ndp007.3
4	ndp007.2
4	ndp007.12
4	ndp007.11
4	ndp007.10
4	ndp007.1
4	ndp006.1
4	ndp005.1
4	ndp004.1
4	ndp002.1
4	ndp001.9
4	ndp001.8
4	ndp001.7
4	ndp001.6
4	ndp001.5
4	ndp001.4
4	ndp001.3
4	ndp001.2
4	ndp001.13
4	ndp001.12
4	ndp001.11
4	ndp001.10
4	ndp001.1
4	methyl_chloride-khalil_rasmussen.9
4	methyl_chloride-khalil_rasmussen.8
4	methyl_chloride-khalil_rasmussen.7
4	methyl_chloride-khalil_rasmussen.6
4	methyl_chloride-khalil_rasmussen.5
4	methyl_chloride-khalil_rasmussen.4
4	methyl_chloride-khalil_rasmussen.3
4	methyl_chloride-khalil_rasmussen.2
4	methyl_chloride-khalil_rasmussen.13
4	methyl_chloride-khalil_rasmussen.12
4	methyl_chloride-khalil_rasmussen.11
4	methyl_chloride-khalil_rasmussen.10
4	methyl_chloride-khalil_rasmussen.1
4	images.1
4	ICRCCM-radiative_fluxes.1
4	GISS3-D.1
4	db1021.9
4	db1021.8
4	db1021.7
4	db1021.6
4	db1021.5
4	db1021.4
4	db1021.3
4	db1021.2
4	db1021.13
4	db1021.12
4	db1021.11
4	db1021.10
4	db1021.1
4	db1020.9
4	db1020.8
4	db1020.7
4	db1020.6
4	db1020.5
4	db1020.4
4	db1020.3
4	db1020.2
4	db1020.12
4	db1020.11
4	db1020.10
4	db1020.1
4	db1019.9
4	db1019.8
4	db1019.7
4	db1019.6
4	db1019.5
4	db1019.4
4	db1019.3
4	db1019.2
4	db1019.12
4	db1019.11
4	db1019.10
4	db1019.1
4	db1018.1
4	db1014.9
4	db1014.8
4	db1014.7
4	db1014.6
4	db1014.5
4	db1014.4
4	db1014.3
4	db1014.2
4	db1014.13
4	db1014.12
4	db1014.11
4	db1014.10
4	db1014.1
4	db1013.1
4	db1012.1
4	db1011.1
4	db1010.9
4	db1010.8
4	db1010.7
4	db1010.6
4	db1010.5
4	db1010.4
4	db1010.3
4	db1010.2
4	db1010.12
4	db1010.11
4	db1010.10
4	db1010.1
4	db1009.1
4	db1008.1
4	db1007.9
4	db1007.8
4	db1007.7
4	db1007.6
4	db1007.5
4	db1007.4
4	db1007.3
4	db1007.2
4	db1007.13
4	db1007.12
4	db1007.11
4	db1007.10
4	db1007.1
4	db1005.9
4	db1005.8
4	db1005.7
4	db1005.6
4	db1005.5
4	db1005.4
4	db1005.3
4	db1005.2
4	db1005.13
4	db1005.12
4	db1005.11
4	db1005.10
4	db1005.1
4	db1004.126c ?
4	co2sys.1
4	cmp002.1
4	cdiac129.1
4	bibliography.1
4	ameriflux.1

So what’s that 3rd thing, ndp026c?

root@RaPiM2:/Temps/ ls -ld ndp026c*
drwxr-xr-x 10 pi pi 4096 Sep 23 00:18 ndp026c
-rw-r--r--  1 pi pi 4995 Oct  4 10:54 ndp026c.1
root@RaPiM2:/Temps/ cat ndp026c.1

[I'm strippingout the HTML tags from this web pasge doc so WOrdpress doesn't go batshit crazy on them... -E.M.Smith]


Index of /ftp/ndp026


NOTE: Land data have now been updated through 2009.
Please see the updated documentation file NDP-026C_EECRA_Update_1997-2009.pdf.

The original documentation file for NDP-026C: Extended Edited Synoptic Cloud Reports 
from Ships and Land Stations Over the Globe, 1952-1996, is ndp026c.pdf.
The file ndp026c.txt is an ASCII version of
the pdf file that contains most text documentation, but lacks tables and figures.
The file ndp026c_readme.txt is an ASCII file that contains a basic overview
of the database and a few key tables related to the contents of the data files.

The monthly cloud data files for land and ocean are contained in the "land" and
"ship" subdirectories, respectively.  The period of record contained in each
subdirectory is apparent from the subdirectory name, e.g., the subdirectory
land_197101_197404 contains all land monthly data files from January 1971 through
April of 1974.

August 21, 2012


So that’s example of how you can dredge through this kind of stuff with ‘grep’ to search for stuff or just ‘cat’ to see what’s in a text type file. How do I know it’s text? The “file’ command is your friend:

root@RaPiM2:/Temps/ file ndp026c.1
ndp026c.1: HTML document, ASCII text

So we now know that ndp026c is a load of “Extended Edited Synoptic Cloud Reports”… looking for a use…

Here’s a sample search using “USHCN” as the search string.

root@RaPiM2:/Temps/ grep USHCN *
grep: ale_gage_Agage: Is a directory
grep: ameriflux: Is a directory
grep: Atul_Jain_etal_Land_Use_Fluxes: Is a directory
grep: bibliography: Is a directory
grep: cdiac115: Is a directory
grep: cdiac129: Is a directory

The ‘grep’ command is ‘global regular expression print’. It takes all sorts of regular expressions, like ^Pr that would say “starting at the start of the line, (That’s the ^) look for ‘Pr’ then print it out”. Just giving it plain text it looks for that text. So here we said ‘look for USHCN and print those lines’. Directories are not files so it complains for each of them. (There are ways to avoid that, but they are beyond this intro level, I’ll skip them in what I paste in.)

ndp042.1:These files comprise a very early version of the USHCN data database
ndp042.2:These files comprise a very early version of the USHCN data database
ndp042.3:These files comprise a very early version of the USHCN data database
ndp042.4:These files comprise a very early version of the USHCN data database
ndp042.5:These files comprise a very early version of the USHCN data database
ndp070.1:These files comprise CDIAC's version of USHCN daily data through 2005.
ndp070.2:These files comprise CDIAC's version of USHCN daily data through 2005.
ndp070.3:These files comprise CDIAC's version of USHCN daily data through 2005.
ndp070.4:These files comprise CDIAC's version of USHCN daily data through 2005.
ndp070.5:These files comprise CDIAC's version of USHCN daily data through 2005.
ushcn_daily.1:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.10:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.12:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.13:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.14:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.15:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.16:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.17:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.18:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.19:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.2:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.20:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.3:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.4:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.5:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.6:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.7:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.8:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_daily.9:These files comprise CDIAC's most current version of USHCN daily data.
ushcn_v2.5_monthly.1:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset
ushcn_v2.5_monthly.1:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data
ushcn_v2.5_monthly.10:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset
ushcn_v2.5_monthly.10:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data
ushcn_v2.5_monthly.11:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset
ushcn_v2.5_monthly.11:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data
ushcn_v2.5_monthly.12:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset
ushcn_v2.5_monthly.12:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data
ushcn_v2.5_monthly.2:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset
ushcn_v2.5_monthly.2:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data
ushcn_v2.5_monthly.3:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset
ushcn_v2.5_monthly.3:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data
ushcn_v2.5_monthly.4:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset
ushcn_v2.5_monthly.4:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data
ushcn_v2.5_monthly.5:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset
ushcn_v2.5_monthly.5:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data
ushcn_v2.5_monthly.6:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset
ushcn_v2.5_monthly.6:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data
ushcn_v2.5_monthly.7:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset
ushcn_v2.5_monthly.7:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data
ushcn_v2.5_monthly.8:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset
ushcn_v2.5_monthly.8:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data
ushcn_v2.5_monthly.9:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset
ushcn_v2.5_monthly.9:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data
grep: ushcn_v2_monthly: Is a directory

So there’s an example of how you can rapidly find some interesting files to look through for USHCN. Unfortunately, a similar search on GHCN yielded nothing of interest. (But I already have v1, v2, and v3 data versions…)

In Conclusion

It is quite reasonable for anyone with a modest internet connection and $120 for a Raspberry Pi and hard disk (with powered hub) to set up a scraper to gather very large blocks of data and save them, even if The Officious Agencies don’t…

I find it rather amazing that things like the historical GHCN v1 and V2 can be tossed in the trash, yet vast swathes of disk space are used for “Ameriflux”. At least 32 x larger. In all their $Billions of US Taxpayer Dollars, they can’t find just $60 for a TB disk at Best Buy to save it? Sigh.

On my “Someday” list is now to put up a temperature ftp server with old and archived copies of the data set so that ‘change over time’ can be analyzed.

Next step is to list the other temperature agencies and where they have data available for gathering and preserving. I’ll start by listing GECOS. If anyone else has a link worth a look, put up a comment, please. Then we can sort out priorities and “who does what.

Subscribe to feed


About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in AGW Science and Background, Earth Sciences, NCDC - GHCN Issues and tagged , , , , , , . Bookmark the permalink.

13 Responses to CDIAC Scrape Finished

  1. Excellent work. As you have been in the vanguard of unearthing, the changes to temperature data within versions is significant as well. Some sort of directory or file naming conventions to establish a date stamp, and occasional rescrapes of that data, will prove interesting.

    Your use of a modest rate is prudent, I think. NASA was quick to block Steve McIntyre years ago when he was attempting a data scrape; they marked his IP as PNG and pronounced him a hacker. Hopefully, your gentle retrieves of data will not elicit similar attention.

    Good show!

    ===|==============/ Keith DeHavelle

  2. Paul Hanlon says:

    It might be an idea to use Git to help sort out the different versions. Basically store the new file over the old, and then do a git add, If there are changes, Git will store them as differences, minimising the amount of space used, and also showing at a glance what changed. I’m sure there are other ways, but this is what Git was optimised to do.

  3. E.M.Smith says:


    Well, one hopes they have learned that “Hide The Data” is a losing game for them once the press and congress start asking questions. Besides, I already have the parts that matter… it’s mostly other “products” to complete the set that are in process (and for NOAA that set is duplicated elsewhere…) so mostly they would just be freeing me up to do other more productive things… probably not what they would want ;-)

    Also, since I’m doing this from a “number du jour”, blocking by IP would be particularly dumb. Aside from my tendency to wander to different places (and the Raspberry Pi has already been made portable via the Dongle Pi posting) and since my home IP resets on a powerfail / restart (or at worst after a DHCP timeout) that’s not going to be very useful either (it would eventually just hit whoever got that last IP after the most recent powerfail / restart of the neighborhood…)

    And not to mention next on my list of networking ToDo’s is to get set up with a VPN to a remote place (preferably with some kind of ‘cloud’ storage available) and make the whole thing more geography unlinked… well, they could speed up my progress on that…

    Finally, part of why I’m posting about this is so that others can do it to. There is beauty in parallel processing and freedom in independent processors with a common cause… at all levels. (From markets to malitia in the original Minute Men form to private busnesses to independent researchers to… ) so at most there would be a momentary inconvenience to me and a bit more discussion with friends… (There are at least 1/2 dozen folks around me who would be happy to let me leach off their links and they are on at least 4 different providers… then there are the two mobile hot spots I own, though the data cost would be higher… I’d rather just park outside of a local Starbucks and let it run from the car while I have a nice long Grande Mocha inside… With my parabolic antenna I can be about 1/8 mile away and still get good speed. (It is only 3 inch, so a 1 foot would do even better…)

    FWIW, there are settings for wget that let it masquerade as a person. You can set browser type, random pause between gets ( that wait 10 can be a randomized wait from 0 to 10 or whatever you set) and more. A whole lot of folks have gone ahead of me on the whole “block ME will you!” response methods. So IFF somehow they were daft enough to block the IP, I’d just drop the kit in the portable case, cross the street to the friends house, plug it in and set it to pretend to be a Windoze Browser with random wait of a minute between gets, at about a 100 kB rate, and on an ‘at’ command that only launched it when folks were not using their net (i.e. between about midnight and 7 am). Segment the wget into smaller subdirectories too, so each one runs, say, one day of the week. Set up 2 or 3 of those at different places and I’m getting faster download than now. What’s not to like? 8-)

    (Yes, this isn’t my first Rodeo… )

    But still, it is a good idea to be polite about network use in any case. Most sys admins are appreciative of the effort and if they see a polite scrape on public data just figure it isn’t a problem. So I’ve tried to always keep the aggregate below 200 kB / second and most of the time no more than 150 kB/sec to any one site (the exception being the ‘polish’ at the end where I let it catch up the skipped daily updates; since much of the traffic is just ‘not retrieving’ anyway… plus it’s nice if all the ‘daily updates’ end up being from the same day, so better if it doesn’t span several…


    I thought about it, but I’m still learning Git. It is intended for distributed projects, so more suited to a group effort (once / if one forms). For use on a single machine, the older ‘one machine only’ Source Code Control Systems would be easier to use.

    But for now, my first Big Step is just to get a single Golden Master for each of GHCN V1, V2, and V3 (and maybe V3.5) sorted out of my big ball of accumulated dross… along with vetting my copies of the Dailies. I think I’ve got about 4 from over about as many years, but need to do a better categorizing of them. Then the same with the USHCN that I think comes in V1, V2, and V2.5 all of which I ought to have along with some uncharacterized dailies… I’ll worry about incremental update / variation “going forward” on more of a quarterly basis; and that means I have until about Jan 1, 2016 to build something for that ;-)

    I’m also playing with SQL and I’ve got a basic data load working. So one thought I’m kicking around is to make a unique station / instrument identifier / Version key and store every single temperature from all the data sets in the database. Then reporting changes becomes pretty easy… But that needs some database design… and the input datasets categorized and vetted (see prior paragraph…)

    Tossing Git in the middle of all that is not high on my list of Oh Boy! moments… though I might end up doing it anyway…

    Mostly I’m still in the “what have I got?” and sort it out stage for now…

  4. Gary says:

    “I find it rather amazing that things like the historical GHCN v1 and V2 can be tossed in the trash, yet vast swathes of disk space are used for “Ameriflux”.”

    NASA lost the original Apollo 11 moon tapes so nothing surprises me about this. The hand having written moves on…

  5. Paul Hanlon says:

    Hi ChiefIO,
    Thanks for the reply. As you say, it’s for “down the road”. With regard to the SQL, I’d say definitely get the metadata, i.e. individual station info, that sort of thing. But for the “raw” data, the actual measurements, you are probably better off parsing them into a CSV format that you are comfortable with. It’s a good deal faster, simpler for something like R to parse, and a lot more efficient storage-wise.

    With regard to the downloading of data, would the ICOADS (the sea temperature stuff) be useful. I could have that running in the background here and maybe expose the file on cubie for downloading remotely. There’s also the GSOD dataset, and of course KNMI. It might be easier doing it this way than asking you to do even more work describing what you want and how to get it.

  6. E.M.Smith says:

    Well, the “catch up” syncronization finished:

    pi@dnsTorrent /Temps $ tail -f CDIAC_wget_log 
    Length: 1645 (1.6K) [text/html]
    Saving to: `;O=D'
         0K .                                                     100%  713K=0.002s
    2015-10-07 14:58:57 (713 KB/s) - `;O=D' saved [1645/1645]
    FINISHED --2015-10-07 14:58:57--
    Total wall clock time: 1d 5h 29m 59s
    Downloaded: 61662 files, 10.0G in 17h 0m 16s (171 KB/s)

    A day and a quarter to catch up the changes from start of collection, and that was about 8% of the dataset at 10 GB. I’m now going to restart the NOAA scrape and let it finish that last directory unimpeeded by competition for the wire or the disk.

    Well, maybe a little competition for the disk… I’m going to launch a mksquashfs against the CDIAC data and see if it gets any smaller… but the Temps file system will be NFS mounted onto the R.PiM2 (as the 4 cores matters a lot to compression) and with output to a disk local to the M2, so I don’t expect the NFS reads to dominate anything…

    With this, the CDIAC collection is done. And time to archive it. At about the end of the year I’ll look at doing another “catch up” and see what changes. As a log file is spit out of the process, I can get clue about changes without needing to get fancy…

    pi@dnsTorrent /Temps $ grep ^Saving CDIAC_wget_log | wc -l

    Here you can see how using ‘grep’ to look for lines starting with “Saving” finds the same number of lines of text as the final command says were updated. If one leaves off the “| wc -l” that does the line counting, it would print out the lines instead.

    You can also stack up the ‘grep’ commands connected with pipes:

    pi@dnsTorrent /Temps $ grep ^Saving CDIAC_wget_log | grep ameriflux | wc -l

    Tells me 24,510 of the “Saved” files are “ameriflux” while

    pi@dnsTorrent /Temps $ grep ^Saving CDIAC_wget_log | grep ocean | wc -l

    Almost 30,000 of them are ocean data. That’s 53,000+ right there. That’s most of the changed data files. (How did I know to pick those two to search on? That prior ‘find the ^Saved and list them” grep was run to the screen and I just watched stuff fly by. What was there a LOT was easy to read in the blur…)

    How to do this searching in a bit more compact way? Well, we can ‘invert’ the grep and only keep what is NOT the search key…

    grep ^Saving CDIAC_wget_log > 
    pi@dnsTorrent /Temps $ grep -v ameriflux | grep -v oceans > EMS.log.OUT.noAmerOceans
    pi@dnsTorrent /Temps $ wc -l EMS.log.OUT.noAmerOceans 
    7591 EMS.log.OUT.noAmerOceans

    Yes, these can also all be stacked up with pipe symbols “|” but catching the intermediate results in a file lets you rummage around more a bit more efficiently… And now I have a file with 7591 lines in it that are the non-Ameriflux non-Oceans files that were saved. Repeat the process until you find something interesting….

    (Beginning to see why I love *NIX systems? One scrape, and now with just a few lines of typing I can do all sorts of interesting “Digging” Here! ;-)

    At 7k lines, I’m OK with just sucking it into an editor and poking around. Here’s one “no surprise”:

    Saving to: `'
    Saving to: `;O=D'
    Saving to: `;O=A'
    Saving to: `;O=A'
    Saving to: `;O=A'
    Saving to: `'
    Saving to: `;O=A'
    Saving to: `;O=D'
    Saving to: `;O=D'
    Saving to: `;O=D'
    Saving to: `;O=D'

    The USHCN Daily data changed…. I’d expect that over a few days.

    I also spanned a month end, so this change in “monthlies” is not a surprise either:

    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'

    There are also a lot of these lines. The ones with “index” in them that shows a change. These are just spurious as the “wget” has some lists of what’s in a directory that it just downloads new each time, regardless of changes to actual files IN the directory. They are the things copied over to decide what to actually copy over. I could remove all those lines with one more “grep -v”… (actual text left as an exercise… )

    Saving to: `;O=D'
    Saving to: `;O=A'
    Saving to: `;O=A'
    Saving to: `;O=A'
    Saving to: `;O=D'

    When you do that, you are down to almost 2 k lines:

    pi@dnsTorrent /Temps $ grep -v index.html EMS.log.OUT.noAmerOceans >EMS.Short_Log
    pi@dnsTorrent /Temps $ wc -l EMS.Short_Log 
    1822 EMS.Short_Log
    pi@dnsTorrent /Temps $ 

    That file tends to be MUCH more interesting and a whole lot easier to read. Here’s the top of it:

    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'

    If you want the whole 1800 lines of it, well, you now know how to do it yourself ;-)

    (Or for serious enquirers I can send a copy to folks. Or even post the whole thing if enough folks want it – with “enough” being about 3 ;-)

    I’m especially interested in what changed in that last line:

    Maybe someone updated their CV?…

    Further down I found:

    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'

    Hansen? I thought he was gone already?… and there is some more playing with USHCN…

    Looks like there’s a workshop going on with NOAA:

    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'
    Saving to: `'

    Don’t know what SOCCR is, but somebody is up to something again…

    Hope this lets folks “get ideas” about the utility of such a system of archival, not just for capturing a static copy of the data, but also for picking out “what changed” and finding bits of interest you might not otherwise notice. Just remember that even if it is a bit ugly, and not a lot of fun at parties, *NIX is your friend and “a grep is a terrible thing to waste” ;-)

    FWIW, I have the ‘wget’ command in an executable file named “synccdiac” and this ‘wrapper script’ goes around it to make the log file and such:

    pi@dnsTorrent /Temps $ bcat synccdiacb
    nohup synccdiac > /Temps/CDIAC_wget_log& 

    I have it named ‘synccdiacb’ (the trailing ‘b’ meaning ‘background task’) so I only need to type:
    at the command prompt and it all launches. Yes, you could type “nohup synccdiac…” long hand each time, but why bother? The trailing ‘&’ puts the synccdiac command in the background and the leading ‘nohup’ says “keep it running even if I log out and hangup the line”.

    So can you guess what I need to do to restart the NOAA run?…


    Aren’t threaded interpretive languages fun? (AND you don’t have to type so much… Yes, the ‘learning curve’ is steep, but man does the payoff get huge fast… and stays that way the rest of your life.)

  7. E.M.Smith says:

    @Paul Hanlon:

    I see this as a Fahrenheit 451 kind of thing… If it has to do with Temperatures, Global Warming, and any other thing that might have the data fudged to achieve The Agenda (21) then someone ought to save it. Each of us can “be a book” as our own interests drive us.

    For me, I’m planning on getting:

    Cru / Hadley
    Antarctic data sets
    Anything Arctic
    Whatever local BOMs have data available. Australia? New Zealand? Canada? Iceland? etc.

    I’ve already got the first one done. The second one is about 90% I’d guess. Between them I’ve got GHCN v1, v2, and v3 along with a couple of USHCN variations. As GIStemp is just a screw over of that data and I’ve got GIStemp code running here, I don’t see any need to archive their stuff, but it might make sense for showing how what they present as ‘fact’ keeps changing…

    I’ve also got a couple of random grabs of some Hadley temperature data in some bucket somewhere, but have not got anything newer than a couple of versions back…

    So pick something you like and “go for it”. I lay claim to NOAA at the moment, everything else is unclaimed. Also feel free to put up any other dataset idea you have. I have no monopoly on ideas here…

    If, someday, I need to merge in a few TB of data, well, I can find a fast pipe to borrow for the weekend ;-)

    Oh, and since I might get hit by lightning some day, it doesn’t hurt if someone duplicates a copy that I’ve already got. More is better than not enough.

  8. E.M.Smith says:

    Well, I decided to try making a squashfs file out of the old copy first (since the alternative is to just throw it away, and since any mistake on the new copy would be A Very Bad Thing…) and that mksquashfs run has sucked up all 4 cores of the R.PiM2 for about 12 hours now, and it is 71% finished… So figure about 18 hours all told. Maybe it’s worth it ;-)

    Compression looks to be running about 50%, so a decent space savings ( est. about 70 GB instead of about 120 GB ). It is CPU limited, not disk limited, so using a journalling file system is just fine (i.e. any file system that supports many GB files of sufficient size will do).

    FWIW, I have a LOT of stuff that’s in compressed archives. Various machine backups, archives of old projects, what have you. They have historically been kept as gzip-ed tar archives (compresses Tape ARchive format files). As you can mount a squashfs file system and wander around in it without decompressing and unpacking the whole thing and that’s more convenient than a gzip tar archive: I’m going to oh so slowly unpack things, toss out the trash, make a big backups file tree of it, and then make squashfs file systems out of chunks. It uses the same compression methods, so just as compact, but a whole lot more accessible for “I want a copy of that old 1 KB file in that 20 GB archive”…

    As I’ve become comfortable with using squashfs file systems in loop mounted files, it has become clear that while the up front CPU load of compression hits you, the downstream “just mount it and get what you want out” along with the ability to append is a very nice combination.

    Not a big priority, just while sitting pondering over morning coffee, just the sort of thing you can launch into the background to keep the equipment busy doing something of value…

  9. p.g.sharrow says:

    @EMSmith; Pondering is good, I do it a lot ;-), Consideration of others that follow and might utilize your labors takes time but means your time and effort is an investment in the future rather then just for personal satisfaction. I am impressed with your progress in this creation and follow every post and comment…pg

  10. E.M.Smith says:


    Well, thank you kindly. My hope is that I can “lead by example” and both save a lot of other folks a lot of time and also perhaps inspire a few to “Dig Here!” on doing their own data archival / comparisons. I’ve tried to make it pretty easy and low cost to do, and I think I’ve done that. 8-}

    The “mksquashfs” on the older copy has completed, so I’ve now got ‘stats’ on it. I made a ‘one line script’ to do it that I called “squish” 8-) it just does the mksquashfs {first directory} {second directory} and with an optional change of blocksize. Similarly, I have one called “sqit” with the default block size override to 64 KB. The use of /tmp as the target by default is just to prevent it from doing any Bad Thing if launched without an argument for from or to directories… (Paranoid Programming? Nope… just experience … and it’s called “defensive programming” ;-)

    Here’s “sqit” that does an ‘in place’ mksquishfs with only one argument:

    root@RaPiM2:/TempsArc# bcat sqit
    mksquashfs ${1-/tmp} ${1-/tmp}.sqsh -b 65536

    So if I have a directory named Foo I can just type “sqit Foo” and it runs off to make Foo.sqsh in that same location but with a 1/2 default sized block size.

    The one line command “squish” is similar, but keeps the default larger block size and lets you direct the output to a different location; useful for me as the data can come from an NFS mounted copy of the Temps data and be sent to a different local real disk…

    root@RaPiM2:/TempsArc# bcat squish
    mksquashfs ${1-/tmp} ${2-/tmp}.sqsh $3 $4

    Again, the $1 is the first directory argument and $2 is the second one. The use of the curly braces and the dash just cause /tmp to be the default value if none is given and thus prevent A Bad Thing sometimes… given that /tmp is disposable.

    So I said to ‘squish’ the data from my /TempArc temperature archive disk old copy of cdiac and put the result into a file named CDIAC.sqsh in the same location (note that $2… above has a .sqsh appended to the file name automagically… so I don’t have to keep typing it and so that the output is very unlikely to overwrite any real data by accidentally forgetting to type the .sqsh…)

    root@RaPiM2:/TempsArc# squish CDIAC
    Parallel mksquashfs: Using 4 processors
    Creating 4.0 filesystem on CDIAC.sqsh, block size 131072.
    [==============================================================================================\] 1144843/1144843 100%
    Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072
    	compressed data, compressed metadata, compressed fragments, compressed xattrs
    	duplicates are removed
    Filesystem size 70295898.56 Kbytes (68648.34 Mbytes)
    	54.39% of uncompressed filesystem size (129236917.59 Kbytes)
    Inode table size 4009811 bytes (3915.83 Kbytes)
    	35.84% of uncompressed inode table size (11187422 bytes)
    Directory table size 1436877 bytes (1403.20 Kbytes)
    	29.43% of uncompressed directory table size (4882099 bytes)
    Number of duplicate files found 25418
    Number of inodes 192127
    Number of files 187607
    Number of fragments 37568
    Number of symbolic links  0
    Number of device nodes 0
    Number of fifo nodes 0
    Number of socket nodes 0
    Number of directories 4520
    Number of ids (unique uids + gids) 1
    Number of uids 1
    	pi (1000)
    Number of gids 1
    	pi (1000)

    At 70.2 GB, my earlier guess of “est. about 70 GB” was pretty darned good! (You might guess I’ve spent far too much of my life waiting for compress / decompress cycles… thus the origin of the “Chiefio” tag as “Chief of I/O”… back when doing a lot of Systems Admin work… long before moving up to be Director of I.T. (and since I reported to the V.P. Business Affairs, i.e. head lawyer, I was the top I.T. guy of the company. Also had Facilities and a few other bits…). But I digress.

    There are other interesting stats, like 187,607 files of which 25,418 are duplicates (wonder where they are…) and a 54.4% compressed size from the 129 GB original size.

    Guess now I need to mount it and take it for a test drive. See if everything is ‘as expected’ and then contemplate tossing out the original uncompressed archive… Or just flag it as ‘disposable’ and leave it sit until I need the disk space. I’ll often make a directory “copied_to_foo” and put thing in it. That way I have a defacto backup copy, but can just delete it any time I actually need the disk space. A very old habit that has been helpful too many times ;-)

    FWIW, as of right now, I have 3 TB of disks that are full of such “duplicates” as I’ve weeded down to about 2 TB of what looks like mostly single copies of what I need to keep. (I think I can get that down by another 500 MB without too much risk). Once that whole ‘weed and shrink’ is done, then I’m looking at making a squashfs version of it onto one of those 3 TB of “backups”… and empty the rest… It’s amazing how much stuff can accumulate over 30 years … especially when you tend to keep the old copies around in ‘delete me’ directories but then just grab the whole machine archive set when something goes flaky and then THAT ends up being duplicated a few times… Maybe I can get it down to under 1 TB of actual stuff … 8-} (Like, do I REALLY need that canonical collection of all the Debian 6.0 releases and variations by processor type? And just what will I be installing Red Hat 6 and 7 onto anyway?… )

    Well, time to check that the “bread has done rizz” and turn it to bake, then come back and test drive the squashfs file system, then a cup of tea with fresh bread as I figure out what’s next ;-)

  11. E.M.Smith says:

    Well, the bread has risen, baked, and first warm slices ‘down the hatch’… yum! Fresh real butter at room temperature soaking into warm fresh from the oven bread… hard to beat! It was an “artisan” bread, meaning crusty with coarser texture… but very nice. ( I’m playing around with a ‘no knead’ recipe… posting up in a few hours after ’round two’ is done… )

    So the first thing I “figured out what’s next” is just that the mounted squashfs file system is really nice and surprisingly fast and all (what the decompression taketh away, the far lower number of seeks and block reads on the disk seems to giveth back…); however: As it is mounted read-only any compressed thing you left IN that re-compressed image is a PITA to look inside…

    ls ushcn_v2_monthly/
    9641C_201112_F52.avg.gz  9641C_201112_tob.avg.gz  index.html?C=M;O=A
    9641C_201112_F52.max.gz  9641C_201112_tob.max.gz  index.html?C=M;O=D
    9641C_201112_F52.min.gz  9641C_201112_tob.min.gz  index.html?C=N;O=A
    9641C_201112_F52.pcp.gz  9641C_err_F52.max.gz	  index.html?C=N;O=D
    9641C_201112_raw.avg.gz  9641C_err_F52.min.gz	  index.html?C=S;O=A
    9641C_201112_raw.max.gz  index.html		  index.html?C=S;O=D
    9641C_201112_raw.min.gz  index.html?C=D;O=A	  readme.txt
    9641C_201112_raw.pcp.gz  index.html?C=D;O=D	  ushcn-stations.txt

    So this is an ‘ls’ listing of the USHCN v2 monthly data still hanging out in that blob. Notice that all of it is “.gz” file type? Normally you just “gunzip foo.gz” and root around in the extract. Except that this is a RO file system… so you end up in the “move somewhere else to unpack and inspect” game… which kind of defeats the whole reason for the squashfs file system in the first place…

    Other than that, things have been fine so far.

    So “lesson learned” is to just go through any giant blob you are going to turn into a squashfs file system and unzip, gunzip, uncompress, etc etc etc all the wads inside it until it is ALL unpacked and unzipped and uncompressed. THEN make your squashfs file system file (that will recompress it all but in a way that lets you look inside the bits).

    This can take some fair amount of effort as you need to check each compressed wad before you uncompress it to assure it does something sane (like puts things in an appropriately named subdirectory) instead of something stupid (like extracting a bunch of files and such in place where will be unpacking a dozen other such compressed wads all of them expecting to write out a README file …and only one survives).

    Also, you get to FIND all those compressed files. Yes, you can use the ‘find’ command (that has more options than a mangy dog has fleas) but that can be a pain for the uninitiated to do as find is a bear to get right the first few times and “A find is a terrible thing to waste!” (Groan… 8-} )

    At any rate, just realize that it is very easy to make the squashfs, it works as a live file system very very fast and efficiently; but any compressed wads inside of it become a bit of a pill to deal with and you end up back in the ‘copy and extract’ game.

    Oh, and the tea was a nice Ceylon Earl Gray loose tea from the local Middle East Persian market… very nice… They also have Baltic area sprats with Cyrillic writing on the tin that are wonderfully and deeply smoked. Latvian or some such… but again I digress…

    OK, with that, I think it’s time for me to move on to making some new posting or two. The scrapes are well detailed, the sizes and issues characterized, the how to store explored, and the best way to do the squashfs (i.e. after unzipping and uncompressing) discovered. From here on out it is more grunt grinding it out than exploration, so I’ll be saying less about it. Mostly just updates as some wad gets done or IFF I run into some kind of ‘gotcha’ issues.

  12. Paul Hanlon says:

    Excellent ChiefIO,

    Okay, I’ll get downloading GSOD and ICOADS. ICOADS is what is used for HadSST, so that might be interesting. I’d also like to explore the KNMI data archive (Bob Tisdale seems to swear by it, and that’s good enough for me:-)). Also, I think the DMI dataset is stored there, and that would be very interesting as I think it is the only one with *actual* temp measurements from the buoys in the Arctic. It will take a few days between that and putting up a page with links to them on the web, but once I have it done I’ll post up a link here.

  13. Pingback: CDIAC, Compression, Squashfs, And Oddities | Musings from the Chiefio

Comments are closed.