I’ve been shoveling TB of data around to make more room for things. Along the way, all the temperature scrapes got moved to their own disk. In the last round of tidy-up (the one just done) I discovered that part way through it the first cut scrape of cdiac.ornl.gov had reached completion. I’d originally just pointed it at the USHCN data, but left out the -np no-parent flag and it had proceeded to wander all the parent links grabbing all sorts of stuff. So on the 2nd or 3rd restart I decided to just let it, but at a low rate. So I set the rate to 50 kB/second and just let it crawl.
Here is a snip from the bottom of the script showing some of the variations over time. Lines with a “#” in the first position are commented out, but were run in the past.
#wget --limit-rate=50k -np -m http://cdiac.ornl.gov/ftp/ushcn_daily wget --limit-rate=200k -m http://cdiac.ornl.gov/ftp/ushcn_daily #wget -r -N -l inf --no-remove-listing -w 10 --limit-rate=50k -np http://cdiac.ornl.gov/ftp/ushcn_daily #wget --limit-rate=50k -nc -r -l inf http://cdiac.ornl.gov/ftp/ushcn_daily #wget -nc -r -l inf http://cdiac.ornl.gov/ftp/ushcn_daily
I’ve upped the speed limit to 200k and in this final pass, set it just mirror everything -m but starting in the USHCN directory and following parent links for the rest.
I did set it to -nc no-clobber in prior runs so that on restarts it did not grab newer versions (skipping a daily update for every restart…) until completed. As of now, I’ve restarted the ‘sync’ run with clobber set, so any changed copies of data are being recopied with newer version. Even though that is running it will still be fast as any unchanged files are not re-sent. Just realize that any size data will change a little by the end of the day (and even after that as data sets are updated and added). The sizes will still be a decent guide to ‘how big is what’.
When it is done, I’m going to restart the NOAA scrape that mostly just needs one final large directory to be done, but for now I’ve paused it as CDIAC does a final tidy up.
Sizes In Total
Here’s the message from the end of my log file:
2015-10-04 23:36:39 (50.1 KB/s) - `cdiac.ornl.gov/new/co2analy.jpg' saved [94156/94156] FINISHED --2015-10-04 23:36:39-- Total wall clock time: 5d 12h 44m 42s Downloaded: 26270 files, 22G in 5d 11h 16m 39s (48.7 KB/s) root@RaPiM2:/Temps#
So this particular run ended just before midnight and had been running for 5 1/2 days at about 50 kB/second. This part of the run copied 26,270 files for 22 G of size. But how big is this thing in total? Including the prior runs?
root@RaPiM2:/Temps# cd cdiac.ornl.gov/ root@RaPiM2:/Temps/cdiac.ornl.gov# du -ms . 127896 .
Roughly 127.8 GBytes of data and stuff. (As 22 GB took 5 days you can figure that was about a month worth at the slow rate, or about a week at full speed for my link).
There’s a LOT of stuff in there, and the top level directory is a bit ‘busy’. Lots of small files, and a few directories with big data caches in them. Here’s a listing of the top level:
root@RaPiM2:/Temps/cdiac.ornl.gov# ls -l total 13072 -rw-r--r-- 1 pi pi 1791 Oct 6 09:06 1DU_mb_out drwxr-xr-x 2 pi pi 4096 Oct 4 22:13 about -rw-r--r-- 1 pi pi 33705 Sep 7 21:23 aerosol_parameters.html -rw-r--r-- 1 pi pi 24773 Sep 7 21:23 aerosol_particle_types.html -rw-r--r-- 1 pi pi 20161 Oct 6 10:19 aerosols.html drwxr-xr-x 2 pi pi 4096 Oct 4 02:31 authors drwxr-xr-x 2 pi pi 4096 Sep 14 02:38 backgrnds drwxr-xr-x 2 pi pi 4096 Sep 10 06:31 by_new -rw-r--r-- 1 pi pi 25412 Oct 6 09:30 carbon_cycle_data.html -rw-r--r-- 1 pi pi 21691 Oct 6 09:30 carbon_cycle.html -rw-r--r-- 1 pi pi 22107 Oct 6 10:19 carbonisotopes.html drwxr-xr-x 7 pi pi 4096 Sep 23 00:19 carbonmanagement -rw-r--r-- 1 pi pi 22875 Sep 7 21:29 carbonmanagement.1 -rw-r--r-- 1 pi pi 22875 Sep 16 06:14 carbonmanagement.10 -rw-r--r-- 1 pi pi 22875 Sep 16 11:34 carbonmanagement.11 -rw-r--r-- 1 pi pi 22875 Sep 23 13:50 carbonmanagement.12 -rw-r--r-- 1 pi pi 22875 Sep 29 11:13 carbonmanagement.13 -rw-r--r-- 1 pi pi 22875 Sep 9 05:31 carbonmanagement.2 -rw-r--r-- 1 pi pi 22875 Sep 9 09:58 carbonmanagement.3 -rw-r--r-- 1 pi pi 22875 Sep 9 14:22 carbonmanagement.4 -rw-r--r-- 1 pi pi 22875 Sep 9 17:26 carbonmanagement.5 -rw-r--r-- 1 pi pi 22875 Sep 13 09:11 carbonmanagement.6 -rw-r--r-- 1 pi pi 22875 Sep 13 10:37 carbonmanagement.7 -rw-r--r-- 1 pi pi 22875 Sep 13 13:00 carbonmanagement.8 -rw-r--r-- 1 pi pi 22875 Sep 13 17:57 carbonmanagement.9 drwxr-xr-x 3 pi pi 4096 Sep 10 06:37 cdiac -rw-r--r-- 1 pi pi 148374 Aug 19 1998 cdiac_welcome.au -rw-r--r-- 1 pi pi 21774 Oct 6 10:19 cfcs.html -rw-r--r-- 1 pi pi 20263 Oct 6 10:19 chcl3.html drwxr-xr-x 13 pi pi 4096 Sep 23 00:19 climate drwxr-xr-x 4 pi pi 4096 Sep 29 11:14 CO2_Emission -rw-r--r-- 1 pi pi 3872 Sep 9 09:57 CO2_Emission.1 -rw-r--r-- 1 pi pi 3872 Sep 16 11:32 CO2_Emission.10 -rw-r--r-- 1 pi pi 3872 Sep 23 13:49 CO2_Emission.11 -rw-r--r-- 1 pi pi 3872 Sep 29 11:12 CO2_Emission.12 -rw-r--r-- 1 pi pi 3872 Oct 6 10:18 CO2_Emission.13 -rw-r--r-- 1 pi pi 3872 Sep 9 12:55 CO2_Emission.2 -rw-r--r-- 1 pi pi 3872 Sep 9 14:21 CO2_Emission.3 -rw-r--r-- 1 pi pi 3872 Sep 9 17:26 CO2_Emission.4 -rw-r--r-- 1 pi pi 3872 Sep 13 09:05 CO2_Emission.5 -rw-r--r-- 1 pi pi 3872 Sep 13 10:36 CO2_Emission.6 -rw-r--r-- 1 pi pi 3872 Sep 13 12:59 CO2_Emission.7 -rw-r--r-- 1 pi pi 3872 Sep 13 17:56 CO2_Emission.8 -rw-r--r-- 1 pi pi 3872 Sep 16 06:13 CO2_Emission.9 -rw-r--r-- 1 pi pi 1061 Sep 11 02:42 comments.html drwxr-xr-x 2 pi pi 4096 Sep 23 00:19 css -rw-r--r-- 1 pi pi 110909 Oct 6 09:30 data_catalog.html drwxr-xr-x 3 pi pi 4096 Aug 29 05:00 datasets -rw-r--r-- 1 pi pi 21223 Oct 6 09:30 datasubmission.html -rw-r--r-- 1 pi pi 20012 Oct 6 10:19 deuterium.html -rw-r--r-- 1 pi pi 3595 Sep 7 21:19 disclaimers.html drwxr-xr-x 9 pi pi 4096 Oct 4 10:55 epubs -rw-r--r-- 1 pi pi 24114 Sep 7 21:23 factsdata.html -rw-r--r-- 1 pi pi 72588 Oct 6 09:30 faq.html -rw-r--r-- 1 pi pi 22345 Oct 6 09:30 frequent_data_products.html drwxr-xr-x 192 pi pi 20480 Oct 6 09:28 ftp -rw-r--r-- 1 pi pi 59493 Oct 4 21:50 ftp.1 drwxr-xr-x 2 pi pi 4096 Sep 11 02:26 ftpdir drwxr-xr-x 4 pi pi 4096 Sep 23 00:21 GCP -rw-r--r-- 1 pi pi 91264 Sep 10 07:03 glossary.html -rw-r--r-- 1 pi pi 20774 Oct 6 10:19 halons.html -rw-r--r-- 1 pi pi 20755 Oct 6 10:19 hcfc.html -rw-r--r-- 1 pi pi 20219 Oct 6 10:19 hfcs.html -rw-r--r-- 1 pi pi 29825 Sep 7 21:23 home.html -rw-r--r-- 1 pi pi 20543 Oct 6 10:19 hydrogen.html -rw-r--r-- 1 pi pi 27149 Sep 7 21:23 ice_core_no.html -rw-r--r-- 1 pi pi 29935 Sep 7 21:23 ice_cores_aerosols.html drwxr-xr-x 2 pi pi 4096 Sep 30 23:40 icons drwxr-xr-x 4 pi pi 12288 Oct 4 22:14 images drwxr-xr-x 2 pi pi 4096 Aug 29 05:00 includes -rw-r--r-- 1 pi pi 29825 Oct 6 09:29 index.html drwxr-xr-x 2 pi pi 4096 Sep 23 00:19 js -rw-r--r-- 1 pi pi 23674 Oct 6 09:30 land_use.html drwxr-xr-x 3 pi pi 4096 Oct 4 21:48 library -rw-r--r-- 1 pi pi 21311 Oct 6 10:19 methane.html -rw-r--r-- 1 pi pi 19970 Oct 6 10:19 methylchloride.html -rw-r--r-- 1 pi pi 20252 Oct 6 10:19 methylchloroform.html -rw-r--r-- 1 pi pi 21918 Oct 6 09:30 mission.html -rw-r--r-- 1 pi pi 39401 Sep 7 21:23 modern_aerosols.html -rw-r--r-- 1 pi pi 37182 Oct 6 10:19 modern_halogens.html -rw-r--r-- 1 pi pi 40592 Sep 7 21:23 modern_no.html drwxr-xr-x 2 pi pi 4096 Oct 4 22:14 ndps drwxr-xr-x 2 pi pi 4096 Oct 4 23:36 new drwxr-xr-x 14 pi pi 4096 Oct 4 02:31 newsletr -rw-r--r-- 1 pi pi 25963 Sep 11 02:27 newsletter.html -rw-r--r-- 1 pi pi 20620 Oct 6 10:19 no.html drwxr-xr-x 53 pi pi 12288 Oct 4 21:49 oceans -rw-r--r-- 1 pi pi 14609 Sep 10 06:51 oceans.1 -rw-r--r-- 1 pi pi 14609 Sep 13 09:18 oceans.2 -rw-r--r-- 1 pi pi 14609 Sep 13 10:40 oceans.3 -rw-r--r-- 1 pi pi 14609 Sep 13 13:01 oceans.4 -rw-r--r-- 1 pi pi 14609 Sep 13 18:00 oceans.5 -rw-r--r-- 1 pi pi 14609 Sep 16 06:15 oceans.6 -rw-r--r-- 1 pi pi 14609 Sep 16 11:37 oceans.7 -rw-r--r-- 1 pi pi 14609 Sep 23 15:13 oceans.8 -rw-r--r-- 1 pi pi 14609 Sep 29 11:15 oceans.9 -rw-r--r-- 1 pi pi 20501 Oct 6 10:19 oxygenisotopes.html -rw-r--r-- 1 pi pi 19963 Oct 6 10:19 ozone.html -rw-r--r-- 1 pi pi 20328 Oct 6 09:30 permission.html drwxr-xr-x 2 pi pi 4096 Oct 4 02:32 pns drwxr-xr-x 6 pi pi 4096 Oct 4 21:49 programs -rw-r--r-- 1 pi pi 33915 Oct 6 09:30 recent_publications.html drwxr-xr-x 3 pi pi 4096 Aug 29 05:00 science-meeting -rw-r--r-- 1 pi pi 804 Sep 11 02:42 search.html -rw-r--r-- 1 pi pi 20218 Oct 6 10:19 sfsix.html drwxr-xr-x 5 pi pi 4096 Oct 4 21:48 SOCCR -rw-r--r-- 1 pi pi 25392 Oct 6 09:30 staff.html -rw-r--r-- 1 pi pi 20205 Oct 6 10:19 tetrachloroethene.html -rw-r--r-- 1 pi pi 24615 Oct 6 09:30 trace_gas_emissions.html -rw-r--r-- 1 pi pi 22496 Oct 6 09:30 tracegases.html drwxr-xr-x 16 pi pi 4096 Oct 4 21:48 trends -rw-r--r-- 1 pi pi 22899 Oct 6 09:30 vegetation.html drwxr-xr-x 2 pi pi 4096 Sep 23 00:21 wdca -rw-r--r-- 1 pi pi 14568 Sep 10 07:03 wdcinfo.html -rw-r--r-- 1 pi pi 39921 Oct 6 09:30 whatsnew.html -rw-r--r-- 1 pi pi 11075997 Sep 11 02:44 wwwstat.html
Just as a reminder, lines starting with a ‘d’ are directories full of stuff, lines starting with a ‘-‘ are just ordinary files. Size is the size of a file, but the size of the directory structure NOT including saved data files, for directories. For example, wdca is shown as a 4k block. (That’s one ‘inode’ or information node size on this file system, and can hold pointers to a modest number of files plus their meta data). What’s in wdca?
root@RaPiM2:/Temps/cdiac.ornl.gov# ls -l wdca total 36 -rw-r--r-- 1 pi pi 31841 Oct 6 09:30 wdcinfo.html -rw-r--r-- 1 pi pi 2350 Mar 29 1999 wdclogo.jpg root@RaPiM2:/Temps/cdiac.ornl.gov#
two files of about 34 kB total size. Looks like a web page (.html) and graphic (.jpg) in it.
Here’s sorted list of ‘big lumps’ cut off at a convenient place:
root@RaPiM2:/Temps/cdiac.ornl.gov# cat 1DU_mb_out 125576 ftp 574 oceans 167 epubs 74 trends 70 SOCCR 25 programs 22 carbonmanagement 19 newsletr 16 images 11 wwwstat.html 4 science-meeting 3 ndps 2 datasets 1 whatsnew.html 1 wdcinfo.html 1 wdca 1 vegetation.html 1 tracegases.html 1 trace_gas_emissions.html 1 tetrachloroethene.html 1 staff.html 1 sfsix.html 1 search.html 1 recent_publications.html 1 pns 1 permission.html 1 ozone.html 1 oxygenisotopes.html 1 oceans.9
Everything from there on down just shows as 1 MB as I counted these up in 1 MB chunks. (du -ms *)
As you can see, almost everything is in the ftp directory.
Even in that directory, it is a long list. I’m going to put all of it here. Files ending in a single digit on random checks of a couple seem to be web pages describing the associated data. Since MB chunks are not enlightening about the smaller files, I’m going to count up sizes in KB for the ftp directory. You can see that almost all of the data bulk is in the ‘ameriflux’ and ‘oceans’ at 66 GB and 20 GB respectively.
root@RaPiM2:/Temps/cdiac.ornl.gov/ftp# du -ks * | sort -rn 66151664 ameriflux 20475520 oceans 11390940 ndp026c 3568792 us_recordtemps 3043020 nlcd92 2739560 ushcn_snow 2435772 nlcd2001 2187568 ndp026b 2098636 FACE 1453216 ale_gage_Agage 1202680 ndp026d 1066480 Atul_Jain_etal_Land_Use_Fluxes 1045564 ushcn_daily 985912 ndp068 869632 russia_daily 765300 ndp048 667460 ndp088 667168 ndp048r1 564988 ndp048r0 502224 ndp076 491396 db1013_v2011 485332 ndp040 357004 images 289244 global_carbon 284736 ndp026e 277492 ndp070 276784 ndp081 233812 ndp005a 227204 ndp055 200344 CDIAC_UWG_Presentations_Sept2010 166652 Nassar_Emissions_Scale_Factors 163064 ICRCCM-radiative_fluxes 142736 ushcn_v2.5_monthly 127356 db1005 90392 fossil_fuel_CO2_emissions_gridded_monthly_v2009 89740 fossil_fuel_CO2_emissions_gridded_monthly_del13C_v2009 82816 ndp026a 82380 fossil_fuel_CO2_emissions_gridded_monthly_del13C_v2013 82344 fossil_fuel_CO2_emissions_gridded_monthly_v2013 81068 fossil_fuel_CO2_emissions_gridded_monthly_del13C_v2012 81032 fossil_fuel_CO2_emissions_gridded_monthly_v2012 79712 fossil_fuel_CO2_emissions_gridded_monthly_del13C_v2011 79676 fossil_fuel_CO2_emissions_gridded_monthly_v2011 78920 ndp017b 78296 fossil_fuel_CO2_emissions_gridded_monthly_del13C_v2010 78236 fossil_fuel_CO2_emissions_gridded_monthly_v2010 77216 ndp005 69740 ndp064 68568 db1019 66928 cdiacpubs 53860 ndp059 47116 Tris_West_US_County_Level_Cropland_C_Estimates 45624 ndp020 40328 ushcn_v2_monthly 36500 ndp078a 35468 ndp041 35188 ndp080 35052 ndp035 29824 db1015 27852 cdiac129 24788 ndp043c 24596 ndp054 23732 ndp043a 23232 ndp046 22708 trends 22156 ndp074 21712 ndp037 21008 ndp026 20760 ndp065 20056 ndp058_v2009 19948 db1013_v2009 19444 ndp075 18040 ndp018 17836 db1013_v2013 17792 ndp058_v2013 17728 db1013_v2012 17704 ndp058_v2012 17580 ndp058_v2011 17484 ndp058_v2010 17348 ndp042 17092 ndp090 16828 ndp058 14892 trends93 14784 ndp043b 14176 HIPPO 13700 ndp039 13468 ndp067 13100 ndp030 13100 fossilfuel-co2-emissions 12384 ndp055b 11968 ndp082 11892 ndp047 11032 ndp044 10776 cdiac140 10312 db1012 9760 tr051 9372 ndp062 9072 ndp004 8332 ndp056 7908 cmp002 7532 ndp086 7456 bibliography 7376 ndp049 7112 ndp071 6916 ndp089 6876 ndp051 6792 ndp011 6752 ndp017 6544 ndp027 6216 ndp066 6140 ndp021 6060 ndp053 5928 ndp057a 5772 CSEQ 5672 maps-co 5672 db1020 5492 ndp087 5356 tr055 5056 ndp060 4952 maunloa.calibration.tar.Z 4352 ndp052 4148 ndp077 4020 Global_Carbon_Project 3948 ndp084 3932 db1009 3844 co2sys 3808 db1016 3764 ndp063 3720 ndp036 3612 ndp045 3420 db1021 3356 db1008 3352 maunaloa.hourly5886 3312 ndp001a 3164 ndp085 2592 ndp009 2572 cdiac74 2540 ndp032 2480 ndp057 2420 ndp079 2372 db1007 2276 ndp006 2212 ndp025 2084 er0649t 2024 Smith_Rothwell_Land-Use_Change_Emissions 1988 ndp001 1988 maunaloa-co2 1888 db1004 1796 ndp033 1600 ndp058a 1560 db1011 1484 db1017 1452 GISS3-D 1428 ndp007 1284 ndp061a 1212 ndp073 1140 ndp050 1064 ndp072 968 ndp013 912 db1013 896 tdemodel 460 quay_dc13_ch4 400 cdiac130 312 ndp028 196 methyl_chloride-khalil_rasmussen 196 db1010 144 ndp029 128 ndp034 112 db1022 104 ndp023 84 ndp003 84 ndp002 80 ndp022 80 cdiac136 68 ndp014 68 db1014 64 db1017.1 60 index.html?C=S;O=D 60 index.html?C=S;O=A 60 index.html?C=N;O=D 60 index.html?C=N;O=A 60 index.html?C=M;O=D 60 index.html?C=M;O=A 60 index.html?C=D;O=D 60 index.html?C=D;O=A 60 index.html 56 ndp058.1 52 ndp048.1 52 ndp026b.9 52 ndp026b.8 52 ndp026b.7 52 ndp026b.6 52 ndp026b.5 52 ndp026b.4 52 ndp026b.3 52 ndp026b.2 52 ndp026b.1 48 ndp040.1 44 db1018 44 db1016.9 44 db1016.8 44 db1016.7 44 db1016.6 44 db1016.5 44 db1016.4 44 db1016.3 44 db1016.2 44 db1016.12 44 db1016.11 44 db1016.10 44 db1016.1 40 ndp034r1 40 ndp030r8 40 ndp022r2 40 ndp021r1 40 ndp020r1 40 ndp019r3 40 ndp019 40 ndp008r4 40 ndp008 40 ndp005r3 40 ndp004r1 40 ndp003r1 40 ndp001r7 40 db1013_v2010 40 cdiac115 24 ale_gage_Agage.1 20 ndp044.1 20 ndp039.8 20 ndp039.7 20 ndp039.6 20 ndp039.5 20 ndp039.4 20 ndp039.3 20 ndp039.2 20 ndp039.1 20 ndp005a.9 20 ndp005a.8 20 ndp005a.7 20 ndp005a.6 20 ndp005a.5 20 ndp005a.4 20 ndp005a.3 20 ndp005a.2 20 ndp005a.1 16 ushcn_daily.8 16 ushcn_daily.7 16 ushcn_daily.6 16 ushcn_daily.5 16 ushcn_daily.4 16 ushcn_daily.3 16 ushcn_daily.20 16 ushcn_daily.2 16 ushcn_daily.19 16 ushcn_daily.18 16 ushcn_daily.17 16 ushcn_daily.16 16 ushcn_daily.15 16 ushcn_daily.14 16 ushcn_daily.13 16 ushcn_daily.12 16 ushcn_daily.10 16 ushcn_daily.1 16 ndp070.5 16 ndp070.4 16 ndp070.3 16 ndp070.2 16 ndp070.1 16 moisture.indices.prc.dat 12 ndp068.1 12 ndp055.9 12 ndp055.8 12 ndp055.7 12 ndp055.6 12 ndp055.5 12 ndp055.4 12 ndp055.3 12 ndp055.2 12 ndp055.1 12 ndp035.1 8 ushcn_v2.5_monthly.9 8 ushcn_v2.5_monthly.8 8 ushcn_v2.5_monthly.7 8 ushcn_v2.5_monthly.6 8 ushcn_v2.5_monthly.5 8 ushcn_v2.5_monthly.4 8 ushcn_v2.5_monthly.3 8 ushcn_v2.5_monthly.2 8 ushcn_v2.5_monthly.12 8 ushcn_v2.5_monthly.11 8 ushcn_v2.5_monthly.10 8 ushcn_v2.5_monthly.1 8 quay_dc13_ch4.8 8 quay_dc13_ch4.7 8 quay_dc13_ch4.6 8 quay_dc13_ch4.5 8 quay_dc13_ch4.4 8 quay_dc13_ch4.3 8 quay_dc13_ch4.2 8 quay_dc13_ch4.1 8 ndp078a.1 8 ndp076.1 8 ndp067.1 8 ndp064.1 8 ndp061a.1 8 ndp059.1 8 ndp058a.1 8 ndp057.1 8 ndp047.1 8 ndp043c.1 8 ndp043b.1 8 ndp043a.1 8 ndp042.5 8 ndp042.4 8 ndp042.3 8 ndp042.2 8 ndp042.1 8 ndp041.1 8 ndp032.1 8 ndp026c.1 8 ndp026a.9 8 ndp026a.8 8 ndp026a.7 8 ndp026a.6 8 ndp026a.5 8 ndp026a.4 8 ndp026a.3 8 ndp026a.2 8 ndp026a.12 8 ndp026a.11 8 ndp026a.10 8 ndp026a.1 8 ndp011.1 8 ndp009.1 8 mlo88.dat 8 db1015.1 8 1DU_mb_out 4 ushcn_daily.9 4 ushcn_daily.11 4 russia_daily.9 4 russia_daily.8 4 russia_daily.7 4 russia_daily.6 4 russia_daily.5 4 russia_daily.4 4 russia_daily.3 4 russia_daily.2 4 russia_daily.12 4 russia_daily.11 4 russia_daily.10 4 russia_daily.1 4 ndp077.1 4 ndp074.1 4 ndp073.1 4 ndp072.1 4 ndp071.1 4 ndp066.1 4 ndp065.1 4 ndp063.1 4 ndp062.1 4 ndp060.1 4 ndp057a.9 4 ndp057a.8 4 ndp057a.7 4 ndp057a.6 4 ndp057a.5 4 ndp057a.4 4 ndp057a.3 4 ndp057a.2 4 ndp057a.1 4 ndp056.1 4 ndp054.1 4 ndp053.1 4 ndp052.1 4 ndp051.1 4 ndp050.1 4 ndp049.9 4 ndp049.8 4 ndp049.7 4 ndp049.6 4 ndp049.5 4 ndp049.4 4 ndp049.3 4 ndp049.2 4 ndp049.13 4 ndp049.12 4 ndp049.11 4 ndp049.10 4 ndp049.1 4 ndp046.1 4 ndp045.1 4 ndp037.1 4 ndp036.1 4 ndp034.1 4 ndp033.1 4 ndp030.9 4 ndp030.8 4 ndp030.7 4 ndp030.6 4 ndp030.5 4 ndp030.4 4 ndp030.3 4 ndp030.2 4 ndp030.12 4 ndp030.11 4 ndp030.10 4 ndp030.1 4 ndp029.1 4 ndp028.1 4 ndp027.1 4 ndp026.9 4 ndp026.8 4 ndp026.7 4 ndp026.6 4 ndp026.5 4 ndp026.4 4 ndp026.3 4 ndp026.2 4 ndp026.13 4 ndp026.12 4 ndp026.11 4 ndp026.10 4 ndp026.1 4 ndp025.1 4 ndp023.9 4 ndp023.8 4 ndp023.7 4 ndp023.6 4 ndp023.5 4 ndp023.4 4 ndp023.3 4 ndp023.2 4 ndp023.12 4 ndp023.11 4 ndp023.10 4 ndp023.1 4 ndp022.1 4 ndp021.1 4 ndp020.1 4 ndp019.1 4 ndp018.1 4 ndp017.9 4 ndp017.8 4 ndp017.7 4 ndp017.6 4 ndp017.5 4 ndp017.4 4 ndp017.3 4 ndp017.2 4 ndp017.1 4 ndp014.1 4 ndp013.1 4 ndp008.1 4 ndp007.9 4 ndp007.8 4 ndp007.7 4 ndp007.6 4 ndp007.5 4 ndp007.4 4 ndp007.3 4 ndp007.2 4 ndp007.12 4 ndp007.11 4 ndp007.10 4 ndp007.1 4 ndp006.1 4 ndp005.1 4 ndp004.1 4 ndp002.1 4 ndp001.9 4 ndp001.8 4 ndp001.7 4 ndp001.6 4 ndp001.5 4 ndp001.4 4 ndp001.3 4 ndp001.2 4 ndp001.13 4 ndp001.12 4 ndp001.11 4 ndp001.10 4 ndp001.1 4 methyl_chloride-khalil_rasmussen.9 4 methyl_chloride-khalil_rasmussen.8 4 methyl_chloride-khalil_rasmussen.7 4 methyl_chloride-khalil_rasmussen.6 4 methyl_chloride-khalil_rasmussen.5 4 methyl_chloride-khalil_rasmussen.4 4 methyl_chloride-khalil_rasmussen.3 4 methyl_chloride-khalil_rasmussen.2 4 methyl_chloride-khalil_rasmussen.13 4 methyl_chloride-khalil_rasmussen.12 4 methyl_chloride-khalil_rasmussen.11 4 methyl_chloride-khalil_rasmussen.10 4 methyl_chloride-khalil_rasmussen.1 4 images.1 4 ICRCCM-radiative_fluxes.1 4 GISS3-D.1 4 db1021.9 4 db1021.8 4 db1021.7 4 db1021.6 4 db1021.5 4 db1021.4 4 db1021.3 4 db1021.2 4 db1021.13 4 db1021.12 4 db1021.11 4 db1021.10 4 db1021.1 4 db1020.9 4 db1020.8 4 db1020.7 4 db1020.6 4 db1020.5 4 db1020.4 4 db1020.3 4 db1020.2 4 db1020.12 4 db1020.11 4 db1020.10 4 db1020.1 4 db1019.9 4 db1019.8 4 db1019.7 4 db1019.6 4 db1019.5 4 db1019.4 4 db1019.3 4 db1019.2 4 db1019.12 4 db1019.11 4 db1019.10 4 db1019.1 4 db1018.1 4 db1014.9 4 db1014.8 4 db1014.7 4 db1014.6 4 db1014.5 4 db1014.4 4 db1014.3 4 db1014.2 4 db1014.13 4 db1014.12 4 db1014.11 4 db1014.10 4 db1014.1 4 db1013.1 4 db1012.1 4 db1011.1 4 db1010.9 4 db1010.8 4 db1010.7 4 db1010.6 4 db1010.5 4 db1010.4 4 db1010.3 4 db1010.2 4 db1010.12 4 db1010.11 4 db1010.10 4 db1010.1 4 db1009.1 4 db1008.1 4 db1007.9 4 db1007.8 4 db1007.7 4 db1007.6 4 db1007.5 4 db1007.4 4 db1007.3 4 db1007.2 4 db1007.13 4 db1007.12 4 db1007.11 4 db1007.10 4 db1007.1 4 db1005.9 4 db1005.8 4 db1005.7 4 db1005.6 4 db1005.5 4 db1005.4 4 db1005.3 4 db1005.2 4 db1005.13 4 db1005.12 4 db1005.11 4 db1005.10 4 db1005.1 4 db1004.126c ? 4 co2sys.1 4 cmp002.1 4 cdiac129.1 4 bibliography.1 4 ameriflux.1 root@RaPiM2:/Temps/cdiac.ornl.gov/ftp#
So what’s that 3rd thing, ndp026c?
root@RaPiM2:/Temps/cdiac.ornl.gov/ftp# ls -ld ndp026c* drwxr-xr-x 10 pi pi 4096 Sep 23 00:18 ndp026c -rw-r--r-- 1 pi pi 4995 Oct 4 10:54 ndp026c.1 [...] root@RaPiM2:/Temps/cdiac.ornl.gov/ftp# cat ndp026c.1 [I'm strippingout the HTML tags from this web pasge doc so WOrdpress doesn't go batshit crazy on them... -E.M.Smith] DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN" Index of /ftp/ndp026 PLEASE NOTE: NOTE: Land data have now been updated through 2009. Please see the updated documentation file NDP-026C_EECRA_Update_1997-2009.pdf. The original documentation file for NDP-026C: Extended Edited Synoptic Cloud Reports from Ships and Land Stations Over the Globe, 1952-1996, is ndp026c.pdf. The file ndp026c.txt is an ASCII version of the pdf file that contains most text documentation, but lacks tables and figures. The file ndp026c_readme.txt is an ASCII file that contains a basic overview of the database and a few key tables related to the contents of the data files. The monthly cloud data files for land and ocean are contained in the "land" and "ship" subdirectories, respectively. The period of record contained in each subdirectory is apparent from the subdirectory name, e.g., the subdirectory land_197101_197404 contains all land monthly data files from January 1971 through April of 1974. August 21, 2012 root@RaPiM2:/Temps/cdiac.ornl.gov/ftp#
So that’s example of how you can dredge through this kind of stuff with ‘grep’ to search for stuff or just ‘cat’ to see what’s in a text type file. How do I know it’s text? The “file’ command is your friend:
root@RaPiM2:/Temps/cdiac.ornl.gov/ftp# file ndp026c.1 ndp026c.1: HTML document, ASCII text
So we now know that ndp026c is a load of “Extended Edited Synoptic Cloud Reports”… looking for a use…
Here’s a sample search using “USHCN” as the search string.
root@RaPiM2:/Temps/cdiac.ornl.gov/ftp# grep USHCN * grep: ale_gage_Agage: Is a directory grep: ameriflux: Is a directory grep: Atul_Jain_etal_Land_Use_Fluxes: Is a directory grep: bibliography: Is a directory grep: cdiac115: Is a directory grep: cdiac129: Is a directory [...]
The ‘grep’ command is ‘global regular expression print’. It takes all sorts of regular expressions, like ^Pr that would say “starting at the start of the line, (That’s the ^) look for ‘Pr’ then print it out”. Just giving it plain text it looks for that text. So here we said ‘look for USHCN and print those lines’. Directories are not files so it complains for each of them. (There are ways to avoid that, but they are beyond this intro level, I’ll skip them in what I paste in.)
ndp042.1:These files comprise a very early version of the USHCN data database ndp042.2:These files comprise a very early version of the USHCN data database ndp042.3:These files comprise a very early version of the USHCN data database ndp042.4:These files comprise a very early version of the USHCN data database ndp042.5:These files comprise a very early version of the USHCN data database ndp070.1:These files comprise CDIAC's version of USHCN daily data through 2005. ndp070.2:These files comprise CDIAC's version of USHCN daily data through 2005. ndp070.3:These files comprise CDIAC's version of USHCN daily data through 2005. ndp070.4:These files comprise CDIAC's version of USHCN daily data through 2005. ndp070.5:These files comprise CDIAC's version of USHCN daily data through 2005. ushcn_daily.1:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.1:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.10:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.10:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.12:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.12:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.13:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.13:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.14:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.14:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.15:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.15:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.16:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.16:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.17:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.17:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.18:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.18:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.19:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.19:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.2:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.2:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.20:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.20:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.3:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.3:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.4:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.4:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.5:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.5:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.6:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.6:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.7:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.7:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.8:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.8:These files comprise CDIAC's most current version of USHCN daily data. ushcn_daily.9:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Daily Dataset ushcn_daily.9:These files comprise CDIAC's most current version of USHCN daily data. ushcn_v2.5_monthly.1:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset ushcn_v2.5_monthly.1:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data ushcn_v2.5_monthly.10:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset ushcn_v2.5_monthly.10:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data ushcn_v2.5_monthly.11:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset ushcn_v2.5_monthly.11:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data ushcn_v2.5_monthly.12:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset ushcn_v2.5_monthly.12:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data ushcn_v2.5_monthly.2:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset ushcn_v2.5_monthly.2:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data ushcn_v2.5_monthly.3:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset ushcn_v2.5_monthly.3:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data ushcn_v2.5_monthly.4:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset ushcn_v2.5_monthly.4:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data ushcn_v2.5_monthly.5:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset ushcn_v2.5_monthly.5:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data ushcn_v2.5_monthly.6:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset ushcn_v2.5_monthly.6:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data ushcn_v2.5_monthly.7:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset ushcn_v2.5_monthly.7:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data ushcn_v2.5_monthly.8:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset ushcn_v2.5_monthly.8:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data ushcn_v2.5_monthly.9:UNITED STATES HISTORICAL CLIMATOLOGY NETWORK (USHCN) Version 2.5 Serial Monthly Dataset ushcn_v2.5_monthly.9:These files comprise CDIAC's most current version of USHCN Vs. 2.5 monthly data grep: ushcn_v2_monthly: Is a directory
So there’s an example of how you can rapidly find some interesting files to look through for USHCN. Unfortunately, a similar search on GHCN yielded nothing of interest. (But I already have v1, v2, and v3 data versions…)
In Conclusion
It is quite reasonable for anyone with a modest internet connection and $120 for a Raspberry Pi and hard disk (with powered hub) to set up a scraper to gather very large blocks of data and save them, even if The Officious Agencies don’t…
I find it rather amazing that things like the historical GHCN v1 and V2 can be tossed in the trash, yet vast swathes of disk space are used for “Ameriflux”. At least 32 x larger. In all their $Billions of US Taxpayer Dollars, they can’t find just $60 for a TB disk at Best Buy to save it? Sigh.
On my “Someday” list is now to put up a temperature ftp server with old and archived copies of the data set so that ‘change over time’ can be analyzed.
Next step is to list the other temperature agencies and where they have data available for gathering and preserving. I’ll start by listing GECOS. If anyone else has a link worth a look, put up a comment, please. Then we can sort out priorities and “who does what.
http://www.wmo.int/pages/prog/gcos/index.php?name=ObservingSystemsandData
Excellent work. As you have been in the vanguard of unearthing, the changes to temperature data within versions is significant as well. Some sort of directory or file naming conventions to establish a date stamp, and occasional rescrapes of that data, will prove interesting.
Your use of a modest rate is prudent, I think. NASA was quick to block Steve McIntyre years ago when he was attempting a data scrape; they marked his IP as PNG and pronounced him a hacker. Hopefully, your gentle retrieves of data will not elicit similar attention.
Good show!
===|==============/ Keith DeHavelle
It might be an idea to use Git to help sort out the different versions. Basically store the new file over the old, and then do a git add, If there are changes, Git will store them as differences, minimising the amount of space used, and also showing at a glance what changed. I’m sure there are other ways, but this is what Git was optimised to do.
@Keith:
Well, one hopes they have learned that “Hide The Data” is a losing game for them once the press and congress start asking questions. Besides, I already have the parts that matter… it’s mostly other “products” to complete the set that are in process (and for NOAA that set is duplicated elsewhere…) so mostly they would just be freeing me up to do other more productive things… probably not what they would want ;-)
Also, since I’m doing this from a “number du jour”, blocking by IP would be particularly dumb. Aside from my tendency to wander to different places (and the Raspberry Pi has already been made portable via the Dongle Pi posting) and since my home IP resets on a powerfail / restart (or at worst after a DHCP timeout) that’s not going to be very useful either (it would eventually just hit whoever got that last IP after the most recent powerfail / restart of the neighborhood…)
And not to mention next on my list of networking ToDo’s is to get set up with a VPN to a remote place (preferably with some kind of ‘cloud’ storage available) and make the whole thing more geography unlinked… well, they could speed up my progress on that…
Finally, part of why I’m posting about this is so that others can do it to. There is beauty in parallel processing and freedom in independent processors with a common cause… at all levels. (From markets to malitia in the original Minute Men form to private busnesses to independent researchers to… ) so at most there would be a momentary inconvenience to me and a bit more discussion with friends… (There are at least 1/2 dozen folks around me who would be happy to let me leach off their links and they are on at least 4 different providers… then there are the two mobile hot spots I own, though the data cost would be higher… I’d rather just park outside of a local Starbucks and let it run from the car while I have a nice long Grande Mocha inside… With my parabolic antenna I can be about 1/8 mile away and still get good speed. (It is only 3 inch, so a 1 foot would do even better…)
FWIW, there are settings for wget that let it masquerade as a person. You can set browser type, random pause between gets ( that wait 10 can be a randomized wait from 0 to 10 or whatever you set) and more. A whole lot of folks have gone ahead of me on the whole “block ME will you!” response methods. So IFF somehow they were daft enough to block the IP, I’d just drop the kit in the portable case, cross the street to the friends house, plug it in and set it to pretend to be a Windoze Browser with random wait of a minute between gets, at about a 100 kB rate, and on an ‘at’ command that only launched it when folks were not using their net (i.e. between about midnight and 7 am). Segment the wget into smaller subdirectories too, so each one runs, say, one day of the week. Set up 2 or 3 of those at different places and I’m getting faster download than now. What’s not to like? 8-)
(Yes, this isn’t my first Rodeo… )
But still, it is a good idea to be polite about network use in any case. Most sys admins are appreciative of the effort and if they see a polite scrape on public data just figure it isn’t a problem. So I’ve tried to always keep the aggregate below 200 kB / second and most of the time no more than 150 kB/sec to any one site (the exception being the ‘polish’ at the end where I let it catch up the skipped daily updates; since much of the traffic is just ‘not retrieving’ anyway… plus it’s nice if all the ‘daily updates’ end up being from the same day, so better if it doesn’t span several…
@Paul:
I thought about it, but I’m still learning Git. It is intended for distributed projects, so more suited to a group effort (once / if one forms). For use on a single machine, the older ‘one machine only’ Source Code Control Systems would be easier to use.
But for now, my first Big Step is just to get a single Golden Master for each of GHCN V1, V2, and V3 (and maybe V3.5) sorted out of my big ball of accumulated dross… along with vetting my copies of the Dailies. I think I’ve got about 4 from over about as many years, but need to do a better categorizing of them. Then the same with the USHCN that I think comes in V1, V2, and V2.5 all of which I ought to have along with some uncharacterized dailies… I’ll worry about incremental update / variation “going forward” on more of a quarterly basis; and that means I have until about Jan 1, 2016 to build something for that ;-)
I’m also playing with SQL and I’ve got a basic data load working. So one thought I’m kicking around is to make a unique station / instrument identifier / Version key and store every single temperature from all the data sets in the database. Then reporting changes becomes pretty easy… But that needs some database design… and the input datasets categorized and vetted (see prior paragraph…)
Tossing Git in the middle of all that is not high on my list of Oh Boy! moments… though I might end up doing it anyway…
Mostly I’m still in the “what have I got?” and sort it out stage for now…
“I find it rather amazing that things like the historical GHCN v1 and V2 can be tossed in the trash, yet vast swathes of disk space are used for “Ameriflux”.”
NASA lost the original Apollo 11 moon tapes so nothing surprises me about this. The hand having written moves on…
Hi ChiefIO,
Thanks for the reply. As you say, it’s for “down the road”. With regard to the SQL, I’d say definitely get the metadata, i.e. individual station info, that sort of thing. But for the “raw” data, the actual measurements, you are probably better off parsing them into a CSV format that you are comfortable with. It’s a good deal faster, simpler for something like R to parse, and a lot more efficient storage-wise.
With regard to the downloading of data, would the ICOADS (the sea temperature stuff) be useful. I could have that running in the background here and maybe expose the file on cubie for downloading remotely. There’s also the GSOD dataset, and of course KNMI. It might be easier doing it this way than asking you to do even more work describing what you want and how to get it.
Well, the “catch up” syncronization finished:
A day and a quarter to catch up the changes from start of collection, and that was about 8% of the dataset at 10 GB. I’m now going to restart the NOAA scrape and let it finish that last directory unimpeeded by competition for the wire or the disk.
Well, maybe a little competition for the disk… I’m going to launch a mksquashfs against the CDIAC data and see if it gets any smaller… but the Temps file system will be NFS mounted onto the R.PiM2 (as the 4 cores matters a lot to compression) and with output to a disk local to the M2, so I don’t expect the NFS reads to dominate anything…
With this, the CDIAC collection is done. And time to archive it. At about the end of the year I’ll look at doing another “catch up” and see what changes. As a log file is spit out of the process, I can get clue about changes without needing to get fancy…
Here you can see how using ‘grep’ to look for lines starting with “Saving” finds the same number of lines of text as the final command says were updated. If one leaves off the “| wc -l” that does the line counting, it would print out the lines instead.
You can also stack up the ‘grep’ commands connected with pipes:
Tells me 24,510 of the “Saved” files are “ameriflux” while
Almost 30,000 of them are ocean data. That’s 53,000+ right there. That’s most of the changed data files. (How did I know to pick those two to search on? That prior ‘find the ^Saved and list them” grep was run to the screen and I just watched stuff fly by. What was there a LOT was easy to read in the blur…)
How to do this searching in a bit more compact way? Well, we can ‘invert’ the grep and only keep what is NOT the search key…
Yes, these can also all be stacked up with pipe symbols “|” but catching the intermediate results in a file lets you rummage around more a bit more efficiently… And now I have a file with 7591 lines in it that are the non-Ameriflux non-Oceans files that were saved. Repeat the process until you find something interesting….
(Beginning to see why I love *NIX systems? One scrape, and now with just a few lines of typing I can do all sorts of interesting “Digging” Here! ;-)
At 7k lines, I’m OK with just sucking it into an editor and poking around. Here’s one “no surprise”:
The USHCN Daily data changed…. I’d expect that over a few days.
I also spanned a month end, so this change in “monthlies” is not a surprise either:
There are also a lot of these lines. The ones with “index” in them that shows a change. These are just spurious as the “wget” has some lists of what’s in a directory that it just downloads new each time, regardless of changes to actual files IN the directory. They are the things copied over to decide what to actually copy over. I could remove all those lines with one more “grep -v”… (actual text left as an exercise… )
When you do that, you are down to almost 2 k lines:
That file tends to be MUCH more interesting and a whole lot easier to read. Here’s the top of it:
If you want the whole 1800 lines of it, well, you now know how to do it yourself ;-)
(Or for serious enquirers I can send a copy to folks. Or even post the whole thing if enough folks want it – with “enough” being about 3 ;-)
I’m especially interested in what changed in that last line:
/jonescru/jones.html
Maybe someone updated their CV?…
Further down I found:
Hansen? I thought he was gone already?… and there is some more playing with USHCN…
Looks like there’s a workshop going on with NOAA:
Don’t know what SOCCR is, but somebody is up to something again…
Hope this lets folks “get ideas” about the utility of such a system of archival, not just for capturing a static copy of the data, but also for picking out “what changed” and finding bits of interest you might not otherwise notice. Just remember that even if it is a bit ugly, and not a lot of fun at parties, *NIX is your friend and “a grep is a terrible thing to waste” ;-)
FWIW, I have the ‘wget’ command in an executable file named “synccdiac” and this ‘wrapper script’ goes around it to make the log file and such:
I have it named ‘synccdiacb’ (the trailing ‘b’ meaning ‘background task’) so I only need to type:
synccdiacb
at the command prompt and it all launches. Yes, you could type “nohup synccdiac…” long hand each time, but why bother? The trailing ‘&’ puts the synccdiac command in the background and the leading ‘nohup’ says “keep it running even if I log out and hangup the line”.
So can you guess what I need to do to restart the NOAA run?…
Yup,
syncnoaab
Aren’t threaded interpretive languages fun? (AND you don’t have to type so much… Yes, the ‘learning curve’ is steep, but man does the payoff get huge fast… and stays that way the rest of your life.)
@Paul Hanlon:
I see this as a Fahrenheit 451 kind of thing… If it has to do with Temperatures, Global Warming, and any other thing that might have the data fudged to achieve The Agenda (21) then someone ought to save it. Each of us can “be a book” as our own interests drive us.
For me, I’m planning on getting:
CDIAC
NOAA
Cru / Hadley
Antarctic data sets
GSOD
Anything Arctic
Whatever local BOMs have data available. Australia? New Zealand? Canada? Iceland? etc.
I’ve already got the first one done. The second one is about 90% I’d guess. Between them I’ve got GHCN v1, v2, and v3 along with a couple of USHCN variations. As GIStemp is just a screw over of that data and I’ve got GIStemp code running here, I don’t see any need to archive their stuff, but it might make sense for showing how what they present as ‘fact’ keeps changing…
I’ve also got a couple of random grabs of some Hadley temperature data in some bucket somewhere, but have not got anything newer than a couple of versions back…
So pick something you like and “go for it”. I lay claim to NOAA at the moment, everything else is unclaimed. Also feel free to put up any other dataset idea you have. I have no monopoly on ideas here…
If, someday, I need to merge in a few TB of data, well, I can find a fast pipe to borrow for the weekend ;-)
Oh, and since I might get hit by lightning some day, it doesn’t hurt if someone duplicates a copy that I’ve already got. More is better than not enough.
Well, I decided to try making a squashfs file out of the old copy first (since the alternative is to just throw it away, and since any mistake on the new copy would be A Very Bad Thing…) and that mksquashfs run has sucked up all 4 cores of the R.PiM2 for about 12 hours now, and it is 71% finished… So figure about 18 hours all told. Maybe it’s worth it ;-)
Compression looks to be running about 50%, so a decent space savings ( est. about 70 GB instead of about 120 GB ). It is CPU limited, not disk limited, so using a journalling file system is just fine (i.e. any file system that supports many GB files of sufficient size will do).
FWIW, I have a LOT of stuff that’s in compressed archives. Various machine backups, archives of old projects, what have you. They have historically been kept as gzip-ed tar archives (compresses Tape ARchive format files). As you can mount a squashfs file system and wander around in it without decompressing and unpacking the whole thing and that’s more convenient than a gzip tar archive: I’m going to oh so slowly unpack things, toss out the trash, make a big backups file tree of it, and then make squashfs file systems out of chunks. It uses the same compression methods, so just as compact, but a whole lot more accessible for “I want a copy of that old 1 KB file in that 20 GB archive”…
As I’ve become comfortable with using squashfs file systems in loop mounted files, it has become clear that while the up front CPU load of compression hits you, the downstream “just mount it and get what you want out” along with the ability to append is a very nice combination.
Not a big priority, just while sitting pondering over morning coffee, just the sort of thing you can launch into the background to keep the equipment busy doing something of value…
@EMSmith; Pondering is good, I do it a lot ;-), Consideration of others that follow and might utilize your labors takes time but means your time and effort is an investment in the future rather then just for personal satisfaction. I am impressed with your progress in this creation and follow every post and comment…pg
@P.G.:
Well, thank you kindly. My hope is that I can “lead by example” and both save a lot of other folks a lot of time and also perhaps inspire a few to “Dig Here!” on doing their own data archival / comparisons. I’ve tried to make it pretty easy and low cost to do, and I think I’ve done that. 8-}
The “mksquashfs” on the older copy has completed, so I’ve now got ‘stats’ on it. I made a ‘one line script’ to do it that I called “squish” 8-) it just does the mksquashfs {first directory} {second directory} and with an optional change of blocksize. Similarly, I have one called “sqit” with the default block size override to 64 KB. The use of /tmp as the target by default is just to prevent it from doing any Bad Thing if launched without an argument for from or to directories… (Paranoid Programming? Nope… just experience … and it’s called “defensive programming” ;-)
Here’s “sqit” that does an ‘in place’ mksquishfs with only one argument:
So if I have a directory named Foo I can just type “sqit Foo” and it runs off to make Foo.sqsh in that same location but with a 1/2 default sized block size.
The one line command “squish” is similar, but keeps the default larger block size and lets you direct the output to a different location; useful for me as the data can come from an NFS mounted copy of the Temps data and be sent to a different local real disk…
Again, the $1 is the first directory argument and $2 is the second one. The use of the curly braces and the dash just cause /tmp to be the default value if none is given and thus prevent A Bad Thing sometimes… given that /tmp is disposable.
So I said to ‘squish’ the data from my /TempArc temperature archive disk old copy of cdiac and put the result into a file named CDIAC.sqsh in the same location (note that $2… above has a .sqsh appended to the file name automagically… so I don’t have to keep typing it and so that the output is very unlikely to overwrite any real data by accidentally forgetting to type the .sqsh…)
At 70.2 GB, my earlier guess of “est. about 70 GB” was pretty darned good! (You might guess I’ve spent far too much of my life waiting for compress / decompress cycles… thus the origin of the “Chiefio” tag as “Chief of I/O”… back when doing a lot of Systems Admin work… long before moving up to be Director of I.T. (and since I reported to the V.P. Business Affairs, i.e. head lawyer, I was the top I.T. guy of the company. Also had Facilities and a few other bits…). But I digress.
There are other interesting stats, like 187,607 files of which 25,418 are duplicates (wonder where they are…) and a 54.4% compressed size from the 129 GB original size.
Guess now I need to mount it and take it for a test drive. See if everything is ‘as expected’ and then contemplate tossing out the original uncompressed archive… Or just flag it as ‘disposable’ and leave it sit until I need the disk space. I’ll often make a directory “copied_to_foo” and put thing in it. That way I have a defacto backup copy, but can just delete it any time I actually need the disk space. A very old habit that has been helpful too many times ;-)
FWIW, as of right now, I have 3 TB of disks that are full of such “duplicates” as I’ve weeded down to about 2 TB of what looks like mostly single copies of what I need to keep. (I think I can get that down by another 500 MB without too much risk). Once that whole ‘weed and shrink’ is done, then I’m looking at making a squashfs version of it onto one of those 3 TB of “backups”… and empty the rest… It’s amazing how much stuff can accumulate over 30 years … especially when you tend to keep the old copies around in ‘delete me’ directories but then just grab the whole machine archive set when something goes flaky and then THAT ends up being duplicated a few times… Maybe I can get it down to under 1 TB of actual stuff … 8-} (Like, do I REALLY need that canonical collection of all the Debian 6.0 releases and variations by processor type? And just what will I be installing Red Hat 6 and 7 onto anyway?… )
Well, time to check that the “bread has done rizz” and turn it to bake, then come back and test drive the squashfs file system, then a cup of tea with fresh bread as I figure out what’s next ;-)
Well, the bread has risen, baked, and first warm slices ‘down the hatch’… yum! Fresh real butter at room temperature soaking into warm fresh from the oven bread… hard to beat! It was an “artisan” bread, meaning crusty with coarser texture… but very nice. ( I’m playing around with a ‘no knead’ recipe… posting up in a few hours after ’round two’ is done… )
So the first thing I “figured out what’s next” is just that the mounted squashfs file system is really nice and surprisingly fast and all (what the decompression taketh away, the far lower number of seeks and block reads on the disk seems to giveth back…); however: As it is mounted read-only any compressed thing you left IN that re-compressed image is a PITA to look inside…
So this is an ‘ls’ listing of the USHCN v2 monthly data still hanging out in that blob. Notice that all of it is “.gz” file type? Normally you just “gunzip foo.gz” and root around in the extract. Except that this is a RO file system… so you end up in the “move somewhere else to unpack and inspect” game… which kind of defeats the whole reason for the squashfs file system in the first place…
Other than that, things have been fine so far.
So “lesson learned” is to just go through any giant blob you are going to turn into a squashfs file system and unzip, gunzip, uncompress, etc etc etc all the wads inside it until it is ALL unpacked and unzipped and uncompressed. THEN make your squashfs file system file (that will recompress it all but in a way that lets you look inside the bits).
This can take some fair amount of effort as you need to check each compressed wad before you uncompress it to assure it does something sane (like puts things in an appropriately named subdirectory) instead of something stupid (like extracting a bunch of files and such in place where will be unpacking a dozen other such compressed wads all of them expecting to write out a README file …and only one survives).
Also, you get to FIND all those compressed files. Yes, you can use the ‘find’ command (that has more options than a mangy dog has fleas) but that can be a pain for the uninitiated to do as find is a bear to get right the first few times and “A find is a terrible thing to waste!” (Groan… 8-} )
At any rate, just realize that it is very easy to make the squashfs, it works as a live file system very very fast and efficiently; but any compressed wads inside of it become a bit of a pill to deal with and you end up back in the ‘copy and extract’ game.
Oh, and the tea was a nice Ceylon Earl Gray loose tea from the local Middle East Persian market… very nice… They also have Baltic area sprats with Cyrillic writing on the tin that are wonderfully and deeply smoked. Latvian or some such… but again I digress…
OK, with that, I think it’s time for me to move on to making some new posting or two. The scrapes are well detailed, the sizes and issues characterized, the how to store explored, and the best way to do the squashfs (i.e. after unzipping and uncompressing) discovered. From here on out it is more grunt grinding it out than exploration, so I’ll be saying less about it. Mostly just updates as some wad gets done or IFF I run into some kind of ‘gotcha’ issues.
Excellent ChiefIO,
Okay, I’ll get downloading GSOD and ICOADS. ICOADS is what is used for HadSST, so that might be interesting. I’d also like to explore the KNMI data archive (Bob Tisdale seems to swear by it, and that’s good enough for me:-)). Also, I think the DMI dataset is stored there, and that would be very interesting as I think it is the only one with *actual* temp measurements from the buoys in the Arctic. It will take a few days between that and putting up a page with links to them on the web, but once I have it done I’ll post up a link here.
Pingback: CDIAC, Compression, Squashfs, And Oddities | Musings from the Chiefio