CDIAC, Compression, Squashfs, And Oddities

In a prior posting, in a comment here:

I’d found that though turning the scrape file tree into a squashfs file system worked Just Dandy and the speed in many things (like listing directories and counting up file sizes) seemed faster (since fewer disk seeks and block reads and as long as you have the CPU left over, you don’t notice the decompression taking CPU); there was One Small Problem…

Many of the files in the file system were themselves already compressed wads. Some ending int .Z so “Zip”ed while others ended in .gz for the GNU gzip format. First off, it doesn’t make much sense to try and compress already compressed wads. The CPU is wasted searching for a compression approach that never has benefit. Second, if / when you want to look at some file, you find it in a .Z or .gz wad that you must uncompress… but since the squashfs file system is Read Only, you can’t just say “unzip FOO.Z” or “gunzip Bar.gz” and unpack in place. Nooo… you have to copy all those bits to a r/w file system (assuming you have the free space… which may be many GB) and THEN to the unzipping… Which kind of defeats the whole purpose of the squashfs file system as an easy to read / search (yet still compressed) archive.

The answer, of course, is to first unzip all the zips and THEN compress them all.

Then Reality Struck

I had saved a copy of the cdiac.ornl download party way through (near the end) during one long pause. Then did a ‘restart’ later that continued to completion. I’ll often save intermediate copies of long running things, just so that ‘restart’ after any kind of disaster is easier.

I also can feel free to use that mid-point copy as test model for things like a decompress / recompress cycle. If it gets broken, I just copy the final one over again. So that’s what I was doing. Using it as the test copy when I found that the zipped files were numerous.

Testing the size of a couple when unzipped, and looking through directories by hand to get an idea of “how many?” showed “a whole lot”… Hundreds.

Far too many to convert by hand.

How many? I made a script to find compressed files, and print their names (named “findcomp”) and then piped that into the “wc -l” command that counts lines. Thus making a command to count number of compressed files found.

pi@RaPiM2 /Temps $ findcomp | wc -l

Yup, 38,607 of those suckers. Yikes! Not going to do THAT by hand…

So I did what any good *Nix Systems Admin would do. I made a couple of short scripted commands to do the work for me. Most of this posting will be about those commands (scripts) and what they found.

Mostly it comes down to my needing to use the “find” command.

A Short Digression On Find:

I have a love / hate relationship with Unix / Linux systems. I love what they let me do, and sometimes I hate how they make me do it. The “find” command is one of those. It is God’s Own Gift to SysAdmins. It lets you root around in massive amounts of disk space and, based on a long list of things you might specify, do darned near anything else on only those files you found that met your rules. Great!

Except the syntax is painful.

So bear with me and I’ll try to make it as non-painful as possible.

Most of the issue is special characters that are grabbed by the command interpreter “shell” and used there, but you want to stop that and let “find” use them as it “finds” stuff. So you end up in a couple of levels of quoting. Which would not be too bad if *Nix didn’t let you quote in most any imaginable way often mixing different kinds on one line. There are paragraphs on this in the “man” pages. The “short form” is that regular quotes like ” quote some things, while the backslash \ quotes the single character after it. Single quotes can also be used and are almost the same, but not quite… and more…

Then you need to be able to hand the file name that find found to “whatever” you choose to execute on it. That gets its own {} method of marking substitution.

Finally, there’s name globbing / regular expressions. *Nix lets you specify all sorts of special characters that get ‘expanded’ or turned into many other names / characters at the time of use. So the asterisk * (called “splat”) can “match one or more of any characters”. Thus the use in *nix meaning anything ending in “nix” such as Unix, Ultrix, Posix… (but humans are not as literal as computers, so we think it also means Linux, but notice it actually doesn’t end in ‘nix’ … but people are like that… computers not so much…)

Any one of those is a bit of a chunk to chew, and “find” mixes all of them in a joyous confusion of syntax. Which is why a few times now I’ve alluded to “find” and said something like “There’s a way to do that, but I’m not going to get into that because it’s complicated”… or similar. So now we’re going to get into it. But only a little and only enough for this problem.

But part of the reason I made “cmd” that tosses me into an editor in my “bin” directory and makes executable the file when done (so I can write little ‘one line scripts’ easy) and ‘bcat’ that prints out those commands from my ‘bin’ directory was directly because I’d miss typed a find command a few times trying to ‘get it right’ and just got tired of it. So I automated the process of capturing a “one line command” so that I could ‘get it right once and be done’.

The Find Commands

First off, that “findcomp” command. How did it find those files?

Here I just use the regular “cat” command to print it out. (As it is in the bin shared with everyone including root, not in my private bin). Note that I use “advisory print” via the ‘echo’ command. That just puts stuff on your screen.

First I look for all the “gunzip” type files ending in .gz, then I look for the ones ending in .z or .Z (two variations on ‘zip’).

pi@RaPiM2 /Temps $ cat /usr/local/bin/findcomp
echo Looking for gunzip .gz files in $1
find $*  -iname "*.gz" -print  | more
echo Now looking for unzip .z or Gunzip-able .Z files
find $*  -iname "*.z" -print | more
echo All Done!

OK, let’s decode the “find” commands. The “$” says “passed in argument” and the “*” says “however many you sent”. Each one gets a number. $1 or $2 or $3. etc. So if I’m in a directory with the files Foo, Bar, and Gilligan; typing “cat *” inside cat has $1=Foo, $2=Bar, and $3=Gilligan. Note that the file globbing via the splat * is very similar to, but not exactly the same, as the use of splat to mean “all passed arguments”. But you can pretty much ignore the difference and live a long and happy life.

So when I type “findcomp” that gets turned into:

find -iname “*.gz” -print | more

in that first line.

Now the fun bits… There are dozens of settings you can give to find. You can have it look for creation time, or last modification time, or sizes or … The “name” and “iname” (and many others) look at the name of the file, not the metadata or the contents. Name just matches the name. But “iname” is Insensitive to case in the matching. So this says to match .gz or .Gz or .gZ or .GZ at the end of the file. This doesn’t do much as gunzip files pretty much always end in .gz (unless they have been on a FAT file system that makes everything uppercase). The next line using iname and .z matters more as .z is used by zip and .Z is used by later versions that are not quite the same. More on that below.

Note, too, that the period . (called dot) is a wild card itself that says ‘match any character, but only and exactly ONE character’ and I really ought to have ‘escaped’ it with a quote of it’s own (the single character quoting backslash, so: “\.gs” ) but I decided to ‘risk it’ and in testing did not run into any file named like borgz or frogz that would cause a (non-fatal anyway) error. In a production bit of code for wide use, though, that little bit of finesse ought to be cleaned up.

There are ways to put both searches into one ‘find’, but I find it easier to leave them separate and traverse the name space (file tree) repeated times (unless the size is horrid… then I’ll put in the time to make it more efficient).

Once a ‘name’ is found, what do we do with it?

Again there are dozens of choices. The simplest is just “print it out”. That is what -print says to do. Finally, I glued on a | symbol (called ‘pipe’) that connects the output of the find to the input of the “more” command. It is just a ‘pager’ that displays a page, then stops scrolling until you hit enter. Useful if you want to look at 38,000 lines and not have them scroll past in a blur.

Notice that in my example, I piped “findcomp” into “wc -l”. *Nix is smart enough to notice you piping on the outside and have the pager not get in the way…

So that’s the command that will go wandering whatever chunk of file name space you feed it and look for anything that reminds you of a compressed file.

The Find and Unzip it command

After testing that one and getting it showing only what really was zipped and needing unzipping (and looking over what was actually there; .gz or .GZ or…) it was time to actually unzip some stuff. I did one or two by hand just to make sure I had it right, then launched this on the whole tree.

Notice that the first ‘find’ is now just -name as I saw only .gz endings in the look-see.

Notice too that I split the .z and .Z into two passes. Turns out unzip will handle .z (but gunzip doesn’t on this box) while unzip doesn’t understand .Z here and gunzip does. Go figure… (but capabilities of each do vary with the particular *Nix distribution).

pi@RaPiM2 /Temps $ cat /usr/local/bin/findgunzip
echo Looking for gunzip .gz files in $1

find $*  -name "*.gz" -print -exec gunzip '{}' \;

echo Now looking for unzip .z or Gunzip-able .Z files

find $*  -name "*.z" -print -exec unzip '{}' \;

find $*  -name "*.Z" -print -exec gunzip '{}' \;

echo All Done!

OK, here’s where that talk about quoting above comes in. I use double quotes around the *.z so that the * doesn’t get taken by the shell, but is still subsitutable by the find command. The curly braces {} say “stick the name you found here”, but also need protection from the shell. Here I used single quotes because all the examples did… it might or might not matter. Finally, when you use the -exec command that “executes” a non-find-built-in command (like unzip) to which arguments are passed (those {} that stand in for a file name) you must end the line with a semicolon… but the shell wants to grab that, too, so it is quoted as a single character by using the \ in front of it.

Now you see why I’ve avoided mentioning the ‘find’ command all these years…

Yes, its a Royal PITA. But that good news is that it’s about as bad as things ever get.

And why I construct it in scripts where I can avoid typing all those quotes of various kinds over and over and over…

So, with those two made, and made ‘executable’ via “chmod +x scriptname”, I can stop caring about how they work and just use them…

Inside the Partial Download

I emphasize that this is the partial download as there were a couple of ‘surprises’ that might be artifacts of my stop / start / stop / start running of it with -nc “no clobber” set. Partial files likely didn’t get over written with the settings I had, leaving the broken 1/2 download in place. (That was the job of the final preening and sync run that was not done on this partial file system). Still, the “errors” are instructive of what to watch out for in the final version. (More on that in a few days…)

So here’s a bit of running commentary on what I’ve been doing the last couple of days digging around in this thing.

Some Sizing. I usually like to ‘size’ a job before I dive in. You saw one of the numbers already. Total number of compressed file wads. Here’s some quotes from my log file of actions ( I usually keep such a journal of actions. Helps the next guy, which may be me coming back to it years later and wondering what I did ;-)

Some very interesting sizes from after it's all done, 
moved here to the top:

root@RaPiM2:/TempsArc# du -ms
root@RaPiM2:/TempsArc# grep 1_DU_out_mb_presquash
root@RaPiM2:/TempsArc# du -ms CDIAC.sqsh
68649   CDIAC.sqsh

So it’s about 232 GB when all is uncompressed (that top line) but it was only 126 GB prior to the uncompression (that line in the saved record of sizes file).

But the “old squashfs” (that squashed the zips, or tried) was only 68 GB.

Meaning I’ve inflated 126 GB to 232 GB and I’m expecting that the total recompression will get it back near the prior 68 GB. That “mksquashfs” run was started about 4 hours ago and is 19% done, so I’m thinking about 16 more hours to go… Details on result sometime in the afternoon tomorrow ;-)

Then we have the process... Took about 12 hours wall clock time, 
might have been closer to 8 if I'd stayed on top of it all the time 
and didn't take breaks or buy dinner or...

Yes, for about 12 hours today I have been doing various things to get all this ready to mksquash again. It MIGHT have been faster if I wasn’t doing other things interleaved with it. But I’m not going to sit for hours to type “no” a couple of times. The machine can wait until I check it.

First I tested “findcomp” on one smaller directory. Here’s a sample of the output:

findgunzip ushcn_daily

Looing for gunzip .gz files in ushcn_daily


Working pretty good.


Now looking for unzip .z or Gunzip-able .Z files

All Done!

But it didn’t find any .z or .Z files. So I had to test that elsewhere…

During the next group, I ran into my first error cases:

You can get errors like:

gzip: oceans/NOAA_Workshop/McKinley_MITgcm_Data/ invalid compressed data--format violated

so worth watching the output when it runs... or at least screening it later...


gzip: oceans/NOAA_Workshop/McKinley_Data/pacific_pco2_cycle.avi.gz: invalid compressed data--crc error

gzip: oceans/NOAA_Workshop/McKinley_Data/pacific_pco2_cycle.avi.gz: invalid compressed data--length error

zip: oceans/NOAA_Workshop/Deutsch_UW_Data/var_ocmip_h2000_specjo2.cdf.gz: invalid compressed data--format violated

So one “Lesson Learned” is that the mass uncompression also finds broken downloads that you could then re-get. It will be very interesting to see if the final preened / synced version has any errors. If so, then I need to fetch Just That File while watching carefully to find if it is my download that failed, or if they have a broken file and never did the QA test.

For now, it is most likely these are what was being downloaded during one of the dozen stop / restarts with -nc set and without a final mirror sync pass. Still, will be fun to find out. ( I have that download done and wrapped, just using this one to ‘debug’ the mksquashfs process so it is known good before applying it to a few weeks work…)

In this case it did find some .Z files:

Now looking for unzip .z or Gunzip-able .Z files


So now we know all of that finding worked fine.

At one point an uncompress run was taking a long time. Turned out that particular wad was rather large:

Well, that's a big one... no wonder it was taking along time...

pi@RaPiM2 /TempsArc/ $ cd data
pi@RaPiM2 /TempsArc/ $ ls -l Level2/AllSites/ameriflux.allsites.L2_data.10Sep2015.tar
-rw------- 1 root root 10475053056 Oct 10 13:25 Level2/AllSites/ameriflux.allsites.L2_data.10Sep2015.tar

@RaPiM2 /TempsArc/ $ ls -l Level3/AllSites/L3.AllSites.tar Level4/AllSites/L4.AllSites.tar
-rw-r--r-- 1 pi pi 1555814400 Jan  8  2009 Level3/AllSites/L3.AllSites.tar
-rw-r--r-- 1 pi pi 1566453760 Jan  8  2009 Level4/AllSites/L4.AllSites.tar
pi@RaPiM2 /TempsArc/ $

While the last two are 1.5 GB, that first one was 10 GB. Surprise! Just be ready for some bits to be a lot bigger than others.

There were also a few that said the ‘tar’ file already existed and did I want to overwrite it. I answered ‘no’, until I can look them over. As I’d unzipped and gunzipped a couple, and didn’t keep track, it wasn’t clear which was new and not. But I’m fairly sure I’d not unpacked all that many. Were these just in the process of being packed up when I did the download? Or left open and packed on the server? Don’t know. But don’t really care until the final batch is processed. Then I’ll check. After all was unzipped, I did another pass (that is MUCH quicker as everything is unpacked or an error case); but it does document the ‘odd ducks’ in a clear listing.

One more pass from the top level:

root@RaPiM2:/TempsArc/ findgunzip .

Looing for gunzip .gz files in .


gzip: ./ftp/oceans/NOAA_Workshop/McKinley_MITgcm_Data/ invalid compressed data--format violated

gzip: ./ftp/oceans/NOAA_Workshop/McKinley_Data/pacific_pco2_cycle.avi.gz: invalid compressed data--crc error

gzip: ./ftp/oceans/NOAA_Workshop/McKinley_Data/pacific_pco2_cycle.avi.gz: invalid compressed data--length error

gzip: ./ftp/oceans/NOAA_Workshop/Deutsch_UW_Data/var_ocmip_h2000_specjo2.cdf.gz: invalid compressed data--format violated

gzip: ./ftp/ameriflux/data/Level1/Sites_ByName/ARM_SGP_Main/aircraft_data/request.38728.20110714.112501.tar.gz: unexpected end of file

gzip: ./ftp/ameriflux/data/Level1/Sites_ByName/Quebec_Boreal_Cutover_Site/CA-Qcu.tar.gz: not in gzip format
gzip: ./ftp/ndp005/event.tar already exists; do you wish to overwrite (y or n)? n
        not overwritten
gzip: ./ftp/ndp005/month.tar already exists; do you wish to overwrite (y or n)? n
        not overwritten

Now looking for unzip .z or Gunzip-able .Z files

gzip: ./ftp/ndp005a/ already exists; do you wish to overwrite (y or n)? n
        not overwritten
gzip: ./ftp/ndp005a/ already exists; do you wish to overwrite (y or n)? n
        not overwritten
gzip: ./ftp/ndp005a/ already exists; do you wish to overwrite (y or n)? n
gzip: ./ftp/ndp005a/ already exists; do you wish to overwrite (y or n)? n
        not overwritten


gzip: ./ftp/ameriflux/data/Level1/Sites_ByName/La_Selva/1998-11/d3320200.raw.Z: unexpected end of file

gzip: ./ftp/ameriflux/data/Level1/Sites_ByName/La_Selva/1998-11/d3320330.raw.Z: unexpected end of file

gzip: ./ftp/ameriflux/data/Level1/Sites_ByName/La_Selva/1998-11/d3320400.raw.Z: unexpected end of file

gzip: ./ftp/ameriflux/data/Level1/Sites_ByName/La_Selva/1998-11/d3320230.raw.Z: unexpected end of file

gzip: ./ftp/ameriflux/data/Level1/Sites_ByName/La_Selva/1998-11/d3320100.raw.Z: unexpected end of file

gzip: ./ftp/ameriflux/data/Level1/Sites_ByName/La_Selva/1998-11/d3320430.raw.Z: unexpected end of file

gzip: ./ftp/ameriflux/data/Level1/Sites_ByName/La_Selva/1998-11/d3320300.raw.Z: unexpected end of file

gzip: ./ftp/ameriflux/data/Level1/Sites_ByName/La_Selva/1998-11/d3320130.raw.Z: unexpected end of file
gzip: ./ftp/ndp005/month.tar already exists; do you wish to overwrite (y or n)?         not overwritten
gzip: ./ftp/ndp005/event.tar already exists; do you wish to overwrite (y or n)? n
        not overwritten

All Done!

In Conclusion

Now you know what I’ve been up to today (and yesterday…).

I’ve now got a pretty good procedure to make a more usable permanent Read Only archive as a squashfs file system; but one where the compressed wads have been uncompressed for easier looking over. In some other file systems of specific data sets I’ve also un-tarred the .tar files into a plain file tree prior to mksquashfs. That’s even better.

But I didn’t do it here for two reasons.

First off, much of this data is stuff I will never look at. So why bother (and with tar you need to be a bit careful that things don’t over write each other, so not quite as ‘just do it’ friendly…)

Second, you can just do a ‘tar tvf foo.tar | more’ to look at what is IN the tar file. It doesn’t need to be unpacked into a r/w file system just to look around.

Between those, it looked like work I didn’t need to do for little return. As it was, I was already using a lot more disk space for a wad of stuff that was mostly going to sit unused. (It is now only insurance for a screwup on my part in the processing of the final version). Besides, if it ever does become an issue, there’s the unsquashfs command to turn it back into a normal file system and I can un-tar things then.

At the start of this, the file system was about 40% used. Here you can see it’s about 20% more used now. And I still have the ‘real one’ to go and the ‘squash’ of this one had not started yet when this grab was done (so another 70 GB or so to be sucked up tonight). So I need to start watching the GB or pop another $60 for another TB disk… (Strange how that works… start slugging multiple copies of a 240 GB or so file wad around and pretty soon your talking real TB ;-)

Oh, and it took about 20% of the disk to hold all this expanded stuff:

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdc3            955008796 537669208 368804856  60% /TempsArc

With that, I’m back to the grindstone on this one. I hope to have the trial run all wrapped up mid tomorrow and the final data copy processed by tomorrow night into a clean uncompressed file system. Then the mksquash can run on it all of tomorrow night.

Sometime Monday I ought to have the final data on the real final copy sizes, and the comparison of error file sets (if any are in the final copy).

Then I can “take out all the trash” of intermdiate copies and the uncompressed file system trees and get back most of a TB of disk ;-)

All this, btw, being done on that $40 Raspberry Pi Model 2 ( $60 for the full kit with case, card, powersupply etc). It’s is remarkably capable for such a small device.

With that, time to call it a night and end the “work day” that started about lunchtime yesterday…

Subscribe to feed

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in AGW and GIStemp Issues, NCDC - GHCN Issues, Tech Bits and tagged , , , , , , . Bookmark the permalink.

6 Responses to CDIAC, Compression, Squashfs, And Oddities

  1. E.M.Smith says:

    Golly, nothing to say about “find” syntax after a couple of days? Somehow I’m not surprised ;-)

    Finally the compression finished. At the end of it all the compressed size is slightly smaller:

    root@RaPiM2:/TempsArc# du -ms cdiac.ornl.early.sqsh 
    67600	cdiac.ornl.early.sqsh

    So about 1 GB smaller when it is globally compressed rather than having lots of uncompressible (already compressed) chunks in it. Here’s the run log:

    root@RaPiM2:/TempsArc# mksquashfs ./ cdiac.ornl.early.sqsh -b 65536 
    Parallel mksquashfs: Using 4 processors
    Creating 4.0 filesystem on cdiac.ornl.early.sqsh, block size 65536.
    [================================================================================================================================================================================|] 3843735/3843735 100%
    Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 65536
    	compressed data, compressed metadata, compressed fragments, compressed xattrs
    	duplicates are removed
    Filesystem size 69221571.70 Kbytes (67599.19 Mbytes)
    	29.11% of uncompressed filesystem size (237768151.91 Kbytes)
    Inode table size 9750642 bytes (9522.11 Kbytes)
    	42.47% of uncompressed inode table size (22959746 bytes)
    Directory table size 1461633 bytes (1427.38 Kbytes)
    	30.32% of uncompressed directory table size (4819949 bytes)
    Number of duplicate files found 25618
    Number of inodes 192132
    Number of files 187612
    Number of fragments 24514
    Number of symbolic links  0
    Number of device nodes 0
    Number of fifo nodes 0
    Number of socket nodes 0
    Number of directories 4520
    Number of ids (unique uids + gids) 2
    Number of uids 2
    	pi (1000)
    	root (0)
    Number of gids 2
    	pi (1000)
    	root (0)

    So it looks like, with about a 1 to 2 day “run time” for the compressor, I can make a read only squashfs “file system in a file” out of the scrape and use 67 GB instead of 232 GB, with everything inside of it unpacked and ready to be looked over. Nice, if tedious to do…

    With this result in hand and tested, I’m ready to do the final version data.

    No, don’t worry, I’m not going to post more ‘find’ syntax when it’s done ;-)

    It will just be done “without comment” other than at the end maybe stating if the ‘broken’ files in this run were not broken in the preened and resync-ed version. ( I get as tired of this kind of tech-detail stuff as anyone else… I’m just willing to put up with it to get a result I want ;-)

    But it is kind of nice to know I can scrape and archive such a very large block of data into a nice single lump that isn’t nearly so huge a chunk of disk, yet is still reasonably fast to read / process / search (as long as you are not decompressing the whole thing ;-) … which might be interesting to do just to see how long it takes… but that is for another day…)

    Time for morning tea and figuring out ‘the next step’.

  2. Paul Hanlon says:

    Awesome, ChiefIO

    Including the find syntax. I’ve used it a few times, but never to that level. I usually use updatedb and locate from the mlocate package, which seems to be much faster, probably because it caches a snapshot of the filesystem.

    And one gigabyte smaller by just compressing it once, and being able to view all files. That has to be worth the extra effort, even if it does mean a couple of days tying up a computer. It should save a pile of time later when you actually do start using it, and it keeps everything *tidy*.

  3. E.M.Smith says:


    Aw, you’ve caught it, or me, in the *tidy* thing! ;-)

    Yes, eventually you find yourself sinking a few days into finding some slighly more “tidy” way of doing things that will pay off for decades to come… (Then that habit sticks with you even when you are not so sure you have “decades” in front of you any more, but maybe…)

    But if not for you, perhaps for others, comes the altruisitic justification for an old habit that you can no longer let go…

    So it goes…

    And, per “find”, just remember:

    “A Find is a terrible thing to waste!” so study the syntax often and deeply! 8-0

    While painful, it has saved me more time than it has consumed… ( I think… ;-)

    I’m just glad someone, even just one someone, noticed and cared… Thanks for that…

  4. Soronel Haetir says:

    The only comment/question I have is why bother with -print as that is the default action if nothing else is specified?

  5. E.M.Smith says:

    @Soronel Haetir:

    Since defaults can change, something else is specified, and it never hurts to specifiy that the default is what you are doing expecially if it is going to be explained to someone else, or someone else might be looking at the code you wrote some time later and not know you are depending on an unspecified default.

    Same reason that causes me to write code like:

    If A do foo
    if NOT A do bar
    else print “Error, you can’t get here line Foo Bar Selector”; exit.
    end if
    end if;

    I’ve actually had that code print out the error message. It was a compiler error / bug… but the code handled it…

    After a few decades running into that kind of thing you get a bit paranoid about how you write code, and that’s a good thing! ;-0

  6. Pingback: 64 bit vs Raspberry Pi Model 2, a surprising thing | Musings from the Chiefio

Comments are closed.