I’ve used ‘loopback’ file systems before. I’ve used file systems based in a ‘sparse file’ before. Many Virtual Machines put their data in a file system built inside a sparse file, for example. I just never bothered to ‘roll my own’ before.
A sparse file is one that looks like a big size, but only really has had data blocks used for the actual bits with data in them. So you can say it will be 5 GB, but if you only put 1 GB in it, it will only use the 1 GB. So for a file system in a file container, you can build all the inodes (information nodes) and all the meta data structures as though the file were 5 GB of space and have them ‘scattered’ through that space, but all the ‘data blocks’ are empty and unused until some real data shows up.
Turns out, it can be a feature…
For one thing, you can have your files in a file system that is encrypted, inside this file that looks like it is giant, but is only as big as it needs to be. Such a ‘file container with encryption’ is one of the benefits of things like TrueCrypt. Yes, to be “NSA proof” you would need the whole OS hardened and encrypting, but for “Barney Fife” secure, it’s way more than enough.
I’m going to walk through the entire process of adding the disk, making the sparse file, and then mounting that as a newly made EXT3 file system. I already had an entry in my /etc/fstab for mounting the Seagate drive.
/dev/sdc1 /SG ntfs-3g defaults 0 1
So remember that you need to have somehow mounted the drive and / or put an entry in /etc/fstab for it and issue the mount command.
root@RaPiM2:/WD# mount /SG root@RaPiM2:/WD# df Filesystem 1K-blocks Used Available Use% Mounted on rootfs 59805812 55000212 1744560 97% / [...] /dev/sdc1 488384000 330276856 158107144 68% /SG
That rootfs / is the 64 GB card on the RaPiM2 and is rapidly filling up with the GHCN data. Now over 50 GB and with only 1.7 GB left before it locks up the whole system by filling the ‘root disk’. Before then, I need to re-point that wget download to a different location. I’d rather not swap if from expecting an EXT3 file system to NTFS. I don’t think anything would break, but with a couple of days invested in this, I’d rather not ‘risk it’ on a pause / restart of the process. Besides, EXT is the native Linux file system and I just like it better. ;-) And I have to think Linux likes it better too.
But what I have is an NTFS file system on a disk that’s a pain to resize… So I’m going to stuff that EXT file system inside a ‘bag of bits’ handed to me by the NTFS driver from the NTFS disk that can easily hold a single large TB scale file, while avoiding all the NTFS way of handling metadata for all those Linux / Unix world files.
You an see from the df output that I have now available 158 GB of NTFS space on /SG. So let’s ‘go there’ and make a container. I could just use a program like ‘dd’ to make a 150 GB file full of blanks and use it, but then again, I don’t really know how big GHCN is going to get. It would be a bit stupid to use 150 GB if the process is going to use 10 more and be done. So I’m going to make a ‘sparse file’ instead. This too can be done with ‘dd’, but I’m going to use the ‘truncate’ command. It can shorten a file, or make it bigger than it seems…
root@RaPiM2:/WD# cd /SG root@RaPiM2:/SG# truncate -s 150G GHCN_filesys root@RaPiM2:/SG# ls -l drwxrwxrwx 1 root root 0 Sep 2 2010 Administrator_Backup drwxrwxrwx 1 root root 20480 Jun 28 09:02 Evo -rwxrwxrwx 1 root root 161061273600 Sep 7 17:25 GHCN_filesys drwxrwxrwx 1 root root 0 May 13 2011 _Memeo drwxrwxrwx 1 root root 0 May 3 2012 $RECYCLE.BIN drwxrwxrwx 1 root root 0 Oct 21 2010 RECYCLER drwxrwxrwx 1 root root 4096 May 3 2012 System Volume Information root@RaPiM2:/SG# df . Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdc1 488384000 330276856 158107144 68% /SG
I’ve chopped a few bits out of the ls listing. You can see here that it says GHCN_filesys is 161 GB (note that since a KB is really 1024, the meaning of a MB can wander between 1,000,000 bytes to 1024 x 1024 bytes… depending on user and context – base ten or binary. Don’t let that bother you…)
Yet our ‘used space’ hasn’t changed per the ‘df’ command. Neat. (Just be aware that some programs are not so bright about this and using things like ‘tar’ might end up mysteriously creating 150 GB after a move / copy…)
root@RaPiM2:/SG# du -h --apparent-size GHCN_filesys 150G GHCN_filesys root@RaPiM2:/SG# du -h GHCN_filesys 0 GHCN_filesys
So it looks like 150 GB apparent per the ‘du’ disk used program, but it really is empty.
Now lets make an EXT3 Linux native journaling file system inside that container. Since I know a lot of the files in that GHCN copy are large compressed files, I’m going to give it a large ‘block size’ to keep the overhead of tracking blocks down just a little. That -b 4096 says to make 4k sized blocks. If I had a use with a gazillion tiny 100 to 500 byte files, I’d make it a 1 k or 512 block size instead and save myself wasting (4096 – 512) bytes per block.
Note that on the 3rd line down mkfs.ext3 notices that I have not given it a real disk on a real ‘block special device’ and asks me to say ‘y’ before it goes on. It then complains that there isn’t a real disk geometry so can’t get that data; which we already knew, so can ignore. I’ve bolded the question.
root@RaPiM2:/SG# mkfs.ext3 -b 4096 GHCN_filesys mke2fs 1.42.5 (29-Jul-2012) GHCN_filesys is not a block special device. Proceed anyway? (y,n) y warning: Unable to get device geometry for GHCN_filesys Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 9830400 inodes, 39321600 blocks 1966080 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=0 1200 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done
It took a few minutes to do the actual creation as EXT3 preallocates and writes all the ‘inodes’ or information nodes that hold all the information about data blocks and the metadata about files. That stuff you see in ‘ls’ listings like name, permissions, size…
Doing the mkfs-ext3 took about 88% of one core during the “Writing inode tables” part for the ntfs-g3 driver. You still get all the lovely inefficiency and CPU usage of NTFS, but you avoid all the various other bits of how NTFS keeps metadata and all that for your Linux file system that now is built inside that NTFS box. All NTFS has to do is glue on more blocks as the container grows and asks for them.
So here, in about 6 minutes all told, I went from a very large NTFS file system on a disk with left over free space, to a usable mounted (if less efficient) EXT3 file system. Far far faster than that 3 day ntfsresize… And, since the use of this ‘partition’ is limited to the speed of the internet download for most uses, and since I have an idle CPU core almost all the time, the inefficiency just doesn’t matter much.
Had we built one of the file system types with dynamic inodes, this space would not be allocated until data is written, so more space efficient, but with more writes later. IIRC, ReiserFS and EXT4 are that way. I’m a bit more fond of EXT3 as it is compatible with EXT2 so I can swap to a non-journaling (lower write load) file system if desired. Eventually I’ll get around to using all the complicated features of things like btrfs, but that’s way overkill for this immediate need.
So if we inspect file size again, we see:
root@RaPiM2:/SG# ls -l GHCN_filesys -rwxrwxrwx 1 root root 161061273600 Sep 7 17:54 GHCN_filesys root@RaPiM2:/SG# du -h GHCN_filesys 2.5G GHCN_filesys root@RaPiM2:/SG# df . Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdc1 488384000 332880068 155503932 69% /SG root@RaPiM2:/SG# du -h --apparent-size GHCN_filesys 150G GHCN_filesys
It really is using 2.5 GB for all those inodes and stuff. It looks like 150 G to both an ls and a du with apparent set.
Next we mount it to the ‘name space’ so it looks like a real file system. I’ll start by making the /GHCN directory as a mount point. Then the “loopback” interface is used in the mount command. This lets us use the loop drivers to get to the file system inside the file.
root@RaPiM2:/SG# mkdir /GHCN root@RaPiM2:/SG# mount -o loop GHCN_filesys /GHCN
Now what do we see on a ‘df’ listing?
root@RaPiM2:/SG# df Filesystem 1K-blocks Used Available Use% Mounted on rootfs 59805812 55097532 1647240 98% / /dev/root 59805812 55097532 1647240 98% / devtmpfs 470416 0 470416 0% /dev tmpfs 94944 396 94548 1% /run tmpfs 5120 0 5120 0% /run/lock tmpfs 189880 0 189880 0% /run/shm /dev/mmcblk0p6 61302 57554 3748 94% /boot /dev/mmcblk0p5 499656 676 462284 1% /media/data /dev/mmcblk0p3 27633 444 24896 2% /media/SETTINGS [...] /dev/sdc1 488384000 332880068 155503932 69% /SG /dev/loop0 154687468 60996 146762152 1% /GHCN
A somewhat fuller rootfs or “/” (as that wget is still running filling it up…), the original /SG mount with 155 GB still free, and what sure looks like a 146 GB free space on a files system named /GHCN mounted on /dev/loop0.
At this point, I can ‘pause’ my GHCN wget, move the data off the SD card to /GHCN, put a symbolic link from the old /MIRRORS location to /GHCN, and then ‘fg’ bring the job back to running in the foreground and “move on”.
FWIW, I also have the 140 GB (after resize / format / etc) of EXT3 file system built on the older Toshiba drive mounted and the CDIAC wget running against it, too.
/dev/sdd3 140745608 43953180 89636308 33% /Temps
So I’m back to normal running, more or less, with nearly 300 GB more space, half as a real EXT3 disk partition that took 3 days to get up and attached, the other half as EXT3 inside an NTFS disk file container that took about 10 minutes including reading web page HowTo…
Later I’m going to do some speed tests to see what kind of penalty there is to this method, but right now the whole I/O system is being very saturated by the wget so any numbers would be contaminated and not very informative.
I searched several web pages on “how to” do this. The model I chose to follow was this very well written one from the Arch Linux folks. It also covers some of the details on how to copy / move such a file without hitting the ‘sudden size’ issue.
This link is similar, and has information on setting it up as an encrypting file system and using ACLs Access Control Lists.
It has more details on using “dd” to build the file along with how to set it up as an encrypted container. I’ll likely do that as a test ‘some other day’ when not in a race condition with wget… But I think it is pretty clear that just putting your ‘stuff’ in a non-PC file system hidden as an encrypted file system in a big bag of bits with a name like “failed_binary_image” on an NTFS drive would get it past all but the more careful of forensics folks.
For now, though, I’m down to 1.3 GB on my root partition on the SD card and need to get busy moving 50+ GB to my new space… Performance testing and ACLs / encryption can be for another day.
UPDATE: About 5 minutes later…
Well, no sooner than I was done posting and ready to do the “pause / move / restart” whenever free space in rootfs fell under 1 GB… and the “GHCN” wget finished. Note that this is the ‘restart’ run after the original one had gone a couple of days already… I’ve clipped a bit from the bottom of the listing and bolded the stats portion:
--2015-09-07 19:43:11-- ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v3/techreports/Technical%20Report%20NCDC%20No12-02-3.2.0-29Aug12.pdf => `ftp.ncdc.noaa.gov/pub/data/ghcn/v3/techreports/Technical Report NCDC No12-02-3.2.0-29Aug12.pdf' ==> CWD not required. ==> PASV ... done. ==> RETR Technical Report NCDC No12-02-3.2.0-29Aug12.pdf ... done. Length: 2343758 (2.2M) 100%[================================================================================>] 2,343,758 132K/s in 14s 2015-09-07 19:43:25 (169 KB/s) - `ftp.ncdc.noaa.gov/pub/data/ghcn/v3/techreports/Technical Report NCDC No12-02-3.2.0-29Aug12.pdf' saved  FINISHED --2015-09-07 19:43:25-- Total wall clock time: 1d 18h 14m 28s Downloaded: 30334 files, 30G in 1d 13h 10m 51s (237 KB/s)
Never has “FINISHED” looked quite so good…
All told, the download (both parts) amounted to about 51 GB of ‘stuff’:
pi@RaPiM2 ~/ftp.ncdc.noaa.gov/pub/data $ du -ks * 51182296 ghcn
So now I can put it somewhere a bit more permanent than the SD card, and never need worry about going back to the NOAA well again unless I want to do an “update”. Most likely then it would still be only a few GB of “daily data” that changed and would be re-downloaded.
Inside the ghcn directory, here are the sizes:
pi@RaPiM2 ~/ftp.ncdc.noaa.gov/pub/data/ghcn $ du -ks * 12 alaska-temperature-anomalies.txt 4 alaska-temperature-means.txt 2452 anom 117412 blended 48933468 daily 30116 forts 14604 grid_gpcp_1979-2002.dat 3796 Lawrimore-ISTI-30Nov11.ppt 1492 snow 3584 v1 62224 v2 2013128 v3
Clearly it is that ‘daily’ archive that is the largest bit, and that is likely to be in chunks where only some copies change. Like the “by year” where only recent years ought to change.
pi@RaPiM2 ~/ftp.ncdc.noaa.gov/pub/data/ghcn/daily $ du -ks * 25522240 all 13950108 by_year 36 COOPDaily_announcement_042011.doc 124 COOPDaily_announcement_042011.pdf 68 COOPDaily_announcement_042011.rtf 7872 figures 326224 ghcnd_all.tar.gz 4 ghcnd-countries.txt 139556 ghcnd_gsn.tar.gz 281244 ghcnd_hcn.tar.gz 25676 ghcnd-inventory.txt 4 ghcnd-states.txt 8236 ghcnd-stations.txt 4 ghcnd-version.txt 5327224 grid 885188 gsn 2451972 hcn 7628 papers 24 readme.txt 32 status.txt
But all that kind of ‘size listings’ and inventory of just what all is in here really belongs in its own posting… that will come on ‘another day’…