> On 06/05/2010 21:07, Erik Trimble wrote: >> VM images contain large quantities of executable files, most of which >> compress poorly, if at all. > > What data are you basing that generalisation on ?
note : I can't believe someone said that. warning : I just detected a fast rise time on my pedantic input line and I am in full geek herd mode : http://www.blastwave.org/dclarke/blog/?q=node/160 The degree to which a file can be compressed is often related to the degree of randomness or "entropy" in the bit sequences in that file. We tend to look at files in chunks of bits called "bytes" or "words" or "blocks" of some given length but the harsh reality is that it is just a sequence of ones and zero values and nothing more. However I can spot blocks or patterns in there and then create tokens that represent repeating blocks. If you want a really random file that you are certain has nearly perfect high entropy then just get a coin and flip it 1024 times while recording the heads and tails results. Then input that data into a file as a sequence of ones and zero bits and you have a very neatly random chunk of data. Good luck trying to compress that thing. Pardon me .. here it comes. I spent waay too many years in labs doing work with RNG hardware and software to just look the other way. And I'm in a good mood. Suppose that C is soem discrete random variable. That means that C can have well defined values like HEAD or TAIL. You usually have a bunch ( n of them ) of possible values x1, x2, x3, ..., xn that C can be. Each of those shows up in the data set with specific propabilities p1, p2, p3, ..., pn where the sum of those add to exactly one. This means that x1 will appear in the dataset with an "expected" probability of p1. All of those propabilities are expressed as a value between 0 and 1. A value of 1 means "certainty". Okay, so in the case of a coin ( not the one in Bat Man The Dark Knight ) you have x1=TAIL and x2=HEAD with ( we hope ) p1=0.5=p2 such that p1+p2 = 1 exactly unless the coin lands on its edge and the universe collapses due to entropy implosion. That is a joke. I used to teach this as a TA in university so bear with me. So go flip a coin a few thousand times and you will get fairly random data. That is a Random Number Generator that you have and its always kicking around your lab or in your pocket or on the street. Pretty cheap but the baud rate is hellish low. If you get tired of flipping bits using a coin then you may have to just give up on that ( or buy a radioactive source where you can monitor particles emitted as it decays for input data ) OR be really cheap and look at /dev/urandom on a decent Solaris machine : $ ls -lap /dev/urandom lrwxrwxrwx 1 root root 34 Jul 3 2008 /dev/urandom -> ../devices/pseudo/ran...@0:urandom That thing right there is a pseudo random number generator. It will make for really random data but there is no promise that over a given number of bits that the sum p1 + p2 will be precisely 1. It will be real real close however to a very random ( high entropy ) data source. Need 1024 bits of random data ? $ /usr/xpg4/bin/od -Ax -N 128 -t x1 /dev/urandom 0000000 ef c6 2b ba 29 eb dd ec 6d 73 36 06 58 33 c8 be 0000010 53 fa 90 a2 a2 70 25 5f 67 1b c3 72 4f 26 c6 54 0000020 e9 83 44 c6 b9 45 3f 88 25 0c 4d c7 bc d5 77 58 0000030 d3 94 8e 4e e1 dd 71 02 dc c2 d0 19 f6 f4 5c 44 0000040 ff 84 56 9f 29 2a e5 00 33 d2 10 a4 d2 8a 13 56 0000050 d1 ac 86 46 4d 1e 2f 10 d9 0b 33 d7 c2 d4 ef df 0000060 d9 a2 0b 7f 24 05 72 39 2d a6 75 25 01 bd 41 6c 0000070 eb d9 4f 23 d9 ee 05 67 61 7c 8a 3d 5f 3a 76 e3 0000080 There ya go. That was faster than flipping a coin eh? ( my Canadian bit just flipped ) So you were saying ( or someone somewhere had the crazy idea that ZFS with dedupe and compression enabled ) won't really be of great benefit because of all the binary files in the filesystem. Well thats just nuts. Sorry but it is. Those binary files are made up of ELF headers and opcodes from a specific set of opcodes for a given architecture and that means the input set C consists of a "discrete set of possible values" and NOT pure random high entropy data. Want a demo ? Here : (1) take a nice big lib $ uname -a SunOS aequitas 5.11 snv_138 i86pc i386 i86pc $ ls -lap /usr/lib | awk '{ print $5 " " $9 }' | sort -n | tail 4784548 libwx_gtk2u_core-2.8.so.0.6.0 4907156 libgtkmm-2.4.so.1.1.0 6403701 llib-lX11.ln 8939956 libicudata.so.2 9031420 libgs.so.8.64 9300228 libCg.so 9916268 libicudata.so.3 14046812 libicudata.so.40.1 21747700 libmlib.so.2 40736972 libwireshark.so.0.0.1 $ cp /usr/lib/libwireshark.so.0.0.1 /tmp $ ls -l /tmp/libwireshark.so.0.0.1 -r-xr-xr-x 1 dclarke csw 40736972 May 7 14:20 /tmp/libwireshark.so.0.0.1 What is the SHA256 hash for that file ? $ cd /tmp Now compress it with gzip ( a good test case ) : $ /opt/csw/bin/gzip -9v libwireshark.so.0.0.1 libwireshark.so.0.0.1: 76.1% -- replaced with libwireshark.so.0.0.1.gz $ ls -l libwireshark.so.0.0.1.gz -r-xr-xr-x 1 dclarke csw 9754053 May 7 14:20 libwireshark.so.0.0.1.gz $ bc scale=9 9754053/40736972 0.239439814 I see compression there. Let's see what happens with really random data : $ dd if=/dev/urandom of=/tmp/foo.dat bs=8192 count=8192 8192+0 records in 8192+0 records out $ ls -l /tmp/foo.dat -rw-r--r-- 1 dclarke csw 67108864 May 7 15:21 /tmp/foo.dat $ ls -l /tmp/foo.dat.gz -rw-r--r-- 1 dclarke csw 67119130 May 7 15:21 /tmp/foo.dat.gz QED. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss