Re: [zfs-discuss] why both dedup and compression?

Dennis Clarke Fri, 07 May 2010 08:24:58 -0700

> On 06/05/2010 21:07, Erik Trimble wrote:
>> VM images contain large quantities of executable files, most of which
>> compress poorly, if at all.
>
> What data are you basing that generalisation on ?

note : I can't believe someone said that.

warning : I just detected a fast rise time on my pedantic input line and I
am in full geek herd mode :
http://www.blastwave.org/dclarke/blog/?q=node/160

The degree to which a file can be compressed is often related to the
degree of randomness or "entropy" in the bit sequences in that file. We
tend to look at files in chunks of bits called "bytes" or "words" or
"blocks" of some given length but the harsh reality is that it is just a
sequence of ones and zero values and nothing more. However I can spot
blocks or patterns in there and then create tokens that represent
repeating blocks. If you want a really random file that you are certain
has nearly perfect high entropy then just get a coin and flip it 1024
times while recording the heads and tails results. Then input that data
into a file as a sequence of ones and zero bits and you have a very neatly
random chunk of data.

Good luck trying to compress that thing.

Pardon me .. here it comes. I spent waay too many years in labs doing work
with RNG hardware and software to just look the other way. And I'm in a
good mood.

Suppose that C is soem discrete random variable. That means that C can
have well defined values like HEAD or TAIL. You usually have a bunch ( n
of them ) of possible values x1, x2, x3, ..., xn that C can be. Each of
those shows up in the data set with specific propabilities p1, p2, p3,
..., pn where the sum of those add to exactly one. This means that x1 will
appear in the dataset with an "expected" probability of p1. All of those
propabilities are expressed as a value between 0 and 1. A value of 1 means
"certainty". Okay, so in the case of a coin ( not the one in Bat Man The
Dark Knight ) you have x1=TAIL and x2=HEAD with ( we hope ) p1=0.5=p2 such
that p1+p2 = 1 exactly unless the coin lands on its edge and the universe
collapses due to entropy implosion. That is a joke. I used to teach this
as a TA in university so bear with me.

So go flip a coin a few thousand times and you will get fairly random
data. That is a Random Number Generator that you have and its always
kicking around your lab or in your pocket or on the street. Pretty cheap
but the baud rate is hellish low.

If you get tired of flipping bits using a coin then you may have to just
give up on that ( or buy a radioactive source where you can monitor
particles emitted as it decays for input data ) OR be really cheap and
look at /dev/urandom on a decent Solaris machine :

$ ls -lap /dev/urandom
lrwxrwxrwx   1 root     root          34 Jul  3  2008 /dev/urandom ->
../devices/pseudo/ran...@0:urandom

That thing right there is a pseudo random number generator. It will make
for really random data but there is no promise that over a given number of
bits that the sum p1 + p2 will be precisely 1.  It will be real real close
however to a very random ( high entropy ) data source.

Need 1024 bits of random data ?

$ /usr/xpg4/bin/od -Ax -N 128 -t x1 /dev/urandom
0000000 ef c6 2b ba 29 eb dd ec 6d 73 36 06 58 33 c8 be
0000010 53 fa 90 a2 a2 70 25 5f 67 1b c3 72 4f 26 c6 54
0000020 e9 83 44 c6 b9 45 3f 88 25 0c 4d c7 bc d5 77 58
0000030 d3 94 8e 4e e1 dd 71 02 dc c2 d0 19 f6 f4 5c 44
0000040 ff 84 56 9f 29 2a e5 00 33 d2 10 a4 d2 8a 13 56
0000050 d1 ac 86 46 4d 1e 2f 10 d9 0b 33 d7 c2 d4 ef df
0000060 d9 a2 0b 7f 24 05 72 39 2d a6 75 25 01 bd 41 6c
0000070 eb d9 4f 23 d9 ee 05 67 61 7c 8a 3d 5f 3a 76 e3
0000080

There ya go. That was faster than flipping a coin eh? ( my Canadian bit
just flipped )

So you were saying ( or someone somewhere had the crazy idea that ZFS with
dedupe and compression enabled ) won't really be of great benefit because
of all the binary files in the filesystem. Well thats just nuts. Sorry but
it is. Those binary files are made up of ELF headers and opcodes from a
specific set of opcodes for a given architecture and that means the input
set C consists of a "discrete set of possible values" and NOT pure random
high entropy data.

Want a demo ?

Here :

(1) take a nice big lib

$ uname -a
SunOS aequitas 5.11 snv_138 i86pc i386 i86pc
$ ls -lap /usr/lib | awk '{ print $5 " " $9 }' | sort -n | tail
4784548 libwx_gtk2u_core-2.8.so.0.6.0
4907156 libgtkmm-2.4.so.1.1.0
6403701 llib-lX11.ln
8939956 libicudata.so.2
9031420 libgs.so.8.64
9300228 libCg.so
9916268 libicudata.so.3
14046812 libicudata.so.40.1
21747700 libmlib.so.2
40736972 libwireshark.so.0.0.1

$ cp /usr/lib/libwireshark.so.0.0.1 /tmp

$ ls -l /tmp/libwireshark.so.0.0.1
-r-xr-xr-x   1 dclarke  csw      40736972 May  7 14:20
/tmp/libwireshark.so.0.0.1

What is the SHA256 hash for that file ?

$ cd /tmp

Now compress it with gzip ( a good test case ) :

$ /opt/csw/bin/gzip -9v libwireshark.so.0.0.1
libwireshark.so.0.0.1:   76.1% -- replaced with libwireshark.so.0.0.1.gz

$ ls -l libwireshark.so.0.0.1.gz
-r-xr-xr-x   1 dclarke  csw      9754053 May  7 14:20
libwireshark.so.0.0.1.gz

$ bc
scale=9
9754053/40736972
0.239439814

I see compression there.

Let's see what happens with really random data :

$ dd if=/dev/urandom of=/tmp/foo.dat bs=8192 count=8192
8192+0 records in
8192+0 records out
$ ls -l /tmp/foo.dat
-rw-r--r--   1 dclarke  csw      67108864 May  7 15:21 /tmp/foo.dat

$ ls -l /tmp/foo.dat.gz
-rw-r--r--   1 dclarke  csw      67119130 May  7 15:21 /tmp/foo.dat.gz

QED.






_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] why both dedup and compression?

Reply via email to