Dennis Clarke wrote:
On Sat, 2009-11-07 at 17:41 -0500, Dennis Clarke wrote:
Does the dedupe functionality happen at the file level or a lower block
level?
it occurs at the block allocation level.
I am writing a large number of files that have the fol structure :
------ file begins
1024 lines of random ASCII chars 64 chars long
some tilde chars .. about 1000 of then
some text ( english ) for 2K
more text ( english ) for 700 bytes or so
------------------
ZFS's default block size is 128K and is controlled by the "recordsize"
filesystem property. Unless you changed "recordsize", each of the files
above would be a single block distinct from the others.
you may or may not get better dedup ratios with a smaller recordsize
depending on how the common parts of the file line up with block
boundaries.
the cost of additional indirect blocks might overwhelm the savings from
deduping a small common piece of the file.
- Bill
Well, I as curious about these sort of things and figured that a simple
test would show me the behavior.
Now the first test I did was to write 26^2 files [a-z][a-z].dat in 26^2
directories named [a-z][a-z] where each file is 64K of random
non-compressible data and then some english text.
I guess I was wrong about the 64K random text chunk also .. because I
wrote out that data as chars from the set { [A-Z][a-z][0-9] } and thus ..
compressible ASCII data as opposed to random binary data.
So ... after doing that a few times I now see something fascinating :
$ ls -lo /tester/foo/*/aa/aa.dat
-rw-r--r-- 1 dclarke 68330 Nov 7 22:38 /tester/foo/1/aa/aa.dat
-rw-r--r-- 1 dclarke 68330 Nov 7 22:45 /tester/foo/2/aa/aa.dat
-rw-r--r-- 1 dclarke 68330 Nov 7 22:43 /tester/foo/3/aa/aa.dat
-rw-r--r-- 1 dclarke 68330 Nov 7 22:43 /tester/foo/4/aa/aa.dat
$ ls -lo /tester/foo/*/zz/az.dat
-rw-r--r-- 1 dclarke 68330 Nov 7 22:39 /tester/foo/1/zz/az.dat
-rw-r--r-- 1 dclarke 68330 Nov 7 22:47 /tester/foo/2/zz/az.dat
-rw-r--r-- 1 dclarke 68330 Nov 7 22:45 /tester/foo/3/zz/az.dat
-rw-r--r-- 1 dclarke 68330 Nov 7 22:47 /tester/foo/4/zz/az.dat
$ find /tester/foo -type f | wc -l
70304
Those files, all 70,000+ of them, are unique and smaller than the
filesystem blocksize.
However :
$ zfs get
used,available,referenced,compressratio,recordsize,compression,dedup
zp_dd/tester
NAME PROPERTY VALUE SOURCE
zp_dd/tester used 4.51G -
zp_dd/tester available 3.49G -
zp_dd/tester referenced 4.51G -
zp_dd/tester compressratio 1.00x -
zp_dd/tester recordsize 128K default
zp_dd/tester compression off local
zp_dd/tester dedup on local
Compression factors don't interest me at the moment .. but see this :
$ zpool get all zp_dd
NAME PROPERTY VALUE SOURCE
zp_dd size 67.5G -
zp_dd capacity 6% -
zp_dd altroot - default
zp_dd health ONLINE -
zp_dd guid 14649016030066358451 default
zp_dd version 21 default
zp_dd bootfs - default
zp_dd delegation on default
zp_dd autoreplace off default
zp_dd cachefile - default
zp_dd failmode wait default
zp_dd listsnapshots off default
zp_dd autoexpand off default
zp_dd dedupratio 1.95x -
zp_dd free 63.3G -
zp_dd allocated 4.22G -
The dedupe ratio has climbed to 1.95x with all those unique files that are
less than %recordsize% bytes.
You can get more dedup information by running 'zdb -DD zp_dd'. This
should show you how we break things down. Add more 'D' options and get
even more detail.
- George
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss