> On Sat, 2009-11-07 at 17:41 -0500, Dennis Clarke wrote: >> Does the dedupe functionality happen at the file level or a lower block >> level? > > it occurs at the block allocation level. > >> I am writing a large number of files that have the fol structure : >> >> ------ file begins >> 1024 lines of random ASCII chars 64 chars long >> some tilde chars .. about 1000 of then >> some text ( english ) for 2K >> more text ( english ) for 700 bytes or so >> ------------------ > > ZFS's default block size is 128K and is controlled by the "recordsize" > filesystem property. Unless you changed "recordsize", each of the files > above would be a single block distinct from the others. > > you may or may not get better dedup ratios with a smaller recordsize > depending on how the common parts of the file line up with block > boundaries. > > the cost of additional indirect blocks might overwhelm the savings from > deduping a small common piece of the file. > > - Bill
Well, I as curious about these sort of things and figured that a simple test would show me the behavior. Now the first test I did was to write 26^2 files [a-z][a-z].dat in 26^2 directories named [a-z][a-z] where each file is 64K of random non-compressible data and then some english text. I guess I was wrong about the 64K random text chunk also .. because I wrote out that data as chars from the set { [A-Z][a-z][0-9] } and thus .. compressible ASCII data as opposed to random binary data. So ... after doing that a few times I now see something fascinating : $ ls -lo /tester/foo/*/aa/aa.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:38 /tester/foo/1/aa/aa.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:45 /tester/foo/2/aa/aa.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:43 /tester/foo/3/aa/aa.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:43 /tester/foo/4/aa/aa.dat $ ls -lo /tester/foo/*/zz/az.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:39 /tester/foo/1/zz/az.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:47 /tester/foo/2/zz/az.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:45 /tester/foo/3/zz/az.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:47 /tester/foo/4/zz/az.dat $ find /tester/foo -type f | wc -l 70304 Those files, all 70,000+ of them, are unique and smaller than the filesystem blocksize. However : $ zfs get used,available,referenced,compressratio,recordsize,compression,dedup zp_dd/tester NAME PROPERTY VALUE SOURCE zp_dd/tester used 4.51G - zp_dd/tester available 3.49G - zp_dd/tester referenced 4.51G - zp_dd/tester compressratio 1.00x - zp_dd/tester recordsize 128K default zp_dd/tester compression off local zp_dd/tester dedup on local Compression factors don't interest me at the moment .. but see this : $ zpool get all zp_dd NAME PROPERTY VALUE SOURCE zp_dd size 67.5G - zp_dd capacity 6% - zp_dd altroot - default zp_dd health ONLINE - zp_dd guid 14649016030066358451 default zp_dd version 21 default zp_dd bootfs - default zp_dd delegation on default zp_dd autoreplace off default zp_dd cachefile - default zp_dd failmode wait default zp_dd listsnapshots off default zp_dd autoexpand off default zp_dd dedupratio 1.95x - zp_dd free 63.3G - zp_dd allocated 4.22G - The dedupe ratio has climbed to 1.95x with all those unique files that are less than %recordsize% bytes. -- Dennis Clarke dcla...@opensolaris.ca <- Email related to the open source Solaris dcla...@blastwave.org <- Email related to open source for Solaris _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss