> On Sat, 2009-11-07 at 17:41 -0500, Dennis Clarke wrote:
>> Does the dedupe functionality happen at the file level or a lower block
>> level?
>
> it occurs at the block allocation level.
>
>> I am writing a large number of files that have the fol structure :
>>
>> ------ file begins
>> 1024 lines of random ASCII chars 64 chars long
>> some tilde chars .. about 1000 of then
>> some text ( english ) for 2K
>> more text ( english ) for 700 bytes or so
>> ------------------
>
> ZFS's default block size is 128K and is controlled by the "recordsize"
> filesystem property.  Unless you changed "recordsize", each of the files
> above would be a single block distinct from the others.
>
> you may or may not get better dedup ratios with a smaller recordsize
> depending on how the common parts of the file line up with block
> boundaries.
>
> the cost of additional indirect blocks might overwhelm the savings from
> deduping a small common piece of the file.
>
>                                               - Bill

Well, I as curious about these sort of things and figured that a simple
test would show me the behavior.

Now the first test I did was to write 26^2 files [a-z][a-z].dat in 26^2
directories named [a-z][a-z] where each file is 64K of random
non-compressible data and then some english text.

I guess I was wrong about the 64K random text chunk also .. because I
wrote out that data as chars from the set { [A-Z][a-z][0-9] } and thus ..
compressible ASCII data as opposed to random binary data.

So ... after doing that a few times I now see something fascinating :

$ ls -lo /tester/foo/*/aa/aa.dat
-rw-r--r--   1 dclarke    68330 Nov  7 22:38 /tester/foo/1/aa/aa.dat
-rw-r--r--   1 dclarke    68330 Nov  7 22:45 /tester/foo/2/aa/aa.dat
-rw-r--r--   1 dclarke    68330 Nov  7 22:43 /tester/foo/3/aa/aa.dat
-rw-r--r--   1 dclarke    68330 Nov  7 22:43 /tester/foo/4/aa/aa.dat
$ ls -lo /tester/foo/*/zz/az.dat
-rw-r--r--   1 dclarke    68330 Nov  7 22:39 /tester/foo/1/zz/az.dat
-rw-r--r--   1 dclarke    68330 Nov  7 22:47 /tester/foo/2/zz/az.dat
-rw-r--r--   1 dclarke    68330 Nov  7 22:45 /tester/foo/3/zz/az.dat
-rw-r--r--   1 dclarke    68330 Nov  7 22:47 /tester/foo/4/zz/az.dat

$ find /tester/foo -type f | wc -l
   70304

Those files, all 70,000+ of them, are unique and smaller than the
filesystem blocksize.

However :

$ zfs get
used,available,referenced,compressratio,recordsize,compression,dedup
zp_dd/tester
NAME          PROPERTY       VALUE             SOURCE
zp_dd/tester  used           4.51G             -
zp_dd/tester  available      3.49G             -
zp_dd/tester  referenced     4.51G             -
zp_dd/tester  compressratio  1.00x             -
zp_dd/tester  recordsize     128K              default
zp_dd/tester  compression    off               local
zp_dd/tester  dedup          on                local

Compression factors don't interest me at the moment .. but see this :

$ zpool get all zp_dd
NAME   PROPERTY       VALUE       SOURCE
zp_dd  size           67.5G       -
zp_dd  capacity       6%          -
zp_dd  altroot        -           default
zp_dd  health         ONLINE      -
zp_dd  guid           14649016030066358451  default
zp_dd  version        21          default
zp_dd  bootfs         -           default
zp_dd  delegation     on          default
zp_dd  autoreplace    off         default
zp_dd  cachefile      -           default
zp_dd  failmode       wait        default
zp_dd  listsnapshots  off         default
zp_dd  autoexpand     off         default
zp_dd  dedupratio     1.95x       -
zp_dd  free           63.3G       -
zp_dd  allocated      4.22G       -

The dedupe ratio has climbed to 1.95x with all those unique files that are
less than %recordsize% bytes.

-- 
Dennis Clarke
dcla...@opensolaris.ca  <- Email related to the open source Solaris
dcla...@blastwave.org   <- Email related to open source for Solaris


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to