[zfs-discuss] No write coalescing after upgrade to Solaris 11 Express

2011-04-27 Thread Matthew Anderson
Hi All,

I've run into a massive performance problem after upgrading to Solaris 11 
Express from oSol 134.

Previously the server was performing a batch write every 10-15 seconds and the 
client servers (connected via NFS and iSCSI) had very low wait times. Now I'm 
seeing constant writes to the array with a very low throughput and high wait 
times on the client servers. Zil is currently disabled. There is currently one 
failed disk that is being replaced shortly.

Is there any ZFS tunable to revert Solaris 11 back to the behaviour of oSol 134?
I attempted to remove Sol 11 and reinstall 134 but it keeps freezing during 
install which is probably another issue entirely...

IOstat output is below. When running iostat -v 2 that level is writes OP's and 
throughput is very constant.

   capacity operationsbandwidth
poolalloc   free   read  write   read  write
--  -  -  -  -  -  -
MirrorPool  12.2T  4.11T153  4.63K  6.06M  33.6M
  mirror1.04T   325G 11416   400K  2.80M
c7t0d0  -  -  5114   163K  2.80M
c7t1d0  -  -  6114   237K  2.80M
  mirror1.04T   324G 10374   426K  2.79M
c7t2d0  -  -  5108   190K  2.79M
c7t3d0  -  -  5107   236K  2.79M
  mirror1.04T   324G 15425   537K  3.15M
c7t4d0  -  -  7115   290K  3.15M
c7t5d0  -  -  8116   247K  3.15M
  mirror1.04T   325G 13412   572K  3.00M
c7t6d0  -  -  7115   313K  3.00M
c7t7d0  -  -  6116   259K  3.00M
  mirror1.04T   324G 13381   580K  2.85M
c7t8d0  -  -  7111   362K  2.85M
c7t9d0  -  -  5111   219K  2.85M
  mirror1.04T   325G 15408   654K  3.10M
c7t10d0  -  -  7122   336K  3.10M
c7t11d0  -  -  7123   318K  3.10M
  mirror1.04T   325G 14461   681K  3.22M
c7t12d0  -  -  8130   403K  3.22M
c7t13d0  -  -  6132   278K  3.22M
  mirror 749G   643G  1279   140K  1.07M
c4t14d0  -  -  0  0  0  0
c7t15d0  -  -  1 83   140K  1.07M
  mirror1.05T   319G 18333   672K  2.74M
c7t16d0  -  - 11 96   406K  2.74M
c7t17d0  -  -  7 96   266K  2.74M
  mirror1.04T   323G 13353   540K  2.85M
c7t18d0  -  -  7 98   279K  2.85M
c7t19d0  -  -  6100   261K  2.85M
  mirror1.04T   324G 12459   543K  2.99M
c7t20d0  -  -  7118   285K  2.99M
c7t21d0  -  -  4119   258K  2.99M
  mirror1.04T   324G 11431   465K  3.04M
c7t22d0  -  -  5116   195K  3.04M
c7t23d0  -  -  6117   272K  3.04M
  c8t2d00  29.5G  0  0  0  0
cache   -  -  -  -  -  -
  c8t3d059.4G  3.88M113 64  6.51M  7.31M
  c8t1d059.5G48K 95 69  5.69M  8.08M


Thanks
-Matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express

2011-04-27 Thread Andrew Gabriel

Matthew Anderson wrote:

Hi All,

I've run into a massive performance problem after upgrading to Solaris 11 
Express from oSol 134.

Previously the server was performing a batch write every 10-15 seconds and the 
client servers (connected via NFS and iSCSI) had very low wait times. Now I'm 
seeing constant writes to the array with a very low throughput and high wait 
times on the client servers. Zil is currently disabled.


How/Why?


 There is currently one failed disk that is being replaced shortly.

Is there any ZFS tunable to revert Solaris 11 back to the behaviour of oSol 134?
  


What does "zfs get sync" report?

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express

2011-04-27 Thread Matthew Anderson
NAMEPROPERTY  VALUE SOURCE
MirrorPool  sync  disabled  local
MirrorPool/CCIT sync  disabled  local
MirrorPool/EX01 sync  disabled  inherited from MirrorPool
MirrorPool/EX02 sync  disabled  inherited from MirrorPool
MirrorPool/FileStore1   sync  disabled  inherited from MirrorPool


Sync was disabled on the main pool and then let to inherrit to everything else. 
The reason for disabled this in the first place was to fix bad NFS write 
performance (even with Zil on an X25e SSD it was under 1MB/s).
I've also tried setting the logbias to throughput and latency but they both 
perform around the same level.

Thanks
-Matt


-Original Message-
From: Andrew Gabriel [mailto:andrew.gabr...@oracle.com] 
Sent: Wednesday, 27 April 2011 3:41 PM
To: Matthew Anderson
Cc: 'zfs-discuss@opensolaris.org'
Subject: Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 
Express

Matthew Anderson wrote:
> Hi All,
>
> I've run into a massive performance problem after upgrading to Solaris 11 
> Express from oSol 134.
>
> Previously the server was performing a batch write every 10-15 seconds and 
> the client servers (connected via NFS and iSCSI) had very low wait times. Now 
> I'm seeing constant writes to the array with a very low throughput and high 
> wait times on the client servers. Zil is currently disabled.

How/Why?

>  There is currently one failed disk that is being replaced shortly.
>
> Is there any ZFS tunable to revert Solaris 11 back to the behaviour of oSol 
> 134?
>   

What does "zfs get sync" report?

-- 
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express

2011-04-27 Thread Markus Kovero


> Sync was disabled on the main pool and then let to inherrit to everything 
> else. The > reason for disabled this in the first place was to fix bad NFS 
> write performance (even with > Zil on an X25e SSD it was under 1MB/s).
> I've also tried setting the logbias to throughput and latency but they both 
> perform > around the same level.

> Thanks
> -Matt

I believe you're hitting bug "7000208: Space map trashing affects NFS write 
throughput". We also did, and it did impact iscsi as well.

If you have enough ram you can try enabling metaslab debug (which makes problem 
vanish);

# echo metaslab_debug/W1 | mdb -kw

And calculating amount of ram needed:


/usr/sbin/amd64/zdb -mm  > /tmp/zdb-mm.out

awk '/segments/ {s+=$2}END {printf("sum=%d\n",s)}' zdb_mm.out

93373117 sum of segments
16 VDEVs
116 metaslabs
1856 metaslabs in total

93373117/1856 = 50308 average number of segments per metaslab

50308*1856*64
5975785472

5975785472/1024/1024/1024
5.56

= 5.56 GB

Yours
Markus Kovero
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express

2011-04-27 Thread Tomas Ögren
On 27 April, 2011 - Matthew Anderson sent me these 3,2K bytes:

> Hi All,
> 
> I've run into a massive performance problem after upgrading to Solaris 11 
> Express from oSol 134.
> 
> Previously the server was performing a batch write every 10-15 seconds and 
> the client servers (connected via NFS and iSCSI) had very low wait times. Now 
> I'm seeing constant writes to the array with a very low throughput and high 
> wait times on the client servers. Zil is currently disabled. There is 
> currently one failed disk that is being replaced shortly.
> 
> Is there any ZFS tunable to revert Solaris 11 back to the behaviour of oSol 
> 134?
> I attempted to remove Sol 11 and reinstall 134 but it keeps freezing during 
> install which is probably another issue entirely...
> 
> IOstat output is below. When running iostat -v 2 that level is writes OP's 
> and throughput is very constant.
> 
>capacity operationsbandwidth
> poolalloc   free   read  write   read  write
> --  -  -  -  -  -  -
> MirrorPool  12.2T  4.11T153  4.63K  6.06M  33.6M
>   mirror1.04T   325G 11416   400K  2.80M
> c7t0d0  -  -  5114   163K  2.80M
> c7t1d0  -  -  6114   237K  2.80M
>   mirror1.04T   324G 10374   426K  2.79M
> c7t2d0  -  -  5108   190K  2.79M
> c7t3d0  -  -  5107   236K  2.79M
>   mirror1.04T   324G 15425   537K  3.15M
> c7t4d0  -  -  7115   290K  3.15M
> c7t5d0  -  -  8116   247K  3.15M
>   mirror1.04T   325G 13412   572K  3.00M
> c7t6d0  -  -  7115   313K  3.00M
> c7t7d0  -  -  6116   259K  3.00M
>   mirror1.04T   324G 13381   580K  2.85M
> c7t8d0  -  -  7111   362K  2.85M
> c7t9d0  -  -  5111   219K  2.85M
>   mirror1.04T   325G 15408   654K  3.10M
> c7t10d0  -  -  7122   336K  3.10M
> c7t11d0  -  -  7123   318K  3.10M
>   mirror1.04T   325G 14461   681K  3.22M
> c7t12d0  -  -  8130   403K  3.22M
> c7t13d0  -  -  6132   278K  3.22M
>   mirror 749G   643G  1279   140K  1.07M
> c4t14d0  -  -  0  0  0  0
> c7t15d0  -  -  1 83   140K  1.07M
>   mirror1.05T   319G 18333   672K  2.74M
> c7t16d0  -  - 11 96   406K  2.74M
> c7t17d0  -  -  7 96   266K  2.74M
>   mirror1.04T   323G 13353   540K  2.85M
> c7t18d0  -  -  7 98   279K  2.85M
> c7t19d0  -  -  6100   261K  2.85M
>   mirror1.04T   324G 12459   543K  2.99M
> c7t20d0  -  -  7118   285K  2.99M
> c7t21d0  -  -  4119   258K  2.99M
>   mirror1.04T   324G 11431   465K  3.04M
> c7t22d0  -  -  5116   195K  3.04M
> c7t23d0  -  -  6117   272K  3.04M
>   c8t2d00  29.5G  0  0  0  0

Btw, this disk seems alone, unmirrored and a bit small..?

> cache   -  -  -  -  -  -
>   c8t3d059.4G  3.88M113 64  6.51M  7.31M
>   c8t1d059.5G48K 95 69  5.69M  8.08M
> 
> 
> Thanks
> -Matt
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive

2011-04-27 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Lamp Zy
>
> One of my drives failed in Raidz2 with two hot spares:
> 

What zpool & zfs version are you using?  What OS version?

Are all the drives precisely the same size (Same make/model number?)  and
all the same firmware level?

Up to some point (I don't know which zpool version) there was a
characteristic (some would say a bug) whereby even a byte smaller drive
would cause the new drive to be an unsuitable replacement for a failed
drive.  And it was certainly known to happen sometimes, that a single mfgr &
model of drive would occasionally have these tiny variations in supposedly
identical drives.  But they created a workaround for this in some version of
zpool.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive

2011-04-27 Thread Lamp Zy

On 04/26/2011 01:25 AM, Nikola M. wrote:

On 04/26/11 01:56 AM, Lamp Zy wrote:

Hi,

One of my drives failed in Raidz2 with two hot spares:

What are zpool/zfs versions? (zpool upgrade Ctrl+c, zfs upgrade Cttr+c).
Latest zpool/zfs versions available by numerical designation in all
OpenSolaris based distributions, are zpool 28 and zfs v. 5. (That is why
one should Not update so S11Ex Zfs/Zpool version if wanting to use/have
installed or continue using in multiple Zfs BE's other open OpenSolaris
based distributions)

What OS are you using with ZFS?
Do you use Solaris 10/update release, Solaris11Express, OpenIndiana
oi_148 dev/ 148b with IllumOS, OpenSolaris 2009.06/snv_134b, Nexenta,
Nexenta Community, Schillix, FreeBSD, Linux zfs-fuse.. (I guess still
not using Linux with Zfs kernel module, but just to mention it
available.. and OSX too).


Thank you for all replies.

Here is what we are using.

- Hardware:

Server: SUN SunFire X4240
DAS Storage: SUN Storage J4400 with 24x1TB SATA drives. Original drives. 
I assume they are identical.


- Software:
OS: Solaris 10 5/09 s10x_u7wos_08 X86; Stock install. No upgrades, no 
patches.

ZFS pool version 10
ZFS filesystem version 3

Another confusing thing is that I wasn't able to put the failed drive 
off-line because there wasn't enough replicas (?). First, the drive 
already failed and second - it's raidz2 which is equivalent of raid6 and 
it should be able to handle 2 failed drives. I skipped that step but 
wanted to mention it here.


I used the "zpool replace" and resilvering finished successfully.

Then the "zpool detach" removed the drive and now I have this:

# zpool status fwgpool0
  pool: fwgpool0
 state: ONLINE
 scrub: resilver completed after 12h59m with 0 errors on Wed Apr 27 
05:15:17 2011

config:

NAME   STATE READ WRITE CKSUM
fwgpool0   ONLINE   0 0 0
  raidz2   ONLINE   0 0 0
c4t5000C500108B406Ad0  ONLINE   0 0 0
c4t5000C50010F436E2d0  ONLINE   0 0 0
c4t5000C50011215B6Ed0  ONLINE   0 0 0
c4t5000C50011234715d0  ONLINE   0 0 0
c4t5000C50011252B4Ad0  ONLINE   0 0 0
c4t5000C500112749EDd0  ONLINE   0 0 0
c4t5000C50014D70072d0  ONLINE   0 0 0
c4t5000C500112C4959d0  ONLINE   0 0 0
c4t5000C50011318199d0  ONLINE   0 0 0
c4t5000C500113C0E9Dd0  ONLINE   0 0 0
c4t5000C500113D0229d0  ONLINE   0 0 0
c4t5000C500113E97B8d0  ONLINE   0 0 0
c4t5000C50014D065A9d0  ONLINE   0 0 0
c4t5000C50014D0B3B9d0  ONLINE   0 0 0
c4t5000C50014D55DEFd0  ONLINE   0 0 0
c4t5000C50014D642B7d0  ONLINE   0 0 0
c4t5000C50014D64521d0  ONLINE   0 0 0
c4t5000C50014D69C14d0  ONLINE   0 0 0
c4t5000C50014D6B2CFd0  ONLINE   0 0 0
c4t5000C50014D6C6D7d0  ONLINE   0 0 0
c4t5000C50014D6D486d0  ONLINE   0 0 0
c4t5000C50014D6D77Fd0  ONLINE   0 0 0
spares
  c4t5000C50014D7058Dd0AVAIL

errors: No known data errors
#

Great. So, now how do I identify which drive out of the 24 in the 
storage unit is the one that failed?


I looked on the Internet for help but the problem is that this drive 
completely disappeared. Even "format" and "iostat -En" show only 23 
drives when there are physically 24.


Any ideas how to identify which drive is the one that failed so I can 
replace it?


Thanks
Peter
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive

2011-04-27 Thread Brandon High
On Wed, Apr 27, 2011 at 12:51 PM, Lamp Zy  wrote:
> Any ideas how to identify which drive is the one that failed so I can
> replace it?

Try the following:
# fmdump -eV
# fmadm faulty

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive

2011-04-27 Thread Paul Kraus
On Wed, Apr 27, 2011 at 3:51 PM, Lamp Zy  wrote:

> Great. So, now how do I identify which drive out of the 24 in the storage
> unit is the one that failed?
>
> I looked on the Internet for help but the problem is that this drive
> completely disappeared. Even "format" and "iostat -En" show only 23 drives
> when there are physically 24.
>
> Any ideas how to identify which drive is the one that failed so I can
> replace it?

We are using CAM to monitor our J4400s and through that interface you
can see which drive is in which slot.

http://www.oracle.com/us/products/servers-storage/storage/storage-software/031603.htm

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-27 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Erik Trimble
> 
> (BTW, is there any way to get a measurement of number of blocks consumed
> per zpool?  Per vdev?  Per zfs filesystem?)  *snip*.
> 
> 
> you need to use zdb to see what the current block usage is for a
filesystem.
> I'd have to look up the particular CLI usage for that, as I don't know
what it is
> off the top of my head.

Anybody know the answer to that one?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-27 Thread Tomas Ögren
On 27 April, 2011 - Edward Ned Harvey sent me these 0,6K bytes:

> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Erik Trimble
> > 
> > (BTW, is there any way to get a measurement of number of blocks consumed
> > per zpool?  Per vdev?  Per zfs filesystem?)  *snip*.
> > 
> > 
> > you need to use zdb to see what the current block usage is for a
> filesystem.
> > I'd have to look up the particular CLI usage for that, as I don't know
> what it is
> > off the top of my head.
> 
> Anybody know the answer to that one?

zdb -bb pool

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express

2011-04-27 Thread zfs user

On 4/27/11 4:00 AM, Markus Kovero wrote:




Sync was disabled on the main pool and then let to inherrit to everything else. 
The>  reason for disabled this in the first place was to fix bad NFS write 
performance (even with>  Zil on an X25e SSD it was under 1MB/s).
I've also tried setting the logbias to throughput and latency but they both 
perform>  around the same level.



Thanks
-Matt


I believe you're hitting bug "7000208: Space map trashing affects NFS write 
throughput". We also did, and it did impact iscsi as well.

If you have enough ram you can try enabling metaslab debug (which makes problem 
vanish);

# echo metaslab_debug/W1 | mdb -kw

And calculating amount of ram needed:


/usr/sbin/amd64/zdb -mm  >  /tmp/zdb-mm.out


metaslab 65   offset  410   spacemap258   free   Assertion 
failed: space_map_load(sm, zfs_metaslab_ops, SM_FREE, smo, 
spa->spa_meta_objset) == 0, file ../zdb.c, line 571, function dump_metaslab


Is this something I should worry about?

uname -a
SunOS E55000 5.11 oi_148 i86pc i386 i86pc Solaris



awk '/segments/ {s+=$2}END {printf("sum=%d\n",s)}' zdb_mm.out

93373117 sum of segments
16 VDEVs
116 metaslabs
1856 metaslabs in total

93373117/1856 = 50308 average number of segments per metaslab

50308*1856*64
5975785472

5975785472/1024/1024/1024
5.56

= 5.56 GB

Yours
Markus Kovero
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-27 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Neil Perrin
> 
> No, that's not true. The DDT is just like any other ZFS metadata and can
be
> split over the ARC,
> cache device (L2ARC) and the main pool devices. An infrequently referenced
> DDT block will get
> evicted from the ARC to the L2ARC then evicted from the L2ARC.

When somebody has their "baseline" system, and they're thinking about adding
dedup and/or cache, I'd like to understand the effect of not having enough
ram.  Obviously the impact will be performance, but precisely...

At bootup, I presume the arc & l2arc are all empty.  So all the DDT entries
reside in pool.  As the system reads things (anything, files etc) from pool,
it will populate arc, and follow fill rate policies to populate the l2arc
over time.  Every entry in l2arc requires 200 bytes of arc, regardless of
what type of entry it is.  (A DDT entry in l2arc consumes just as much arc
memory as any other type of l2arc entry.)  (Ummm...  What's the point of
that?  Aren't DDT entries 270 bytes and ARC references 200 bytes?  Seems
like a very questionable benefit to allow DDT entries to get evicted into
L2ARC.)  So the ram consumption caused by the presence of l2arc will
initially be zero after bootup, and it will grow over time as the l2arc
populates, up to a maximum which is determined linearly as 200 bytes * the
number of entries that can fit in the l2arc.  Of course that number varies
based on the size of each entry and size of l2arc, but at least you can
estimate and establish upper and lower bounds.

So that's how the l2arc consumes system memory in arc.  The penalty of
insufficient ram, in conjunction with enabled L2ARC, is insufficient arc
availability for other purposes - Maybe the whole arc is consumed by l2arc
entries, and so the arc doesn't have any room for other stuff like commonly
used files.  Worse yet, your arc consumption could be so large, that
PROCESSES don't fit in ram anymore.  In this case, your processes get pushed
out to swap space, which is really bad.

Correct me if I'm wrong, but the dedup sha256 checksum happens in addition
to (not instead of) the fletcher2 integrity checksum.  So after bootup,
while the system is reading a bunch of data from the pool, all those reads
are not populating the arc/l2arc with DDT entries.  Reads are just
populating the arc and l2arc with other stuff.

DDT entries don't get into the arc/l2arc until something tries to do a
write.  When performing a write, dedup calculates the checksum of the block
to be written, and then it needs to figure out if that's a duplicate of
another block that's already on disk somewhere.  So (I guess this part)
there's probably a tree-structure (I'll use the subdirectories and files
analogy even though I'm certain that's not technically correct) on disk.
You need to find the DDT entry, if it exists, for the block whose checksum
is 1234ABCD.  So you start by looking under the 1 directory, and from there
look for the 2 subdirectory, and then the 3 subdirectory, [...etc...] If you
encounter "not found" at any step, then the DDT entry doesn't already exist
and you decide to create a new one.  But if you get all the way down to the
C subdirectory and it contains a file named "D,"  then you have found a
possible dedup hit - the checksum matched another block that's already on
disk.  Now the DDT entry is stored in ARC just like anything else you read
from disk.

So the point is - Whenever you do a write, and the calculated DDT is not
already in ARC/L2ARC, the system will actually perform several small reads
looking for the DDT entry before it finally knows that the DDT entry
actually exists.  So the penalty of performing a write, with dedup enabled,
and the relevant DDT entry not already in ARC/L2ARC is a very large penalty.
What originated as a single write quickly became several small reads plus a
write, due to the fact the necessary DDT entry was not already available.

The penalty of insufficient ram, in conjunction with dedup, is terrible
write performance.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-27 Thread Richard Elling
On Apr 27, 2011, at 9:26 PM, Edward Ned Harvey 
 wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Neil Perrin
>> 
>> No, that's not true. The DDT is just like any other ZFS metadata and can
> be
>> split over the ARC,
>> cache device (L2ARC) and the main pool devices. An infrequently referenced
>> DDT block will get
>> evicted from the ARC to the L2ARC then evicted from the L2ARC.
> 
> When somebody has their "baseline" system, and they're thinking about adding
> dedup and/or cache, I'd like to understand the effect of not having enough
> ram.  Obviously the impact will be performance, but precisely...

Pecision is only possible if you know what the data looks like...

> At bootup, I presume the arc & l2arc are all empty.  So all the DDT entries
> reside in pool.  As the system reads things (anything, files etc) from pool,
> it will populate arc, and follow fill rate policies to populate the l2arc
> over time.  Every entry in l2arc requires 200 bytes of arc, regardless of
> what type of entry it is.  (A DDT entry in l2arc consumes just as much arc
> memory as any other type of l2arc entry.)  (Ummm...  What's the point of
> that?  Aren't DDT entries 270 bytes and ARC references 200 bytes?

No. The DDT entries vary in size.

>  Seems
> like a very questionable benefit to allow DDT entries to get evicted into
> L2ARC.)  So the ram consumption caused by the presence of l2arc will
> initially be zero after bootup, and it will grow over time as the l2arc
> populates, up to a maximum which is determined linearly as 200 bytes * the
> number of entries that can fit in the l2arc.  Of course that number varies
> based on the size of each entry and size of l2arc, but at least you can
> estimate and establish upper and lower bounds.

The upper and lower bounds vary by 256x, unless you know what the data
looks like more precisely.

> So that's how the l2arc consumes system memory in arc.  The penalty of
> insufficient ram, in conjunction with enabled L2ARC, is insufficient arc
> availability for other purposes - Maybe the whole arc is consumed by l2arc
> entries, and so the arc doesn't have any room for other stuff like commonly
> used files.  

I've never seen this.

> Worse yet, your arc consumption could be so large, that
> PROCESSES don't fit in ram anymore.  In this case, your processes get pushed
> out to swap space, which is really bad.

[for Solaris, illumos, and NexentaOS]
This will not happen unless the ARC size is at arc_min. At that point you are
already close to severe memory shortfall.

> Correct me if I'm wrong, but the dedup sha256 checksum happens in addition
> to (not instead of) the fletcher2 integrity checksum.  

You are mistaken.

> So after bootup,
> while the system is reading a bunch of data from the pool, all those reads
> are not populating the arc/l2arc with DDT entries.  Reads are just
> populating the arc and l2arc with other stuff.

L2ARC is populated by a separate thread that watches the to-be-evicted list.
The L2ARC fill rate is also throttled, so that under severe shortfall, blocks
will be evicted without being placed in the L2ARC.

> DDT entries don't get into the arc/l2arc until something tries to do a
> write.  

No, the DDT entry contains the references to the actual data.

> When performing a write, dedup calculates the checksum of the block
> to be written, and then it needs to figure out if that's a duplicate of
> another block that's already on disk somewhere.  So (I guess this part)
> there's probably a tree-structure (I'll use the subdirectories and files
> analogy even though I'm certain that's not technically correct) on disk.

Implemented as an AVL tree.

> You need to find the DDT entry, if it exists, for the block whose checksum
> is 1234ABCD.  So you start by looking under the 1 directory, and from there
> look for the 2 subdirectory, and then the 3 subdirectory, [...etc...] If you
> encounter "not found" at any step, then the DDT entry doesn't already exist
> and you decide to create a new one.  But if you get all the way down to the
> C subdirectory and it contains a file named "D,"  then you have found a
> possible dedup hit - the checksum matched another block that's already on
> disk.  Now the DDT entry is stored in ARC just like anything else you read
> from disk.

DDT is metadata, not data, so it is more constrained than data entries in the
ARC.

> So the point is - Whenever you do a write, and the calculated DDT is not
> already in ARC/L2ARC, the system will actually perform several small reads
> looking for the DDT entry before it finally knows that the DDT entry
> actually exists.  So the penalty of performing a write, with dedup enabled,
> and the relevant DDT entry not already in ARC/L2ARC is a very large penalty.
> What originated as a single write quickly became several small reads plus a
> write, due to the fact the necessary DDT entry was not already available.
> 
> The penalty of insuffici

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-27 Thread Erik Trimble
OK, I just re-looked at a couple of things, and here's what I /think/ is
the correct numbers.

A single entry in the DDT is defined in the struct "ddt_entry" :

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/ddt.h#108

I just checked, and the current size of this structure is 0x178, or 376
bytes.


Each ARC entry, which points to either an L2ARC item (of any kind,
cached data, metadata, or a DDT line) or actual data/metadata/etc., is
defined in the struct "arc_buf_hdr" :

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#431

It's current size is 0xb0, or 176 bytes.

These are fixed-size structures.


PLEASE - someone correct me if these two structures AREN'T what we
should be looking at.



So, our estimate calculations have to be based on these new numbers.


Back to the original scenario:

1TB (after dedup) of 4k blocks: how much space is needed for the DDT,
and how much ARC space is needed if the DDT is kept in a L2ARC cache
device?

Step 1)  1TB (2^40 bytes) stored in blocks of 4k (2^12) = 2^28 blocks
total, which is about 268 million.

Step 2)  2^28 blocks of information in the DDT requires  376 bytes/block
* 2^28 blocks = 94 * 2^30 = 94 GB of space.  

Step 3)  Storing a reference to 268 million (2^28) DDT entries in the
L2ARC will consume the following amount of ARC space: 176 bytes/entry *
2^28 entries = 44GB of RAM.


That's pretty ugly.


So, to summarize:

For 1TB of data, broken into the following block sizes:
DDT sizeARC consumption
512b752GB (73%) 352GB (34%)
4k  94GB (9%)   44GB (4.3%)
8k  47GB (4.5%) 22GB (2.1%)
32k 11.75GB (2.2%)  5.5GB (0.5%)
64k 5.9GB (1.1%)2.75GB (0.3%)
128k2.9GB% (0.6%)   1.4GB (0.1%)

ARC consumption presumes the whole DDT is stored in the L2ARC.

Percentage size is relative to the original 1TB total data size



Of course, the trickier proposition here is that we DON'T KNOW what our
dedup value is ahead of time on a given data set.  That is, given a data
set of X size, we don't know how big the deduped data size will be. The
above calculations are for DDT/ARC size for a data set that has already
been deduped down to 1TB in size.


Perhaps it would be nice to have some sort of userland utility that
builds it's own DDT as a test and does all the above calculations, to
see how dedup would work on a given dataset.  'zdb -S' sorta, kinda does
that, but...


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss