Re: [zfs-discuss] zpool replace - choke point

Marion Hakanson Fri, 05 Dec 2008 11:15:26 -0800

[EMAIL PROTECTED] said:
> Thanks for the tips.  I'm not sure if they will be relevant, though.  We
> don't talk directly with the AMS1000.  We are using a USP-VM to virtualize
> all of our storage and we didn't have to add anything to the drv
> configuration files to see the new disk (mpxio was already turned on).  We
> are using the Sun drivers and mpxio and we didn't require any tinkering to
> see the new LUNs.


Yes, the fact that the USP-VM was recognized automatically by Solaris drivers
is a good sign.  I suggest that you check to see what queue-depth and disksort
values you ended up with from the automatic settings:

  echo "*ssd_state::walk softstate |::print -t struct sd_lun un_throttle" \
   | mdb -k

The "ssd_state" would be "sd_state" on an x86 machine (Solaris-10).
The "un_throttle" above will show the current max_throttle (queue depth);
Replace it with "un_min_throttle" to see the min, and "un_f_disksort_disabled"
to see the current queue-sort setting.

The HDS docs for 9500 series suggested 32 as the max_throttle to use, and
the default setting (Solaris-10) was 256 (hopefully with the USP-VM you get
something more reasonable).  And while 32 did work for us, i.e. no operations
were ever lost as far as I could tell, the array back-end -- the drives
themselves, and the internal SATA shelf connections, have an actual queue
depth of four for each array controller.  The AMS1000 has the same limitation
for SATA shelves, according to our HDS engineer.

In short, Solaris, especially with ZFS, functions much better if it does
not try to send more FC operations to the array than the actual physical
devices can handle.  We were actually seeing NFS client operations hang
for minutes at a time when the SAN-hosted NFS server was making its ZFS
devices busy -- and this was true even if clients were using different
devices than the busy ones.  We do not see these hangs after making the
described changes, and I believe this is because the OS is no longer waiting
around for a response from devices that aren't going to respond in a
reasonable amount of time.

Yes, having the USP between the host and the AMS1000 will affect things;
There's probably some huge cache in there somewhere.  But unless you've
got cache of hundreds of GB in size, at some point a resilver operation
is going to end up running at the speed of the actual back-end device.

Regards,

Marion


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool replace - choke point

Reply via email to