On Oct 23, 2009, at 4:46 PM, Tim Cook wrote:
On Fri, Oct 23, 2009 at 6:32 PM, Adam Cheal <ach...@pnimedia.com>
wrote:
I don't think there was any intention on Sun's part to ignore the
problem...obviously their target market wants a performance-oriented
box and the x4540 delivers that. Each 1068E controller chip supports
8 SAS PHY channels = 1 channel per drive = no contention for
channels. The x4540 is a monster and performs like a dream with
snv_118 (we have a few ourselves).
My issue is that implementing an archival-type solution demands a
dense, simple storage platform that performs at a reasonable level,
nothing more. Our design has the same controller chip (8 SAS PHY
channels) driving 46 disks, so there is bound to be contention there
especially in high-load situations. I just need it to work and
handle load gracefully, not timeout and cause disk "failures"; at
this point I can't even scrub the zpools to verify the data we have
on there is valid. From a hardware perspective, the 3801E card is
spec'ed to handle our architecture; the OS just seems to fall over
somewhere though and not be able to throttle itself in certain
intensive IO situations.
That said, I don't know whether to point the finger at LSI's
firmware or mpt-driver/ZFS. Sun obviously has a good relationship
with LSI as their 1068E is the recommended SAS controller chip and
is used in their own products. At least we've got a bug filed now,
and we can hopefully follow this through to find out where the
system breaks down.
Have you checked in with LSI to verify the IOPS ability of the
chip? Just because it supports having 46 drives attached to one
ASIC doesn't mean it can actually service all 46 at once. You're
talking (VERY conservatively) 2800 IOPS.
Tim has a valid point. By default, ZFS will queue 35 commands per disk.
For 46 disks that is 1,610 concurrent I/Os. Historically, it has
proven to be
relatively easy to crater performance or cause problems with very, very,
very expensive arrays that are easily overrun by Solaris. As a result,
it is
not uncommon to see references to setting throttles, especially in
older docs.
Fortunately, this is simple to test by reducing the number of I/Os ZFS
will queue. See the Evil Tuning Guide
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29
The mpt source is not open, so the mpt driver's reaction to 1,610
concurrent
I/Os can only be guessed from afar -- public LSI docs mention a number
of 511
concurrent I/Os for SAS1068, but it is not clear to me that is an
explicit limit. If
you have success with zfs_vdev_max_pending set to 10, then the mystery
might be solved. Use iostat to observe the wait and actv columns, which
show the number of transactions in the queues. JCMP?
NB sometimes a driver will have the limit be configurable. For
example, to get
high performance out of a high-end array attached to a qlc card, I've
set
the execution-throttle in /kernel/drv/qlc.conf to be more than two
orders of
magnitude greater than its default of 32. /kernel/drv/mpt*.conf does
not seem
to have a similar throttle.
-- richard
Even ignoring that, I know for a fact that the chip can't handle raw
throughput numbers on 46 disks unless you've got some very severe
raid overhead. That chip is good for roughly 2GB/sec each
direction. 46 7200RPM drives can fairly easily push 4x that amount
in streaming IO loads.
Long story short, it appears you've got a 5lbs bag a 50lbs load...
--Tim
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss