>>>>> "bh" == Brandon High <bh...@freaks.com> writes:

    bh> For those 5 minutes, you'll see horrible performance. If the
    bh> drive returns an error within 7-10 seconds, it would only take
    bh> 35-50 seconds to fail.

For those 1 - 5 minutes, AIUI you see NO performance, not bad
performance.  And pools other than the one containing the failing
drive may be frozen as well, ex. for NFS client mounts.

But if it were just the difference between 5min freeze when a drive
fails, and 1min freeze when a drive fails, I don't see that anyone
would care---both are bad enough to invoke upper-layer application
timeouts of iSCSI connections and load balancers, but not disastrous.

but it's not.  ZFS doesn't immediately offline the drive after 1 read
error.  Some people find it doesn't offline the drive at all, until
they notice which drive is taking multiple seconds to complete
commands and offline it manually.  so you have 1 - 5 minute freezes
several times a day, every time the slowly-failing drive hits a latent
sector error.

I'm saying the works:notworks comparison is not between TLER-broken
and non-TLER-broken.  I think the TLER fans are taking advantage of
people's binary debating bias to imply that TLER is the ``works OK''
case and non-TLER is ``broken: dont u see it's 5x slower.''  There are
three cases to compare for any given failure mode: TLER-failed,
non-TLER-failed, and working.  The proper comparison is therefore
between a successful read (7ms) and an unsuccessful read (7000ms * <n>
cargo-cult retries put into various parts of the stack to work around
some scar someone has on their knee from some weird thing an FC switch
once did in 1999).

The unsuccessful read is thousands of times slower than normal
performance.  It doesn't make your array seem 5x slower during the
fail like the false TLER vs non-TLER comparison makes it seem.  It
makes your array seem entirely frozen.  The actual speed doesn't
matter: it's FROZEN.  Having TLER does not make FROZEN any faster than
FROZEN.

The story here sounds great, so I can see why it spreads so well:
``during drive failures, the array drags performance a little, maybe
5x, until you locate teh drive and replace it.  However, if you have
used +1 MAGICAL DRIVES OF RECKONING, the dragging is much reduced!
Unfortunately +1 magical drives are only appropriate for ENTERPRISE
use while at home we use non-magic drives, but you get what you pay
for.''  That all sounds fair, reasonable, and like good fun gameplay.
Unfortunately ZFS isn't a video game: it just fucking freezes.

    bh> The difference is that a fast fail with ZFS relies on ZFS to
    bh> fix the problem rather than degrading the array.

ok but the decision of ``degrading the array'' means ``not sending
commands to the slowly-failing drive any more''.

which is actually the correct decision, the wrong course being to
continue sending commands there and ``patiently waiting'' for them to
fail instead of re-issuing them to redundant drives, even when waiting
thousands of standard deviations outside the mean request time.  TLER
or not, a failing drive will poison the array by making reads
thousands of times slower.

And ZFS or HW, fail or degrade, the problem is still fixed for the
upper layers.  You make it soudn like ``degrading the array'' means
the upper layers got an error for the HW controller and got good data
for ZFS.  not so.  If anything, the thread above ZFS gave up waiting
on read() for ``fixed'' data to come back and got killed by request
timeout, or the user pressed ^Z^Z^C^C^C^C^C^\^\^acpkill -9 vi


If the disk manufacturers could find a way to make all errors return
in 7 seconds (to reduce the number of HW RAID 'degraded' marks leading
to warranty returns), but still charge people double for drives that
have some silly feature they think they need, I bet they'd do it.  The
only real problem we've got is the one we always had, that the Solaris
storage stack and vdev layer don't handle slowly failing drives with
any reasonable grace, and this is how most drives fail.

now suppose they built a drive with a ``streaming'' mode:

 * with the ``streaming'' jumper in place, drive starts spun down.

 * drive must be sent a magical ENABLE command, otherwise it returns
   failure to everything.  once the magical command is sent, the rules
   below apply.

 * the first read must be LBA 0.  If so, it spins up the drive.

 * the drive's head now ignores you, and reads from one end of the
   disk to the other, dumping data into the on-disk cache.

 * if you issue a read that's a higher LBA than the head's current
   position, then your read WAITS for the head to pass that position.
   This is the only time any command waits on mechanics.

 * if you issue a read that's in the cache, it returns data from the
   cache and ``success''

 * if you issue a read of lower LBA than the head, and the data is not
   in the cache, then the read immediately returns ``failure''.

 * no writes.

I might pay extra for that feature, but it's more a desktop-grade
feature for recovering data.  If the drive's in an array, may as well
send it back the first time it reports an error and let them deal with
it while I resilver.  The only question is FINDING the bad drive,
which doesn't seem easier with or without TLER: either way you wait
until you find your pool freezing now and then, and then you look at
'iostat' for service times that are a thousand times too high.

Attachment: pgpBxJSui0anD.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to