>>>>> "r" == Ross  <[EMAIL PROTECTED]> writes:

     r> 1.  Loss of a server is very much the worst case scenario.
     r> Disk errors are much more likely, and with raid-z2 pools on
     r> the individual servers

yeah, it kind of sucks that the slow resilvering speed enforces this
two-tier scheme.

Also if you're going to have 1000 spinning platters you'll have a
drive failure every four days or so---you need to be able to do more
than one resilver at a time, and you need to do resilvers without
interrupting scrubs which could take so long to run that you run them
continuously.  The ZFS-on-zvol hack lets you do both to a point, but I
think it's an ugly workaround for lack of scalability in flat ZFS, not
the ideal way to do things.

     r> A motherboard / backplane / PSU failure will offline that
     r> server, but once the faulted components are replaced your pool
     r> will come back online.  Once the pool is online, ZFS has the
     r> ability to resilver just the changed data,

except that is not what actually happens for my iSCSI setup.  If I
'zpool offline' the target before taking it down, it usually does work
as you describe---a relatively fast resilver kicks off, and no CKSUM
errors appear later.  I've used it gently.  I haven't offlined a
raidz2 device for three weeks while writing gigabytes to the pool in
the mean time, but for my gentle use it does seem to work.

But if the iSCSI target goes down unexpectedly---ex., because I pull
the network cord---it does come back online and does resilver, but
latent CKSUM errors show up weeks later.

Also, if the head node reboots during a resilver, ZFS totally forgets
what it was doing, and upon reboot just blindly mounts the unclean
component as if it were clean, later calling all the differences CKSUM
errors.  same thing happens if you offline a device, then reboot.  The
``persistent'' offlining doesn't seem to work, and in any case the
device comes online without a proper resilver.

SVM had dirty-region logging stored in the metadb so that resilvers
could continue where they left off across reboots.  I believe SVM
usually did a full resilver when a component disappeared, but am not
sure this was always the case.  Anyway ZFS doesn't seem to have a
similar capability, at least not one that works.

so, in practice, whenever any iSCSI component goes away
unexpectedly---target server failure, power failure, kernel panic, L2
spanning tree reconfiguration, whatever---you have to scrub the whole
pool from the head node.


It's interesting how the speed and optimisation of these maintenance
activities limit pool size.  It's not just full scrubs.  If the
filesystem is subject to corruption, you need a backup.  If the
filesystem takes two months to back up / restore, then you need really
solid incremental backup/restore features, and the backup needs to be
a cold spare, not just a backup---restoring means switching the roles
of the primary and backup system, not actually moving data.  

finally, for really big pools, even O(n) might be too slow.  The ZFS
best practice guide for converting UFS to ZFS says ``start multiple
rsync's in parallel,'' but I think we're finding zpool scrubs and zfs
sends are not well-parallelized.

These reliability limitations and performance characteristics of
maintenance tasks seem to make a sort of max-pool-size Wall beyond
which you end up painted into corners.  If they were made better, I
think you'd later hit another wall at the maximum amount of data you
could push through one head node and would have to switch to some
QFS/GFS/OCFS-type separate-data-and-metadata filesystem, and to match
ZFS this filesystem would have to do scrubs, resilvers, and backups in
a distributed way not just distribute normal data access.  A month ago
I might have ranted, ``head node speed puts a cap on how _busy_ the
filesystem can be, not how big it can be, so ZFS (modulo a lot of bug
fixes) could be fantastic for data sets of virtually unlimited size
even with its single-initiator, single-head-node limitation, so long
as the pool gets very light access.''  Now, I don't think so, because
scrubbing/resilvering/backup-restore has to flow through the head
node, too.

This observation also means my preference for a ``recovery tool'' that
treats corrupt pools as read-only over fsck (online or offline) isn't
very scalable.  The original zfs kool-aid ``online maintenance'' model
of doing a cheap fsck at import time and a long O(n) fsck through
online scrubs is the only one with a future in a world where
maintenance activities can take months.

Attachment: pgpzqaJe5ZecE.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to