>>>>> "r" == Ross <[EMAIL PROTECTED]> writes:
r> 1. Loss of a server is very much the worst case scenario. r> Disk errors are much more likely, and with raid-z2 pools on r> the individual servers yeah, it kind of sucks that the slow resilvering speed enforces this two-tier scheme. Also if you're going to have 1000 spinning platters you'll have a drive failure every four days or so---you need to be able to do more than one resilver at a time, and you need to do resilvers without interrupting scrubs which could take so long to run that you run them continuously. The ZFS-on-zvol hack lets you do both to a point, but I think it's an ugly workaround for lack of scalability in flat ZFS, not the ideal way to do things. r> A motherboard / backplane / PSU failure will offline that r> server, but once the faulted components are replaced your pool r> will come back online. Once the pool is online, ZFS has the r> ability to resilver just the changed data, except that is not what actually happens for my iSCSI setup. If I 'zpool offline' the target before taking it down, it usually does work as you describe---a relatively fast resilver kicks off, and no CKSUM errors appear later. I've used it gently. I haven't offlined a raidz2 device for three weeks while writing gigabytes to the pool in the mean time, but for my gentle use it does seem to work. But if the iSCSI target goes down unexpectedly---ex., because I pull the network cord---it does come back online and does resilver, but latent CKSUM errors show up weeks later. Also, if the head node reboots during a resilver, ZFS totally forgets what it was doing, and upon reboot just blindly mounts the unclean component as if it were clean, later calling all the differences CKSUM errors. same thing happens if you offline a device, then reboot. The ``persistent'' offlining doesn't seem to work, and in any case the device comes online without a proper resilver. SVM had dirty-region logging stored in the metadb so that resilvers could continue where they left off across reboots. I believe SVM usually did a full resilver when a component disappeared, but am not sure this was always the case. Anyway ZFS doesn't seem to have a similar capability, at least not one that works. so, in practice, whenever any iSCSI component goes away unexpectedly---target server failure, power failure, kernel panic, L2 spanning tree reconfiguration, whatever---you have to scrub the whole pool from the head node. It's interesting how the speed and optimisation of these maintenance activities limit pool size. It's not just full scrubs. If the filesystem is subject to corruption, you need a backup. If the filesystem takes two months to back up / restore, then you need really solid incremental backup/restore features, and the backup needs to be a cold spare, not just a backup---restoring means switching the roles of the primary and backup system, not actually moving data. finally, for really big pools, even O(n) might be too slow. The ZFS best practice guide for converting UFS to ZFS says ``start multiple rsync's in parallel,'' but I think we're finding zpool scrubs and zfs sends are not well-parallelized. These reliability limitations and performance characteristics of maintenance tasks seem to make a sort of max-pool-size Wall beyond which you end up painted into corners. If they were made better, I think you'd later hit another wall at the maximum amount of data you could push through one head node and would have to switch to some QFS/GFS/OCFS-type separate-data-and-metadata filesystem, and to match ZFS this filesystem would have to do scrubs, resilvers, and backups in a distributed way not just distribute normal data access. A month ago I might have ranted, ``head node speed puts a cap on how _busy_ the filesystem can be, not how big it can be, so ZFS (modulo a lot of bug fixes) could be fantastic for data sets of virtually unlimited size even with its single-initiator, single-head-node limitation, so long as the pool gets very light access.'' Now, I don't think so, because scrubbing/resilvering/backup-restore has to flow through the head node, too. This observation also means my preference for a ``recovery tool'' that treats corrupt pools as read-only over fsck (online or offline) isn't very scalable. The original zfs kool-aid ``online maintenance'' model of doing a cheap fsck at import time and a long O(n) fsck through online scrubs is the only one with a future in a world where maintenance activities can take months.
pgpzqaJe5ZecE.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss