On Tue, Nov 25, 2008 at 11:55:17AM +0100, [EMAIL PROTECTED] wrote: > >My idea is simply to allow the pool to continue operation while > >waiting for the drive to fault, even if that's a faulty write. It > >just means that the rest of the operations (reads and writes) can keep > >working for the minute (or three) it takes for FMA and the rest of the > >chain to flag a device as faulty. > > Except when you're writing a lot; 3 minutes can cause a 20GB backlog > for a single disk.
If we're talking isolated, or even clumped-but-relatively-few bad sectors, then having a short timeout for writes and remapping should be possible to do without running out of memory to cache those writes. But... ...writes to bad sectors will happen when txgs flush, and depending on how bad sector remapping is done (say, by picking a new block address and changing the blkptrs that referred to the old one) that might mean redoing large chunks of the txg in the next one, which might mean that fsync() could be delayed an additional 5 seconds or so. And even if that's not the case, writes to mirrors are supposed to be synchronous, so one would think that bad block remapping should be synchronous also, thus there must be a delay on writes to bad blocks no matter what -- though that delay could be tuned to be no more than a few seconds. That points to a possibly decent heuristic on writes: vdev-level timeouts that result in bad block remapping, but if the queue of outstanding bad block remappings grows too large -> treat the disk as faulted and degrade the pool. Sounds simple, but it needs to be combined at a higher layer with information from other vdevs. Unplugging a whole jbod shouldn't necessarily fault all the vdevs on it -- perhaps it should cause pool operation to pause until the jbod is plugged back in... which should then cause those outstanding bad block remappings to be rolled back since they weren't bad blocks after all. That's a lot of fault detection and handling logic across many layers. Incidentally, cables to fall out, or, rather, get pulled out accidentally. What should be the failure mode of a jbod disappearing due to a pulled cable (or power supply failure)? A pause in operation (hangs)? Or faulting of all affected vdevs, and if you're mirrored across different jbods, incurring the need to re-silver later, with degraded operation for hours on end? I bet answers will vary. The best answer is to provide enough redundancy (multiple power supplies, multi-pathing, ...) to make such situations less likely, but that's not a complete answer. Nico -- _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss