On Thu, Oct 26, 2006 at 01:30:46AM -0700, Dan Price wrote: > > scsi: WARNING: /[EMAIL PROTECTED],700000/[EMAIL PROTECTED] (glm0): > Resetting scsi bus, got incorrect phase from (1,0) > genunix: NOTICE: glm0: fault detected in device; service still available > genunix: NOTICE: glm0: Resetting scsi bus, got incorrect phase from (1,0) > scsi: WARNING: /[EMAIL PROTECTED],700000/[EMAIL PROTECTED] (glm0): > got SCSI bus reset > genunix: NOTICE: glm0: fault detected in device; service still available > genunix: NOTICE: glm0: got SCSI bus reset > scsi: WARNING: /[EMAIL PROTECTED],700000/[EMAIL PROTECTED]/[EMAIL > PROTECTED],0 (sd11): > auto request sense failed (reason=reset) > > Eventually I had to drive in to work to reboot the machine, although > the system did not tip over. After a reboot to single user mode, the > same symptoms recurred (since it seems that the resilver kicked off > again... and at a certain stage hit this problem over again).
This is where the next phase of ZFS/FMA interoperability (which I've been sketching out for a while and am starting to work on now) will come in handy. Currently, ZFS will drive on forever even if a disk is arbitrarily misbehaving. In this case, it caused the scrub to grind to a halt (it was likely making progress, just very slowly). In the future ZFS/FMA world, the number of errors on the device would have exceeded an appropriate threshold (via a SERD engine) and the device would have been placed into the 'FAULTED' state. The scrub would have finished, you would have a nice FMA message on your console, and one of the drives would have been faulted. There are a lot of subtleties here, particularly w.r.t. other I/O FMA work, but we're making some progress. > The only recourse was to reboot to single user mode, rapidly log in, and > detach the problem-causing side of the mirror. This led me to > suggestion #1: > > - It'd be nice if auto-resilvering did not kick off until > sometime after we leave single user mode. This isn't completely straightforward, but obviously doable. It's also unclear if this is just a temporary stopgap in lieu of a complete FMA solution. Please file an RFE anyway so that the problem is recorded somewhere. > This is awesome. I can pinpoint any corruption, which is great. > But... So this may be a stupid question, but it's unclear how to > locate the object in question. See: 6410433 'zpool status -v' would be more useful with filenames > I did a find -inum 42073, which located some help.jar file in a copy > of netbeans I have in the zpool. If that's all I've lost, then > hooray! > > But I wasn't sure if that was the right thing to do. It'd be great if > the documentation was clearer on this point: > > http://docs.sun.com/app/docs/doc/819-5461/6n7ht6qt1?a=view#gbcuz > > Just says to try 'rm' on "the file" but does not mention how to > locate it. Yeah, partly because there is no good way ;-) We want the answer to be 6410433, but even then there are tricky edge conditions (such as directories and dnode corruption) that can't simply be removed because they reference arbitrary amounts of metadata. The documentation can be improved in the meantime (to mention 'find -inum' at the very least), but we really need to sit down again and think about how we want the user experience to be when dealing with corruption. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss