Hi John, On Qui, 2008-09-11 at 20:23 -0600, John Antonio wrote: > It is operating with Sol 10 u3 and also u4. Sun support is claiming > the issue is related to quiet corruptions.
Probably, yes. > Since the ZFS structure was not cleanly exported because of the event > (Node crash), the statement from support is that these types of > corruptions could occur. I don't think the cause of corruption is because the pool wasn't cleanly exported. If corruption only happens when a node crashes, there are 3 likely causes for this problem: 1) The storage subsystem is ignoring disk write cache flush requests and allows writes to go out-of-order, making the uberblock reach the disk before other important metadata blocks. 2) Or you're running into a bug that is corrupting metadata. 3) Or you're experiencing memory corruption. The first one should be fixable and there is a bug open for this already (CR 6667683), the second one is fixable once identified, the third one is harder to solve. The ZFS team and a few folks in the Lustre group are looking into making ZFS more resilient against corrupted metadata, but this is definitely a hard-to-solve issue. > The panic response is apparently the expected behavior during a zpool > import if this situation occurs. I wouldn't say that is the expected behavior.. :-) I'd say a panic when importing a pool is a bug. > Apparently in u6, there will be additional zpool import options that > will make the identification of a corruption a passive event. The pool > won't import but instead of panicing the server it would respond with > a failure status. Interesting.. I'd love to see the CR for this. > Regardless of a passive response or not, it concerns me that the > condition can occur period. Not that other filesystems don't > experience silent corruptions, the concern here is ZFS had been > promoted as overcoming these exact issues. "Silent corruptions" is a bit vague :-) ZFS is promoted as being resilient against most kinds of disk corruption, but memory corruption and potential bugs are different issues. Note that there may be several causes for panicking when importing a pool, depending on which metadata was corrupted. Some may be easily fixable, others may be harder. That's why providing a stack trace of the panic would help identify which particular issue you're running into. Also note that there are efforts being made into solving these problems. As an example, Victor Latushkin has very recently identified a similar panic when importing a pool (CR 6720531) and provided a patch that allows the corrupted pool to be successfully imported (only for that particular kind of corruption, of course). > The fact that it has been certified to work in a cluster deployment, > this situation suggests that it may not be ready or a significant bug > exists. Yes, it does appear that you've ran into a significant bug. Knowing the exact bug you're running into would be helpful. Best regards, Ricardo _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss