Hi John,

On Qui, 2008-09-11 at 20:23 -0600, John Antonio wrote:
> It is operating with Sol 10 u3 and also u4. Sun support is claiming
> the issue is related to quiet corruptions.

Probably, yes.

>  Since the ZFS structure was not cleanly exported because of the event
> (Node crash), the statement from support is that these types of
> corruptions could occur.

I don't think the cause of corruption is because the pool wasn't cleanly
exported.

If corruption only happens when a node crashes, there are 3 likely
causes for this problem:

1) The storage subsystem is ignoring disk write cache flush requests and
allows writes to go out-of-order, making the uberblock reach the disk
before other important metadata blocks.

2) Or you're running into a bug that is corrupting metadata.

3) Or you're experiencing memory corruption.

The first one should be fixable and there is a bug open for this already
(CR 6667683), the second one is fixable once identified, the third one
is harder to solve.

The ZFS team and a few folks in the Lustre group are looking into making
ZFS more resilient against corrupted metadata, but this is definitely a
hard-to-solve issue.

>  The panic response is apparently the expected behavior during a zpool
> import if this situation occurs.

I wouldn't say that is the expected behavior.. :-) I'd say a panic when
importing a pool is a bug.

>  Apparently in u6, there will be additional zpool import options that
> will make the identification of a corruption a passive event. The pool
> won't import but instead of panicing the server it would respond with
> a failure status.

Interesting.. I'd love to see the CR for this.

>  Regardless of a passive response or not, it concerns me that the
> condition can occur period. Not that other filesystems don't
> experience silent corruptions, the concern here is ZFS had been
> promoted as overcoming these exact issues.

"Silent corruptions" is a bit vague :-)
ZFS is promoted as being resilient against most kinds of disk
corruption, but memory corruption and potential bugs are different
issues.

Note that there may be several causes for panicking when importing a
pool, depending on which metadata was corrupted. Some may be easily
fixable, others may be harder.
That's why providing a stack trace of the panic would help identify
which particular issue you're running into.

Also note that there are efforts being made into solving these problems.
As an example, Victor Latushkin has very recently identified a similar
panic when importing a pool (CR 6720531) and provided a patch that
allows the corrupted pool to be successfully imported (only for that
particular kind of corruption, of course).

>  The fact that it has been certified to work in a cluster deployment,
> this situation suggests that it may not be ready or a significant bug
> exists.

Yes, it does appear that you've ran into a significant bug.
Knowing the exact bug you're running into would be helpful.

Best regards,
Ricardo



_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to