ugh, thanks for exploring this and isolating the problem. We will look
into what is going on (wrong) here. I have filed bug:
6545015 RAID-Z resilver broken
to track this problem.
-Mark
Marco van Lienen wrote:
On Sat, Apr 07, 2007 at 05:05:18PM -0500, in a galaxy far far away, Chris
Csanady said:
In a recent message, I detailed the excessive checksum errors that
occurred after replacing a disk. It seems that after a resilver
completes, it leaves a large number of blocks in the pool which fail
to checksum properly. Afterward, it is necessary to scrub the pool in
order to correct these errors.
After some testing, it seems that this only occurs with RAID-Z. The
same behavior can be observed on both snv_59 and snv_60, though I do
not have any other installs to test at the moment.
A colleague at work and I have followed the same steps, included running a
digest on the /test/file, on a SXCE:61 build today and can confirm the exact
same,
and disturbing?, result.
My colleague mentioned to me he has witnessed the same 'resilver' behavior on
builds 57 and 60.
The box which these steps were performed on was 'luupgraded' from SXCE: 60 to
61 using the SUNWlu* packages from
61!
# cat /etc/release
Solaris Nevada snv_61 X86
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 26 March 2007
# mkdir /tmp/test
# mkfile 64m /tmp/test/0 /tmp/test/1
# zpool create test raidz /tmp/test/0 /tmp/test/1
# mkfile 16m /test/file
# digest -v -a sha1 /test/file
sha1 (/test/file) = 3b4417fc421cee30a9ad0fd9319220a8dae32da2
#
# zpool export test
# rm /tmp/test/0
# zpool import -d /tmp/test test
# mkfile 64m /tmp/test/0
# zpool replace test /tmp/test/0
# digest -v -a sha1 /test/file
sha1 (/test/file) = 3b4417fc421cee30a9ad0fd9319220a8dae32da2
# zpool status test
pool: test
state: ONLINE
scrub: resilver completed with 0 errors on Wed Apr 11 15:19:15 2007
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
/tmp/test/0 ONLINE 0 0 0
/tmp/test/1 ONLINE 0 0 0
errors: No known data errors
# zpool scrub test
#
# zpool status test
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed with 0 errors on Wed Apr 11 15:22:30 2007
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
/tmp/test/0 ONLINE 0 0 17
/tmp/test/1 ONLINE 0 0 0
errors: No known data errors
I don't think these checksum errors are a good sign.
The sha1 digest on the file *does* show to be the same so the question arises:
is the resilver process truly broken (even though in this test-case the test
file does appear to unchanged based on the sha1 digest) ?
Marco
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss