Hello,

Recently, one of the disks in a raidz1 on my OpenSolaris (snv_118)
file server failed.
It continued operating in DEGRADED for a day or so until I noticed.
At which point I removed the faulted disk and turned it back on (to
confirm I had removed the correct disk).
When I replaced the disk, I wasn't able to bring the pool out of faulted.

I know I replaced the correct disk, and I can see that the disk (now
replaced) was bad, as it generates hard errors if I reconnect it.
(On a side note, those disk hard errors really kill the system
performance - if the rpool disk is still fine why does it have such a
performance impact?)

Hardware:
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 4-core
8GB (2G + 2G + 2G + 2G); 8G maximum
Gigabyte EP45-DS3 Motherboard
    Intel Corporation 82801JI (ICH10 Family) 2 port SATA IDE Controller
    Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller

The pool originally had 5 Western Digital 1TB disks in raidz1 (all sata2 disks).
I replaced one with an equivalent Seagate model (due to stock at the shop).
My rpool is c8d0.

The pool:
    NAME        STATE     READ WRITE CKSUM
    data1       FAULTED      0     0     1  bad intent log
      raidz1    DEGRADED     0     0     6
        c8d1    ONLINE       0     0     0
        c10d0   UNAVAIL      0     0     0  cannot open
        c11d0   ONLINE       0     0     0
        c9d0    ONLINE       0     0     0
        c9d1    ONLINE       0     0     0

As the raidz1 state is DEGRADED and 4/5 of my disks are ONLINE, I'm
confident no data has been lost (yet) except those few CKSUM errors.
Loosing the 6 files with failed checksum while the raidz was missing
parity is acceptable, especially compared to loosing the entire pool.

zpool clear is the suggested action in the documentation, as I am not
interested in preserving the intent log.
When I tried this, the command failed. It also marked all the devices
in the pool to FAULTED, until I reboot then they go back to as shown
above.
The new disk is c0d0 instead of c10d0.

The core of the problem is this:
`zpool clear data1` - fails due to c10d0 being unavailable, this puts
all the devices to FAULTED.
`zpool replace data1 c10d0 c0d0` - fails due to the pool being faulted
and therefore unaccessable.
I do not know if these failures might be caused by those CKSUM errors instead.

Perhaps something along the lines of a `zpool clear -f data1` is required?

I found a couple of threads on the mailing list where a very similar
issue was described:
http://mail.opensolaris.org/pipermail/zfs-discuss/2009-January/025481.html
http://mail.opensolaris.org/pipermail/zfs-discuss/2009-January/025574.html

Things I have tried:
- zpool clear data1 - fails due to c10d0 being unavailable.
- zpool online data1 c10d0 - fails due to the pool being faulted and
therefore unaccessable.
- zpool replace data1 c10d0 c0d0 - fails due to the pool being faulted
and therefore unaccessable.
- zpool replace -f data1 c10d0 c0d0 - fails as above
- restarting, then replacing or clearing - fails as above
- ln -s c0d0* c10d0*, then replacing - fails as above
- removing /etc/zfs/zpool.cache, restart - fails to import (`zpool
import` shows all devices as FAULTED)
- restarting with the new hard disk not attached

Some output from these attempts:
http://dersaidin.ath.cx/other/osol/commands.txt

Output from zdb -l for each disk.
http://dersaidin.ath.cx/other/osol/zdbout.txt

Thanks,
Andrew Browne
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to