> Is your ZFS pool configured with redundancy (e.g mirrors, raidz) or is
> it non-redundant? If non-redundant, then there is not much that ZFS
> can really do if a device begins to fail.

It's RAID 10 (more info here: 
http://www.opensolaris.org/jive/thread.jspa?threadID=57425):

NAME STATE READ WRITE CKSUM
box5 ONLINE 0 0 4
mirror ONLINE 0 0 2
c1d0 ONLINE 0 0 4
c2d0 ONLINE 0 0 4
mirror ONLINE 0 0 2
c2d1 ONLINE 0 0 4
c1d1 ONLINE 0 0 4

Actually, there's no damaged data so far. I don't get any "unable to 
read/write" kind of errors. It's just very strange checksum errors synchronized 
over all disks.

> That's a bit harsh.  ZFS is telling you that you u have corrupted data 
> based on the checksums.  Other types of filesystems would likely simply 
> pass the corrupted data on silently.

Checksums are good, no complaints about that.

> Do you have the panic messages?  ZFS won't cause panics based on bad 
> checksums.  It will by default cause panic if it can't write data out to 
> any device or if it completely loses access to non-redundant devices or 
> loses both redundant devices at the same time.

A number of panic messages and crash dump stack trace are attached to the 
original post (http://www.opensolaris.org/jive/thread.jspa?threadID=57425). 
Here is the short snip:

> ::status
debugging crash dump vmcore.5 (64-bit) from core
operating system: 5.10 Generic_127112-07 (i86pc)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe800017f8d0 addr=238 
occurred in module "unix" due to a NULL pointer dereference
dump content: kernel pages only
>
> ::stack
mutex_enter+0xb()
zio_buf_alloc+0x1a()
zio_read+0xba()
spa_scrub_io_start+0xf1()
spa_scrub_cb+0x13d()
traverse_callback+0x6a()
traverse_segment+0x118()
traverse_more+0x7b()
spa_scrub_thread+0x147()
thread_start+8()

> Since this seems to show the same number of checksum errors across 2 
> different channels and 4 different drives.  Given that, I'd assume that 
> this is likely a dual-channel HBA of some sort.  It would appear that 
> you either have bad hardware or some sort of driver issue.

You right, this is the dual-channel Intel's ICH6 SATA controller. 10U4 has 
native support/drivers for this SATA controller (AHCI drivers afaik). The thing 
is that this hardware and ZFS were in production for almost 2 years (ok, not 
the best argument). However this problem occurred recently (20 days). It's even 
more strange because I didn't made any OS/diver upgrade or patch during last 
2-3 months.

However, this is good point. I've seen some new SATA/AHCI drivers available in 
10U5. Maybe I should try to upgrade and see if it helps. Thanks Phil.

--
Rustam
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to