We recently installed a 24 disk SATA array with an LSI controller attached
to a box running Solaris X86  10 Release 4. The drives were set up in one
big pool with raidz, and it worked great for about a month. On the 4th, we
had the system kernel panic and crash, and it's now behaving very badly.
Here's what diagnostic data I've been able to collect so far:

In the messages file:

Nov  4 13:24:11 mondo4 savecore: [ID 570001 auth.error] reboot after panic:
ZFS: I/O failure (write on <unknown> off 0: zio ffffffff97c86a00 [L0 DMU
dnode] 4000L/1000P DVA[0]=<0
:d08cf11b800:1800> DVA[1]=<0:1020a711c800:1800> fletcher4 lzjb LE contiguous
birth=731555 fill=32
Nov  4 13:24:11 mondo4 savecore: [ID 570001 auth.error] reboot after panic:
ZFS: I/O failure (write on <unknown> off 0: zio ffffffff97c86a00 [L0 DMU
dnode] 4000L/1000P DVA[0]=<0
:d08cf11b800:1800> DVA[1]=<0:1020a711c800:1800> fletcher4 lzjb LE contiguous
birth=731555 fill=32
Nov  4 13:24:06 mondo4 savecore: [ID 748169 auth.error] saving system crash
dump in /var/crash/mondo4/*.0
Nov  4 13:24:06 mondo4 savecore: [ID 748169 auth.error] saving system crash
dump in /var/crash/mondo4/*.0


And yes, we've got the core files.

The box came back up and seemed to run okay for a couple days, but we
noticed today that things were very very odd.

We noticed that doing a df on the filesystem hung, and that ls would hang on
the local box as well.

Looking at the output of dmesg, we see a lot of messages that look like:

Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    Requested Block:
1450319385                Error Block: 1450319385
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    Requested Block:
1450319385                Error Block: 1450319385
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    Vendor: ATA
Serial Number:     
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    Vendor: ATA
Serial Number:     
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    Sense Key: Unit
Attention
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    Sense Key: Unit
Attention
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    ASC: 0x29 (power on,
reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    ASC: 0x29 (power on,
reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    Requested Block:
1450487074                Error Block: 1450487074
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    Requested Block:
1450487074                Error Block: 1450487074
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    Vendor: ATA
Serial Number:     
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    Vendor: ATA
Serial Number:     
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    Sense Key: Unit
Attention
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    Sense Key: Unit
Attention
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    ASC: 0x29 (power on,
reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    ASC: 0x29 (power on,
reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0


Finally trying to do a zpool status yields:

[EMAIL PROTECTED]:/# zpool status -v
  pool: LogData
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested

At which point the shell hangs, and cannot be control-c'd.


Any thoughts on how to proceed? I'm guessing we have a bad disk, but I'm not
sure. Anything you can recommend to diagnose this would be welcome.

--Mike 


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to