SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc

Admittedly we are not having much luck with the x4500s.

This time it was the new x4500, running Solaris 10 5/08. Drive 
"/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
PROTECTED]/[EMAIL PROTECTED],0 (sd30):" stopped 
responding, and even after a hard reset, it would simply repeat 
"retryable", "reset", and "fatal" messages forever.

So unable to login on console. Again we ended up with the problem of 
knowing which HDD that actually is broken. Turns out to be drive #40. 
(Has anyone got a map we can print? Since we couldn't boot it, any Unix 
commands needed to map are a bit useless, nor do we have a "hd" utility).

That a HDD died in the first month of operation is understandable, but 
does it really have to take the whole server with it? Not to mention 
stop it from booting. Eventually the NOC staff guessed the correct drive 
from the blinking of LEDs (no LED was RED), and we were able to boot.

Log outputs:

Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 670675 kern.info] NOTICE: 
marvell88sx5: device on port 3 reset: device disconnected or device error
Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE: 
/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]:
Aug 11 08:47:59 x4500-02.unix  port 3: device reset
Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE: 
/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]:
Aug 11 08:47:59 x4500-02.unix  port 3: link lost
Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE: 
/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]:
Aug 11 08:47:59 x4500-02.unix  port 3: link established
Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 812950 kern.warning] 
WARNING: marvell88sx5: error on port 3:
Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] 
device error
Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] 
device disconnected
Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] 
device connected
Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] 
EDMA self disabled
Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.warning] WARNING: 
/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL 
PROTECTED],0 (sd30):
Aug 11 08:47:59 x4500-02.unix   Error for Command: read 
    Error Level: Retryable
Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice] 
Requested Block: 439202                    Error Block: 439202
Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice]     Vendor: 
ATA                                Serial Number:
Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice]     Sense 
Key: No Additional Sense
Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice]     ASC: 0x0 
(no additional sense info), ASCQ: 0x0, FRU: 0x0


scrub: resilver in progress, 10.27% done, 2h14m to go



Perhaps not related, but equally annoying:

# fmdump
TIME                 UUID                                 SUNW-MSG-ID
Aug 11 08:16:32.3925 64da6f29-4dda-44aa-e9ca-ad7054aaeaa1 ZFS-8000-D3
Aug 11 09:08:18.7834 086e6170-e4c7-c66b-c908-e37840db7e96 ZFS-8000-D3

# fmdump -v -u 086e6170-e4c7-c66b-c908-e37840db7e96
TIME                 UUID                                 SUNW-MSG-ID
Aug 11 09:08:18.7834 086e6170-e4c7-c66b-c908-e37840db7e96 ZFS-8000-D3
^C^Z^\

Alas, "kill -9" does not kill fmdump either, and it appears to lock the 
server (as well). I will remove the command for now, as it definitely 
hangs the server every time. Hard reset done again.

Lund



-- 
Jorgen Lundman       | <[EMAIL PROTECTED]>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to