Hello, I'm fairly sure that this is a firmware issue and not a Linux issue, but I'm hoping someone on this list would know who is the right person to contact about firmware issues. If you know the right person to contact, please email me off list with their contact info.
The Dell techs will replace the disk, of course, but that won't solve the real problem that caused the system to become unresponsive when the disk failed. We've been grappling with them for just over a year on this issue and never once have they put me in touch with a firmware programmer, though they've replaced every component in the system during the same time. We have two identical systems exhibiting these non-reproducable symptoms that only show with full production use (ugh). First it was drive ID 1 (the 2nd drive) in both systems. Replaced those. Now it's drive ID 4 (the 5th drive) in both systems. Replaced it on one system and am now replacing it on the second system. The difference between the original drive and the replacement? The original was a QUANTUM ATLAS10K3_36_SCA rev. 120G U160 and the replacement was either Fujitsu U320 or Seagate U320 depending on what Dell shipped on that day. I'm fairly sure that the Quantums just have a slightly flaky drive firmware that locks up under certain conditions unique to our I/O patterns, but since there is no firmware being developed for those drives the only option is to replace the drive with a different brand. At any rate, since the pattern holds with both systems, it most probably points at a misbehaved drive model. However, the RAID controller is still at fault for the entire system going down because it didn't mark the drive as failed and return control to the OS within 60 seconds. The system became unresponsive. SNMP graphing showed load of 495 and all network activity stopped shortly after disk failure. Probably resulted from build-up of block I/O after the kernel kicked the unresponsive storage offline. The specific error message repeating across the console as fast as it would print was: Assertion failure in do_get_write_access() at :0: "jh->b_transaction == journal->j_committing_transaction" The following controller log indicates that the failed disk detection routine within the controller took too long to determine the disk was failed: AFA0> diagnostic show history /old=TRUE Executing: diagnostic show history /old=TRUE *** HISTORY BUFFER FROM LAST RUN *** [00]: ID(0:04:0) Cmd[0x28] Fail: Block Range 15179520 : 15179647 [01]: at 5082220 sec [02]: ID(0:04:0) Cmd[0x28] Fail: Block Range 23749824 : 23749839 [03]: at 5082220 sec [04]: ID(0:04:0) Cmd[0x28] Fail: Block Range 41313205 : 41313206 [05]: at 5082220 sec [06]: ID(0:04:0) Cmd[0x2a] Fail: Block Range 5270011 : 5270012 at [07]: 5082220 sec [08]: ID(0:04:0) Cmd[0x28] Fail: Block Range 43269255 : 43269256 [09]: at 5082220 sec [10]: ID(0:04:0) Cmd[0x28] Fail: Block Range 41310245 : 41310246 [11]: at 5082220 sec [12]: ID(0:04:0) Cmd[0x28] Fail: Block Range 3144832 : 3144959 at [13]: 5082220 sec [14]: ID(0:04:0) Cmd[0x28] Fail: Block Range 48800545 : 48800546 [15]: at 5082220 sec [16]: ID(0:04:0) Cmd[0x28] Fail: Block Range 24652631 : 24652632 [17]: at 5082220 sec [18]: ID(0:04:0) Cmd[0x28] Fail: Block Range 8102825 : 8102826 at [19]: 5082220 sec [20]: ID(0:04:0) Cmd[0x28] Fail: Block Range 59097920 : 59097951 [21]: at 5082220 sec [22]: ID(0:04:0) Cmd[0x28] Fail: Block Range 5461313 : 5461318 at [23]: 5082220 sec [24]: ID(0:04:0) Cmd[0x28] Fail: Block Range 64466133 : 64466134 [25]: at 5082220 sec [26]: ID(0:04:0) Cmd[0x2a] Fail: Block Range 3147136 : 3147263 at [27]: 5082220 sec [28]: ID(0:04:0) Cmd[0x28] Fail: Block Range 590215 : 590222 at 5 [29]: 082220 sec [30]: ID(0:04:0) Cmd[0x2a] Fail: Block Range 12283087 : 12283088 [31]: at 5082220 sec [32]: ID(0:04:0) Cmd[0x2a] Fail: Block Range 3147264 : 3147391 at [33]: 5082220 sec [34]: ID(0:04:0) Cmd[0x28] Fail: Block Range 19046144 : 19046271 [35]: at 5082220 sec [36]: ID(0:04:0) Cmd[0x28] Fail: Block Range 54603697 : 54603698 [37]: at 5082220 sec [38]: ID(0:04:0) Cmd[0x28] Fail: Block Range 215263 : 215270 at 5 [39]: 082220 sec [40]: ID(0:04:0) Cmd[0x28] Fail: Block Range 70646759 : 70646764 [41]: at 5082220 sec [42]: ID(0:04:0) Cmd[0x28] Fail: Block Range 64215 : 64222 at 508 [43]: 2220 sec [44]: ID(0:04:0) Cmd[0x28] Fail: Block Range 46804736 : 46804751 [45]: at 5082220 sec [46]: ID(0:04:0) Cmd[0x28] Fail: Block Range 70664653 : 70664654 [47]: at 5082220 sec [48]: ID(0:04:0) Cmd[0x28] Fail: Block Range 46055040 : 46055167 [49]: at 5082220 sec [50]: ID(0:04:0) Cmd[0x28] Fail: Block Range 39911821 : 39911830 [51]: at 5082220 sec [52]: ID(0:04:0) Cmd[0x2a] Fail: Block Range 3146880 : 3147007 at [53]: 5082220 sec [54]: ID(0:04:0) Cmd[0x28] Fail: Block Range 5159344 : 5159359 at [55]: 5082220 sec [56]: ID(0:04:0) Cmd[0x28] Fail: Block Range 41295373 : 41295374 [57]: at 5082220 sec [58]: ID(0:04:0) Cmd[0x28] Fail: Block Range 15544320 : 15544447 [59]: at 5082220 sec [60]: ID(0:04:0) Cmd[0x28] Fail: Block Range 5166793 : 5166794 at [61]: 5082220 sec [62]: RAID5 Container 0 Drive 0:4:0 Failure [63]: ID(0:04:0): Timeout detected on cmd[0x28] [64]: SCSI Channel[0]: Timeout Detected On 1 Command(s) [65]: ID(0:04:0) Timeout detected on cmd[0x28] [66]: SCSI Channel[0]: Timeout Detected On 1 Command(s) [67]: ID(0:04:0): Timeout detected on cmd[0x28] [68]: SCSI Channel[0]: Timeout Detected On 1 Command(s) [69]: ID(0:04:0): Timeout detected on cmd[0x28] [70]: SCSI Channel[0]: Timeout Detected On 1 Command(s) [71]: ID(0:04:0): Timeout detected on cmd[0x28] [72]: SCSI Channel[0]: Timeout Detected On 1 Command(s) [73]: ID(0:04:0): Timeout detected on cmd[0x28] [74]: SCSI Channel[0]: Timeout Detected On 1 Command(s) [75]: ID(0:04:0): Timeout detected on cmd[0x28] [76]: SCSI Channel[0]: Timeout Detected On 1 Command(s) [77]: ID(0:04:0) Timeout detected on cmd[0x28] [78]: SCSI Channel[0]: Timeout Detected On 1 Command(s) [79]: ID(0:04:0) Cmd[0x28] Fail: Block Range 0 : 0 at 5082308 sec [80]: 2 can't read mbr dev_t:4 [81]: <...repeats 1 more times> [82]: can't read config from slice #[4] [83]: 2 can't read mbr dev_t:4 [84]: can't read config from slice #[4] [85]: CT_LogMissingEntry: Log missing entry, container 0, dev 4, [86]: signature 0x8f950a4d, nvEntry 65 [87]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f [88]: 950a4d [89]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f [90]: 950a4d [91]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f [92]: 950a4d [93]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f [94]: 950a4d [95]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f [96]: 950a4d [97]: RAID5 Failover Container 0 No Failover Assigned [98]: Drive 0:4:0 returning error [99]: [/CODE] 88 seconds to determine the drive failed. In other words, it took 88 seconds from the time it stopped processing commands from the OS until it was ready to continue processing commands from the OS. The kernel killed the storage at 60 seconds, thus hosing the OS since that was the only storage device. Though the controller came back, the OS had already given up and couldn't recover. Am I correct in assessing that the controller's firmware is responsible for this extended delay in detecting the failed disk? Here's the information on our setup: PERC3/DI on Dell PowerEdge 2500 5 disk U160 RAID5 AFA0> controller details Executing: controller details Controller Information ---------------------- Device Name: AFA0 Controller Type: PERC 3/Di Access Mode: READ-WRITE Controller Serial Number: Last Six Digits = 4C20D2 Number of Buses: 2 Devices per Bus: 15 Controller CPU: i960 R series Controller CPU Speed: 100 Mhz Controller Memory: 128 Mbytes Battery State: Ok Component Revisions ------------------- CLI: 2.8-0 (Build #6076) API: 2.8-0 (Build #6076) Miniport Driver: 1.1-4 (Build #9999) Controller Software: 2.8-0 (Build #6092) Controller BIOS: 2.8-0 (Build #6092) Controller Firmware: (Build #6092) Sincerely, Andrew Kinney President and Chief Technology Officer Advantagecom Networks, Inc. http://www.advantagecom.net - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html