Hi, TSM 5.1.9.0 on win2K server. Library is Neo 4100 with 2 x LTO-2 drives (HP) and 60 tapes.
I *think* I have a drive failure, but I'm not sure because the errors are so varied and intermittent. Here is what happens. 1) 16 of my tapes got set to unavailable when the server was unable to read the labels on the tapes: 06/07/2005 10:49:22 ANR8355E I/O error reading label for volume ITG051L2 in drive MT1.0.0.3 (mt1.0.0.3). All 16 tapes got marked within a 3 hour period, and all of them failed in drive MT1.0.0.3. Through my fault I didn't realize the tapes were being set to unavailable until much later. That problem is now fixed as my reporting script now tells me how many tapes are marked unavailable. 2) A few days ago, I started noticing more errors in the logs related to drive/tape/scsi errors. Errors such as: 06/21/2005 08:44:19 ANR8300E I/O error on library LB6.0.0.3 (OP=8401C058, CC=205, KEY=FF, ASC=FF, ASCQ=FF, SENSE=**NONE**, Description=SCSI adapter failure). Refer to Appendix D in the 'Messages' manual for recommended action. 06/21/2005 09:10:32 ANR8300E I/O error on library LB6.0.0.3 (OP=8401C058, CC=211, KEY=FF, ASC=FF, ASCQ=FF, SENSE=**NONE**, Description=The SCSI bus was busy). Refer to Appendix D in the 'Messages' manual for recommended action. 06/22/2005 16:02:54 ANR8302E I/O error on drive MT1.0.0.3 (mt1.0.0.3) (OP=TESTREADY, Error Number=1117, CC=305, KEY=06, ASC=29, ASCQ=02, SENSE=70.00.06.00.00.00.00.0E.00.00.00.00.29.02-.00.00.2C.E4.00.00.00.00., Description=Drive failure). Refer to Appendix D in the 'Messages' manual for recommended action. 3) In addition, tapes get stuck in drive MT1.0.0.3, and I have to power cycle the library in order to get the tape out. So I don't think it is stuck in the drive because of some hardware failure, but rather the server is very confused about the I/O errors, and it gets to a point where it doesn't know what to do. 4) But the problem isn't only with drive MT1.0.0.3. My second drive MT2.0.0.3 has also had this problem, but with much less frequency. 5) I also notice errors in the windows event/system logs: Event Type: Error Event Source: AdsmScsi Event Category: None Event ID: 3 Date: 6/27/2005 Time: 9:54:37 AM User: N/A Computer: TENEDOS Description: A check condition error has occurred on device \Device\mt1.0.0.3 during Rewind with completion code DD_DRIVE_FAILURE. Refer to the device's SCSI reference for appropriate action. 6) I doubt it's a tape failure because how could 16 tapes fail all at once. In addition, I've marked unavailable tapes back to read/write, and they have worked for a while, but eventually the I/O errors come back. 7) I've also resat the SCSI controller in my host, and unplugged/replugged all SCSI cables. But I still have the problem. I've got a service call into Overland, but I haven't heard from them. Right now our backups are down, because as soon as the library tries to read a tape, the I/O errors pop up, the tape is marked unavailable, and I have to restart everything. Also, it doesn't help that our tape library and disk spools are near full capacity. Anyone have a clue as to the source of my problem? Thanks! Alex