Package: linux-image-2.6.16-2-em64t-p4-smp Version: 2.6.16-18~bpo.1 Severity: normal
I feel very bad filing a bug against a backports package, but this backports package is (according to the changelog) an unmodified 2.6.16-18 package, just recompiled for sarge, and people on the mailing lists are reporting problems with a variety of OSes [1], so I have a feeling it's a genuine driver bug with this kernel version. In any case, the problem is that under heavy write load, I get messages like these in /var/log/kern.log: Oct 2 14:36:01 localhost kernel: sd 0:2:1:0: megasas: RESET -55455 cmd=2a Oct 2 14:36:01 localhost kernel: megasas: reset successful Oct 2 14:36:31 localhost kernel: sd 0:2:1:0: megasas: RESET -70369 cmd=2a Oct 2 14:36:31 localhost kernel: megasas: reset successful Oct 2 14:37:02 localhost kernel: sd 0:2:1:0: megasas: RESET -83487 cmd=2a Oct 2 14:37:02 localhost kernel: megasas: reset successful Oct 2 14:37:32 localhost kernel: sd 0:2:1:0: megasas: RESET -95079 cmd=2a Oct 2 14:37:32 localhost kernel: megasas: reset successful Oct 2 14:38:02 localhost kernel: sd 0:2:1:0: megasas: RESET -105361 cmd=2a Oct 2 14:38:02 localhost kernel: megasas: reset successful Oct 2 14:38:33 localhost kernel: sd 0:2:1:0: megasas: RESET -115613 cmd=2a Oct 2 14:38:33 localhost kernel: megasas: reset successful Oct 2 14:38:33 localhost kernel: sd 0:2:1:0: SCSI error: return code = 0x6000000 Oct 2 14:38:33 localhost kernel: end_request: I/O error, dev sdb, sector 2927091007 Oct 2 14:38:33 localhost kernel: Buffer I/O error on device sdb1, logical block 731772736 Oct 2 14:38:33 localhost kernel: lost page write due to I/O error on sdb1 Oct 2 14:39:03 localhost kernel: sd 0:2:1:0: megasas: RESET -125667 cmd=2a Oct 2 14:39:03 localhost kernel: megasas: reset successful Oct 2 14:39:33 localhost kernel: sd 0:2:1:0: megasas: RESET -135588 cmd=2a Oct 2 14:39:33 localhost kernel: megasas: [ 0]waiting for 1 commands to complete Oct 2 14:39:34 localhost kernel: megasas: reset successful A mailing list posting recommended reducing BLKDEV_MAX_RQ to 8 in include/linux/blkdev.h as a workaround; I've tried that, and it seems to work for me. I suspect that the following patch is the actual fix (from recent changes to drivers/scsi/megaraid/megaraid_sas.c): --- a/drivers/scsi/megaraid/megaraid_sas.c 2006-03-20 00:53:29.000000000 -0500 +++ b/drivers/scsi/megaraid/megaraid_sas.c 2006-10-13 12:25:04.000000000 -0400 @@ -1716,6 +1823,12 @@ * Get various operational parameters from status register */ instance->max_fw_cmds = instance->instancet->read_fw_status_reg(reg_set) & 0x00FFFF; + /* + * Reduce the max supported cmds by 1. This is to ensure that the + * reply_q_sz (1 more than the max cmd that driver may send) + * does not exceed max cmds that the FW can support + */ + instance->max_fw_cmds = instance->max_fw_cmds-1; instance->max_num_sge = (instance->instancet->read_fw_status_reg(reg_set) & 0xFF0000) >> 0x10; /* ... but, of course, I'm not entirely sure what I'm doing. This is a production server now, but I may be able to do some amount of testing (like installing an etch or unstable partition to test more recent Debian kernels) from time to time over weekends or during downtime. If you'd like me to, let me know and I'll see what I can do. I'm also planning to test the above patch after consulting with some kernel hackers. I'll let you know how it goes. [1] http://lists.us.dell.com/pipermail/linux-poweredge/2006-October/027705.html http://lkml.org/lkml/2006/9/6/12 http://lists.us.dell.com/pipermail/linux-poweredge/2006-August/026821.html -- System Information: Debian Release: 3.1 Architecture: amd64 (x86_64) Kernel: Linux 2.6.16+max-nr-req-8 Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Versions of packages linux-image-2.6.16-2-em64t-p4-smp depends on: ii e2fsprogs 1.37-2sarge1 ext2 file system utilities and lib ii initramfs-tools [linux-ini 0.80~bpo.1 tools for generating an initramfs ii module-init-tools 3.2.2-3~bpo.1 tools for managing Linux kernel mo -- debconf information: shared/kernel-image/really-run-bootloader: true linux-image-2.6.16-2-em64t-p4-smp/preinst/abort-install-2.6.16-2-em64t-p4-smp: linux-image-2.6.16-2-em64t-p4-smp/preinst/bootloader-initrd-2.6.16-2-em64t-p4-smp: true linux-image-2.6.16-2-em64t-p4-smp/preinst/initrd-2.6.16-2-em64t-p4-smp: linux-image-2.6.16-2-em64t-p4-smp/postinst/old-dir-initrd-link-2.6.16-2-em64t-p4-smp: true linux-image-2.6.16-2-em64t-p4-smp/postinst/old-initrd-link-2.6.16-2-em64t-p4-smp: true linux-image-2.6.16-2-em64t-p4-smp/preinst/already-running-this-2.6.16-2-em64t-p4-smp: linux-image-2.6.16-2-em64t-p4-smp/postinst/bootloader-test-error-2.6.16-2-em64t-p4-smp: linux-image-2.6.16-2-em64t-p4-smp/postinst/depmod-error-initrd-2.6.16-2-em64t-p4-smp: false linux-image-2.6.16-2-em64t-p4-smp/postinst/kimage-is-a-directory: linux-image-2.6.16-2-em64t-p4-smp/postinst/old-system-map-link-2.6.16-2-em64t-p4-smp: true linux-image-2.6.16-2-em64t-p4-smp/prerm/would-invalidate-boot-loader-2.6.16-2-em64t-p4-smp: true linux-image-2.6.16-2-em64t-p4-smp/preinst/failed-to-move-modules-2.6.16-2-em64t-p4-smp: linux-image-2.6.16-2-em64t-p4-smp/postinst/bootloader-error-2.6.16-2-em64t-p4-smp: linux-image-2.6.16-2-em64t-p4-smp/postinst/depmod-error-2.6.16-2-em64t-p4-smp: false linux-image-2.6.16-2-em64t-p4-smp/preinst/lilo-initrd-2.6.16-2-em64t-p4-smp: true linux-image-2.6.16-2-em64t-p4-smp/preinst/elilo-initrd-2.6.16-2-em64t-p4-smp: true linux-image-2.6.16-2-em64t-p4-smp/preinst/overwriting-modules-2.6.16-2-em64t-p4-smp: true linux-image-2.6.16-2-em64t-p4-smp/preinst/abort-overwrite-2.6.16-2-em64t-p4-smp: linux-image-2.6.16-2-em64t-p4-smp/postinst/create-kimage-link-2.6.16-2-em64t-p4-smp: true linux-image-2.6.16-2-em64t-p4-smp/prerm/removing-running-kernel-2.6.16-2-em64t-p4-smp: true linux-image-2.6.16-2-em64t-p4-smp/preinst/lilo-has-ramdisk: -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]