Summary: Large writes (eg. copying /dev/random) to an mpii device quickly fails and locks up any further IO on the device.
I am not familiar with device drivers but I can do whatever is necessary to probe this driver and/or device because there's nothing on this machine yet so there's nothing on it to lose. This is what I have found out so far. This is the device: mpii0 at pci10 dev 0 function 0 "Symbios Logic SAS3008" rev 0x02: msi mpii0: SAS9300-8i, firmware 5.0.0.0, MPI 2.5 scsibus4 at mpii0: 1024 targets Last month I sent a huge dump of unexamined kernel output attached to a bug report. It seems that the extra output was stripped and the email was left only with the regular dmesg and the binary blobs that sendbug extracts without the failure. Oops. In the meantime I have poked around in the driver and tried to if not figure out what is going wrong, at least figure out where. I have recompiled a kernel with debugging enabled in the mpii driver (and adjusted a few of the debug messages to include the SMID to tie log lines together). When booted into that kernel I mounted a partition on the mpii device and copied a large file into it. First, this theme repeats a lot using SMIDs between 120 and 127. Sometimes one is in flight, sometimes several. nb. The log times are useless beyond ~minute-wise resolution because syslog was lagging so far behind the console. Nov 18 11:29:46 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52530 smid: 120 Nov 18 11:29:46 o /bsd: mpii0: mpii_scsi_cmd Nov 18 11:29:46 o /bsd: mpii0: ccb_smid: 120 xs->flags: 0x1001 Nov 18 11:29:46 o /bsd: mpii0: mpii_start 0x6be1a000 Nov 18 11:29:46 o /bsd: mpii0: MPII_REQ_DESCR_POST_LOW (0x000000c0) write 0x0009000000780000 Nov 18 11:29:46 o /bsd: mpii0: MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 0x00000206 Nov 18 11:29:46 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52580 smid: 121 Nov 18 11:29:46 o /bsd: mpii0: mpii_scsi_cmd Nov 18 11:29:46 o /bsd: mpii0: ccb_smid: 121 xs->flags: 0x1001 Nov 18 11:29:46 o /bsd: mpii0: mpii_start 0x6be1a200 Nov 18 11:29:46 o /bsd: mpii0: MPII_REQ_DESCR_POST_LOW (0x000000c0) write 0x0009000000790000 Nov 18 11:29:46 o /bsd: mpii0: MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 0x00000206 Nov 18 11:29:46 o /bsd: mpii0: mpii_get_ccb 0xffff800003d526c0 smid: 125 Nov 18 11:29:46 o /bsd: mpii0: mpii_scsi_cmd Nov 18 11:29:46 o /bsd: mpii0: ccb_smid: 125 xs->flags: 0x1001 Nov 18 11:29:46 o /bsd: mpii0: mpii_start 0x6be1aa00 Nov 18 11:29:46 o /bsd: mpii0: MPII_REQ_DESCR_POST_LOW (0x000000c0) write 0x00090000007d0000 Nov 18 11:29:46 o /bsd: mpii0: MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 0x00000206 Nov 18 11:29:46 o /bsd: mpii0: mpii_reply Nov 18 11:29:46 o last message repeated 2 times Nov 18 11:29:46 o /bsd: mpii0: mpii_put_ccb 0xffff800003d52530 smid: 120 Nov 18 11:29:46 o /bsd: mpii0: mpii_put_ccb 0xffff800003d52580 smid: 121 Nov 18 11:29:46 o /bsd: mpii0: mpii_put_ccb 0xffff800003d526c0 smid: 125 This goes on for an unpredictable amount of time (here started at 11:28:07, another took 5 minutes to fail) until there are no more replies and the available CCBs are then exhausted: Nov 18 11:29:46 o /bsd: mpii0: mpii_get_ccb 0xffff800003d526c0 smid: 125 Nov 18 11:29:46 o /bsd: mpii0: mpii_scsi_cmd Nov 18 11:29:46 o /bsd: mpii0: ccb_smid: 125 xs->flags: 0x1001 Nov 18 11:29:46 o /bsd: mpii0: mpii_start 0x6be1aa00 Nov 18 11:29:46 o /bsd: mpii0: MPII_REQ_DESCR_POST_LOW (0x000000c0) write 0x00090000007d0000 Nov 18 11:29:46 o /bsd: mpii0: MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 0x00000206 Nov 18 11:29:46 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52580 smid: 121 Nov 18 11:29:46 o /bsd: mpii0: mpii_scsi_cmd Nov 18 11:29:46 o /bsd: mpii0: ccb_smid: 121 xs->flags: 0x1001 Nov 18 11:29:46 o /bsd: mpii0: mpii_start 0x6be1a200 Nov 18 11:29:46 o /bsd: mpii0: MPII_REQ_DESCR_POST_LOW (0x000000c0) write 0x0009000000790000 Nov 18 11:29:46 o /bsd: mpii0: MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 0x00000206 Nov 18 11:29:46 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52530 smid: 120 Nov 18 11:29:46 o /bsd: mpii0: mpii_scsi_cmd Nov 18 11:29:46 o /bsd: mpii0: ccb_smid: 120 xs->flags: 0x1001 Nov 18 11:29:46 o /bsd: mpii0: mpii_start 0x6be1a000 Nov 18 11:29:46 o /bsd: mpii0: MPII_REQ_DESCR_POST_LOW (0x000000c0) write 0x0009000000780000 Nov 18 11:29:46 o /bsd: mpii0: MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 0x00000206 ... Nov 18 11:29:55 o /bsd: mpii0: mpii_get_ccb 0xffff800003d500f0 smid: 4 Nov 18 11:29:55 o /bsd: mpii0: mpii_scsi_cmd Nov 18 11:29:55 o /bsd: mpii0: ccb_smid: 4 xs->flags: 0x1001 Nov 18 11:29:55 o /bsd: mpii0: mpii_start 0x6be0b800 Nov 18 11:29:55 o /bsd: mpii0: MPII_REQ_DESCR_POST_LOW (0x000000c0) write 0x0009000000040000 Nov 18 11:29:55 o /bsd: mpii0: MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 0x00000206 From here on it's not always the same. Sometimes it reaches SMID 1 and sometimes it stops higher. Sometimes the time outs start after a delay and sometimes while the CCBs are still being sent. This is not too surprising because whatever's failed has already failed by this point and so all bets are off. Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d Nov 18 11:30:42 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 125 Nov 18 11:30:42 o /bsd: mpii0: mpii_get_ccb 0xffff800003d500a0 smid: 3 Nov 18 11:30:42 o /bsd: mpii0: mpii_start 0x6be0b600 Nov 18 11:30:42 o /bsd: mpii0: MPII_REQ_DESCR_POST_LOW (0x000000c0) write 0x0000000000030006 Nov 18 11:30:42 o /bsd: mpii0: MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 0x00000206 Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d Nov 18 11:30:42 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 121 Nov 18 11:30:42 o /bsd: mpii0: mpii_get_ccb 0xffff800003d50050 smid: 2 Nov 18 11:30:42 o /bsd: mpii0: mpii_start 0x6be0b400 Nov 18 11:30:42 o /bsd: mpii0: MPII_REQ_DESCR_POST_LOW (0x000000c0) write 0x0000000000020006 Nov 18 11:30:42 o /bsd: mpii0: MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 0x00000206 Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d Nov 18 11:30:42 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 120 Nov 18 11:30:42 o /bsd: mpii0: mpii_get_ccb 0xffff800003d50000 smid: 1 Nov 18 11:30:42 o /bsd: mpii0: mpii_start 0x6be0b200 Nov 18 11:30:42 o /bsd: mpii0: MPII_REQ_DESCR_POST_LOW (0x000000c0) write 0x0000000000010006 Nov 18 11:30:42 o /bsd: mpii0: MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 0x00000206 Sometimes it exhausts all the CCBs but still tries to allocate more: Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d Nov 18 11:30:42 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 126 Nov 18 11:30:42 o /bsd: mpii0: mpii_get_ccb 0x0 smid: -1 Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d Nov 18 11:30:42 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 127 Nov 18 11:30:42 o /bsd: mpii0: mpii_get_ccb 0x0 smid: -1 Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d Nov 18 11:30:42 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 123 Nov 18 11:30:42 o /bsd: mpii0: mpii_get_ccb 0x0 smid: -1 Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d ... Nov 18 11:30:55 o /bsd: mpii0: mpii_read 0 0x4000265d Nov 18 11:30:55 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 4 Nov 18 11:30:55 o /bsd: mpii0: mpii_get_ccb 0x0 smid: -1 After hard-rebooting the system fsck usually fails unless I revert to the regular kernel. The output is essentially the same except: only a single CCB is in use at a time and the eventual failure has a different return value: 0x40002623 instead of 0x4000265d above: Nov 18 11:55:09 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52710 smid: 126 Nov 18 11:55:09 o /bsd: mpii0: mpii_scsi_cmd Nov 18 11:55:09 o /bsd: mpii0: ccb_smid: 126 xs->flags: 0x801 Nov 18 11:55:09 o /bsd: mpii0: mpii_start 0x6be1ac00 Nov 18 11:55:09 o /bsd: mpii0: MPII_REQ_DESCR_POST_LOW (0x000000c0) write 0x00090000007e0000 Nov 18 11:55:09 o /bsd: mpii0: MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 0x00000286 Nov 18 11:55:09 o /bsd: mpii0: mpii_reply Nov 18 11:55:09 o /bsd: mpii0: mpii_put_ccb 0xffff800003d52710 smid: 126 Nov 18 11:55:09 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52710 smid: 126 Nov 18 11:55:09 o /bsd: mpii0: mpii_scsi_cmd Nov 18 11:55:09 o /bsd: mpii0: ccb_smid: 126 xs->flags: 0x801 Nov 18 11:55:09 o /bsd: mpii0: mpii_start 0x6be1ac00 Nov 18 11:55:09 o /bsd: mpii0: MPII_REQ_DESCR_POST_LOW (0x000000c0) write 0x00090000007e0000 Nov 18 11:55:09 o /bsd: mpii0: MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 0x00000286 Nov 18 11:56:09 o /bsd: mpii0: mpii_read 0 0x40002623 Nov 18 11:56:09 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x40002623) smid: 126 Nov 18 11:56:09 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52760 smid: 127 Nov 18 11:56:09 o /bsd: mpii0: mpii_start 0x6be1ae00 Nov 18 11:56:09 o /bsd: mpii0: MPII_REQ_DESCR_POST_LOW (0x000000c0) write 0x00000000007f0006 Nov 18 11:56:09 o /bsd: mpii0: MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 0x00000286 20 minutes later there is still no mpii_scsi_cmd_tmo for SMID 127 so that's completely wedged now. I've tried to find out the significance of the error return codes. FreeBSD's driver, written by the manufacturer, has some definitions which look relevant: #define MPI2_IOC_STATE_RESET (0x00000000) #define MPI2_IOC_STATE_READY (0x10000000) #define MPI2_IOC_STATE_OPERATIONAL (0x20000000) #define MPI2_IOC_STATE_FAULT (0x40000000) #define MPI2_IOC_STATE_MASK (0xF0000000) #define MPI2_IOC_STATE_SHIFT (28) /* Fault state range for prodcut specific codes */ #define MPI2_FAULT_PRODUCT_SPECIFIC_MIN (0x0000) #define MPI2_FAULT_PRODUCT_SPECIFIC_MAX (0xEFFF) These seem to be related to these values from OpenBSD's driver: #define MPII_DOORBELL_STATE (0xf<<28) /* ioc state */ #define MPII_DOORBELL_STATE_RESET (0x0<<28) #define MPII_DOORBELL_STATE_READY (0x1<<28) #define MPII_DOORBELL_STATE_OPER (0x2<<28) #define MPII_DOORBELL_STATE_FAULT (0x4<<28) ... #define MPII_DOORBELL_FAULT (0xffff<<0) /* fault code */ MPI2_IOC_STATE_FAULT and MPII_DOORBELL_STATE_FAULT look promising but this is only guesswork based on similar-looking numbers and neither set explains what the value in the lower 2 bytes is, which is what I wanted. Unfortunately there's no part of the driver that says "your bug is here" so I am at a loss. I can reproduce the problem and I can see that it's because a faulty (fawlty?) command is sent to the device but I don't know what is sent, how it differs from what the device is expecting or what return value is telling me. Matthew ps. The patch below fixes up one part of the debug output that confused me until I remembered reading something about 32-bit writes needing a barrier to be completed in two parts: Index: mpii.c =================================================================== RCS file: /src/datum/openbsd/cvs/src/sys/dev/pci/mpii.c,v @@ -2771,16 +2773,18 @@ mpii_start(struct mpii_softc *sc, struct descr.vf_id = sc->sc_vf_id; descr.smid = ccb->ccb_smid; +#if defined(__LP64__) DNPRINTF(MPII_D_RW, "%s: MPII_REQ_DESCR_POST_LOW (0x%08x) write " - "0x%08lx\n", DEVNAME(sc), MPII_REQ_DESCR_POST_LOW, *rdp); - - DNPRINTF(MPII_D_RW, "%s: MPII_REQ_DESCR_POST_HIGH (0x%08x) write " - "0x%08lx\n", DEVNAME(sc), MPII_REQ_DESCR_POST_HIGH, *(rdp+1)); + "0x%016lx\n", DEVNAME(sc), MPII_REQ_DESCR_POST_LOW, *rdp); -#if defined(__LP64__) bus_space_write_raw_8(sc->sc_iot, sc->sc_ioh, MPII_REQ_DESCR_POST_LOW, *rdp); #else + DNPRINTF(MPII_D_RW, "%s: MPII_REQ_DESCR_POST_LOW (0x%08x) write " + "0x%08lx\n", DEVNAME(sc), MPII_REQ_DESCR_POST_LOW, rdp[0]); + DNPRINTF(MPII_D_RW, "%s: MPII_REQ_DESCR_POST_HIGH (0x%08x) write " + "0x%08lx\n", DEVNAME(sc), MPII_REQ_DESCR_POST_HIGH, rdp[1]); + mtx_enter(&sc->sc_req_mtx); bus_space_write_raw_4(sc->sc_iot, sc->sc_ioh, MPII_REQ_DESCR_POST_LOW, rdp[0]);