Summary: Large writes (eg. copying /dev/random) to an mpii device
quickly fails and locks up any further IO on the device.
I am not familiar with device drivers but I can do whatever is
necessary to probe this driver and/or device because there's nothing
on this machine yet so there's nothing on it to lose.

This is what I have found out so far.

This is the device:

        mpii0 at pci10 dev 0 function 0 "Symbios Logic SAS3008" rev 0x02: msi
        mpii0: SAS9300-8i, firmware 5.0.0.0, MPI 2.5
        scsibus4 at mpii0: 1024 targets

Last month I sent a huge dump of unexamined kernel output attached
to a bug report. It seems that the extra output was stripped and
the email was left only with the regular dmesg and the binary blobs
that sendbug extracts without the failure. Oops.

In the meantime I have poked around in the driver and tried to if
not figure out what is going wrong, at least figure out where.

I have recompiled a kernel with debugging enabled in the mpii driver
(and adjusted a few of the debug messages to include the SMID to
tie log lines together). When booted into that kernel I mounted a
partition on the mpii device and copied a large file into it.

First, this theme repeats a lot using SMIDs between 120 and 127.
Sometimes one is in flight, sometimes several.

nb. The log times are useless beyond ~minute-wise resolution because
syslog was lagging so far behind the console.

Nov 18 11:29:46 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52530 smid: 120
Nov 18 11:29:46 o /bsd: mpii0: mpii_scsi_cmd
Nov 18 11:29:46 o /bsd: mpii0: ccb_smid: 120 xs->flags: 0x1001
Nov 18 11:29:46 o /bsd: mpii0: mpii_start 0x6be1a000
Nov 18 11:29:46 o /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW (0x000000c0) write 
0x0009000000780000
Nov 18 11:29:46 o /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 
0x00000206
Nov 18 11:29:46 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52580 smid: 121
Nov 18 11:29:46 o /bsd: mpii0: mpii_scsi_cmd
Nov 18 11:29:46 o /bsd: mpii0: ccb_smid: 121 xs->flags: 0x1001
Nov 18 11:29:46 o /bsd: mpii0: mpii_start 0x6be1a200
Nov 18 11:29:46 o /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW (0x000000c0) write 
0x0009000000790000
Nov 18 11:29:46 o /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 
0x00000206
Nov 18 11:29:46 o /bsd: mpii0: mpii_get_ccb 0xffff800003d526c0 smid: 125
Nov 18 11:29:46 o /bsd: mpii0: mpii_scsi_cmd
Nov 18 11:29:46 o /bsd: mpii0: ccb_smid: 125 xs->flags: 0x1001
Nov 18 11:29:46 o /bsd: mpii0: mpii_start 0x6be1aa00
Nov 18 11:29:46 o /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW (0x000000c0) write 
0x00090000007d0000
Nov 18 11:29:46 o /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 
0x00000206
Nov 18 11:29:46 o /bsd: mpii0: mpii_reply
Nov 18 11:29:46 o last message repeated 2 times
Nov 18 11:29:46 o /bsd: mpii0: mpii_put_ccb 0xffff800003d52530 smid: 120
Nov 18 11:29:46 o /bsd: mpii0: mpii_put_ccb 0xffff800003d52580 smid: 121
Nov 18 11:29:46 o /bsd: mpii0: mpii_put_ccb 0xffff800003d526c0 smid: 125

This goes on for an unpredictable amount of time (here started at
11:28:07, another took 5 minutes to fail) until there are no more
replies and the available CCBs are then exhausted:

Nov 18 11:29:46 o /bsd: mpii0: mpii_get_ccb 0xffff800003d526c0 smid: 125
Nov 18 11:29:46 o /bsd: mpii0: mpii_scsi_cmd
Nov 18 11:29:46 o /bsd: mpii0: ccb_smid: 125 xs->flags: 0x1001
Nov 18 11:29:46 o /bsd: mpii0: mpii_start 0x6be1aa00
Nov 18 11:29:46 o /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW (0x000000c0) write 
0x00090000007d0000
Nov 18 11:29:46 o /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 
0x00000206
Nov 18 11:29:46 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52580 smid: 121
Nov 18 11:29:46 o /bsd: mpii0: mpii_scsi_cmd
Nov 18 11:29:46 o /bsd: mpii0: ccb_smid: 121 xs->flags: 0x1001
Nov 18 11:29:46 o /bsd: mpii0: mpii_start 0x6be1a200
Nov 18 11:29:46 o /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW (0x000000c0) write 
0x0009000000790000
Nov 18 11:29:46 o /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 
0x00000206
Nov 18 11:29:46 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52530 smid: 120
Nov 18 11:29:46 o /bsd: mpii0: mpii_scsi_cmd
Nov 18 11:29:46 o /bsd: mpii0: ccb_smid: 120 xs->flags: 0x1001
Nov 18 11:29:46 o /bsd: mpii0: mpii_start 0x6be1a000
Nov 18 11:29:46 o /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW (0x000000c0) write 
0x0009000000780000
Nov 18 11:29:46 o /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 
0x00000206
...
Nov 18 11:29:55 o /bsd: mpii0: mpii_get_ccb 0xffff800003d500f0 smid: 4
Nov 18 11:29:55 o /bsd: mpii0: mpii_scsi_cmd
Nov 18 11:29:55 o /bsd: mpii0: ccb_smid: 4 xs->flags: 0x1001
Nov 18 11:29:55 o /bsd: mpii0: mpii_start 0x6be0b800
Nov 18 11:29:55 o /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW (0x000000c0) write 
0x0009000000040000
Nov 18 11:29:55 o /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 
0x00000206

From here on it's not always the same. Sometimes it reaches SMID 1
and sometimes it stops higher. Sometimes the time outs start after
a delay and sometimes while the CCBs are still being sent. This is
not too surprising because whatever's failed has already failed by
this point and so all bets are off.

Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d
Nov 18 11:30:42 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 125
Nov 18 11:30:42 o /bsd: mpii0: mpii_get_ccb 0xffff800003d500a0 smid: 3
Nov 18 11:30:42 o /bsd: mpii0: mpii_start 0x6be0b600
Nov 18 11:30:42 o /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW (0x000000c0) write 
0x0000000000030006
Nov 18 11:30:42 o /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 
0x00000206
Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d
Nov 18 11:30:42 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 121
Nov 18 11:30:42 o /bsd: mpii0: mpii_get_ccb 0xffff800003d50050 smid: 2
Nov 18 11:30:42 o /bsd: mpii0: mpii_start 0x6be0b400
Nov 18 11:30:42 o /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW (0x000000c0) write 
0x0000000000020006
Nov 18 11:30:42 o /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 
0x00000206
Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d
Nov 18 11:30:42 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 120
Nov 18 11:30:42 o /bsd: mpii0: mpii_get_ccb 0xffff800003d50000 smid: 1
Nov 18 11:30:42 o /bsd: mpii0: mpii_start 0x6be0b200
Nov 18 11:30:42 o /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW (0x000000c0) write 
0x0000000000010006
Nov 18 11:30:42 o /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 
0x00000206

Sometimes it exhausts all the CCBs but still tries to allocate more:

Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d
Nov 18 11:30:42 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 126
Nov 18 11:30:42 o /bsd: mpii0: mpii_get_ccb 0x0 smid: -1
Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d
Nov 18 11:30:42 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 127
Nov 18 11:30:42 o /bsd: mpii0: mpii_get_ccb 0x0 smid: -1
Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d
Nov 18 11:30:42 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 123
Nov 18 11:30:42 o /bsd: mpii0: mpii_get_ccb 0x0 smid: -1
Nov 18 11:30:42 o /bsd: mpii0: mpii_read 0 0x4000265d
...
Nov 18 11:30:55 o /bsd: mpii0: mpii_read 0 0x4000265d
Nov 18 11:30:55 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d) smid: 4
Nov 18 11:30:55 o /bsd: mpii0: mpii_get_ccb 0x0 smid: -1

After hard-rebooting the system fsck usually fails unless I revert
to the regular kernel. The output is essentially the same except:
only a single CCB is in use at a time and the eventual failure has
a different return value: 0x40002623 instead of 0x4000265d above:

Nov 18 11:55:09 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52710 smid: 126
Nov 18 11:55:09 o /bsd: mpii0: mpii_scsi_cmd
Nov 18 11:55:09 o /bsd: mpii0: ccb_smid: 126 xs->flags: 0x801
Nov 18 11:55:09 o /bsd: mpii0: mpii_start 0x6be1ac00
Nov 18 11:55:09 o /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW (0x000000c0) write 
0x00090000007e0000
Nov 18 11:55:09 o /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 
0x00000286
Nov 18 11:55:09 o /bsd: mpii0: mpii_reply
Nov 18 11:55:09 o /bsd: mpii0: mpii_put_ccb 0xffff800003d52710 smid: 126
Nov 18 11:55:09 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52710 smid: 126
Nov 18 11:55:09 o /bsd: mpii0: mpii_scsi_cmd
Nov 18 11:55:09 o /bsd: mpii0: ccb_smid: 126 xs->flags: 0x801
Nov 18 11:55:09 o /bsd: mpii0: mpii_start 0x6be1ac00
Nov 18 11:55:09 o /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW (0x000000c0) write 
0x00090000007e0000
Nov 18 11:55:09 o /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 
0x00000286
Nov 18 11:56:09 o /bsd: mpii0: mpii_read 0 0x40002623
Nov 18 11:56:09 o /bsd: mpii0: mpii_scsi_cmd_tmo (0x40002623) smid: 126
Nov 18 11:56:09 o /bsd: mpii0: mpii_get_ccb 0xffff800003d52760 smid: 127
Nov 18 11:56:09 o /bsd: mpii0: mpii_start 0x6be1ae00
Nov 18 11:56:09 o /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW (0x000000c0) write 
0x00000000007f0006
Nov 18 11:56:09 o /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH (0x000000c4) write 
0x00000286

20 minutes later there is still no mpii_scsi_cmd_tmo for SMID 127
so that's completely wedged now.

I've tried to find out the significance of the error return codes.
FreeBSD's driver, written by the manufacturer, has some definitions
which look relevant:

        #define MPI2_IOC_STATE_RESET               (0x00000000)
        #define MPI2_IOC_STATE_READY               (0x10000000)
        #define MPI2_IOC_STATE_OPERATIONAL         (0x20000000)
        #define MPI2_IOC_STATE_FAULT               (0x40000000)

        #define MPI2_IOC_STATE_MASK                (0xF0000000)
        #define MPI2_IOC_STATE_SHIFT               (28)

        /* Fault state range for prodcut specific codes */
        #define MPI2_FAULT_PRODUCT_SPECIFIC_MIN                 (0x0000)
        #define MPI2_FAULT_PRODUCT_SPECIFIC_MAX                 (0xEFFF)

These seem to be related to these values from OpenBSD's driver:

        #define MPII_DOORBELL_STATE             (0xf<<28) /* ioc state */
        #define  MPII_DOORBELL_STATE_RESET      (0x0<<28)
        #define  MPII_DOORBELL_STATE_READY      (0x1<<28)
        #define  MPII_DOORBELL_STATE_OPER       (0x2<<28)
        #define  MPII_DOORBELL_STATE_FAULT      (0x4<<28)
        ...
        #define MPII_DOORBELL_FAULT             (0xffff<<0) /* fault code */

MPI2_IOC_STATE_FAULT and MPII_DOORBELL_STATE_FAULT look promising
but this is only guesswork based on similar-looking numbers and
neither set explains what the value in the lower 2 bytes is, which
is what I wanted.

Unfortunately there's no part of the driver that says "your bug is
here" so I am at a loss. I can reproduce the problem and I can see
that it's because a faulty (fawlty?) command is sent to the device
but I don't know what is sent, how it differs from what the device
is expecting or what return value is telling me.

Matthew

ps. The patch below fixes up one part of the debug output that
confused me until I remembered reading something about 32-bit writes
needing a barrier to be completed in two parts:

Index: mpii.c
===================================================================
RCS file: /src/datum/openbsd/cvs/src/sys/dev/pci/mpii.c,v
@@ -2771,16 +2773,18 @@ mpii_start(struct mpii_softc *sc, struct
        descr.vf_id = sc->sc_vf_id;
        descr.smid = ccb->ccb_smid;

+#if defined(__LP64__)
        DNPRINTF(MPII_D_RW, "%s:   MPII_REQ_DESCR_POST_LOW (0x%08x) write "
-           "0x%08lx\n", DEVNAME(sc), MPII_REQ_DESCR_POST_LOW, *rdp);
-
-       DNPRINTF(MPII_D_RW, "%s:   MPII_REQ_DESCR_POST_HIGH (0x%08x) write "
-           "0x%08lx\n", DEVNAME(sc), MPII_REQ_DESCR_POST_HIGH, *(rdp+1));
+           "0x%016lx\n", DEVNAME(sc), MPII_REQ_DESCR_POST_LOW, *rdp);

-#if defined(__LP64__)
        bus_space_write_raw_8(sc->sc_iot, sc->sc_ioh,
            MPII_REQ_DESCR_POST_LOW, *rdp);
 #else
+       DNPRINTF(MPII_D_RW, "%s:   MPII_REQ_DESCR_POST_LOW (0x%08x) write "
+           "0x%08lx\n", DEVNAME(sc), MPII_REQ_DESCR_POST_LOW, rdp[0]);
+       DNPRINTF(MPII_D_RW, "%s:   MPII_REQ_DESCR_POST_HIGH (0x%08x) write "
+           "0x%08lx\n", DEVNAME(sc), MPII_REQ_DESCR_POST_HIGH, rdp[1]);
+
        mtx_enter(&sc->sc_req_mtx);
        bus_space_write_raw_4(sc->sc_iot, sc->sc_ioh,
            MPII_REQ_DESCR_POST_LOW, rdp[0]);

Reply via email to