On 2/2/2019 12:02, Karl Denninger wrote: > I recently started having some really oddball things happening under > stress. This coincided with the machine being updated to 11.2-STABLE > (FreeBSD 11.2-STABLE #1 r342918:) from 11.1. > > Specifically, I get "errors" like this: > > (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00 > length 131072 SMID 269 Aborting command 0xfffffe0001179110 > mps0: Sending reset from mpssas_send_abort for target ID 37 > (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00 > length 131072 SMID 924 terminated ioc 804b loginfo 31140000 scsi 0 state > c xfer 0 > (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00 > length 131072 SMID 161 terminated ioc 804b loginfo 31140000 scsi 0 state > c xfer 0 > mps0: Unfreezing devq for target ID 37 > (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00 > (da12:mps0:0:37:0): CAM status: CCB request completed with an error > (da12:mps0:0:37:0): Retrying command > (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00 > (da12:mps0:0:37:0): CAM status: Command timeout > (da12:mps0:0:37:0): Retrying command > (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00 > (da12:mps0:0:37:0): CAM status: CCB request completed with an error > (da12:mps0:0:37:0): Retrying command > (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00 > (da12:mps0:0:37:0): CAM status: SCSI Status Error > (da12:mps0:0:37:0): SCSI status: Check Condition > (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, > reset, or bus device reset occurred) > (da12:mps0:0:37:0): Retrying command (per sense data) > > The "Unit Attention" implies the drive reset. It only occurs on certain > drives under very heavy load (e.g. a scrub.) I've managed to provoke it > on two different brands of disk across multiple firmware and capacities, > however, which tends to point away from a drive firmware problem. > > A look at the pool data shows /no /errors (e.g. no checksum problems, > etc) and a look at the disk itself (using smartctl) shows no problems > either -- whatever is going on here the adapter is recovering from it > without any data corruption or loss registered on *either end*! > > The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows: > > mps0: <Avago Technologies (LSI) SAS2008> port 0xc000-0xc0ff mem > 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3 > mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd > mps0: IOCCapabilities: > 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
After considerable additional work this looks increasingly like either a missed interrupt or a command is getting lost between the host adapter and the expander. I'm going to turn the driver debug level up and see if I can capture more information..... whatever is behind this, however, it is almost-certainly related to something that changed between 11.1 and 11.2, as I never saw these on the 11.1-STABLE build. -- Karl Denninger k...@denninger.net <mailto:k...@denninger.net> /The Market Ticker/ /[S/MIME encrypted email preferred]/
smime.p7s
Description: S/MIME Cryptographic Signature