[Bug 1841132] [NEW] mpt3sas - storage controller resets under heavy disk io

Drew Woodard Thu, 22 Aug 2019 20:50:47 -0700

Public bug reported:

[summary]
when a server running ubuntu 18.04 with an lsi sas controller experiences high 
disk io there is a chance the storage controller will reset
this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
the server must be rebooted to restore the controller to a normal state


[hardware configuration]
server: dell poweredge r7415, purchased 2019-02
cpu/chipset: amd epyc naples
storage controller: "dell hba330 mini" with chipset "lsi sas3008"
drives: 4x samsung 860 pro 2TB ssd

[software configuration]
ubuntu 18.04 server
mdadm raid6
all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

[what happened]
server was operating as a vm host for months without issue
one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
the server was removed from production and I looked for a way to reproduce the 
issue

[how to reproduce the issue]
there are probably many ways to product this issue, the hackish way I found to 
reliably reproduce it was:
have the four ssds in a mdadm raid6 with ext4 filesystem
create three 500GB files containing random data
open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
the number of files is arbitrary, the goal is just to generate a lot of disk io 
on files too large to be cached in memory
then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
rebooting the server restores the controller to a normal state
if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

[why this is being reported here]
It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
-
tl;dr version:
ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

[caveats]
Server os misconfiguration is possible, however this is a rather basic vm host 
running kvm and no 3rd-party packages.
I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: bionic hba330 mpt3sas sas3008

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1841132] [NEW] mpt3sas - storage controller resets under heavy disk io

Reply via email to