Public bug reported: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state
[hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. ** Affects: linux (Ubuntu) Importance: Undecided Status: New ** Tags: bionic hba330 mpt3sas sas3008 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs