Hi! We are running some large (100+) kubernetes clusters on bare metal machines also running a LSI MegaRAID controller and we are experiencing these exact problems since about a week (cluster being in production since early june).
We run with this controller: # lspci | grep -i mega 59:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID Tri-Mode SAS3516 (rev 01) After being in operation and taking production traffic (we are very IO intense) for less month both clusters started behaving weirdly with timeouts between processes on different nodes. After much debugging we found that basically IO performance had ground to a halt. Here is a quick summary of what we found together with our external partner (developer of our software defined storage solution): We analyzed a small subset of nodes encountering the issue. Based on our initial diagnosis, here are the findings: - The CPU and IO utilization were low but all systems reported high load averages. - All disks looked fine - root and data disks. The IO load was very small with no visible spikes in latencies. None of the devices were unresponsive/stuck. - All user space processes looked fine, none were blocked. - The systems were slow in general. i.e. Installing packages and writing even to the root disk was very slow. - A kernel thread dump showed a lot of them in D state, all stuck on writing to the page cache. - We used dd to write a small 1M file on the host (NOT on px device), requiring no disk io and it showed the same symptoms. - System dirty threshold settings do not show anything out of the ordinary. - All the affected nodes another common symptom is that the page cache drain appeared to be slow/stuck. Amount of pages in the writeback list is high but no writeback appeared to be happening. These clusters all run a 18.04 LTS but have different kernels depending on their patch level: Cluster A: 228 4.15.0-140-generic 74 4.15.0-142-generic 7 4.15.0-143-generic Cluster B: 418 4.15.0-136-generic 22 4.15.0-144-generic 12 4.15.0-143-generic 9 4.15.0-142-generic Cluster A is the one with the worst performance. Cluster B have had similar problems but since the majority of the nodes are on 136 it has not been so badly hit. Our only remedy for now have been to downgrade to 136 and/or rebooting the machines. Rebooting the machines but staying on the problematic kernel works for now, but most likely we will see the same behavior in a week or so of taking production traffic. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1928744 Title: Disk IO very slow on kernel 4.15.0-142-generic To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1928744/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs