Hi!

We are running some large (100+) kubernetes clusters on bare metal
machines also running a LSI MegaRAID controller and we are experiencing
these exact problems since about a week (cluster being in production
since early june).

We run with this controller:

# lspci | grep -i mega
59:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID Tri-Mode 
SAS3516 (rev 01)


After being in operation and taking production traffic (we are very IO intense) 
for less month both clusters started behaving weirdly with timeouts between 
processes on different nodes. After much debugging we found that basically IO 
performance had ground to a halt. Here is a quick summary of what we found 
together with our external partner (developer of our software defined storage 
solution):

We analyzed a small subset of nodes encountering the issue. Based on our
initial diagnosis, here are the findings:

- The CPU and IO utilization were low but all systems reported high load 
averages.
- All disks looked fine - root and data disks. The IO load was very small with 
no visible spikes in latencies. None of the devices were unresponsive/stuck.
- All user space processes looked fine, none were blocked.
- The systems were slow in general. i.e. Installing packages and writing even 
to the root disk was very slow.
- A kernel thread dump showed a lot of them in D state, all stuck on writing to 
the page cache.
- We used dd to write a small 1M file on the host (NOT on px device), requiring 
no disk io and it showed the same symptoms.
- System dirty threshold settings do not show anything out of the ordinary.
- All the affected nodes another common symptom is that the page cache drain 
appeared to be slow/stuck. Amount of pages in the writeback list is high but no 
writeback appeared to be happening.

These clusters all run a 18.04 LTS but have different kernels depending
on their patch level:

Cluster A:

 228 4.15.0-140-generic
  74 4.15.0-142-generic
   7 4.15.0-143-generic


Cluster B:

 418 4.15.0-136-generic
  22 4.15.0-144-generic
  12 4.15.0-143-generic
   9 4.15.0-142-generic


Cluster A is the one with the worst performance. Cluster B have had similar 
problems but since the majority of the nodes are on 136 it has not been so 
badly hit.

Our only remedy for now have been to downgrade to 136 and/or rebooting
the machines. Rebooting the machines but staying on the problematic
kernel works for now, but most likely we will see the same behavior in a
week or so of taking production traffic.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1928744

Title:
  Disk IO very slow on kernel 4.15.0-142-generic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1928744/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to