** Description changed:

  Hello,
  
  In short, we faced an issue with a huge IO wait on a bionic Ubuntu 
4.15.0-118.119-generic kernel.
  This is the full list of process and the kernel function they were stuck in 
[0].
  
  The main issue can probably be summarized by this perf reports
  * first identify that the cpu are stuck in idle because of something[1]
  * second, see what kernel function seems to stuck the process kswapd0 and 
kswapd1 [2].
  
  We could see that this seems to be the mutex_lock in the bch_mca_scan
  function [3].
  
  After running the command:
  
   | sudo bash -c "echo 1 > /sys/fs/bcache/f1a1e8cb-3e6b-40ea-852e-
  583c48d0c2b8/internal/btree_shrinker_disabled"
  
  The server started to respond normally and the IO wait dropped
  significantly
  
  Here is a trace of the bcache event related lock in the kernel obtained
  with some bpfcc-tools [4].
  
  klockstat-bpfcc -c bch_ -i 5 -s 3
  
  The trace has been run in parallel with the following command line
  
  echo "Shrinker disabled: $(date)"; sleep 60; echo "Enabling shrinker:
  $(date)"; echo 0 | sudo tee
  /sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled ; sleep
  60; echo "Disabling shrinker: $(date)"; echo 1 | sudo tee
  /sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled; sleep
  60; echo "End of test: $(date)"
  
+ Trying to dig more, we reduced by 20 GB the memory allocated to a VM on the 
server.
+ * The bcache btree size fluctuation seems "normal" [5]
+ * I noticed that, when the shrinker was enabled,a lot of time was spent in 
the locks during "bch_btree_insert_node".
+ 
+ I decided to check if one of the function called during
+ bch_btree_insert_node was taking longer than usual when the shrinker was
+ enabled.
+ 
+ I finally found the "funclatency" tool and tried do have the same approach I 
had with the klockstat [7]. However, that was inconclusive. I could see there 
that the bch_btree_insert_node was barely called during the whole duration of 
the test.
+ Which made me think it's amount of time spent in lock is more due to another 
process acquiring the lock.
+ 
+ I'm going to try to have another go with some perf/klockstat/funclatency
+ focused on bch_mca_scan and the function called there.
+ 
+ Also, here are some memory related metrics [8].
+ 
  
  [0]: https://pastebin.ubuntu.com/p/QYXPdsMCWC/
  [1]: https://pastebin.ubuntu.com/p/BFsvF7H54r/
  [2]: https://pastebin.ubuntu.com/p/35qdsHYHf5/
  [3]: 
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/md/bcache/btree.c?h=Ubuntu-4.15.0-118.119#n674
  [4]: https://pastebin.ubuntu.com/p/qhyqP35fCw/
+ [5]: https://pastebin.ubuntu.com/p/McjxxqTVjn/
+ [6]: https://pastebin.ubuntu.com/p/KmrnW4Ng8F/
+ [7]: https://pastebin.ubuntu.com/p/fSX4c7tTFV/
+ [8]: https://pastebin.ubuntu.com/p/CZgXkgKhmJ/
  
  ====================
  $ cat /proc/version_signature
  Ubuntu 4.15.0-118.119-generic 4.15.18
  
  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: linux-image-4.15.0-118-generic 4.15.0-118.119
  ProcVersionSignature: User Name 4.15.0-118.119-generic 4.15.18
  Uname: Linux 4.15.0-118-generic x86_64
  AlsaDevices:
   total 0
   crw-rw---- 1 root audio 116,  1 Sep 29 10:04 seq
   crw-rw---- 1 root audio 116, 33 Sep 29 10:04 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.16
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 
'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  Date: Tue Oct  6 20:36:18 2020
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
  MachineType: HP ProLiant DL380 G7
  PciMultimedia:
  
  ProcFB: 0 radeondrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-118-generic 
root=UUID=c6ad1629-a506-4043-a339-6d57f0708d12 ro console=ttyS1,115200 nosplash
  RelatedPackageVersions:
   linux-restricted-modules-4.15.0-118-generic N/A
   linux-backports-modules-4.15.0-118-generic  N/A
   linux-firmware                              1.173.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
  SourcePackage: linux
  UpgradeStatus: Upgraded to bionic on 2019-09-27 (375 days ago)
  dmi.bios.date: 05/05/2011
  dmi.bios.vendor: HP
  dmi.bios.version: P67
  dmi.chassis.type: 23
  dmi.chassis.vendor: HP
  dmi.modalias: 
dmi:bvnHP:bvrP67:bd05/05/2011:svnHP:pnProLiantDL380G7:pvr:cvnHP:ct23:cvr:
  dmi.product.family: ProLiant
  dmi.product.name: ProLiant DL380 G7
  dmi.sys.vendor: HP

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to