** Changed in: linux (Ubuntu Noble)
       Status: In Progress => Fix Committed

** Changed in: linux (Ubuntu Oracular)
       Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2089327

Title:
  By always inlining _compound_head(), clone() sees 3%+ performance
  increase

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/2089327

  [Impact]

  _compound_head() is called frequently during clone() heavy workloads with
  CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y set, so much that it is worthwhile
  always inlining it for a slight 3%+ performance improvement during clone().

  Over the lifecycle of Noble, Oracular it could save significant amounts of
  CPU time during clone(), and save a large amount of electricity. We should
  always inline _compound_head() and take advantage of the performance boost.

  [Fix]

  This was fixed in 6.12-rc1 by:

  commit ef5f379de302884b9b7ad9b62587a942a9f0bb55
  Author: David Hildenbrand <da...@redhat.com>
  Date:  Tue Aug 20 14:22:10 2024 +0200
  Subject: mm: always inline _compound_head() with 
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ef5f379de302884b9b7ad9b62587a942a9f0bb55

  This commit is intended to offset the performance loss caused by:

   c0bff412e67b ("mm: allow anon exclusive check over hugetlb tail
  pages")

  which landed in 6.10-rc1, but the change is generic enough that Noble users
  would benefit from the fix as well. They bring both Noble and Oracular +3%.

  [Testcase]

  clone() heavy workloads are best to show the performance increase.

  Originally, the user who requested this is running an Ansible heavy workload,
  and finds that clone() bottlenecks during large runs of Ansible against
  thousands of containers and hosts.

  They benchmarked 6.8.0-49-generic against a patched test kernel of the same
  6.8.0-49-generic and found:

  Before:
      08:24:23: Rename subiquity netplan config
      08:36:12: hostendpoint_monitoring: Create log directory (10990)
      = 11m49s

      08:37:59: Rename subiquity netplan config
      08:49:49: hostendpoint_monitoring: Create log directory (10991)
      = 11m50s

  After:
      08:55:16: Rename subiquity netplan config
      09:06:28: hostendpoint_monitoring: Create log directory (10991)
      = 11m12s

      09:08:59: Rename subiquity netplan config
      09:20:22: hostendpoint_monitoring: Create log directory (10991)
      = 11m23s

  Take 11m23s versus 11m49s, for a 3.6%+ performance improvement. This adds up
  over thousands of hosts.

  I did some basic tests with stress-ng using the clone() stressor.

  I ran:

  $ sudo apt install stress-ng
  $ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics

  Before:
  ubuntu@jammy-test:~$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics
  stress-ng: info:  [953] stressor       bogo ops real time  usr time  sys time 
  bogo ops/s     bogo ops/s CPU used per
  stress-ng: info:  [953]                           (secs)    (secs)    (secs)  
 (real time) (usr+sys time) instance (%)
  stress-ng: info:  [953] clone             19919     61.80      2.19    232.84 
      322.29          84.75        76.06
  stress-ng: info:  [55777] clone             19540     61.17      1.75    
229.32       319.42          84.56        75.55
  stress-ng: info:  [107873] clone             19817     62.39      1.92    
235.90       317.64          83.33        76.24
  stress-ng: info:  [177572] clone             19763     60.57      0.89    
226.55       326.27          86.89        75.10

  After:
  ubuntu@jammy-test:~$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics
  stress-ng: info:  [914] stressor       bogo ops real time  usr time  sys time 
  bogo ops/s     bogo ops/s CPU used per
  stress-ng: info:  [914]                           (secs)    (secs)    (secs)  
 (real time) (usr+sys time) instance (%)
  stress-ng: info:  [914] clone             19446     60.67      1.83    229.60 
      320.50          84.03        76.29
  stress-ng: info:  [67984] clone             19600     60.63      0.90    
226.66       323.26          86.13        75.06
  stress-ng: info:  [117843] clone             19665     60.64      0.98    
226.97       324.27          86.27        75.18
  stress-ng: info:  [167831] clone             19306     61.22      1.20    
227.39       315.38          84.46        74.68

  These numbers are a bit more fuzzy, but its about 3% extra bogo ops.

  There is a test kernel available in the below ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf401086-test

  If you install it, you too will get 3%+ performance improvement on clone() 
heavy
  workloads.

  [Where problems could occur]

  We are inlining a hotly used function in the clone() syscall callpath. This
  should technically increase the performance due to not having to context 
switch
  between calls to _compound_head(), without much of a downside, apart from
  slightly increased binary size, and the inability to livepatch the function.

  I checked on cscope, and _compound_head is called from:

  compound_head()
  page_folio()

  both in page-flags.h as #defines. This is going to have a minuscule footprint
  change.

  The risk of regression is well worth the 3%+ performance gain.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089327/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to