** Changed in: linux (Ubuntu Noble) Status: In Progress => Fix Committed
** Changed in: linux (Ubuntu Oracular) Status: In Progress => Fix Committed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2089327 Title: By always inlining _compound_head(), clone() sees 3%+ performance increase Status in linux package in Ubuntu: Fix Released Status in linux source package in Noble: Fix Committed Status in linux source package in Oracular: Fix Committed Bug description: BugLink: https://bugs.launchpad.net/bugs/2089327 [Impact] _compound_head() is called frequently during clone() heavy workloads with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y set, so much that it is worthwhile always inlining it for a slight 3%+ performance improvement during clone(). Over the lifecycle of Noble, Oracular it could save significant amounts of CPU time during clone(), and save a large amount of electricity. We should always inline _compound_head() and take advantage of the performance boost. [Fix] This was fixed in 6.12-rc1 by: commit ef5f379de302884b9b7ad9b62587a942a9f0bb55 Author: David Hildenbrand <da...@redhat.com> Date: Tue Aug 20 14:22:10 2024 +0200 Subject: mm: always inline _compound_head() with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ef5f379de302884b9b7ad9b62587a942a9f0bb55 This commit is intended to offset the performance loss caused by: c0bff412e67b ("mm: allow anon exclusive check over hugetlb tail pages") which landed in 6.10-rc1, but the change is generic enough that Noble users would benefit from the fix as well. They bring both Noble and Oracular +3%. [Testcase] clone() heavy workloads are best to show the performance increase. Originally, the user who requested this is running an Ansible heavy workload, and finds that clone() bottlenecks during large runs of Ansible against thousands of containers and hosts. They benchmarked 6.8.0-49-generic against a patched test kernel of the same 6.8.0-49-generic and found: Before: 08:24:23: Rename subiquity netplan config 08:36:12: hostendpoint_monitoring: Create log directory (10990) = 11m49s 08:37:59: Rename subiquity netplan config 08:49:49: hostendpoint_monitoring: Create log directory (10991) = 11m50s After: 08:55:16: Rename subiquity netplan config 09:06:28: hostendpoint_monitoring: Create log directory (10991) = 11m12s 09:08:59: Rename subiquity netplan config 09:20:22: hostendpoint_monitoring: Create log directory (10991) = 11m23s Take 11m23s versus 11m49s, for a 3.6%+ performance improvement. This adds up over thousands of hosts. I did some basic tests with stress-ng using the clone() stressor. I ran: $ sudo apt install stress-ng $ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics Before: ubuntu@jammy-test:~$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics stress-ng: info: [953] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per stress-ng: info: [953] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) stress-ng: info: [953] clone 19919 61.80 2.19 232.84 322.29 84.75 76.06 stress-ng: info: [55777] clone 19540 61.17 1.75 229.32 319.42 84.56 75.55 stress-ng: info: [107873] clone 19817 62.39 1.92 235.90 317.64 83.33 76.24 stress-ng: info: [177572] clone 19763 60.57 0.89 226.55 326.27 86.89 75.10 After: ubuntu@jammy-test:~$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics stress-ng: info: [914] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per stress-ng: info: [914] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) stress-ng: info: [914] clone 19446 60.67 1.83 229.60 320.50 84.03 76.29 stress-ng: info: [67984] clone 19600 60.63 0.90 226.66 323.26 86.13 75.06 stress-ng: info: [117843] clone 19665 60.64 0.98 226.97 324.27 86.27 75.18 stress-ng: info: [167831] clone 19306 61.22 1.20 227.39 315.38 84.46 74.68 These numbers are a bit more fuzzy, but its about 3% extra bogo ops. There is a test kernel available in the below ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf401086-test If you install it, you too will get 3%+ performance improvement on clone() heavy workloads. [Where problems could occur] We are inlining a hotly used function in the clone() syscall callpath. This should technically increase the performance due to not having to context switch between calls to _compound_head(), without much of a downside, apart from slightly increased binary size, and the inability to livepatch the function. I checked on cscope, and _compound_head is called from: compound_head() page_folio() both in page-flags.h as #defines. This is going to have a minuscule footprint change. The risk of regression is well worth the 3%+ performance gain. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089327/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp