Re: [PATCH v3 5/5] alloc_tag: config to store page allocation tag refs in page flags
On Mon, Oct 21, 2024 at 8:57 AM Michal Hocko wrote: > > On Mon 21-10-24 08:41:00, Suren Baghdasaryan wrote: > > On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko wrote: > > > > > > On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote: > > > [...] > > > > Yeah, I thought about adding new values to "mem_profiling" but it's a > > > > bit complicated. Today it's a tristate: > > > > > > > > mem_profiling=0|1|never > > > > > > > > 0/1 means we disable/enable memory profiling by default but the user > > > > can enable it at runtime using a sysctl. This means that we enable > > > > page_ext at boot even when it's set to 0. > > > > "never" means we do not enable page_ext, memory profiling is disabled > > > > and sysctl to enable it will not be exposed. Used when a distribution > > > > has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does > > > > not want to waste memory on enabling page_ext. > > > > > > > > I can add another option like "pgflags" but then it also needs to > > > > specify whether we should enable or disable profiling by default > > > > (similar to 0|1 for page_ext mode). IOW we will need to encode also > > > > the default state we want. Something like this: > > > > > > > > mem_profiling=0|1|never|pgflags_on|pgflags_off > > > > > > > > Would this be acceptable? > > > > > > Isn't this overcomplicating it? Why cannot you simply go with > > > mem_profiling={0|never|1}[,$YOUR_OPTIONS] > > > > > > While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply > > > or just be ignored if that is not applicable. > > > > Oh, you mean having 2 parts in the parameter with supported options being: > > > > mem_profiling=never > > mem_profiling=0 > > mem_profiling=1 > > mem_profiling=0,pgflags > > mem_profiling=1,pgflags > > > > Did I understand correctly? If so then yes, this should work. > > yes. I would just not call it pgflags because that just doesn't really > tell what the option is to anybody but kernel developers. You could also > have an option to override the default (disable profiling) failure strategy. Ok, how about "compressed" instead? Like this: mem_profiling=0,compressed > -- > Michal Hocko > SUSE Labs
Re: [PATCH v3 5/5] alloc_tag: config to store page allocation tag refs in page flags
On Mon 21-10-24 09:16:14, Suren Baghdasaryan wrote: > On Mon, Oct 21, 2024 at 8:57 AM Michal Hocko wrote: > > > > On Mon 21-10-24 08:41:00, Suren Baghdasaryan wrote: > > > On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko wrote: > > > > > > > > On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote: > > > > [...] > > > > > Yeah, I thought about adding new values to "mem_profiling" but it's a > > > > > bit complicated. Today it's a tristate: > > > > > > > > > > mem_profiling=0|1|never > > > > > > > > > > 0/1 means we disable/enable memory profiling by default but the user > > > > > can enable it at runtime using a sysctl. This means that we enable > > > > > page_ext at boot even when it's set to 0. > > > > > "never" means we do not enable page_ext, memory profiling is disabled > > > > > and sysctl to enable it will not be exposed. Used when a distribution > > > > > has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does > > > > > not want to waste memory on enabling page_ext. > > > > > > > > > > I can add another option like "pgflags" but then it also needs to > > > > > specify whether we should enable or disable profiling by default > > > > > (similar to 0|1 for page_ext mode). IOW we will need to encode also > > > > > the default state we want. Something like this: > > > > > > > > > > mem_profiling=0|1|never|pgflags_on|pgflags_off > > > > > > > > > > Would this be acceptable? > > > > > > > > Isn't this overcomplicating it? Why cannot you simply go with > > > > mem_profiling={0|never|1}[,$YOUR_OPTIONS] > > > > > > > > While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply > > > > or just be ignored if that is not applicable. > > > > > > Oh, you mean having 2 parts in the parameter with supported options being: > > > > > > mem_profiling=never > > > mem_profiling=0 > > > mem_profiling=1 > > > mem_profiling=0,pgflags > > > mem_profiling=1,pgflags > > > > > > Did I understand correctly? If so then yes, this should work. > > > > yes. I would just not call it pgflags because that just doesn't really > > tell what the option is to anybody but kernel developers. You could also > > have an option to override the default (disable profiling) failure strategy. > > Ok, how about "compressed" instead? Like this: > > mem_profiling=0,compressed Sounds good to me. And just to repeat, I do not really care about specific name but let's just stay away from something as specific as page flags because that is really not helping to understand the purpose but rather the underlying mechanism which is not telling much to most users outside of kernel developers. -- Michal Hocko SUSE Labs
Re: [PATCH v3 5/5] alloc_tag: config to store page allocation tag refs in page flags
On Mon, Oct 21, 2024 at 2:13 AM David Hildenbrand wrote: > > > > Am 21.10.24 um 09:26 schrieb Michal Hocko: > > On Fri 18-10-24 14:57:26, Suren Baghdasaryan wrote: > >> On Fri, Oct 18, 2024 at 10:45 AM Suren Baghdasaryan > >> wrote: > >>> > >>> On Fri, Oct 18, 2024 at 10:08 AM Michal Hocko wrote: > >>> > >>> Automatic fallback is possible during boot, when we decide whether to > >>> enable page extensions or not. So, if during boot we decide to disable > >>> page extensions and use page flags, we can't go back and re-enable > >>> page extensions after boot is complete. Since there is a possibility > >>> that we run out of page flags at runtime when we load a new module, > >>> this leaves this case when we can't reference the module tags and we > >>> can't fall back to page extensions, so we have to disable memory > >>> profiling. > >>> I could keep page extensions always on just in case this happens but > >>> that's a lot of memory waste to handle a rare case... > >> > >> After thinking more about this, I suggest a couple of changes that > >> IMHO would make configuration simpler: > >> 1. Change the CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to an early boot > >> parameter. > > > > This makes much more sense! > > > >> Today we have a "mem_profiling" parameter to enable/disable > >> memory profiling. I suggest adding "mem_profiling_use_pgflags" to > >> switch the current behavior of using page extensions to use page > >> flags. > > > > I do not want to bikeshed about this but to me it would make more sense > > to have an extension paramater to mem_profiling and call it something > > like compress or similar so that page flags are not really carved into > > naming. The docuemntation then can explain that the copression cannot be > > always guaranteed and it might fail so this is more of a optimistic and > > potentially failing optimization that might need to be dropped in some > > usege scenarios. > > Maybe we can reuse the existing parameter (e.g., tristate). Only makes sense > if > we don't expect too many other modes though :) Yeah, I thought about adding new values to "mem_profiling" but it's a bit complicated. Today it's a tristate: mem_profiling=0|1|never 0/1 means we disable/enable memory profiling by default but the user can enable it at runtime using a sysctl. This means that we enable page_ext at boot even when it's set to 0. "never" means we do not enable page_ext, memory profiling is disabled and sysctl to enable it will not be exposed. Used when a distribution has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does not want to waste memory on enabling page_ext. I can add another option like "pgflags" but then it also needs to specify whether we should enable or disable profiling by default (similar to 0|1 for page_ext mode). IOW we will need to encode also the default state we want. Something like this: mem_profiling=0|1|never|pgflags_on|pgflags_off Would this be acceptable? > > > > >> We keep the current behavior of using page extensions as > >> default (mem_profiling_use_pgflags=0) because it always works even > >> though it has higher overhead. > > > > Yes this seems to be a safe default. > > Agreed. > > > > >> 2. No auto-fallback. If mem_profiling_use_pgflags=1 and we don't have > >> enough page flags (at boot time or later when we load a module), we > >> simply disable memory profiling with a warning. > > Sounds reasonable to me. > > -- > Cheers, > > David / dhildenb >
Re: [PATCH v3 5/5] alloc_tag: config to store page allocation tag refs in page flags
On Mon 21-10-24 08:41:00, Suren Baghdasaryan wrote: > On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko wrote: > > > > On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote: > > [...] > > > Yeah, I thought about adding new values to "mem_profiling" but it's a > > > bit complicated. Today it's a tristate: > > > > > > mem_profiling=0|1|never > > > > > > 0/1 means we disable/enable memory profiling by default but the user > > > can enable it at runtime using a sysctl. This means that we enable > > > page_ext at boot even when it's set to 0. > > > "never" means we do not enable page_ext, memory profiling is disabled > > > and sysctl to enable it will not be exposed. Used when a distribution > > > has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does > > > not want to waste memory on enabling page_ext. > > > > > > I can add another option like "pgflags" but then it also needs to > > > specify whether we should enable or disable profiling by default > > > (similar to 0|1 for page_ext mode). IOW we will need to encode also > > > the default state we want. Something like this: > > > > > > mem_profiling=0|1|never|pgflags_on|pgflags_off > > > > > > Would this be acceptable? > > > > Isn't this overcomplicating it? Why cannot you simply go with > > mem_profiling={0|never|1}[,$YOUR_OPTIONS] > > > > While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply > > or just be ignored if that is not applicable. > > Oh, you mean having 2 parts in the parameter with supported options being: > > mem_profiling=never > mem_profiling=0 > mem_profiling=1 > mem_profiling=0,pgflags > mem_profiling=1,pgflags > > Did I understand correctly? If so then yes, this should work. yes. I would just not call it pgflags because that just doesn't really tell what the option is to anybody but kernel developers. You could also have an option to override the default (disable profiling) failure strategy. -- Michal Hocko SUSE Labs
Re: [PATCH v3 5/5] alloc_tag: config to store page allocation tag refs in page flags
On Fri 18-10-24 10:45:39, Suren Baghdasaryan wrote: > On Fri, Oct 18, 2024 at 10:08 AM Michal Hocko wrote: > > > > On Fri 18-10-24 09:04:24, Suren Baghdasaryan wrote: > > > On Fri, Oct 18, 2024 at 6:03 AM Michal Hocko wrote: > > > > > > > > On Tue 15-10-24 08:58:59, Suren Baghdasaryan wrote: > > > > > On Tue, Oct 15, 2024 at 8:42 AM David Hildenbrand > > > > > wrote: > > > > [...] > > > > > > Right, I think what John is concerned about (and me as well) is that > > > > > > once a new feature really needs a page flag, there will be objection > > > > > > like "no you can't, we need them for allocation tags otherwise that > > > > > > feature will be degraded". > > > > > > > > > > I do understand your concern but IMHO the possibility of degrading a > > > > > feature should not be a reason to always operate at degraded capacity > > > > > (which is what we have today). If one is really concerned about > > > > > possible future regression they can set > > > > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS=n and keep what we have today. That's > > > > > why I'm strongly advocating that we do need > > > > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS so that the user has control over how > > > > > this scarce resource is used. > > > > > > > > I really do not think users will know how/why to setup this and I > > > > wouldn't > > > > even bother them thinking about that at all TBH. > > > > > > > > This is an implementation detail. It is fine to reuse unused flags space > > > > as a storage as a performance optimization but why do you want users to > > > > bother with that? Why would they ever want to say N here? > > > > > > In this patch you can find a couple of warnings that look like this: > > > > > > pr_warn("With module %s there are too many tags to fit in %d page flag > > > bits. Memory profiling is disabled!\n", mod->name, > > > NR_UNUSED_PAGEFLAG_BITS); > > > emitted when we run out of page flag bits during a module loading, > > > > > > pr_err("%s: alignment %lu is incompatible with allocation tag > > > indexing, disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS", mod->name, > > > align); > > > emitted when the arch-specific section alignment is incompatible with > > > alloc_tag indexing. > > > > You are asking users to workaround implementation issue by configuration > > which sounds like a really bad idea. Why cannot you make the fallback > > automatic? > > Automatic fallback is possible during boot, when we decide whether to > enable page extensions or not. So, if during boot we decide to disable > page extensions and use page flags, we can't go back and re-enable > page extensions after boot is complete. Since there is a possibility > that we run out of page flags at runtime when we load a new module, > this leaves this case when we can't reference the module tags and we > can't fall back to page extensions, so we have to disable memory > profiling. Right, I do understand (I guess) the challenge. I am just arguing that it makes really no sense to tell user to recompile the kernel with a CONFIG_FOO to workaround this limitation. Please note that many users of this feature will simply use a precompiled (e.g. distribution) kernels. Once you force somebody to recompile with CONFIG_PGALLOC_TAG_USE_PAGEFLAGS=n they are not going back to a more memory optimal implementation. Just my 2cents -- Michal Hocko SUSE Labs
Re: [PATCH v3 5/5] alloc_tag: config to store page allocation tag refs in page flags
On Fri 18-10-24 14:57:26, Suren Baghdasaryan wrote: > On Fri, Oct 18, 2024 at 10:45 AM Suren Baghdasaryan wrote: > > > > On Fri, Oct 18, 2024 at 10:08 AM Michal Hocko wrote: > > > > > > On Fri 18-10-24 09:04:24, Suren Baghdasaryan wrote: > > > > On Fri, Oct 18, 2024 at 6:03 AM Michal Hocko wrote: > > > > > > > > > > On Tue 15-10-24 08:58:59, Suren Baghdasaryan wrote: > > > > > > On Tue, Oct 15, 2024 at 8:42 AM David Hildenbrand > > > > > > wrote: > > > > > [...] > > > > > > > Right, I think what John is concerned about (and me as well) is > > > > > > > that > > > > > > > once a new feature really needs a page flag, there will be > > > > > > > objection > > > > > > > like "no you can't, we need them for allocation tags otherwise > > > > > > > that > > > > > > > feature will be degraded". > > > > > > > > > > > > I do understand your concern but IMHO the possibility of degrading a > > > > > > feature should not be a reason to always operate at degraded > > > > > > capacity > > > > > > (which is what we have today). If one is really concerned about > > > > > > possible future regression they can set > > > > > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS=n and keep what we have today. > > > > > > That's > > > > > > why I'm strongly advocating that we do need > > > > > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS so that the user has control over > > > > > > how > > > > > > this scarce resource is used. > > > > > > > > > > I really do not think users will know how/why to setup this and I > > > > > wouldn't > > > > > even bother them thinking about that at all TBH. > > > > > > > > > > This is an implementation detail. It is fine to reuse unused flags > > > > > space > > > > > as a storage as a performance optimization but why do you want users > > > > > to > > > > > bother with that? Why would they ever want to say N here? > > > > > > > > In this patch you can find a couple of warnings that look like this: > > > > > > > > pr_warn("With module %s there are too many tags to fit in %d page flag > > > > bits. Memory profiling is disabled!\n", mod->name, > > > > NR_UNUSED_PAGEFLAG_BITS); > > > > emitted when we run out of page flag bits during a module loading, > > > > > > > > pr_err("%s: alignment %lu is incompatible with allocation tag > > > > indexing, disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS", mod->name, > > > > align); > > > > emitted when the arch-specific section alignment is incompatible with > > > > alloc_tag indexing. > > > > > > You are asking users to workaround implementation issue by configuration > > > which sounds like a really bad idea. Why cannot you make the fallback > > > automatic? > > > > Automatic fallback is possible during boot, when we decide whether to > > enable page extensions or not. So, if during boot we decide to disable > > page extensions and use page flags, we can't go back and re-enable > > page extensions after boot is complete. Since there is a possibility > > that we run out of page flags at runtime when we load a new module, > > this leaves this case when we can't reference the module tags and we > > can't fall back to page extensions, so we have to disable memory > > profiling. > > I could keep page extensions always on just in case this happens but > > that's a lot of memory waste to handle a rare case... > > After thinking more about this, I suggest a couple of changes that > IMHO would make configuration simpler: > 1. Change the CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to an early boot > parameter. This makes much more sense! > Today we have a "mem_profiling" parameter to enable/disable > memory profiling. I suggest adding "mem_profiling_use_pgflags" to > switch the current behavior of using page extensions to use page > flags. I do not want to bikeshed about this but to me it would make more sense to have an extension paramater to mem_profiling and call it something like compress or similar so that page flags are not really carved into naming. The docuemntation then can explain that the copression cannot be always guaranteed and it might fail so this is more of a optimistic and potentially failing optimization that might need to be dropped in some usege scenarios. > We keep the current behavior of using page extensions as > default (mem_profiling_use_pgflags=0) because it always works even > though it has higher overhead. Yes this seems to be a safe default. > 2. No auto-fallback. If mem_profiling_use_pgflags=1 and we don't have > enough page flags (at boot time or later when we load a module), we > simply disable memory profiling with a warning. No strong opinion on this. -- Michal Hocko SUSE Labs
Re: [PATCH v4 1/6] Add AutoFDO support for Clang build
Thanks for the detailed suggestions! My comments are inlined below. Best regards, -Rong On Sun, Oct 20, 2024 at 9:33 AM Masahiro Yamada wrote: > > On Tue, Oct 15, 2024 at 6:33 AM Rong Xu wrote: > > > > +Customization > > += > > + > > +You can enable or disable AutoFDO build for individual file and > > directories by > > +adding a line similar to the following to the respective kernel Makefile: > > > Perhaps, it might be worth mentioning that kernel space objects are > covered by default. > > Then, people would understand ':= y' will be less common than ':= n'. > Good point! How about I change to the following: " The default CONFIG_AUTOFDO_CLANG setting covers kernel space objects for AutoFDO builds. One can, however, enable or disable AutoFDO build for individual file and directories by adding a line similar to the following to the respective kernel Makefile ... > > > > > + > > +- For enabling a single file (e.g. foo.o) :: > > + > > + AUTOFDO_PROFILE_foo.o := y > > + > > +- For enabling all files in one directory :: > > + > > + AUTOFDO_PROFILE := y > > + > > +- For disabling one file :: > > + > > + AUTOFDO_PROFILE_foo.o := n > > + > > +- For disabling all files in one directory :: > > + > > + AUTOFDO_PROFILE := n > > + > > > > > > +3) Run the load tests. The '-c' option in perf specifies the sample > > + event period. We suggest using a suitable prime number, like 59, > > + for this purpose. > > + > > + - For Intel platforms:: > > + > > + $ perf record -e BR_INST_RETIRED.NEAR_TAKEN:k -a -N -b -c -o > > -- > > + > > + - For AMD platforms: For Intel platforms: > > > I guess this is a copy-paste mistake. > > > For AMD platforms: For Intel platforms: > >-> > > For AMD platforms: Thanks for catching this! Will fix this. > > > > > > > > + (https://github.com/google/autofdo), version v0.30.1 or later. > > > Please one space instead of two after the comma. > Will fix it. > > > > > > > > diff --git a/scripts/Makefile.autofdo b/scripts/Makefile.autofdo > > new file mode 100644 > > index ..1c9f224bc221 > > --- /dev/null > > +++ b/scripts/Makefile.autofdo > > @@ -0,0 +1,23 @@ > > +# SPDX-License-Identifier: GPL-2.0 > > + > > +# Enable available and selected Clang AutoFDO features. > > + > > +CFLAGS_AUTOFDO_CLANG := -fdebug-info-for-profiling -mllvm > > -enable-fs-discriminator=true -mllvm -improved-fs-discriminator=true > > + > > +# If CONFIG_DEBUG_INFO is not enabled, set -gmlt option. > > > Meaningless comment. It explains too obvious code. Will remove this line of comment. > > > > +ifndef CONFIG_DEBUG_INFO > > + CFLAGS_AUTOFDO_CLANG += -gmlt > > +endif > > > > > > diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib > > index 01a9f567d5af..e85d6ac31bd9 100644 > > --- a/scripts/Makefile.lib > > +++ b/scripts/Makefile.lib > > @@ -191,6 +191,16 @@ _c_flags += $(if $(patsubst n%,, \ > > -D__KCSAN_INSTRUMENT_BARRIERS__) > > endif > > > > +# > > +# Enable Clang's AutoFDO build flags for a file or directory depending on > > +# variables AUTOFDO_PROFILE_obj.o and AUTOFDO_PROFILE. > > +# > > > This comment would give the wrong understanding that this flag is opt-in. > > > The comment for KASAN correctly describes that it is enabled by default, > and can be opted out using KASAN_SANITIZE_*. > I can change to use KASAN's expression: " # Enable Clang's AutoFDO build flags for kernel except some files or directories # we don't want to enable (depends on variables AUTOFDO_PROFILE_obj.o and AUTOFDO_PROFILE) " > > > > > -- > Best Regards > > > Masahiro Yamada
Re: [PATCH v3 5/5] alloc_tag: config to store page allocation tag refs in page flags
Am 21.10.24 um 09:26 schrieb Michal Hocko: On Fri 18-10-24 14:57:26, Suren Baghdasaryan wrote: On Fri, Oct 18, 2024 at 10:45 AM Suren Baghdasaryan wrote: On Fri, Oct 18, 2024 at 10:08 AM Michal Hocko wrote: Automatic fallback is possible during boot, when we decide whether to enable page extensions or not. So, if during boot we decide to disable page extensions and use page flags, we can't go back and re-enable page extensions after boot is complete. Since there is a possibility that we run out of page flags at runtime when we load a new module, this leaves this case when we can't reference the module tags and we can't fall back to page extensions, so we have to disable memory profiling. I could keep page extensions always on just in case this happens but that's a lot of memory waste to handle a rare case... After thinking more about this, I suggest a couple of changes that IMHO would make configuration simpler: 1. Change the CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to an early boot parameter. This makes much more sense! Today we have a "mem_profiling" parameter to enable/disable memory profiling. I suggest adding "mem_profiling_use_pgflags" to switch the current behavior of using page extensions to use page flags. I do not want to bikeshed about this but to me it would make more sense to have an extension paramater to mem_profiling and call it something like compress or similar so that page flags are not really carved into naming. The docuemntation then can explain that the copression cannot be always guaranteed and it might fail so this is more of a optimistic and potentially failing optimization that might need to be dropped in some usege scenarios. Maybe we can reuse the existing parameter (e.g., tristate). Only makes sense if we don't expect too many other modes though :) We keep the current behavior of using page extensions as default (mem_profiling_use_pgflags=0) because it always works even though it has higher overhead. Yes this seems to be a safe default. Agreed. 2. No auto-fallback. If mem_profiling_use_pgflags=1 and we don't have enough page flags (at boot time or later when we load a module), we simply disable memory profiling with a warning. Sounds reasonable to me. -- Cheers, David / dhildenb
Re: [PATCH v2 5/7] virtio-mem: s390 support
Am 21.10.24 um 08:33 schrieb Christian Borntraeger: Am 15.10.24 um 10:37 schrieb Heiko Carstens: On Mon, Oct 14, 2024 at 09:16:45PM +0200, David Hildenbrand wrote: On 14.10.24 20:48, Heiko Carstens wrote: The cover letter is clearer on that: "One remaining work item is kdump support for virtio-mem memory. This will be sent out separately once initial support landed." I had a prototype, but need to spend some time to clean it up -- or find someone to hand it over to clean it up. I have to chose wisely what I work on nowadays, and cannot spend that time if the basic support won't get ACKed. For many production use cases it certainly needs to exist. But note that virtio-mem can be used with ZONE_MOVABLE, in which case mostly only user data (e.g., pagecache,anon) ends up on hotplugged memory, that would get excluded from makedumpfile in the default configs either way. It's not uncommon to let kdump support be added later (e.g., AMD SNP variants). I'll leave it up to kvm folks to decide if we need kdump support from the beginning or if we are good with the current implementation. If David confirms that he has a plan for this, I am fine with a staged approach for upstream. I do have a plan and a even a semi-working prototype that I am currently improving. In summary, the virtio-mem driver in kdump mode can report ranges with plugged memory to the core so we can include them in the elfcore hdr. That is the easy part. The "challenge" is when the virtio-mem driver is built as a module and gets loaded after building/allocating the elfcore hdr (which happens when creating /proc/vmcore). We have to defer detecting+adding the ranges to the time /proc/vmcore gets opened. Not super complicated, but needs some thought to get it done in a clean way / with minimal churn. -- Cheers, David / dhildenb
Re: [PATCH v2 1/7] s390/kdump: implement is_kdump_kernel()
Hi David, David Hildenbrand writes: > Makes sense, so it boils down to either > > bool is_kdump_kernel(void) > { > return oldmem_data.start; > } > > Which means is_kdump_kernel() can be "false" even though /proc/vmcore is > available or > > bool is_kdump_kernel(void) > { > return dump_available(); > } > > Which means is_kdump_kernel() can never be "false" if /proc/vmcore is > available. There is the chance of is_kdump_kernel() being "true" if > "elfcorehdr_alloc()" fails with -ENODEV. Do you consider is_kdump_kernel() returning "true" in case of zfcpdump or nvme/eckd+ldipl dump (also called NGDump) okay ? Because dump_available() would return "true" in such cases too. If yes then please explain why, i might have missed a previous explanation from you. I'm afraid everyone will make wrong assumptions while reading the name of is_kdump_kernel() and assuming that it only applies to kdump or kdump-alike dumps (like stand-alone kdump), and, therefore, introduce bugs because the name of the function conveys the wrong idea to code readers. I consider dump_available() as a superset of is_kdump_kernel() and, therefore, to me they are not equivalent. I have the feeling you consider is_kdump_kernel() equivalent to "/proc/vmcore" being present and not really saying anything about whether kdump is active ? Regards Alex
Re: [PATCH v3 5/5] alloc_tag: config to store page allocation tag refs in page flags
On 10/21/24 9:32 AM, Suren Baghdasaryan wrote: On Mon, Oct 21, 2024 at 9:23 AM Michal Hocko wrote: On Mon 21-10-24 09:16:14, Suren Baghdasaryan wrote: On Mon, Oct 21, 2024 at 8:57 AM Michal Hocko wrote: On Mon 21-10-24 08:41:00, Suren Baghdasaryan wrote: On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko wrote: On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote: [...] Yeah, I thought about adding new values to "mem_profiling" but it's a bit complicated. Today it's a tristate: mem_profiling=0|1|never 0/1 means we disable/enable memory profiling by default but the user can enable it at runtime using a sysctl. This means that we enable page_ext at boot even when it's set to 0. "never" means we do not enable page_ext, memory profiling is disabled and sysctl to enable it will not be exposed. Used when a distribution has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does not want to waste memory on enabling page_ext. I can add another option like "pgflags" but then it also needs to specify whether we should enable or disable profiling by default (similar to 0|1 for page_ext mode). IOW we will need to encode also the default state we want. Something like this: mem_profiling=0|1|never|pgflags_on|pgflags_off Would this be acceptable? Isn't this overcomplicating it? Why cannot you simply go with mem_profiling={0|never|1}[,$YOUR_OPTIONS] While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply or just be ignored if that is not applicable. Oh, you mean having 2 parts in the parameter with supported options being: mem_profiling=never mem_profiling=0 mem_profiling=1 mem_profiling=0,pgflags mem_profiling=1,pgflags Did I understand correctly? If so then yes, this should work. yes. I would just not call it pgflags because that just doesn't really tell what the option is to anybody but kernel developers. You could also have an option to override the default (disable profiling) failure strategy. Ok, how about "compressed" instead? Like this: mem_profiling=0,compressed Yes. The configuration options all fit together nicely now, and the naming seems exactly right as well. And no more "you must rebuild your kernel" messages. Great! thanks, -- John Hubbard Sounds good to me. And just to repeat, I do not really care about specific name but let's just stay away from something as specific as page flags because that is really not helping to understand the purpose but rather the underlying mechanism which is not telling much to most users outside of kernel developers. Understood. Ok, I'll start changing my patchset to incorporate this feedback and will post the new version this week. Thanks for the input everyone! -- Michal Hocko SUSE Labs
Re: [PATCH v3 5/5] alloc_tag: config to store page allocation tag refs in page flags
On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote: [...] > Yeah, I thought about adding new values to "mem_profiling" but it's a > bit complicated. Today it's a tristate: > > mem_profiling=0|1|never > > 0/1 means we disable/enable memory profiling by default but the user > can enable it at runtime using a sysctl. This means that we enable > page_ext at boot even when it's set to 0. > "never" means we do not enable page_ext, memory profiling is disabled > and sysctl to enable it will not be exposed. Used when a distribution > has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does > not want to waste memory on enabling page_ext. > > I can add another option like "pgflags" but then it also needs to > specify whether we should enable or disable profiling by default > (similar to 0|1 for page_ext mode). IOW we will need to encode also > the default state we want. Something like this: > > mem_profiling=0|1|never|pgflags_on|pgflags_off > > Would this be acceptable? Isn't this overcomplicating it? Why cannot you simply go with mem_profiling={0|never|1}[,$YOUR_OPTIONS] While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply or just be ignored if that is not applicable. -- Michal Hocko SUSE Labs
Re: [PATCH v3 5/5] alloc_tag: config to store page allocation tag refs in page flags
On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko wrote: > > On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote: > [...] > > Yeah, I thought about adding new values to "mem_profiling" but it's a > > bit complicated. Today it's a tristate: > > > > mem_profiling=0|1|never > > > > 0/1 means we disable/enable memory profiling by default but the user > > can enable it at runtime using a sysctl. This means that we enable > > page_ext at boot even when it's set to 0. > > "never" means we do not enable page_ext, memory profiling is disabled > > and sysctl to enable it will not be exposed. Used when a distribution > > has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does > > not want to waste memory on enabling page_ext. > > > > I can add another option like "pgflags" but then it also needs to > > specify whether we should enable or disable profiling by default > > (similar to 0|1 for page_ext mode). IOW we will need to encode also > > the default state we want. Something like this: > > > > mem_profiling=0|1|never|pgflags_on|pgflags_off > > > > Would this be acceptable? > > Isn't this overcomplicating it? Why cannot you simply go with > mem_profiling={0|never|1}[,$YOUR_OPTIONS] > > While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply > or just be ignored if that is not applicable. Oh, you mean having 2 parts in the parameter with supported options being: mem_profiling=never mem_profiling=0 mem_profiling=1 mem_profiling=0,pgflags mem_profiling=1,pgflags Did I understand correctly? If so then yes, this should work. > -- > Michal Hocko > SUSE Labs
Re: [PATCH] docs: Remove redundant word "for"
Thorsten Blum writes: > s/for// > > Signed-off-by: Thorsten Blum > --- > Documentation/maintainer/pull-requests.rst | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/Documentation/maintainer/pull-requests.rst > b/Documentation/maintainer/pull-requests.rst > index 00b200facf67..0d63d9d7e347 100644 > --- a/Documentation/maintainer/pull-requests.rst > +++ b/Documentation/maintainer/pull-requests.rst > @@ -50,7 +50,7 @@ so outline what is contained here, why it should be merged, > and what, if > any, testing has been done. All of this information will end up in the tag > itself, and then in the merge commit that the maintainer makes if/when they > merge the pull request. So write it up well, as it will be in the kernel > -tree for forever. > +tree forever. > Applied, thanks. jon
Re: [PATCH v2 1/7] s390/kdump: implement is_kdump_kernel()
Am 21.10.24 um 14:46 schrieb Alexander Egorenkov: Hi David, David Hildenbrand writes: Makes sense, so it boils down to either bool is_kdump_kernel(void) { return oldmem_data.start; } Which means is_kdump_kernel() can be "false" even though /proc/vmcore is available or bool is_kdump_kernel(void) { return dump_available(); } Which means is_kdump_kernel() can never be "false" if /proc/vmcore is available. There is the chance of is_kdump_kernel() being "true" if "elfcorehdr_alloc()" fails with -ENODEV. Thanks for having another look! Do you consider is_kdump_kernel() returning "true" in case of zfcpdump or nvme/eckd+ldipl dump (also called NGDump) okay ? Because dump_available() would return "true" in such cases too. If yes then please explain why, i might have missed a previous explanation from you. I consider it okay because this is the current behavior after elfcorehdr_alloc() succeeded and set elfcorehdr_addr. Not sure if it is the right think to do, though :) Whatever we do, we should achieve on s390 that the result of is_kdump_kernel() is consistent throughout the booting stages, just like on all other architectures. Right now it goes from false->true when /proc/vmcore gets initialized (and only if it gets initialized properly). I'm afraid everyone will make wrong assumptions while reading the name of is_kdump_kernel() and assuming that it only applies to kdump or kdump-alike dumps (like stand-alone kdump), and, therefore, introduce bugs because the name of the function conveys the wrong idea to code readers. I consider dump_available() as a superset of is_kdump_kernel() and, therefore, to me they are not equivalent. > > I have the feeling you consider is_kdump_kernel() equivalent to "/proc/vmcore" being present and not really saying anything about whether kdump is active ? Yes, but primarily because this is the existing handling. Staring at the powerpc implementation: /* * Return true only when kexec based kernel dump capturing method is used. * This ensures all restritions applied for kdump case are not automatically * applied for fadump case. */ bool is_kdump_kernel(void) { return !is_fadump_active() && elfcorehdr_addr != ELFCORE_ADDR_MAX; } EXPORT_SYMBOL_GPL(is_kdump_kernel); Which was added by commit b098f1c32365304633077d73e4ae21c72d4241b3 Author: Hari Bathini Date: Tue Sep 12 13:59:50 2023 +0530 powerpc/fadump: make is_kdump_kernel() return false when fadump is active Currently, is_kdump_kernel() returns true in crash dump capture kernel for both kdump and fadump crash dump capturing methods, as both these methods set elfcorehdr_addr. Some restrictions enforced for crash dump capture kernel, based on is_kdump_kernel(), are specifically meant for kdump case and not desirable for fadump - eg. IO queues restriction in device drivers. So, define is_kdump_kernel() to return false when f/w assisted dump is active. For my purpose (virtio-mem), it's sufficient to only support "kexec triggered kdump" either way, so I don't care. So for me it's good enough to have bool is_kdump_kernel(void) { return oldmem_data.start; } And trying to document the situation in a comment like powerpc does :) -- Cheers, David / dhildenb
Re: [PATCH v3 5/5] alloc_tag: config to store page allocation tag refs in page flags
On 21.10.24 17:41, Suren Baghdasaryan wrote: On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko wrote: On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote: [...] Yeah, I thought about adding new values to "mem_profiling" but it's a bit complicated. Today it's a tristate: mem_profiling=0|1|never 0/1 means we disable/enable memory profiling by default but the user can enable it at runtime using a sysctl. This means that we enable page_ext at boot even when it's set to 0. "never" means we do not enable page_ext, memory profiling is disabled and sysctl to enable it will not be exposed. Used when a distribution has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does not want to waste memory on enabling page_ext. I can add another option like "pgflags" but then it also needs to specify whether we should enable or disable profiling by default (similar to 0|1 for page_ext mode). IOW we will need to encode also the default state we want. Something like this: mem_profiling=0|1|never|pgflags_on|pgflags_off Would this be acceptable? Isn't this overcomplicating it? Why cannot you simply go with mem_profiling={0|never|1}[,$YOUR_OPTIONS] While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply or just be ignored if that is not applicable. Oh, you mean having 2 parts in the parameter with supported options being: mem_profiling=never mem_profiling=0 mem_profiling=1 mem_profiling=0,pgflags mem_profiling=1,pgflags That's also a viable solution indeed. -- Cheers, David / dhildenb
Re: [PATCH v3 5/5] alloc_tag: config to store page allocation tag refs in page flags
On Mon, Oct 21, 2024 at 9:23 AM Michal Hocko wrote: > > On Mon 21-10-24 09:16:14, Suren Baghdasaryan wrote: > > On Mon, Oct 21, 2024 at 8:57 AM Michal Hocko wrote: > > > > > > On Mon 21-10-24 08:41:00, Suren Baghdasaryan wrote: > > > > On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko wrote: > > > > > > > > > > On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote: > > > > > [...] > > > > > > Yeah, I thought about adding new values to "mem_profiling" but it's > > > > > > a > > > > > > bit complicated. Today it's a tristate: > > > > > > > > > > > > mem_profiling=0|1|never > > > > > > > > > > > > 0/1 means we disable/enable memory profiling by default but the user > > > > > > can enable it at runtime using a sysctl. This means that we enable > > > > > > page_ext at boot even when it's set to 0. > > > > > > "never" means we do not enable page_ext, memory profiling is > > > > > > disabled > > > > > > and sysctl to enable it will not be exposed. Used when a > > > > > > distribution > > > > > > has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and > > > > > > does > > > > > > not want to waste memory on enabling page_ext. > > > > > > > > > > > > I can add another option like "pgflags" but then it also needs to > > > > > > specify whether we should enable or disable profiling by default > > > > > > (similar to 0|1 for page_ext mode). IOW we will need to encode also > > > > > > the default state we want. Something like this: > > > > > > > > > > > > mem_profiling=0|1|never|pgflags_on|pgflags_off > > > > > > > > > > > > Would this be acceptable? > > > > > > > > > > Isn't this overcomplicating it? Why cannot you simply go with > > > > > mem_profiling={0|never|1}[,$YOUR_OPTIONS] > > > > > > > > > > While $YOUR_OPTIONS could be compress,fallback,ponies and it would > > > > > apply > > > > > or just be ignored if that is not applicable. > > > > > > > > Oh, you mean having 2 parts in the parameter with supported options > > > > being: > > > > > > > > mem_profiling=never > > > > mem_profiling=0 > > > > mem_profiling=1 > > > > mem_profiling=0,pgflags > > > > mem_profiling=1,pgflags > > > > > > > > Did I understand correctly? If so then yes, this should work. > > > > > > yes. I would just not call it pgflags because that just doesn't really > > > tell what the option is to anybody but kernel developers. You could also > > > have an option to override the default (disable profiling) failure > > > strategy. > > > > Ok, how about "compressed" instead? Like this: > > > > mem_profiling=0,compressed > > Sounds good to me. And just to repeat, I do not really care about > specific name but let's just stay away from something as specific as > page flags because that is really not helping to understand the purpose > but rather the underlying mechanism which is not telling much to most > users outside of kernel developers. Understood. Ok, I'll start changing my patchset to incorporate this feedback and will post the new version this week. Thanks for the input everyone! > > -- > Michal Hocko > SUSE Labs
Re: [PATCH v4 3/6] Change the symbols order when --ffuntion-sections is enabled
On Sun, Oct 20, 2024 at 7:15 PM Masahiro Yamada wrote: > > On Tue, Oct 15, 2024 at 6:33 AM Rong Xu wrote: > > > > When the -ffunction-sections compiler option is enabled, each function > > is placed in a separate section named .text.function_name rather than > > putting all functions in a single .text section. > > > > However, using -function-sections can cause problems with the > > linker script. The comments included in include/asm-generic/vmlinux.lds.h > > note these issues.: > > “TEXT_MAIN here will match .text.fixup and .text.unlikely if dead > >code elimination is enabled, so these sections should be converted > >to use ".." first.” > > > > It is unclear whether there is a straightforward method for converting > > a suffix to "..". > > > > Why not for ".text.fixup"? > > $ git grep --name-only '\.text\.fixup' | xargs sed -i > 's/\.text\.fixup/.text..fixup/g' > I did not move .text.fixup because it currently groups together with TEXT_MAIN. Yes. For all the kernel annotated sections, we can replace them with a ".." string. For compiler generate strings, like .unlikely, .hot, and .split, we need a compiler change for that (maybe under an option). The process will be long. Or we can use an extra script, like objcopy to change them? > > > I do not know how to rename other sections that are generated by compilers. > > > > > > This patch modifies the order of subsections within the > > text output section when the -ffunction-sections flag is enabled. > > Specifically, it repositions sections with certain fixed patterns (for > > example .text.unlikely) before TEXT_MAIN, ensuring that they are grouped > > and matched together. > > > > Note that the limitation arises because the linker script employs glob > > patterns instead of regular expressions for string matching. While there > > is a method to maintain the current order using complex patterns, this > > significantly complicates the pattern and increases the likelihood of > > errors. > > > > Co-developed-by: Han Shen > > Signed-off-by: Han Shen > > Signed-off-by: Rong Xu > > Suggested-by: Sriraman Tallam > > Suggested-by: Krzysztof Pszeniczny > > --- > > include/asm-generic/vmlinux.lds.h | 17 +++-- > > 1 file changed, 15 insertions(+), 2 deletions(-) > > > > diff --git a/include/asm-generic/vmlinux.lds.h > > b/include/asm-generic/vmlinux.lds.h > > index eeadbaeccf88..5df589c60401 100644 > > --- a/include/asm-generic/vmlinux.lds.h > > +++ b/include/asm-generic/vmlinux.lds.h > > @@ -554,9 +554,21 @@ > > * during second ld run in second ld pass when generating System.map > > * > > * TEXT_MAIN here will match .text.fixup and .text.unlikely if dead > > - * code elimination is enabled, so these sections should be converted > > - * to use ".." first. > > + * code elimination or function-section is enabled. Match these symbols > > + * first when in these builds. > > */ > > +#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || > > defined(CONFIG_LTO_CLANG) > > +#define TEXT_TEXT \ > > > Why did you do this conditionally? > > You are making this even more unmaintainable. Again, we don't want to change the default build. If you think the change can apply to the default build, we would be happy to remove the condition. > > > > > > > + ALIGN_FUNCTION(); \ > > + *(.text.asan.* .text.tsan.*)\ > > + *(.text.unknown .text.unknown.*)\ > > + *(.text.unlikely .text.unlikely.*) \ > > + . = ALIGN(PAGE_SIZE); \ > > + *(.text.hot .text.hot.*)\ > > + *(TEXT_MAIN .text.fixup)\ > > + NOINSTR_TEXT\ > > + *(.ref.text) > > +#else > > #define TEXT_TEXT \ > > ALIGN_FUNCTION(); \ > > *(.text.hot .text.hot.*)\ > > @@ -566,6 +578,7 @@ > > NOINSTR_TEXT\ > > *(.ref.text)\ > > *(.text.asan.* .text.tsan.*) > > +#endif > > > > > > /* sched.text is aling to function alignment to secure we have same > > -- > > 2.47.0.rc1.288.g06298d1525-goog > > > > > > > -- > Best Regards > Masahiro Yamada
Re: [PATCH v4 4/6] AutoFDO: Enable -ffunction-sections for the AutoFDO build
The answers are the same as the reply in [PATCH v4 5/6] On Sun, Oct 20, 2024 at 7:26 PM Masahiro Yamada wrote: > > On Tue, Oct 15, 2024 at 6:33 AM Rong Xu wrote: > > > > Enable -ffunction-sections by default for the AutoFDO build. > > > > With -ffunction-sections, the compiler places each function in its own > > section named .text.function_name instead of placing all functions in > > the .text section. In the AutoFDO build, this allows the linker to > > utilize profile information to reorganize functions for improved > > utilization of iCache and iTLB. > > > > Co-developed-by: Han Shen > > Signed-off-by: Han Shen > > Signed-off-by: Rong Xu > > Suggested-by: Sriraman Tallam > > --- > > include/asm-generic/vmlinux.lds.h | 37 --- > > scripts/Makefile.autofdo | 2 +- > > 2 files changed, 30 insertions(+), 9 deletions(-) > > > > diff --git a/include/asm-generic/vmlinux.lds.h > > b/include/asm-generic/vmlinux.lds.h > > index 5df589c60401..ace617d1af9b 100644 > > --- a/include/asm-generic/vmlinux.lds.h > > +++ b/include/asm-generic/vmlinux.lds.h > > @@ -95,18 +95,25 @@ > > * With LTO_CLANG, the linker also splits sections by default, so we need > > * these macros to combine the sections during the final link. > > * > > + * With LTO_CLANG, the linker also splits sections by default, so we need > > + * these macros to combine the sections during the final link. > > + * > > * RODATA_MAIN is not used because existing code already defines .rodata.x > > * sections to be brought in with rodata. > > */ > > -#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || > > defined(CONFIG_LTO_CLANG) > > +#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || > > defined(CONFIG_LTO_CLANG) || \ > > +defined(CONFIG_AUTOFDO_CLANG) > > #define TEXT_MAIN .text .text.[0-9a-zA-Z_]* > > +#else > > +#define TEXT_MAIN .text > > +#endif > > +#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || > > defined(CONFIG_LTO_CLANG) > > #define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* > > .data..compoundliteral* .data.$__unnamed_* .data.$L* > > #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]* > > #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L* > > #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..L* .bss..compoundliteral* > > #define SBSS_MAIN .sbss .sbss.[0-9a-zA-Z_]* > > #else > > -#define TEXT_MAIN .text > > #define DATA_MAIN .data > > #define SDATA_MAIN .sdata > > #define RODATA_MAIN .rodata > > @@ -549,6 +556,20 @@ > > __cpuidle_text_end = .; \ > > __noinstr_text_end = .; > > > > +#ifdef CONFIG_AUTOFDO_CLANG > > +#define TEXT_HOT \ > > + __hot_text_start = .; \ > > + *(.text.hot .text.hot.*)\ > > + __hot_text_end = .; > > +#define TEXT_UNLIKELY \ > > + __unlikely_text_start = .; \ > > + *(.text.unlikely .text.unlikely.*) \ > > + __unlikely_text_end = .; > > +#else > > +#define TEXT_HOT *(.text.hot .text.hot.*) > > +#define TEXT_UNLIKELY *(.text.unlikely .text.unlikely.*) > > +#endif > > > > Again, why is this conditional? The condition is to ensure that we don't change the default kernel build by any means. The new code will introduce a few new symbols. > > > The only difference is *_start and *_end symbols are defined > when CONFIG_AUTOFDO_CLANG=y. > > And, where are these symbols used? These new symbols are currently unreferenced within the kernel source tree. However, they provide a valuable means of identifying hot and cold sections of text, and how large they are. I think they are useful information. > > > > > > > > > > > > > + > > /* > > * .text section. Map to function alignment to avoid address changes > > * during second ld run in second ld pass when generating System.map > > @@ -557,30 +578,30 @@ > > * code elimination or function-section is enabled. Match these symbols > > * first when in these builds. > > */ > > -#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || > > defined(CONFIG_LTO_CLANG) > > +#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || > > defined(CONFIG_LTO_CLANG) || \ > > +defined(CONFIG_AUTOFDO_CLANG) > > #define TEXT_TEXT \ > > ALIGN_FUNCTION(); \ > > *(.text.asan.* .text.tsan.*)\ > > *(.text.unknown .text.unknown.*)\ > > - *(.text.unlikely .text.unlikely.*) \ > > + TEXT_UNLIKELY \ > > . = ALIGN(PAGE_SIZE); \ > > -
Re: [PATCH v4 5/6] AutoFDO: Enable machine function split optimization for AutoFDO
On Sun, Oct 20, 2024 at 8:18 PM Masahiro Yamada wrote: > > On Tue, Oct 15, 2024 at 6:33 AM Rong Xu wrote: > > > > Enable the machine function split optimization for AutoFDO in Clang. > > > > Machine function split (MFS) is a pass in the Clang compiler that > > splits a function into hot and cold parts. The linker groups all > > cold blocks across functions together. This decreases hot code > > fragmentation and improves iCache and iTLB utilization. > > > > MFS requires a profile so this is enabled only for the AutoFDO builds. > > > > Co-developed-by: Han Shen > > Signed-off-by: Han Shen > > Signed-off-by: Rong Xu > > Suggested-by: Sriraman Tallam > > Suggested-by: Krzysztof Pszeniczny > > --- > > include/asm-generic/vmlinux.lds.h | 6 ++ > > scripts/Makefile.autofdo | 2 ++ > > 2 files changed, 8 insertions(+) > > > > diff --git a/include/asm-generic/vmlinux.lds.h > > b/include/asm-generic/vmlinux.lds.h > > index ace617d1af9b..20e46c0917db 100644 > > --- a/include/asm-generic/vmlinux.lds.h > > +++ b/include/asm-generic/vmlinux.lds.h > > @@ -565,9 +565,14 @@ defined(CONFIG_AUTOFDO_CLANG) > > __unlikely_text_start = .; \ > > *(.text.unlikely .text.unlikely.*) \ > > __unlikely_text_end = .; > > +#define TEXT_SPLIT \ > > + __split_text_start = .; \ > > + *(.text.split .text.split.[0-9a-zA-Z_]*)\ > > + __split_text_end = .; > > #else > > #define TEXT_HOT *(.text.hot .text.hot.*) > > #define TEXT_UNLIKELY *(.text.unlikely .text.unlikely.*) > > +#define TEXT_SPLIT > > #endif > > > Why conditional? The condition is to ensure that we don't change the default kernel build by any means. The new code will introduce a few new symbols. > > > Where are __unlikely_text_start and __unlikely_text_end used? These new symbols are currently unreferenced within the kernel source tree. However, they provide a valuable means of identifying hot and cold sections of text, and how large they are. I think they are useful information. > > > > > > > > -- > Best Regards > Masahiro Yamada
Re: [PATCH v4 6/6] Add Propeller configuration for kernel build.
On Sun, Oct 20, 2024 at 10:49 AM Masahiro Yamada wrote: > > Please remove the period at the end of the commit subject. Will fix this. > > > > On Tue, Oct 15, 2024 at 6:34 AM Rong Xu wrote: > > > > Add the build support for using Clang's Propeller optimizer. Like > > AutoFDO, Propeller uses hardware sampling to gather information > > about the frequency of execution of different code paths within a > > binary. This information is then used to guide the compiler's > > optimization decisions, resulting in a more efficient binary. > > > > The support requires a Clang compiler LLVM 19 or later, and the > > create_llvm_prof tool > > (https://github.com/google/autofdo/releases/tag/v0.30.1). This > > submission is limited to x86 platforms that support PMU features > > > "This submission" -> "This commit" Will fix this. > > > > > like LBR on Intel machines and AMD Zen3 BRS. > > > > For Arm, we plan to send patches for SPE-based Propeller when > > AutoFDO for Arm is ready. > > > "we plan to send ..." is not a good description once it is committed. > > This sentence should be moved to the cover letter, or reworked. We will move this sentence to the cover letter. > > > > > > > > > > Here is an example workflow for building an AutoFDO+Propeller > > optimized kernel: > > > > 1) Build the kernel on the HOST machine, with AutoFDO and Propeller > > > Why is the "HOST" capitalized? We will fix this. > > > > >build config > > CONFIG_AUTOFDO_CLANG=y > > CONFIG_PROPELLER_CLANG=y > >then > > $ make LLVM=1 CLANG_AUTOFDO_PROFILE= > > > > “” is the profile collected when doing a non-Propeller > > AutoFDO build. This step builds a kernel that has the same optimization > > level as AutoFDO, plus a metadata section that records basic block > > information. This kernel image runs as fast as an AutoFDO optimized > > kernel. > > > > 2) Install the kernel on test/production machines. > > > > 3) Run the load tests. The '-c' option in perf specifies the sample > >event period. We suggest using a suitable prime number, > >like 59, for this purpose. > >For Intel platforms: > > $ perf record -e BR_INST_RETIRED.NEAR_TAKEN:k -a -N -b -c \ > > -o -- > >For AMD platforms: > > The supported system are: Zen3 with BRS, or Zen4 with amd_lbr_v2 > > # To see if Zen3 support LBR: > > $ cat proc/cpuinfo | grep " brs" > > # To see if Zen4 support LBR: > > $ cat proc/cpuinfo | grep amd_lbr_v2 > > # If the result is yes, then collect the profile using: > > $ perf record --pfm-events RETIRED_TAKEN_BRANCH_INSTRUCTIONS:k -a \ > > -N -b -c -o -- > > > > 4) (Optional) Download the raw perf file to the HOST machine. > > > Same question as above. Will use "host". > > > > > > 5) Generate Propeller profile: > >$ create_llvm_prof --binary= --profile= \ > > --format=propeller --propeller_output_module_name \ > > --out=_cc_profile.txt \ > > --propeller_symorder=_ld_profile.txt > > > >“create_llvm_prof” is the profile conversion tool, and a prebuilt > >binary for linux can be found on > >https://github.com/google/autofdo/releases/tag/v0.30.1 (can also build > >from source). > > > >"" can be something like > >"/home/user/dir/any_string". > > > >This command generates a pair of Propeller profiles: > >"_cc_profile.txt" and > >"_ld_profile.txt". > > > > 6) Rebuild the kernel using the AutoFDO and Propeller profile files. > > CONFIG_AUTOFDO_CLANG=y > > CONFIG_PROPELLER_CLANG=y > >and > > $ make LLVM=1 CLANG_AUTOFDO_PROFILE= \ > > CLANG_PROPELLER_PROFILE_PREFIX= > > > > Co-developed-by: Han Shen > > Signed-off-by: Han Shen > > Signed-off-by: Rong Xu > > Suggested-by: Sriraman Tallam > > Suggested-by: Krzysztof Pszeniczny > > Suggested-by: Nick Desaulniers > > Suggested-by: Stephane Eranian > > > > > > > .. only:: subproject and html > > diff --git a/Documentation/dev-tools/propeller.rst > > b/Documentation/dev-tools/propeller.rst > > new file mode 100644 > > index ..a217354e0f95 > > --- /dev/null > > +++ b/Documentation/dev-tools/propeller.rst > > @@ -0,0 +1,161 @@ > > +.. SPDX-License-Identifier: GPL-2.0 > > + > > += > > +Using Propeller with the Linux kernel > > += > > + > > +This enables Propeller build support for the kernel when using Clang > > +compiler. Propeller is a profile-guided optimization (PGO) method used > > +to optimize binary executables. Like AutoFDO, it utilizes hardware > > +sampling to gather information about the frequency of execution of > > +different code paths within a binary. Unlike AutoFDO, this information > > +is then used right before linking phase to optimize (among others) > > +block layout within and across functions. > > + > > +A few important notes about adopting Propeller optimization: > > + > > +#. Although it can be used as a standalon
Re: [PATCH v4 0/6] Add AutoFDO and Propeller support for Clang build
On Sat, Oct 19, 2024 at 8:25 PM Nathan Chancellor wrote: > > Hi Rong, > > On Fri, Oct 18, 2024 at 11:20:02PM -0700, Rong Xu wrote: > > Thanks to all for the feedback and suggestions! We are ready to make any > > further > > changes needed. Is there anything else we can address for this patch? > > I will reply in a separate thread for visibility but I think one of the > biggest open questions at the moment is trying to find someone to > shepherd this code into mainline. > > > Also, we know it's not easy to test this patch, but if anyone has had a > > chance > > to try building AutoFDO/Propeller kernels with it, we'd really appreciate > > your > > input here. Any confirmation that it works as expected would be very > > helpful. > > I went to take this series for a spin in a virtual machine first as a > smoke test before attempting to boot on bare metal. This was done on a > server with an Intel Xeon Gold 6314U. The kernel booted fine but when I > went to run the command to generate the perf data from the > documentation, I get an error. > > $ perf record -e BR_INST_RETIRED.NEAR_TAKEN:k -a -N -b -c 59 -o > /tmp/perf.data -- make -j$(nproc) O=out mrproper defconfig all > Error: > BR_INST_RETIRED.NEAR_TAKEN:k: PMU Hardware or event type doesn't support > branch stack sampling. > > Do you know if this is expected for a virtual machine setup? I will > attempt to test the series on real hardware here soon, it is currently > tied up with investigating a regression in -next at the moment. We have never tested this patch in a KVM setup. As far as we know, LBR support in KVM is currently limited, and varies depending on the PMU virtualization model: (1) For legacy mode, LBR profiling might work under LBR virtualization (VLBR). However, we have not tested this. (2) For the new "Mediated vPMU passthru' mode, there is no LBR virtualization support at this point. So LBR profiling is not working. I've included Stephance here. He should have more expertise on this topic. > > Cheers, > Nathan