Re: [PATCH v4 6/6] Add Propeller configuration for kernel build.
On Wed, Oct 23, 2024 at 4:25 PM Arnd Bergmann wrote: > > On Wed, Oct 23, 2024, at 07:06, Masahiro Yamada wrote: > > On Tue, Oct 22, 2024 at 9:00 AM Rong Xu wrote: > > > >> > > +=== > >> > > + > >> > > +Configure the kernel with:: > >> > > + > >> > > + CONFIG_AUTOFDO_CLANG=y > >> > > >> > > >> > This is automatically met due to "depends on AUTOFDO_CLANG". > >> > >> Agreed. But we will remove the dependency from PROPELlER_CLANG to > >> AUTOFDO_CLANG. > >> So we will keep the part. > > > > > > You can replace "depends on AUTOFDO_CLANG" with > > "imply AUTOFDO_CLANG" if it is sensible. > > > > Up to you. > > I don't think we should ever encourage the use of 'imply' > because it is almost always used incorrectly. If we are able to delete the 'imply' keyword, Kconfig would be a bit cleaner. In most cases, it can be replaced with 'default'. -- Best Regards Masahiro Yamada
Re: [PATCH v2 1/7] s390/kdump: implement is_kdump_kernel()
On 23.10.24 09:42, Heiko Carstens wrote: On Mon, Oct 21, 2024 at 04:45:59PM +0200, David Hildenbrand wrote: For my purpose (virtio-mem), it's sufficient to only support "kexec triggered kdump" either way, so I don't care. So for me it's good enough to have bool is_kdump_kernel(void) { return oldmem_data.start; } And trying to document the situation in a comment like powerpc does :) Then let's go forward with this, since as Alexander wrote, this is returning what is actually happening. If this is not sufficient or something breaks we can still address this. Yes, I'll send this change separately from the other virtio-mem stuff out today. -- Cheers, David / dhildenb
Re: [PATCH v2 1/7] s390/kdump: implement is_kdump_kernel()
On Mon, Oct 21, 2024 at 04:45:59PM +0200, David Hildenbrand wrote: > For my purpose (virtio-mem), it's sufficient to only support "kexec > triggered kdump" either way, so I don't care. > > So for me it's good enough to have > > bool is_kdump_kernel(void) > { > return oldmem_data.start; > } > > And trying to document the situation in a comment like powerpc does :) Then let's go forward with this, since as Alexander wrote, this is returning what is actually happening. If this is not sufficient or something breaks we can still address this.
Re: [PATCH net] Documentation: ieee802154: fix grammar
Hi Leo, leocst...@gmail.com wrote on Tue, 22 Oct 2024 21:12:01 -0700: > Fix grammar where it improves readability. > > Signed-off-by: Leo Stone Reviewed-by: Miquel Raynal Thanks, Miquèl
Re: [PATCH v4 6/6] Add Propeller configuration for kernel build.
On Tue, Oct 22, 2024 at 9:00 AM Rong Xu wrote: > > > +=== > > > + > > > +Configure the kernel with:: > > > + > > > + CONFIG_AUTOFDO_CLANG=y > > > > > > This is automatically met due to "depends on AUTOFDO_CLANG". > > Agreed. But we will remove the dependency from PROPELlER_CLANG to > AUTOFDO_CLANG. > So we will keep the part. You can replace "depends on AUTOFDO_CLANG" with "imply AUTOFDO_CLANG" if it is sensible. Up to you. -- Best Regards Masahiro Yamada
Re: [PATCH v4 6/6] Add Propeller configuration for kernel build.
On Wed, Oct 23, 2024, at 07:06, Masahiro Yamada wrote: > On Tue, Oct 22, 2024 at 9:00 AM Rong Xu wrote: > >> > > +=== >> > > + >> > > +Configure the kernel with:: >> > > + >> > > + CONFIG_AUTOFDO_CLANG=y >> > >> > >> > This is automatically met due to "depends on AUTOFDO_CLANG". >> >> Agreed. But we will remove the dependency from PROPELlER_CLANG to >> AUTOFDO_CLANG. >> So we will keep the part. > > > You can replace "depends on AUTOFDO_CLANG" with > "imply AUTOFDO_CLANG" if it is sensible. > > Up to you. I don't think we should ever encourage the use of 'imply' because it is almost always used incorrectly. Arnd
Re: [PATCH v2 1/7] s390/kdump: implement is_kdump_kernel()
Hi David, David Hildenbrand writes: > Staring at the powerpc implementation: > > /* > * Return true only when kexec based kernel dump capturing method is used. > * This ensures all restritions applied for kdump case are not automatically > * applied for fadump case. > */ > bool is_kdump_kernel(void) > { > return !is_fadump_active() && elfcorehdr_addr != ELFCORE_ADDR_MAX; > } > EXPORT_SYMBOL_GPL(is_kdump_kernel); Thanks for the pointer. I would say power's version is semantically equivalent to what i have in mind for s390 :) If a dump kernel is running, but not a stand-alone one (apart from sa kdump), then it's a kdump kernel. Regards Alex
Re: [PATCH v4 6/6] Add Propeller configuration for kernel build.
While Propeller often works best with AutoFDO (or the instrumentation based FDO), it's not required. One can use Propeller (or similar post-link-optimizer, like Bolt) on plain kernel builds. So I will remove "depends on AUTOFDO_CLANG". I will not use "imply" -- simpler is better here. -Rong On Wed, Oct 23, 2024 at 12:29 AM Masahiro Yamada wrote: > > On Wed, Oct 23, 2024 at 4:25 PM Arnd Bergmann wrote: > > > > On Wed, Oct 23, 2024, at 07:06, Masahiro Yamada wrote: > > > On Tue, Oct 22, 2024 at 9:00 AM Rong Xu wrote: > > > > > >> > > +=== > > >> > > + > > >> > > +Configure the kernel with:: > > >> > > + > > >> > > + CONFIG_AUTOFDO_CLANG=y > > >> > > > >> > > > >> > This is automatically met due to "depends on AUTOFDO_CLANG". > > >> > > >> Agreed. But we will remove the dependency from PROPELlER_CLANG to > > >> AUTOFDO_CLANG. > > >> So we will keep the part. > > > > > > > > > You can replace "depends on AUTOFDO_CLANG" with > > > "imply AUTOFDO_CLANG" if it is sensible. > > > > > > Up to you. > > > > I don't think we should ever encourage the use of 'imply' > > because it is almost always used incorrectly. > > If we are able to delete the 'imply' keyword, Kconfig would be a bit cleaner. > > In most cases, it can be replaced with 'default'. > > > > -- > Best Regards > Masahiro Yamada
Re: [PATCH v6 3/6] KVM: arm64: Add support for PSCI v1.2 and v1.3
> On 19 Oct 2024, at 17:15, David Woodhouse wrote: > > From: David Woodhouse > > As with PSCI v1.1 in commit 512865d83fd9 ("KVM: arm64: Bump guest PSCI > version to 1.1"), expose v1.3 to the guest by default. The SYSTEM_OFF2 > call which is exposed by doing so is compatible for userspace because > it's just a new flag in the event that KVM raises, in precisely the same > way that SYSTEM_RESET2 was compatible when v1.1 was enabled by default. > > Signed-off-by: David Woodhouse > --- > arch/arm64/kvm/hypercalls.c | 2 ++ > arch/arm64/kvm/psci.c | 6 +- > include/kvm/arm_psci.h | 4 +++- > 3 files changed, 10 insertions(+), 2 deletions(-) > > diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c > index 5763d979d8ca..9c6267ca2b82 100644 > --- a/arch/arm64/kvm/hypercalls.c > +++ b/arch/arm64/kvm/hypercalls.c > @@ -575,6 +575,8 @@ int kvm_arm_set_fw_reg(struct kvm_vcpu *vcpu, const > struct kvm_one_reg *reg) > case KVM_ARM_PSCI_0_2: > case KVM_ARM_PSCI_1_0: > case KVM_ARM_PSCI_1_1: > + case KVM_ARM_PSCI_1_2: > + case KVM_ARM_PSCI_1_3: > if (!wants_02) > return -EINVAL; > vcpu->kvm->arch.psci_version = val; > diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c > index df834f2e928e..6c24a9252fa3 100644 > --- a/arch/arm64/kvm/psci.c > +++ b/arch/arm64/kvm/psci.c > @@ -328,7 +328,7 @@ static int kvm_psci_1_x_call(struct kvm_vcpu *vcpu, u32 > minor) > > switch(psci_fn) { > case PSCI_0_2_FN_PSCI_VERSION: > - val = minor == 0 ? KVM_ARM_PSCI_1_0 : KVM_ARM_PSCI_1_1; > + val = PSCI_VERSION(1, minor); > break; > case PSCI_1_0_FN_PSCI_FEATURES: > arg = smccc_get_arg1(vcpu); > @@ -493,6 +493,10 @@ int kvm_psci_call(struct kvm_vcpu *vcpu) > } > > switch (version) { > + case KVM_ARM_PSCI_1_3: > + return kvm_psci_1_x_call(vcpu, 3); > + case KVM_ARM_PSCI_1_2: > + return kvm_psci_1_x_call(vcpu, 2); > case KVM_ARM_PSCI_1_1: > return kvm_psci_1_x_call(vcpu, 1); > case KVM_ARM_PSCI_1_0: > diff --git a/include/kvm/arm_psci.h b/include/kvm/arm_psci.h > index e8fb624013d1..cbaec804eb83 100644 > --- a/include/kvm/arm_psci.h > +++ b/include/kvm/arm_psci.h > @@ -14,8 +14,10 @@ > #define KVM_ARM_PSCI_0_2 PSCI_VERSION(0, 2) > #define KVM_ARM_PSCI_1_0 PSCI_VERSION(1, 0) > #define KVM_ARM_PSCI_1_1 PSCI_VERSION(1, 1) > +#define KVM_ARM_PSCI_1_2 PSCI_VERSION(1, 2) > +#define KVM_ARM_PSCI_1_3 PSCI_VERSION(1, 3) > > -#define KVM_ARM_PSCI_LATEST KVM_ARM_PSCI_1_1 > +#define KVM_ARM_PSCI_LATEST KVM_ARM_PSCI_1_3 > Reviewed-by: Miguel Luis > static inline int kvm_psci_version(struct kvm_vcpu *vcpu) > { > -- > 2.44.0 >
Re: [PATCH v6 2/6] KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation
Hi David, > On 19 Oct 2024, at 17:15, David Woodhouse wrote: > > From: David Woodhouse > > The PSCI v1.3 specification adds support for a SYSTEM_OFF2 function > which is analogous to ACPI S4 state. This will allow hosting > environments to determine that a guest is hibernated rather than just > powered off, and ensure that they preserve the virtual environment > appropriately to allow the guest to resume safely (or bump the > hardware_signature in the FACS to trigger a clean reboot instead). > > This feature is safe to enable unconditionally (in a subsequent commit) > because it is exposed to userspace through the existing > KVM_SYSTEM_EVENT_SHUTDOWN event, just with an additional flag which > userspace can use to know that the instance intended hibernation instead > of a plain power-off. > > As with SYSTEM_RESET2, there is only one type available (in this case > HIBERNATE_OFF), and it is not explicitly reported to userspace through > the event; userspace can get it from the registers if it cares). > > Signed-off-by: David Woodhouse > --- > Documentation/virt/kvm/api.rst| 11 > arch/arm64/include/uapi/asm/kvm.h | 6 + > arch/arm64/kvm/psci.c | 44 +++ > 3 files changed, 61 insertions(+) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index e32471977d0a..1ec076d806e6 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6855,6 +6855,10 @@ the first `ndata` items (possibly zero) of the data > array are valid. >the guest issued a SYSTEM_RESET2 call according to v1.1 of the PSCI >specification. > > + - for arm64, data[0] is set to KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2 > + if the guest issued a SYSTEM_OFF2 call according to v1.3 of the PSCI > + specification. > + > - for RISC-V, data[0] is set to the value of the second argument of the >``sbi_system_reset`` call. > > @@ -6888,6 +6892,13 @@ either: > - Deny the guest request to suspend the VM. See ARM DEN0022D.b 5.19.2 >"Caller responsibilities" for possible return values. > > +Hibernation using the PSCI SYSTEM_OFF2 call is enabled when PSCI v1.3 > +is enabled. If a guest invokes the PSCI SYSTEM_OFF2 function, KVM will > +exit to userspace with the KVM_SYSTEM_EVENT_SHUTDOWN event type and with > +data[0] set to KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2. The only > +supported hibernate type for the SYSTEM_OFF2 function is HIBERNATE_OFF > +0x0). I don’t think that ‘0x0’ adds something to what’s already explained before, IMO. > + > :: > > /* KVM_EXIT_IOAPIC_EOI */ > diff --git a/arch/arm64/include/uapi/asm/kvm.h > b/arch/arm64/include/uapi/asm/kvm.h > index 964df31da975..66736ff04011 100644 > --- a/arch/arm64/include/uapi/asm/kvm.h > +++ b/arch/arm64/include/uapi/asm/kvm.h > @@ -484,6 +484,12 @@ enum { > */ > #define KVM_SYSTEM_EVENT_RESET_FLAG_PSCI_RESET2 (1ULL << 0) > > +/* > + * Shutdown caused by a PSCI v1.3 SYSTEM_OFF2 call. > + * Valid only when the system event has a type of KVM_SYSTEM_EVENT_SHUTDOWN. > + */ > +#define KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2 (1ULL << 0) > + > /* run->fail_entry.hardware_entry_failure_reason codes. */ > #define KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED (1ULL << 0) > > diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c > index 1f69b667332b..df834f2e928e 100644 > --- a/arch/arm64/kvm/psci.c > +++ b/arch/arm64/kvm/psci.c > @@ -194,6 +194,12 @@ static void kvm_psci_system_off(struct kvm_vcpu *vcpu) > kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_SHUTDOWN, 0); > } > > +static void kvm_psci_system_off2(struct kvm_vcpu *vcpu) > +{ > + kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_SHUTDOWN, > + KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2); > +} > + > static void kvm_psci_system_reset(struct kvm_vcpu *vcpu) > { > kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_RESET, 0); > @@ -358,6 +364,11 @@ static int kvm_psci_1_x_call(struct kvm_vcpu *vcpu, u32 > minor) > if (minor >= 1) > val = 0; > break; > + case PSCI_1_3_FN_SYSTEM_OFF2: > + case PSCI_1_3_FN64_SYSTEM_OFF2: > + if (minor >= 3) > + val = PSCI_1_3_OFF_TYPE_HIBERNATE_OFF; > + break; > } > break; > case PSCI_1_0_FN_SYSTEM_SUSPEND: > @@ -392,6 +403,39 @@ static int kvm_psci_1_x_call(struct kvm_vcpu *vcpu, u32 > minor) > break; > } > break; > + case PSCI_1_3_FN_SYSTEM_OFF2: > + kvm_psci_narrow_to_32bit(vcpu); > + fallthrough; > + case PSCI_1_3_FN64_SYSTEM_OFF2: > + if (minor < 3) > + break; > + > + arg = smccc_get_arg1(vcpu); > + /* > + * PSCI v1.3 issue F.b requires that zero be accepted to mean > + * HIBERNATE_OFF (in line with pre-publication versions of the > + * spec, and thus some actual implementations in the wild). > + * The second argument must be zero. > + */ > + if ((arg && arg != PSCI_1_3_OFF_TYPE_HIBERNATE_OFF) || > +smccc_get_arg2(vcpu) != 0) { > + val = PSCI_RET_INVALID_PARAMS; > + break; > + } > + kvm_psci_system_off2(vcpu); > + /* > + * We shouldn't be going back to guest VCPU after > + *
Re: [PATCH 0/1] remoteproc documentation changes
anish kumar writes: > This patch series transitions the documentation > for remoteproc from the staging directory to the > mainline kernel. It introduces both kernel and > user-space APIs, enhancing the overall documentation > quality. > > V4: > Fixed compilation errors and moved documentation to > driver-api directory. > > V3: > Seperated out the patches further to make the intention > clear for each patch. > > V2: > Reported-by: kernel test robot > Closes: > https://lore.kernel.org/oe-kbuild-all/202410161444.jokmsogs-...@intel.com/ So I think you could make better use of kerneldoc comments for a number of your APIs and structures - a project for the future. I can't judge the remoteproc aspects of this, but from a documentation mechanics point of view, this looks about ready to me. In the absence of objections I'll apply it in the near future. Thanks, jon
[PATCH v4 2/6] alloc_tag: introduce shutdown_mem_profiling helper function
Implement a helper function to disable memory allocation profiling and use it when creation of /proc/allocinfo fails. Ensure /proc/allocinfo does not get created when memory allocation profiling is disabled. Signed-off-by: Suren Baghdasaryan --- lib/alloc_tag.c | 33 ++--- 1 file changed, 26 insertions(+), 7 deletions(-) diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c index 81e5f9a70f22..435aa837e550 100644 --- a/lib/alloc_tag.c +++ b/lib/alloc_tag.c @@ -8,6 +8,14 @@ #include #include +#define ALLOCINFO_FILE_NAME"allocinfo" + +#ifdef CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT +static bool mem_profiling_support __meminitdata = true; +#else +static bool mem_profiling_support __meminitdata; +#endif + static struct codetag_type *alloc_tag_cttype; DEFINE_PER_CPU(struct alloc_tag_counters, _shared_alloc_tag); @@ -144,9 +152,26 @@ size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sl return nr; } +static void __init shutdown_mem_profiling(void) +{ + if (mem_alloc_profiling_enabled()) + static_branch_disable(&mem_alloc_profiling_key); + + if (!mem_profiling_support) + return; + + mem_profiling_support = false; +} + static void __init procfs_init(void) { - proc_create_seq("allocinfo", 0400, NULL, &allocinfo_seq_op); + if (!mem_profiling_support) + return; + + if (!proc_create_seq(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op)) { + pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME); + shutdown_mem_profiling(); + } } static bool alloc_tag_module_unload(struct codetag_type *cttype, @@ -174,12 +199,6 @@ static bool alloc_tag_module_unload(struct codetag_type *cttype, return module_unused; } -#ifdef CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT -static bool mem_profiling_support __meminitdata = true; -#else -static bool mem_profiling_support __meminitdata; -#endif - static int __init setup_early_mem_profiling(char *str) { bool enable; -- 2.47.0.105.g07ac214952-goog
Re: [PATCH v4 1/6] maple_tree: add mas_for_each_rev() helper
On Wed, Oct 23, 2024 at 1:08 PM Suren Baghdasaryan wrote: > > Add mas_for_each_rev() function to iterate maple tree nodes in reverse > order. > > Suggested-by: Liam R. Howlett > Signed-off-by: Suren Baghdasaryan > Reviewed-by: Liam R. Howlett Reviewed-by: Pasha Tatashin > --- > include/linux/maple_tree.h | 14 ++ > 1 file changed, 14 insertions(+) > > diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h > index 61c236850ca8..cbbcd18d4186 100644 > --- a/include/linux/maple_tree.h > +++ b/include/linux/maple_tree.h > @@ -592,6 +592,20 @@ static __always_inline void mas_reset(struct ma_state > *mas) > #define mas_for_each(__mas, __entry, __max) \ > while (((__entry) = mas_find((__mas), (__max))) != NULL) > > +/** > + * mas_for_each_rev() - Iterate over a range of the maple tree in reverse > order. > + * @__mas: Maple Tree operation state (maple_state) > + * @__entry: Entry retrieved from the tree > + * @__min: minimum index to retrieve from the tree > + * > + * When returned, mas->index and mas->last will hold the entire range for the > + * entry. > + * > + * Note: may return the zero entry. > + */ > +#define mas_for_each_rev(__mas, __entry, __min) \ > + while (((__entry) = mas_find_rev((__mas), (__min))) != NULL) > + > #ifdef CONFIG_DEBUG_MAPLE_TREE > enum mt_dump_format { > mt_dump_dec, > -- > 2.47.0.105.g07ac214952-goog >
Re: [PATCH v4 5/6] alloc_tag: introduce pgtag_ref_handle to abstract page tag references
On Wed, Oct 23, 2024 at 1:08 PM Suren Baghdasaryan wrote: > > To simplify later changes to page tag references, introduce new > pgtag_ref_handle type. This allows easy replacement of page_ext > as a storage of page allocation tags. > > Signed-off-by: Suren Baghdasaryan Reviewed-by: Pasha Tatashin
Re: [PATCH 0/1] remoteproc documentation changes
Jonathan Corbet writes: > anish kumar writes: > >> This patch series transitions the documentation >> for remoteproc from the staging directory to the >> mainline kernel. It introduces both kernel and >> user-space APIs, enhancing the overall documentation >> quality. >> >> V4: >> Fixed compilation errors and moved documentation to >> driver-api directory. >> >> V3: >> Seperated out the patches further to make the intention >> clear for each patch. >> >> V2: >> Reported-by: kernel test robot >> Closes: >> https://lore.kernel.org/oe-kbuild-all/202410161444.jokmsogs-...@intel.com/ > > So I think you could make better use of kerneldoc comments for a number > of your APIs and structures - a project for the future. I can't judge > the remoteproc aspects of this, but from a documentation mechanics point > of view, this looks about ready to me. In the absence of objections > I'll apply it in the near future. One other question, actually - what kernel version did you make these patches against? It looks like something rather old...? Thanks, jon
Re: [PATCH v4 2/6] alloc_tag: introduce shutdown_mem_profiling helper function
On Wed, Oct 23, 2024 at 1:08 PM Suren Baghdasaryan wrote: > > Implement a helper function to disable memory allocation profiling and > use it when creation of /proc/allocinfo fails. > Ensure /proc/allocinfo does not get created when memory allocation > profiling is disabled. > > Signed-off-by: Suren Baghdasaryan Reviewed-by: Pasha Tatashin
Re: [PATCH v4 3/6] alloc_tag: load module tags into separate contiguous memory
On Wed, Oct 23, 2024 at 1:08 PM Suren Baghdasaryan wrote: > > When a module gets unloaded there is a possibility that some of the > allocations it made are still used and therefore the allocation tags > corresponding to these allocations are still referenced. As such, the > memory for these tags can't be freed. This is currently handled as an > abnormal situation and module's data section is not being unloaded. > To handle this situation without keeping module's data in memory, > allow codetags with longer lifespan than the module to be loaded into > their own separate memory. The in-use memory areas and gaps after > module unloading in this separate memory are tracked using maple trees. > Allocation tags arrange their separate memory so that it is virtually > contiguous and that will allow simple allocation tag indexing later on > in this patchset. The size of this virtually contiguous memory is set > to store up to 10 allocation tags. > > Signed-off-by: Suren Baghdasaryan Reviewed-by: Pasha Tatashin > --- > include/asm-generic/codetag.lds.h | 19 +++ > include/linux/alloc_tag.h | 13 +- > include/linux/codetag.h | 37 - > kernel/module/main.c | 80 ++ > lib/alloc_tag.c | 249 +++--- > lib/codetag.c | 100 +++- > scripts/module.lds.S | 5 +- > 7 files changed, 441 insertions(+), 62 deletions(-) > > diff --git a/include/asm-generic/codetag.lds.h > b/include/asm-generic/codetag.lds.h > index 64f536b80380..372c320c5043 100644 > --- a/include/asm-generic/codetag.lds.h > +++ b/include/asm-generic/codetag.lds.h > @@ -11,4 +11,23 @@ > #define CODETAG_SECTIONS() \ > SECTION_WITH_BOUNDARIES(alloc_tags) > > +/* > + * Module codetags which aren't used after module unload, therefore have the > + * same lifespan as the module and can be safely unloaded with the module. > + */ > +#define MOD_CODETAG_SECTIONS() > + > +#define MOD_SEPARATE_CODETAG_SECTION(_name)\ > + .codetag.##_name : {\ > + SECTION_WITH_BOUNDARIES(_name) \ > + } > + > +/* > + * For codetags which might be used after module unload, therefore might stay > + * longer in memory. Each such codetag type has its own section so that we > can > + * unload them individually once unused. > + */ > +#define MOD_SEPARATE_CODETAG_SECTIONS()\ > + MOD_SEPARATE_CODETAG_SECTION(alloc_tags) > + > #endif /* __ASM_GENERIC_CODETAG_LDS_H */ > diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h > index 1f0a9ff23a2c..7431757999c5 100644 > --- a/include/linux/alloc_tag.h > +++ b/include/linux/alloc_tag.h > @@ -30,6 +30,13 @@ struct alloc_tag { > struct alloc_tag_counters __percpu *counters; > } __aligned(8); > > +struct alloc_tag_module_section { > + unsigned long start_addr; > + unsigned long end_addr; > + /* used size */ > + unsigned long size; > +}; > + > #ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG > > #define CODETAG_EMPTY ((void *)1) > @@ -54,6 +61,8 @@ static inline void set_codetag_empty(union codetag_ref > *ref) {} > > #ifdef CONFIG_MEM_ALLOC_PROFILING > > +#define ALLOC_TAG_SECTION_NAME "alloc_tags" > + > struct codetag_bytes { > struct codetag *ct; > s64 bytes; > @@ -76,7 +85,7 @@ DECLARE_PER_CPU(struct alloc_tag_counters, > _shared_alloc_tag); > > #define DEFINE_ALLOC_TAG(_alloc_tag) > \ > static struct alloc_tag _alloc_tag __used __aligned(8) > \ > - __section("alloc_tags") = { > \ > + __section(ALLOC_TAG_SECTION_NAME) = { > \ > .ct = CODE_TAG_INIT, > \ > .counters = &_shared_alloc_tag }; > > @@ -85,7 +94,7 @@ DECLARE_PER_CPU(struct alloc_tag_counters, > _shared_alloc_tag); > #define DEFINE_ALLOC_TAG(_alloc_tag) > \ > static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr); > \ > static struct alloc_tag _alloc_tag __used __aligned(8) > \ > - __section("alloc_tags") = { > \ > + __section(ALLOC_TAG_SECTION_NAME) = { > \ > .ct = CODE_TAG_INIT, > \ > .counters = &_alloc_tag_cntr }; > > diff --git a/include/linux/codetag.h b/include/linux/codetag.h > index c2a579ccd455..d10bd9810d32 100644 > --- a/include/linux/codetag.h > +++ b/include/linux/codetag.h > @@ -35,8 +35,15 @@ struct codetag_type_desc { > size_t tag_size; > void (*module_load)(struct codetag_type *cttype, > struct codetag_module *cmod); > - bool (*module_unload)(s
Re: [PATCH v4 6/6] alloc_tag: support for page allocation tag compression
On Wed, Oct 23, 2024 at 1:08 PM Suren Baghdasaryan wrote: > > Implement support for storing page allocation tag references directly > in the page flags instead of page extensions. sysctl.vm.mem_profiling > boot parameter it extended to provide a way for a user to request this > mode. Enabling compression eliminates memory overhead caused by page_ext > and results in better performance for page allocations. However this > mode will not work if the number of available page flag bits is > insufficient to address all kernel allocations. Such condition can > happen during boot or when loading a module. If this condition is > detected, memory allocation profiling gets disabled with an appropriate > warning. By default compression mode is disabled. > > Signed-off-by: Suren Baghdasaryan Thank you very much Suren for doing this work. This is a very significant improvement for the fleet users. Reviewed-by: Pasha Tatashin
Re: [PATCH v4 4/6] alloc_tag: populate memory for module tags as needed
On Wed, Oct 23, 2024 at 1:08 PM Suren Baghdasaryan wrote: > > The memory reserved for module tags does not need to be backed by > physical pages until there are tags to store there. Change the way > we reserve this memory to allocate only virtual area for the tags > and populate it with physical pages as needed when we load a module. > > Signed-off-by: Suren Baghdasaryan Reviewed-by: Pasha Tatashin
[PATCH v4 0/6] page allocation tag compression
This patchset implements several improvements: 1. Gracefully handles module unloading while there are used allocations allocated from that module; 2. Provides an option to store page allocation tag references in the page flags, removing dependency on page extensions and eliminating the memory overhead from storing page allocation references (~0.2% of total system memory). This also improves page allocation performance when CONFIG_MEM_ALLOC_PROFILING is enabled by eliminating page extension lookup. Page allocation performance overhead is reduced from 41% to 5.5%. Patch #1 introduces mas_for_each_rev() helper function. Patch #2 introduces shutdown_mem_profiling() helper function to be used when disabling memory allocation profiling. Patch #3 copies module tags into virtually contiguous memory which serves two purposes: - Lets us deal with the situation when module is unloaded while there are still live allocations from that module. Since we are using a copy version of the tags we can safely unload the module. Space and gaps in this contiguous memory are managed using a maple tree. - Enables simple indexing of the tags in the later patches. Patch #4 changes the way we allocate virtually contiguous memory for module tags to reserve only vitrual area and populate physical pages only as needed at module load time. Patch #5 abstracts page allocation tag reference to simplify later changes. Patch #6 adds compression option to the sysctl.vm.mem_profiling boot parameter for storing page allocation tag references inside page flags if they fit. If the number of available page flag bits is insufficient to address all kernel allocations, memory allocation profiling gets disabled with an appropriate warning. Patchset applies to mm-unstable. Changes since v3 [1]: - rebased over Mike's patchset in mm-unstable - added Reviewed-by, per Liam Howlett - limited execmem_vmap to work with EXECMEM_MODULE_DATA only, per Mike Rapoport - moved __get_vm_area_node() declaration into mm/internal.h, per Mike Rapoport - split parts of reserve_module_tags() into helper functions to make it more readable, per Mike Rapoport - introduced shutdown_mem_profiling() to be used when disabling memory allocation profiling - replaced CONFIG_PGALLOC_TAG_USE_PAGEFLAGS with a new boot parameter option, per Michal Hocko - minor code cleanups and refactoring to make the code more readable - added VMALLOC and MODULE SUPPORT reviewers I missed before [1] https://lore.kernel.org/all/20241014203646.1952505-1-sur...@google.com/ Suren Baghdasaryan (6): maple_tree: add mas_for_each_rev() helper alloc_tag: introduce shutdown_mem_profiling helper function alloc_tag: load module tags into separate contiguous memory alloc_tag: populate memory for module tags as needed alloc_tag: introduce pgtag_ref_handle to abstract page tag references alloc_tag: support for page allocation tag compression Documentation/mm/allocation-profiling.rst | 7 +- include/asm-generic/codetag.lds.h | 19 + include/linux/alloc_tag.h | 21 +- include/linux/codetag.h | 40 +- include/linux/execmem.h | 10 + include/linux/maple_tree.h| 14 + include/linux/mm.h| 25 +- include/linux/page-flags-layout.h | 7 + include/linux/pgalloc_tag.h | 197 +++-- include/linux/vmalloc.h | 3 + kernel/module/main.c | 80 ++-- lib/alloc_tag.c | 467 -- lib/codetag.c | 104 - mm/execmem.c | 16 + mm/internal.h | 6 + mm/mm_init.c | 5 +- mm/vmalloc.c | 4 +- scripts/module.lds.S | 5 +- 18 files changed, 903 insertions(+), 127 deletions(-) base-commit: b5d43fad926a3f542cd06f3c9d286f6f489f7129 -- 2.47.0.105.g07ac214952-goog
[PATCH v4 1/6] maple_tree: add mas_for_each_rev() helper
Add mas_for_each_rev() function to iterate maple tree nodes in reverse order. Suggested-by: Liam R. Howlett Signed-off-by: Suren Baghdasaryan Reviewed-by: Liam R. Howlett --- include/linux/maple_tree.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h index 61c236850ca8..cbbcd18d4186 100644 --- a/include/linux/maple_tree.h +++ b/include/linux/maple_tree.h @@ -592,6 +592,20 @@ static __always_inline void mas_reset(struct ma_state *mas) #define mas_for_each(__mas, __entry, __max) \ while (((__entry) = mas_find((__mas), (__max))) != NULL) +/** + * mas_for_each_rev() - Iterate over a range of the maple tree in reverse order. + * @__mas: Maple Tree operation state (maple_state) + * @__entry: Entry retrieved from the tree + * @__min: minimum index to retrieve from the tree + * + * When returned, mas->index and mas->last will hold the entire range for the + * entry. + * + * Note: may return the zero entry. + */ +#define mas_for_each_rev(__mas, __entry, __min) \ + while (((__entry) = mas_find_rev((__mas), (__min))) != NULL) + #ifdef CONFIG_DEBUG_MAPLE_TREE enum mt_dump_format { mt_dump_dec, -- 2.47.0.105.g07ac214952-goog
[PATCH v4 3/6] alloc_tag: load module tags into separate contiguous memory
When a module gets unloaded there is a possibility that some of the allocations it made are still used and therefore the allocation tags corresponding to these allocations are still referenced. As such, the memory for these tags can't be freed. This is currently handled as an abnormal situation and module's data section is not being unloaded. To handle this situation without keeping module's data in memory, allow codetags with longer lifespan than the module to be loaded into their own separate memory. The in-use memory areas and gaps after module unloading in this separate memory are tracked using maple trees. Allocation tags arrange their separate memory so that it is virtually contiguous and that will allow simple allocation tag indexing later on in this patchset. The size of this virtually contiguous memory is set to store up to 10 allocation tags. Signed-off-by: Suren Baghdasaryan --- include/asm-generic/codetag.lds.h | 19 +++ include/linux/alloc_tag.h | 13 +- include/linux/codetag.h | 37 - kernel/module/main.c | 80 ++ lib/alloc_tag.c | 249 +++--- lib/codetag.c | 100 +++- scripts/module.lds.S | 5 +- 7 files changed, 441 insertions(+), 62 deletions(-) diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h index 64f536b80380..372c320c5043 100644 --- a/include/asm-generic/codetag.lds.h +++ b/include/asm-generic/codetag.lds.h @@ -11,4 +11,23 @@ #define CODETAG_SECTIONS() \ SECTION_WITH_BOUNDARIES(alloc_tags) +/* + * Module codetags which aren't used after module unload, therefore have the + * same lifespan as the module and can be safely unloaded with the module. + */ +#define MOD_CODETAG_SECTIONS() + +#define MOD_SEPARATE_CODETAG_SECTION(_name)\ + .codetag.##_name : {\ + SECTION_WITH_BOUNDARIES(_name) \ + } + +/* + * For codetags which might be used after module unload, therefore might stay + * longer in memory. Each such codetag type has its own section so that we can + * unload them individually once unused. + */ +#define MOD_SEPARATE_CODETAG_SECTIONS()\ + MOD_SEPARATE_CODETAG_SECTION(alloc_tags) + #endif /* __ASM_GENERIC_CODETAG_LDS_H */ diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h index 1f0a9ff23a2c..7431757999c5 100644 --- a/include/linux/alloc_tag.h +++ b/include/linux/alloc_tag.h @@ -30,6 +30,13 @@ struct alloc_tag { struct alloc_tag_counters __percpu *counters; } __aligned(8); +struct alloc_tag_module_section { + unsigned long start_addr; + unsigned long end_addr; + /* used size */ + unsigned long size; +}; + #ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG #define CODETAG_EMPTY ((void *)1) @@ -54,6 +61,8 @@ static inline void set_codetag_empty(union codetag_ref *ref) {} #ifdef CONFIG_MEM_ALLOC_PROFILING +#define ALLOC_TAG_SECTION_NAME "alloc_tags" + struct codetag_bytes { struct codetag *ct; s64 bytes; @@ -76,7 +85,7 @@ DECLARE_PER_CPU(struct alloc_tag_counters, _shared_alloc_tag); #define DEFINE_ALLOC_TAG(_alloc_tag) \ static struct alloc_tag _alloc_tag __used __aligned(8) \ - __section("alloc_tags") = { \ + __section(ALLOC_TAG_SECTION_NAME) = { \ .ct = CODE_TAG_INIT, \ .counters = &_shared_alloc_tag }; @@ -85,7 +94,7 @@ DECLARE_PER_CPU(struct alloc_tag_counters, _shared_alloc_tag); #define DEFINE_ALLOC_TAG(_alloc_tag) \ static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr); \ static struct alloc_tag _alloc_tag __used __aligned(8) \ - __section("alloc_tags") = { \ + __section(ALLOC_TAG_SECTION_NAME) = { \ .ct = CODE_TAG_INIT, \ .counters = &_alloc_tag_cntr }; diff --git a/include/linux/codetag.h b/include/linux/codetag.h index c2a579ccd455..d10bd9810d32 100644 --- a/include/linux/codetag.h +++ b/include/linux/codetag.h @@ -35,8 +35,15 @@ struct codetag_type_desc { size_t tag_size; void (*module_load)(struct codetag_type *cttype, struct codetag_module *cmod); - bool (*module_unload)(struct codetag_type *cttype, + void (*module_unload)(struct codetag_type *cttype, struct codetag_module *cmod); +#ifdef CONFIG_MODULES + void (*module_replaced)(struct module *mod, struct module *new_mod); + bool (*needs_section_mem)(struct module *mod, unsigned long size); +
[PATCH v4 5/6] alloc_tag: introduce pgtag_ref_handle to abstract page tag references
To simplify later changes to page tag references, introduce new pgtag_ref_handle type. This allows easy replacement of page_ext as a storage of page allocation tags. Signed-off-by: Suren Baghdasaryan --- include/linux/mm.h | 25 +- include/linux/pgalloc_tag.h | 92 ++--- 2 files changed, 67 insertions(+), 50 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 5cd22303fbc0..8efb4a6a1a70 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4180,37 +4180,38 @@ static inline void pgalloc_tag_split(struct folio *folio, int old_order, int new return; for (i = nr_pages; i < (1 << old_order); i += nr_pages) { - union codetag_ref *ref = get_page_tag_ref(folio_page(folio, i)); + union pgtag_ref_handle handle; + union codetag_ref ref; - if (ref) { + if (get_page_tag_ref(folio_page(folio, i), &ref, &handle)) { /* Set new reference to point to the original tag */ - alloc_tag_ref_set(ref, tag); - put_page_tag_ref(ref); + alloc_tag_ref_set(&ref, tag); + update_page_tag_ref(handle, &ref); + put_page_tag_ref(handle); } } } static inline void pgalloc_tag_copy(struct folio *new, struct folio *old) { + union pgtag_ref_handle handle; + union codetag_ref ref; struct alloc_tag *tag; - union codetag_ref *ref; tag = pgalloc_tag_get(&old->page); if (!tag) return; - ref = get_page_tag_ref(&new->page); - if (!ref) + if (!get_page_tag_ref(&new->page, &ref, &handle)) return; /* Clear the old ref to the original allocation tag. */ clear_page_tag_ref(&old->page); /* Decrement the counters of the tag on get_new_folio. */ - alloc_tag_sub(ref, folio_nr_pages(new)); - - __alloc_tag_ref_set(ref, tag); - - put_page_tag_ref(ref); + alloc_tag_sub(&ref, folio_nr_pages(new)); + __alloc_tag_ref_set(&ref, tag); + update_page_tag_ref(handle, &ref); + put_page_tag_ref(handle); } #else /* !CONFIG_MEM_ALLOC_PROFILING */ static inline void pgalloc_tag_split(struct folio *folio, int old_order, int new_order) diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h index 59a3deb792a8..b13cd3313a88 100644 --- a/include/linux/pgalloc_tag.h +++ b/include/linux/pgalloc_tag.h @@ -11,46 +11,59 @@ #include +union pgtag_ref_handle { + union codetag_ref *ref; /* reference in page extension */ +}; + extern struct page_ext_operations page_alloc_tagging_ops; -static inline union codetag_ref *codetag_ref_from_page_ext(struct page_ext *page_ext) +/* Should be called only if mem_alloc_profiling_enabled() */ +static inline bool get_page_tag_ref(struct page *page, union codetag_ref *ref, + union pgtag_ref_handle *handle) { - return (union codetag_ref *)page_ext_data(page_ext, &page_alloc_tagging_ops); -} + struct page_ext *page_ext; + union codetag_ref *tmp; -static inline struct page_ext *page_ext_from_codetag_ref(union codetag_ref *ref) -{ - return (void *)ref - page_alloc_tagging_ops.offset; + if (!page) + return false; + + page_ext = page_ext_get(page); + if (!page_ext) + return false; + + tmp = (union codetag_ref *)page_ext_data(page_ext, &page_alloc_tagging_ops); + ref->ct = tmp->ct; + handle->ref = tmp; + return true; } -/* Should be called only if mem_alloc_profiling_enabled() */ -static inline union codetag_ref *get_page_tag_ref(struct page *page) +static inline void put_page_tag_ref(union pgtag_ref_handle handle) { - if (page) { - struct page_ext *page_ext = page_ext_get(page); + if (WARN_ON(!handle.ref)) + return; - if (page_ext) - return codetag_ref_from_page_ext(page_ext); - } - return NULL; + page_ext_put((void *)handle.ref - page_alloc_tagging_ops.offset); } -static inline void put_page_tag_ref(union codetag_ref *ref) +static inline void update_page_tag_ref(union pgtag_ref_handle handle, + union codetag_ref *ref) { - if (WARN_ON(!ref)) + if (WARN_ON(!handle.ref || !ref)) return; - page_ext_put(page_ext_from_codetag_ref(ref)); + handle.ref->ct = ref->ct; } static inline void clear_page_tag_ref(struct page *page) { if (mem_alloc_profiling_enabled()) { - union codetag_ref *ref = get_page_tag_ref(page); + union pgtag_ref_handle handle; + union codetag_ref ref; - if (ref) { - set_codetag_empty(ref); - put_pag
[PATCH v4 4/6] alloc_tag: populate memory for module tags as needed
The memory reserved for module tags does not need to be backed by physical pages until there are tags to store there. Change the way we reserve this memory to allocate only virtual area for the tags and populate it with physical pages as needed when we load a module. Signed-off-by: Suren Baghdasaryan --- include/linux/execmem.h | 10 ++ include/linux/vmalloc.h | 3 ++ lib/alloc_tag.c | 73 - mm/execmem.c| 16 + mm/internal.h | 6 mm/vmalloc.c| 4 +-- 6 files changed, 101 insertions(+), 11 deletions(-) diff --git a/include/linux/execmem.h b/include/linux/execmem.h index 1517fa196bf7..5a5e2917f870 100644 --- a/include/linux/execmem.h +++ b/include/linux/execmem.h @@ -139,6 +139,16 @@ void *execmem_alloc(enum execmem_type type, size_t size); */ void execmem_free(void *ptr); +/** + * execmem_vmap - create virtual mapping for EXECMEM_MODULE_DATA memory + * @size: size of the virtual mapping in bytes + * + * Maps virtually contiguous area in the range suitable for EXECMEM_MODULE_DATA. + * + * Return: the area descriptor on success or %NULL on failure. + */ +struct vm_struct *execmem_vmap(size_t size); + /** * execmem_update_copy - copy an update to executable memory * @dst: destination address to update diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 27408f21e501..31e9ffd936e3 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -202,6 +202,9 @@ extern int remap_vmalloc_range_partial(struct vm_area_struct *vma, extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr, unsigned long pgoff); +int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot, +struct page **pages, unsigned int page_shift); + /* * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings() diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c index d9f51169ffeb..061e43196247 100644 --- a/lib/alloc_tag.c +++ b/lib/alloc_tag.c @@ -8,14 +8,15 @@ #include #include #include +#include #define ALLOCINFO_FILE_NAME"allocinfo" #define MODULE_ALLOC_TAG_VMAP_SIZE (10UL * sizeof(struct alloc_tag)) #ifdef CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT -static bool mem_profiling_support __meminitdata = true; +static bool mem_profiling_support = true; #else -static bool mem_profiling_support __meminitdata; +static bool mem_profiling_support; #endif static struct codetag_type *alloc_tag_cttype; @@ -154,7 +155,7 @@ size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sl return nr; } -static void __init shutdown_mem_profiling(void) +static void shutdown_mem_profiling(void) { if (mem_alloc_profiling_enabled()) static_branch_disable(&mem_alloc_profiling_key); @@ -179,6 +180,7 @@ static void __init procfs_init(void) #ifdef CONFIG_MODULES static struct maple_tree mod_area_mt = MTREE_INIT(mod_area_mt, MT_FLAGS_ALLOC_RANGE); +static struct vm_struct *vm_module_tags; /* A dummy object used to indicate an unloaded module */ static struct module unloaded_mod; /* A dummy object used to indicate a module prepended area */ @@ -252,6 +254,33 @@ static bool find_aligned_area(struct ma_state *mas, unsigned long section_size, return false; } +static int vm_module_tags_populate(void) +{ + unsigned long phys_size = vm_module_tags->nr_pages << PAGE_SHIFT; + + if (phys_size < module_tags.size) { + struct page **next_page = vm_module_tags->pages + vm_module_tags->nr_pages; + unsigned long addr = module_tags.start_addr + phys_size; + unsigned long more_pages; + unsigned long nr; + + more_pages = ALIGN(module_tags.size - phys_size, PAGE_SIZE) >> PAGE_SHIFT; + nr = alloc_pages_bulk_array_node(GFP_KERNEL | __GFP_NOWARN, +NUMA_NO_NODE, more_pages, next_page); + if (nr < more_pages || + vmap_pages_range(addr, addr + (nr << PAGE_SHIFT), PAGE_KERNEL, +next_page, PAGE_SHIFT) < 0) { + /* Clean up and error out */ + for (int i = 0; i < nr; i++) + __free_page(next_page[i]); + return -ENOMEM; + } + vm_module_tags->nr_pages += nr; + } + + return 0; +} + static void *reserve_module_tags(struct module *mod, unsigned long size, unsigned int prepend, unsigned long align) { @@ -310,8 +339,18 @@ static void *reserve_module_tags(struct module *mod, unsigned long size, if (IS_ERR(ret)) return ret; - if (module_tags
Re: [PATCH v6 1/6] firmware/psci: Add definitions for PSCI v1.3 specification
> On 19 Oct 2024, at 17:15, David Woodhouse wrote: > > From: David Woodhouse > > The v1.3 PSCI spec (https://developer.arm.com/documentation/den0022) adds > the SYSTEM_OFF2 function. Add definitions for it and its hibernation type > parameter. > > Signed-off-by: David Woodhouse > --- > include/uapi/linux/psci.h | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/include/uapi/linux/psci.h b/include/uapi/linux/psci.h > index 42a40ad3fb62..81759ff385e6 100644 > --- a/include/uapi/linux/psci.h > +++ b/include/uapi/linux/psci.h > @@ -59,6 +59,7 @@ > #define PSCI_1_1_FN_SYSTEM_RESET2 PSCI_0_2_FN(18) > #define PSCI_1_1_FN_MEM_PROTECT PSCI_0_2_FN(19) > #define PSCI_1_1_FN_MEM_PROTECT_CHECK_RANGE PSCI_0_2_FN(20) > +#define PSCI_1_3_FN_SYSTEM_OFF2 PSCI_0_2_FN(21) > > #define PSCI_1_0_FN64_CPU_DEFAULT_SUSPEND PSCI_0_2_FN64(12) > #define PSCI_1_0_FN64_NODE_HW_STATE PSCI_0_2_FN64(13) > @@ -68,6 +69,7 @@ > > #define PSCI_1_1_FN64_SYSTEM_RESET2 PSCI_0_2_FN64(18) > #define PSCI_1_1_FN64_MEM_PROTECT_CHECK_RANGE PSCI_0_2_FN64(20) > +#define PSCI_1_3_FN64_SYSTEM_OFF2 PSCI_0_2_FN64(21) > > /* PSCI v0.2 power state encoding for CPU_SUSPEND function */ > #define PSCI_0_2_POWER_STATE_ID_MASK 0x > @@ -100,6 +102,9 @@ > #define PSCI_1_1_RESET_TYPE_SYSTEM_WARM_RESET 0 > #define PSCI_1_1_RESET_TYPE_VENDOR_START 0x8000U > > +/* PSCI v1.3 hibernate type for SYSTEM_OFF2 */ > +#define PSCI_1_3_OFF_TYPE_HIBERNATE_OFF BIT(0) > + Reviewed-by: Miguel Luis > /* PSCI version decoding (independent of PSCI version) */ > #define PSCI_VERSION_MAJOR_SHIFT 16 > #define PSCI_VERSION_MINOR_MASK \ > -- > 2.44.0 >
Re: [PATCH 07/12] huge_memory: Allow mappings of PMD sized pages
Alistair Popple writes: > Alistair Popple wrote: >> Dan Williams writes: [...] >>> + >>> + return VM_FAULT_NOPAGE; >>> +} >>> +EXPORT_SYMBOL_GPL(dax_insert_pfn_pmd); >> >> Like I mentioned before, lets make the exported function >> vmf_insert_folio() and move the pte, pmd, pud internal private / static >> details of the implementation. The "dax_" specific aspect of this was >> removed at the conversion of a dax_pfn to a folio. > > Ok, let me try that. Note that vmf_insert_pfn{_pmd|_pud} will have to > stick around though. Creating a single vmf_insert_folio() seems somewhat difficult because it needs to be called from multiple fault paths (either PTE, PMD or PUD fault) and do something different for each. Specifically the issue I ran into is that DAX does not downgrade PMD entries to PTE entries if they are backed by storage. So the PTE fault handler will get a PMD-sized DAX entry and therefore a PMD size folio. The way I tried implementing vmf_insert_folio() was to look at folio_order() to determine which internal implementation to call. But that doesn't work for a PTE fault, because there's no way to determine if we should PTE map a subpage or PMD map the entire folio. We could pass down some context as to what type of fault we're handling, or add it to the vmf struct, but that seems excessive given callers already know this and could just call a specific vmf_insert_page_{pte|pmd|pud}.
[PATCH v5 1/7] Add AutoFDO support for Clang build
Add the build support for using Clang's AutoFDO. Building the kernel with AutoFDO does not reduce the optimization level from the compiler. AutoFDO uses hardware sampling to gather information about the frequency of execution of different code paths within a binary. This information is then used to guide the compiler's optimization decisions, resulting in a more efficient binary. Experiments showed that the kernel can improve up to 10% in latency. The support requires a Clang compiler after LLVM 17. This submission is limited to x86 platforms that support PMU features like LBR on Intel machines and AMD Zen3 BRS. Support for SPE on ARM 1, and BRBE on ARM 1 is part of planned future work. Here is an example workflow for AutoFDO kernel: 1) Build the kernel on the host machine with LLVM enabled, for example, $ make menuconfig LLVM=1 Turn on AutoFDO build config: CONFIG_AUTOFDO_CLANG=y With a configuration that has LLVM enabled, use the following command: scripts/config -e AUTOFDO_CLANG After getting the config, build with $ make LLVM=1 2) Install the kernel on the test machine. 3) Run the load tests. The '-c' option in perf specifies the sample event period. We suggest using a suitable prime number, like 59, for this purpose. For Intel platforms: $ perf record -e BR_INST_RETIRED.NEAR_TAKEN:k -a -N -b -c \ -o -- For AMD platforms: The supported system are: Zen3 with BRS, or Zen4 with amd_lbr_v2 For Zen3: $ cat proc/cpuinfo | grep " brs" For Zen4: $ cat proc/cpuinfo | grep amd_lbr_v2 $ perf record --pfm-events RETIRED_TAKEN_BRANCH_INSTRUCTIONS:k -a \ -N -b -c -o -- 4) (Optional) Download the raw perf file to the host machine. 5) To generate an AutoFDO profile, two offline tools are available: create_llvm_prof and llvm_profgen. The create_llvm_prof tool is part of the AutoFDO project and can be found on GitHub (https://github.com/google/autofdo), version v0.30.1 or later. The llvm_profgen tool is included in the LLVM compiler itself. It's important to note that the version of llvm_profgen doesn't need to match the version of Clang. It needs to be the LLVM 19 release or later, or from the LLVM trunk. $ llvm-profgen --kernel --binary= --perfdata= \ -o or $ create_llvm_prof --binary= --profile= \ --format=extbinary --out= Note that multiple AutoFDO profile files can be merged into one via: $ llvm-profdata merge -o... 6) Rebuild the kernel using the AutoFDO profile file with the same config as step 1, (Note CONFIG_AUTOFDO_CLANG needs to be enabled): $ make LLVM=1 CLANG_AUTOFDO_PROFILE= Co-developed-by: Han Shen Signed-off-by: Han Shen Signed-off-by: Rong Xu Suggested-by: Sriraman Tallam Suggested-by: Krzysztof Pszeniczny Suggested-by: Nick Desaulniers Suggested-by: Stephane Eranian Tested-by: Yonghong Song --- Documentation/dev-tools/autofdo.rst | 167 Documentation/dev-tools/index.rst | 1 + MAINTAINERS | 7 ++ Makefile| 1 + arch/Kconfig| 20 arch/x86/Kconfig| 1 + scripts/Makefile.autofdo| 22 scripts/Makefile.lib| 10 ++ tools/objtool/check.c | 1 + 9 files changed, 230 insertions(+) create mode 100644 Documentation/dev-tools/autofdo.rst create mode 100644 scripts/Makefile.autofdo diff --git a/Documentation/dev-tools/autofdo.rst b/Documentation/dev-tools/autofdo.rst new file mode 100644 index ..9d90e6d79781 --- /dev/null +++ b/Documentation/dev-tools/autofdo.rst @@ -0,0 +1,167 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=== +Using AutoFDO with the Linux kernel +=== + +This enables AutoFDO build support for the kernel when using +the Clang compiler. AutoFDO (Auto-Feedback-Directed Optimization) +is a type of profile-guided optimization (PGO) used to enhance the +performance of binary executables. It gathers information about the +frequency of execution of various code paths within a binary using +hardware sampling. This data is then used to guide the compiler's +optimization decisions, resulting in a more efficient binary. AutoFDO +is a powerful optimization technique, and data indicates that it can +significantly improve kernel performance. It's especially beneficial +for workloads affected by front-end stalls. + +For AutoFDO builds, unlike non-FDO builds, the user must supply a +profile. Acquiring an AutoFDO profile can be done in several ways. +AutoFDO profiles are created by converting hardware sampling using +the "perf" tool. It is crucial that the workload used to create these +perf files is representative; they must exhibit runtime +characteristics similar to the workloads that are intended to be +optimized. Failure to do so
Re: [PATCH 07/12] huge_memory: Allow mappings of PMD sized pages
Alistair Popple wrote: > > Alistair Popple writes: > > > Alistair Popple wrote: > >> Dan Williams writes: > > [...] > > >>> + > >>> + return VM_FAULT_NOPAGE; > >>> +} > >>> +EXPORT_SYMBOL_GPL(dax_insert_pfn_pmd); > >> > >> Like I mentioned before, lets make the exported function > >> vmf_insert_folio() and move the pte, pmd, pud internal private / static > >> details of the implementation. The "dax_" specific aspect of this was > >> removed at the conversion of a dax_pfn to a folio. > > > > Ok, let me try that. Note that vmf_insert_pfn{_pmd|_pud} will have to > > stick around though. > > Creating a single vmf_insert_folio() seems somewhat difficult because it > needs to be called from multiple fault paths (either PTE, PMD or PUD > fault) and do something different for each. > > Specifically the issue I ran into is that DAX does not downgrade PMD > entries to PTE entries if they are backed by storage. So the PTE fault > handler will get a PMD-sized DAX entry and therefore a PMD size folio. > > The way I tried implementing vmf_insert_folio() was to look at > folio_order() to determine which internal implementation to call. But > that doesn't work for a PTE fault, because there's no way to determine > if we should PTE map a subpage or PMD map the entire folio. Ah, that conflict makes sense. > We could pass down some context as to what type of fault we're handling, > or add it to the vmf struct, but that seems excessive given callers > already know this and could just call a specific > vmf_insert_page_{pte|pmd|pud}. Ok, I think it is good to capture that "because dax does not downgrade entries it may satisfy PTE faults with PMD inserts", or something like that in comment or changelog.
[PATCH v5 0/7] Add AutoFDO and Propeller support for Clang build
Hi, This patch series is to integrate AutoFDO and Propeller support into the Linux kernel. AutoFDO is a profile-guided optimization technique that leverages hardware sampling to enhance binary performance. Unlike Instrumentation-based FDO (iFDO), AutoFDO offers a user-friendly and straightforward application process. While iFDO generally yields superior profile quality and performance, our findings reveal that AutoFDO achieves remarkable effectiveness, bringing performance close to iFDO for benchmark applications. Propeller is a profile-guided, post-link optimizer that improves the performance of large-scale applications compiled with LLVM. It operates by relinking the binary based on an additional round of runtime profiles, enabling precise optimizations that are not possible at compile time. Similar to AutoFDO, Propeller too utilizes hardware sampling to collect profiles and apply post-link optimizations to improve the benchmark’s performance over and above AutoFDO. Our empirical data demonstrates significant performance improvements with AutoFDO and Propeller, up to 10% on microbenchmarks and up to 5% on large warehouse-scale benchmarks. This makes a strong case for their inclusion as supported features in the upstream kernel. Background A significant fraction of fleet processing cycles (excluding idle time) from data center workloads are attributable to the kernel. Ware-house scale workloads maximize performance by optimizing the production kernel using iFDO (a.k.a instrumented PGO, Profile Guided Optimization). iFDO can significantly enhance application performance but its use within the kernel has raised concerns. AutoFDO is a variant of FDO that uses the hardware’s Performance Monitoring Unit (PMU) to collect profiling data. While AutoFDO typically yields smaller performance gains than iFDO, it presents unique benefits for optimizing kernels. AutoFDO eliminates the need for instrumented kernels, allowing a single optimized kernel to serve both execution and profile collection. It also minimizes slowdown during profile collection, potentially yielding higher-fidelity profiling, especially for time-sensitive code, compared to iFDO. Additionally, AutoFDO profiles can be obtained from production environments via the hardware’s PMU whereas iFDO profiles require carefully curated load tests that are representative of real-world traffic. AutoFDO facilitates profile collection across diverse targets. Preliminary studies indicate significant variation in kernel hot spots within Google’s infrastructure, suggesting potential performance gains through target-specific kernel customization. Furthermore, other advanced compiler optimization techniques, including ThinLTO and Propeller can be stacked on top of AutoFDO, similar to iFDO. ThinLTO achieves better runtime performance through whole-program analysis and cross module optimizations. The main difference between traditional LTO and ThinLTO is that the latter is scalable in time and memory. This patch series adds AutoFDO and Propeller support to the kernel. The actual solution comes in six parts: [P 1] Add the build support for using AutoFDO in Clang Add the basic support for AutoFDO build and provide the instructions for using AutoFDO. [P 2] Fix objtool for bogus warnings when -ffunction-sections is enabled [P 3] Change the subsection ordering when -ffunction-sections is enabled [P 4] Add markers for text_unlikely and text_hot sections [P 5] Enable –ffunction-sections for the AutoFDO build [P 6] Enable Machine Function Split (MFS) optimization for AutoFDO [P 7] Add Propeller configuration to the kernel build Patch 1 provides basic AutoFDO build support. Patches 2 to 6 further enhance the performance of AutoFDO builds and are functionally dependent on Patch 1. Patch 7 enables support for Propeller and is dependent on patch 2 to patch 4. Caveats AutoFDO is compatible with both GCC and Clang, but the patches in this series are exclusively applicable to LLVM 17 or newer for AutoFDO and LLVM 19 or newer for Propeller. For profile conversion, two different tools could be used, llvm_profgen or create_llvm_prof. llvm_profgen needs to be the LLVM 19 or newer, or just the LLVM trunk. Alternatively, create_llvm_prof v0.30.1 or newer can be used instead of llvm-profgen. Additionally, the build is only supported on x86 platforms equipped with PMU capabilities, such as LBR on Intel machines. More specifically: * Intel platforms: works on every platform that supports LBR; we have tested on Skylake. * AMD platforms: tested on AMD Zen3 with the BRS feature. The kernel needs to be configured with “CONFIG_PERF_EVENTS_AMD_BRS=y", To check, use $ cat /proc/cpuinfo | grep “ brs” For the AMD Zen4, AMD LBRV2 is supported, but we suspect a bug with AMD LBRv2 implementation in Genoa which blocks the usage. For ARM, we plan to send patches for SPE-based Propeller when AutoFDO for Arm is ready. Experiments and Results Experiments
[PATCH v5 2/7] objtool: Fix unreachable instruction warnings for weak functions
In the presence of both weak and strong function definitions, the linker drops the weak symbol in favor of a strong symbol, but leaves the code in place. Code in ignore_unreachable_insn() has some heuristics to suppress the warning, but it does not work when -ffunction-sections is enabled. Suppose function foo has both strong and weak definitions. Case 1: The strong definition has an annotated section name, like .init.text. Only the weak definition will be placed into .text.foo. But since the section has no symbols, there will be no "hole" in the section. Case 2: Both sections are without an annotated section name. Both will be placed into .text.foo section, but there will be only one symbol (the strong one). If the weak code is before the strong code, there is no "hole" as it fails to find the right-most symbol before the offset. The fix is to use the first node to compute the hole if hole.sym is empty. If there is no symbol in the section, the first node will be NULL, in which case, -1 is returned to skip the whole section. Co-developed-by: Han Shen Signed-off-by: Han Shen Signed-off-by: Rong Xu Suggested-by: Sriraman Tallam Suggested-by: Krzysztof Pszeniczny Tested-by: Yonghong Song --- tools/objtool/elf.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c index 3d27983dc908..6f64d611faea 100644 --- a/tools/objtool/elf.c +++ b/tools/objtool/elf.c @@ -224,12 +224,17 @@ int find_symbol_hole_containing(const struct section *sec, unsigned long offset) if (n) return 0; /* not a hole */ - /* didn't find a symbol for which @offset is after it */ - if (!hole.sym) - return 0; /* not a hole */ + /* +* @offset >= sym->offset + sym->len, find symbol after it. +* When hole.sym is empty, use the first node to compute the hole. +* If there is no symbol in the section, the first node will be NULL, +* in which case, -1 is returned to skip the whole section. +*/ + if (hole.sym) + n = rb_next(&hole.sym->node); + else + n = rb_first_cached(&sec->symbol_tree); - /* @offset >= sym->offset + sym->len, find symbol after it */ - n = rb_next(&hole.sym->node); if (!n) return -1; /* until end of address space */ -- 2.47.0.105.g07ac214952-goog
[PATCH v5 7/7] Add Propeller configuration for kernel build
Add the build support for using Clang's Propeller optimizer. Like AutoFDO, Propeller uses hardware sampling to gather information about the frequency of execution of different code paths within a binary. This information is then used to guide the compiler's optimization decisions, resulting in a more efficient binary. The support requires a Clang compiler LLVM 19 or later, and the create_llvm_prof tool (https://github.com/google/autofdo/releases/tag/v0.30.1). This commit is limited to x86 platforms that support PMU features like LBR on Intel machines and AMD Zen3 BRS. Here is an example workflow for building an AutoFDO+Propeller optimized kernel: 1) Build the kernel on the host machine, with AutoFDO and Propeller build config CONFIG_AUTOFDO_CLANG=y CONFIG_PROPELLER_CLANG=y then $ make LLVM=1 CLANG_AUTOFDO_PROFILE= “” is the profile collected when doing a non-Propeller AutoFDO build. This step builds a kernel that has the same optimization level as AutoFDO, plus a metadata section that records basic block information. This kernel image runs as fast as an AutoFDO optimized kernel. 2) Install the kernel on test/production machines. 3) Run the load tests. The '-c' option in perf specifies the sample event period. We suggest using a suitable prime number, like 59, for this purpose. For Intel platforms: $ perf record -e BR_INST_RETIRED.NEAR_TAKEN:k -a -N -b -c \ -o -- For AMD platforms: The supported system are: Zen3 with BRS, or Zen4 with amd_lbr_v2 # To see if Zen3 support LBR: $ cat proc/cpuinfo | grep " brs" # To see if Zen4 support LBR: $ cat proc/cpuinfo | grep amd_lbr_v2 # If the result is yes, then collect the profile using: $ perf record --pfm-events RETIRED_TAKEN_BRANCH_INSTRUCTIONS:k -a \ -N -b -c -o -- 4) (Optional) Download the raw perf file to the host machine. 5) Generate Propeller profile: $ create_llvm_prof --binary= --profile= \ --format=propeller --propeller_output_module_name \ --out=_cc_profile.txt \ --propeller_symorder=_ld_profile.txt “create_llvm_prof” is the profile conversion tool, and a prebuilt binary for linux can be found on https://github.com/google/autofdo/releases/tag/v0.30.1 (can also build from source). "" can be something like "/home/user/dir/any_string". This command generates a pair of Propeller profiles: "_cc_profile.txt" and "_ld_profile.txt". 6) Rebuild the kernel using the AutoFDO and Propeller profile files. CONFIG_AUTOFDO_CLANG=y CONFIG_PROPELLER_CLANG=y and $ make LLVM=1 CLANG_AUTOFDO_PROFILE= \ CLANG_PROPELLER_PROFILE_PREFIX= Co-developed-by: Han Shen Signed-off-by: Han Shen Signed-off-by: Rong Xu Suggested-by: Sriraman Tallam Suggested-by: Krzysztof Pszeniczny Suggested-by: Nick Desaulniers Suggested-by: Stephane Eranian Tested-by: Yonghong Song --- Documentation/dev-tools/index.rst | 1 + Documentation/dev-tools/propeller.rst | 162 ++ MAINTAINERS | 7 ++ Makefile | 1 + arch/Kconfig | 19 +++ arch/x86/Kconfig | 1 + arch/x86/kernel/vmlinux.lds.S | 4 + include/asm-generic/vmlinux.lds.h | 6 +- scripts/Makefile.lib | 10 ++ scripts/Makefile.propeller| 28 + tools/objtool/check.c | 1 + 11 files changed, 237 insertions(+), 3 deletions(-) create mode 100644 Documentation/dev-tools/propeller.rst create mode 100644 scripts/Makefile.propeller diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst index 6945644f7008..3c0ac08b2709 100644 --- a/Documentation/dev-tools/index.rst +++ b/Documentation/dev-tools/index.rst @@ -35,6 +35,7 @@ Documentation/dev-tools/testing-overview.rst checkuapi gpio-sloppy-logic-analyzer autofdo + propeller .. only:: subproject and html diff --git a/Documentation/dev-tools/propeller.rst b/Documentation/dev-tools/propeller.rst new file mode 100644 index ..92195958e3db --- /dev/null +++ b/Documentation/dev-tools/propeller.rst @@ -0,0 +1,162 @@ +.. SPDX-License-Identifier: GPL-2.0 + += +Using Propeller with the Linux kernel += + +This enables Propeller build support for the kernel when using Clang +compiler. Propeller is a profile-guided optimization (PGO) method used +to optimize binary executables. Like AutoFDO, it utilizes hardware +sampling to gather information about the frequency of execution of +different code paths within a binary. Unlike AutoFDO, this information +is then used right before linking phase to optimize (among others) +block layout within and across functions. + +A few important notes about adopting Propeller optimization: + +#. Although it can be used as a standalone optimization step, i
[PATCH v5 5/7] AutoFDO: Enable -ffunction-sections for the AutoFDO build
Enable -ffunction-sections by default for the AutoFDO build. With -ffunction-sections, the compiler places each function in its own section named .text.function_name instead of placing all functions in the .text section. In the AutoFDO build, this allows the linker to utilize profile information to reorganize functions for improved utilization of iCache and iTLB. Co-developed-by: Han Shen Signed-off-by: Han Shen Signed-off-by: Rong Xu Suggested-by: Sriraman Tallam Tested-by: Yonghong Song --- include/asm-generic/vmlinux.lds.h | 11 +-- scripts/Makefile.autofdo | 2 +- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index e02973f3b418..bd64fdedabd2 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -95,18 +95,25 @@ * With LTO_CLANG, the linker also splits sections by default, so we need * these macros to combine the sections during the final link. * + * With AUTOFDO_CLANG, by default, the linker splits text sections and + * regroups functions into subsections. + * * RODATA_MAIN is not used because existing code already defines .rodata.x * sections to be brought in with rodata. */ -#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || defined(CONFIG_LTO_CLANG) +#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || defined(CONFIG_LTO_CLANG) || \ +defined(CONFIG_AUTOFDO_CLANG) #define TEXT_MAIN .text .text.[0-9a-zA-Z_]* +#else +#define TEXT_MAIN .text +#endif +#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || defined(CONFIG_LTO_CLANG) #define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* .data..compoundliteral* .data.$__unnamed_* .data.$L* #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]* #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L* #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..L* .bss..compoundliteral* #define SBSS_MAIN .sbss .sbss.[0-9a-zA-Z_]* #else -#define TEXT_MAIN .text #define DATA_MAIN .data #define SDATA_MAIN .sdata #define RODATA_MAIN .rodata diff --git a/scripts/Makefile.autofdo b/scripts/Makefile.autofdo index ff96a63fea7c..6155d6fc4ca7 100644 --- a/scripts/Makefile.autofdo +++ b/scripts/Makefile.autofdo @@ -9,7 +9,7 @@ ifndef CONFIG_DEBUG_INFO endif ifdef CLANG_AUTOFDO_PROFILE - CFLAGS_AUTOFDO_CLANG += -fprofile-sample-use=$(CLANG_AUTOFDO_PROFILE) + CFLAGS_AUTOFDO_CLANG += -fprofile-sample-use=$(CLANG_AUTOFDO_PROFILE) -ffunction-sections endif ifdef CONFIG_LTO_CLANG_THIN -- 2.47.0.105.g07ac214952-goog
[PATCH v5 3/7] Change the symbols order when --ffunction-sections is enabled
When the -ffunction-sections compiler option is enabled, each function is placed in a separate section named .text.function_name rather than putting all functions in a single .text section. However, using -function-sections can cause problems with the linker script. The comments included in include/asm-generic/vmlinux.lds.h note these issues.: “TEXT_MAIN here will match .text.fixup and .text.unlikely if dead code elimination is enabled, so these sections should be converted to use ".." first.” It is unclear whether there is a straightforward method for converting a suffix to "..". This patch modifies the order of subsections within the text output section. Specifically, it repositions sections with certain fixed patterns (for example .text.unlikely) before TEXT_MAIN, ensuring that they are grouped and matched together. It also places .text.hot section at the beginning of a page to help the TLB performance. Note that the limitation arises because the linker script employs glob patterns instead of regular expressions for string matching. While there is a method to maintain the current order using complex patterns, this significantly complicates the pattern and increases the likelihood of errors. Co-developed-by: Han Shen Signed-off-by: Han Shen Signed-off-by: Rong Xu Suggested-by: Sriraman Tallam Suggested-by: Krzysztof Pszeniczny Tested-by: Yonghong Song --- include/asm-generic/vmlinux.lds.h | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index eeadbaeccf88..fd901951549c 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -553,19 +553,24 @@ * .text section. Map to function alignment to avoid address changes * during second ld run in second ld pass when generating System.map * - * TEXT_MAIN here will match .text.fixup and .text.unlikely if dead - * code elimination is enabled, so these sections should be converted - * to use ".." first. + * TEXT_MAIN here will match symbols with a fixed pattern (for example, + * .text.hot or .text.unlikely) if dead code elimination or + * function-section is enabled. Match these symbols first before + * TEXT_MAIN to ensure they are grouped together. + * + * Also placing .text.hot section at the beginning of a page, this + * would help the TLB performance. */ #define TEXT_TEXT \ ALIGN_FUNCTION(); \ + *(.text.asan.* .text.tsan.*)\ + *(.text.unknown .text.unknown.*)\ + *(.text.unlikely .text.unlikely.*) \ + . = ALIGN(PAGE_SIZE); \ *(.text.hot .text.hot.*)\ *(TEXT_MAIN .text.fixup)\ - *(.text.unlikely .text.unlikely.*) \ - *(.text.unknown .text.unknown.*)\ NOINSTR_TEXT\ - *(.ref.text)\ - *(.text.asan.* .text.tsan.*) + *(.ref.text) /* sched.text is aling to function alignment to secure we have same -- 2.47.0.105.g07ac214952-goog
[PATCH v5 6/7] AutoFDO: Enable machine function split optimization for AutoFDO
Enable the machine function split optimization for AutoFDO in Clang. Machine function split (MFS) is a pass in the Clang compiler that splits a function into hot and cold parts. The linker groups all cold blocks across functions together. This decreases hot code fragmentation and improves iCache and iTLB utilization. MFS requires a profile so this is enabled only for the AutoFDO builds. Co-developed-by: Han Shen Signed-off-by: Han Shen Signed-off-by: Rong Xu Suggested-by: Sriraman Tallam Suggested-by: Krzysztof Pszeniczny Tested-by: Yonghong Song --- include/asm-generic/vmlinux.lds.h | 7 ++- scripts/Makefile.autofdo | 2 ++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index bd64fdedabd2..8a0bb3946cf0 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -556,6 +556,11 @@ defined(CONFIG_AUTOFDO_CLANG) __cpuidle_text_end = .; \ __noinstr_text_end = .; +#define TEXT_SPLIT \ + __split_text_start = .; \ + *(.text.split .text.split.[0-9a-zA-Z_]*)\ + __split_text_end = .; + #define TEXT_UNLIKELY \ __unlikely_text_start = .; \ *(.text.unlikely .text.unlikely.*) \ @@ -582,6 +587,7 @@ defined(CONFIG_AUTOFDO_CLANG) ALIGN_FUNCTION(); \ *(.text.asan.* .text.tsan.*)\ *(.text.unknown .text.unknown.*)\ + TEXT_SPLIT \ TEXT_UNLIKELY \ . = ALIGN(PAGE_SIZE); \ TEXT_HOT\ @@ -589,7 +595,6 @@ defined(CONFIG_AUTOFDO_CLANG) NOINSTR_TEXT\ *(.ref.text) - /* sched.text is aling to function alignment to secure we have same * address even at second ld pass when generating System.map */ #define SCHED_TEXT \ diff --git a/scripts/Makefile.autofdo b/scripts/Makefile.autofdo index 6155d6fc4ca7..1caf2457e585 100644 --- a/scripts/Makefile.autofdo +++ b/scripts/Makefile.autofdo @@ -10,6 +10,7 @@ endif ifdef CLANG_AUTOFDO_PROFILE CFLAGS_AUTOFDO_CLANG += -fprofile-sample-use=$(CLANG_AUTOFDO_PROFILE) -ffunction-sections + CFLAGS_AUTOFDO_CLANG += -fsplit-machine-functions endif ifdef CONFIG_LTO_CLANG_THIN @@ -17,6 +18,7 @@ ifdef CONFIG_LTO_CLANG_THIN KBUILD_LDFLAGS += --lto-sample-profile=$(CLANG_AUTOFDO_PROFILE) endif KBUILD_LDFLAGS += --mllvm=-enable-fs-discriminator=true --mllvm=-improved-fs-discriminator=true -plugin-opt=thinlto + KBUILD_LDFLAGS += -plugin-opt=-split-machine-functions endif export CFLAGS_AUTOFDO_CLANG -- 2.47.0.105.g07ac214952-goog
[PATCH v5 4/7] Add markers for text_unlikely and text_hot sections
Add markers like __hot_text_start, __hot_text_end, __unlikely_text_start, and __unlikely_text_end which will be included in System.map. These markers indicate how the compiler groups functions, providing valuable information to developers about the layout and optimization of the code. Co-developed-by: Han Shen Signed-off-by: Han Shen Signed-off-by: Rong Xu Suggested-by: Sriraman Tallam Tested-by: Yonghong Song --- include/asm-generic/vmlinux.lds.h | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index fd901951549c..e02973f3b418 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -549,6 +549,16 @@ __cpuidle_text_end = .; \ __noinstr_text_end = .; +#define TEXT_UNLIKELY \ + __unlikely_text_start = .; \ + *(.text.unlikely .text.unlikely.*) \ + __unlikely_text_end = .; + +#define TEXT_HOT \ + __hot_text_start = .; \ + *(.text.hot .text.hot.*)\ + __hot_text_end = .; + /* * .text section. Map to function alignment to avoid address changes * during second ld run in second ld pass when generating System.map @@ -565,9 +575,9 @@ ALIGN_FUNCTION(); \ *(.text.asan.* .text.tsan.*)\ *(.text.unknown .text.unknown.*)\ - *(.text.unlikely .text.unlikely.*) \ + TEXT_UNLIKELY \ . = ALIGN(PAGE_SIZE); \ - *(.text.hot .text.hot.*)\ + TEXT_HOT\ *(TEXT_MAIN .text.fixup)\ NOINSTR_TEXT\ *(.ref.text) -- 2.47.0.105.g07ac214952-goog
Re: [PATCH 0/1] remoteproc documentation changes
On Wed, 23 Oct 2024 at 07:53, Jonathan Corbet wrote: > > anish kumar writes: > > > This patch series transitions the documentation > > for remoteproc from the staging directory to the > > mainline kernel. It introduces both kernel and > > user-space APIs, enhancing the overall documentation > > quality. > > > > V4: > > Fixed compilation errors and moved documentation to > > driver-api directory. > > > > V3: > > Seperated out the patches further to make the intention > > clear for each patch. > > > > V2: > > Reported-by: kernel test robot > > Closes: > > https://lore.kernel.org/oe-kbuild-all/202410161444.jokmsogs-...@intel.com/ > > So I think you could make better use of kerneldoc comments for a number > of your APIs and structures - a project for the future. I can't judge > the remoteproc aspects of this, but from a documentation mechanics point > of view, this looks about ready to me. In the absence of objections > I'll apply it in the near future. > Please hold off before applying, I will review the content in the coming days. Thanks, Mathieu
Re: [PATCH v4 5/6] AutoFDO: Enable machine function split optimization for AutoFDO
On Tue, Oct 22, 2024 at 11:50 PM Masahiro Yamada wrote: > > On Tue, Oct 22, 2024 at 8:28 AM Rong Xu wrote: > > > > On Sun, Oct 20, 2024 at 8:18 PM Masahiro Yamada > > wrote: > > > > > > On Tue, Oct 15, 2024 at 6:33 AM Rong Xu wrote: > > > > > > > > Enable the machine function split optimization for AutoFDO in Clang. > > > > > > > > Machine function split (MFS) is a pass in the Clang compiler that > > > > splits a function into hot and cold parts. The linker groups all > > > > cold blocks across functions together. This decreases hot code > > > > fragmentation and improves iCache and iTLB utilization. > > > > > > > > MFS requires a profile so this is enabled only for the AutoFDO builds. > > > > > > > > Co-developed-by: Han Shen > > > > Signed-off-by: Han Shen > > > > Signed-off-by: Rong Xu > > > > Suggested-by: Sriraman Tallam > > > > Suggested-by: Krzysztof Pszeniczny > > > > --- > > > > include/asm-generic/vmlinux.lds.h | 6 ++ > > > > scripts/Makefile.autofdo | 2 ++ > > > > 2 files changed, 8 insertions(+) > > > > > > > > diff --git a/include/asm-generic/vmlinux.lds.h > > > > b/include/asm-generic/vmlinux.lds.h > > > > index ace617d1af9b..20e46c0917db 100644 > > > > --- a/include/asm-generic/vmlinux.lds.h > > > > +++ b/include/asm-generic/vmlinux.lds.h > > > > @@ -565,9 +565,14 @@ defined(CONFIG_AUTOFDO_CLANG) > > > > __unlikely_text_start = .; > > > > \ > > > > *(.text.unlikely .text.unlikely.*) > > > > \ > > > > __unlikely_text_end = .; > > > > +#define TEXT_SPLIT > > > > \ > > > > + __split_text_start = .; > > > > \ > > > > + *(.text.split .text.split.[0-9a-zA-Z_]*) > > > > \ > > > > + __split_text_end = .; > > > > #else > > > > #define TEXT_HOT *(.text.hot .text.hot.*) > > > > #define TEXT_UNLIKELY *(.text.unlikely .text.unlikely.*) > > > > +#define TEXT_SPLIT > > > > #endif > > > > > > > > > Why conditional? > > > > The condition is to ensure that we don't change the default kernel > > build by any means. > > The new code will introduce a few new symbols. > > > Same. > > Adding two __split_text_start and __split_text_end markers > do not affect anything. It just increases the kallsyms table slightly. > > You can do it unconditionally. Got it. > > > > > > > > > > > > > > Where are __unlikely_text_start and __unlikely_text_end used? > > > > These new symbols are currently unreferenced within the kernel source tree. > > However, they provide a valuable means of identifying hot and cold > > sections of text, > > and how large they are. I think they are useful information. > > > Should be explained in the commit description. Will explain the commit message. > > > > -- > Best Regards > Masahiro Yamada
[PATCH v4 6/6] alloc_tag: support for page allocation tag compression
Implement support for storing page allocation tag references directly in the page flags instead of page extensions. sysctl.vm.mem_profiling boot parameter it extended to provide a way for a user to request this mode. Enabling compression eliminates memory overhead caused by page_ext and results in better performance for page allocations. However this mode will not work if the number of available page flag bits is insufficient to address all kernel allocations. Such condition can happen during boot or when loading a module. If this condition is detected, memory allocation profiling gets disabled with an appropriate warning. By default compression mode is disabled. Signed-off-by: Suren Baghdasaryan --- Documentation/mm/allocation-profiling.rst | 7 +- include/linux/alloc_tag.h | 10 +- include/linux/codetag.h | 3 + include/linux/page-flags-layout.h | 7 ++ include/linux/pgalloc_tag.h | 145 +++--- lib/alloc_tag.c | 142 +++-- lib/codetag.c | 4 +- mm/mm_init.c | 5 +- 8 files changed, 290 insertions(+), 33 deletions(-) diff --git a/Documentation/mm/allocation-profiling.rst b/Documentation/mm/allocation-profiling.rst index ffd6655b7be2..316311240e6a 100644 --- a/Documentation/mm/allocation-profiling.rst +++ b/Documentation/mm/allocation-profiling.rst @@ -18,12 +18,17 @@ kconfig options: missing annotation Boot parameter: - sysctl.vm.mem_profiling=0|1|never + sysctl.vm.mem_profiling={0|1|never}[,compressed] When set to "never", memory allocation profiling overhead is minimized and it cannot be enabled at runtime (sysctl becomes read-only). When CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y, default value is "1". When CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=n, default value is "never". + "compressed" optional parameter will try to store page tag references in a + compact format, avoiding page extensions. This results in improved performance + and memory consumption, however it might fail depending on system configuration. + If compression fails, a warning is issued and memory allocation profiling gets + disabled. sysctl: /proc/sys/vm/mem_profiling diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h index 7431757999c5..4f811ec0ffe0 100644 --- a/include/linux/alloc_tag.h +++ b/include/linux/alloc_tag.h @@ -30,8 +30,16 @@ struct alloc_tag { struct alloc_tag_counters __percpu *counters; } __aligned(8); +struct alloc_tag_kernel_section { + struct alloc_tag *first_tag; + unsigned long count; +}; + struct alloc_tag_module_section { - unsigned long start_addr; + union { + unsigned long start_addr; + struct alloc_tag *first_tag; + }; unsigned long end_addr; /* used size */ unsigned long size; diff --git a/include/linux/codetag.h b/include/linux/codetag.h index d10bd9810d32..d14dbd26b370 100644 --- a/include/linux/codetag.h +++ b/include/linux/codetag.h @@ -13,6 +13,9 @@ struct codetag_module; struct seq_buf; struct module; +#define CODETAG_SECTION_START_PREFIX "__start_" +#define CODETAG_SECTION_STOP_PREFIX"__stop_" + /* * An instance of this structure is created in a special ELF section at every * code location being tagged. At runtime, the special section is treated as diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h index 7d79818dc065..4f5c9e979bb9 100644 --- a/include/linux/page-flags-layout.h +++ b/include/linux/page-flags-layout.h @@ -111,5 +111,12 @@ ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \ NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH) +#define NR_NON_PAGEFLAG_BITS (SECTIONS_WIDTH + NODES_WIDTH + ZONES_WIDTH + \ + LAST_CPUPID_SHIFT + KASAN_TAG_WIDTH + \ + LRU_GEN_WIDTH + LRU_REFS_WIDTH) + +#define NR_UNUSED_PAGEFLAG_BITS(BITS_PER_LONG - \ + (NR_NON_PAGEFLAG_BITS + NR_PAGEFLAGS)) + #endif #endif /* _LINUX_PAGE_FLAGS_LAYOUT */ diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h index b13cd3313a88..1fe63b52e5e5 100644 --- a/include/linux/pgalloc_tag.h +++ b/include/linux/pgalloc_tag.h @@ -11,29 +11,118 @@ #include +extern struct page_ext_operations page_alloc_tagging_ops; +extern unsigned long alloc_tag_ref_mask; +extern int alloc_tag_ref_offs; +extern struct alloc_tag_kernel_section kernel_tags; + +DECLARE_STATIC_KEY_FALSE(mem_profiling_compressed); + +typedef u16pgalloc_tag_idx; + union pgtag_ref_handle { union codetag_ref *ref; /* reference in page extension */ + struct page *page; /* reference in page flags */ }; -extern struct page_ext_operations page_alloc_tagging_ops; +/* Reserved indexes */ +#defi
Re: [PATCH v4 5/6] alloc_tag: introduce pgtag_ref_handle to abstract page tag references
On Wed, 23 Oct 2024 10:07:58 -0700 Suren Baghdasaryan wrote: > To simplify later changes to page tag references, introduce new > pgtag_ref_handle type. This allows easy replacement of page_ext > as a storage of page allocation tags. > > ... > > static inline void pgalloc_tag_copy(struct folio *new, struct folio *old) > { > + union pgtag_ref_handle handle; > + union codetag_ref ref; > struct alloc_tag *tag; > - union codetag_ref *ref; > > tag = pgalloc_tag_get(&old->page); > if (!tag) > return; > > - ref = get_page_tag_ref(&new->page); > - if (!ref) > + if (!get_page_tag_ref(&new->page, &ref, &handle)) > return; > > /* Clear the old ref to the original allocation tag. */ > clear_page_tag_ref(&old->page); > /* Decrement the counters of the tag on get_new_folio. */ > - alloc_tag_sub(ref, folio_nr_pages(new)); > - > - __alloc_tag_ref_set(ref, tag); > - > - put_page_tag_ref(ref); > + alloc_tag_sub(&ref, folio_nr_pages(new)); mm-stable has folio_size(new) here, fixed up. I think we aleady discussed this, but there's a crazy amount of inlining here. pgalloc_tag_split() is huge, and has four callsites.
Re: [PATCH v4 5/6] alloc_tag: introduce pgtag_ref_handle to abstract page tag references
On Wed, Oct 23, 2024 at 2:00 PM Andrew Morton wrote: > > On Wed, 23 Oct 2024 10:07:58 -0700 Suren Baghdasaryan > wrote: > > > To simplify later changes to page tag references, introduce new > > pgtag_ref_handle type. This allows easy replacement of page_ext > > as a storage of page allocation tags. > > > > ... > > > > static inline void pgalloc_tag_copy(struct folio *new, struct folio *old) > > { > > + union pgtag_ref_handle handle; > > + union codetag_ref ref; > > struct alloc_tag *tag; > > - union codetag_ref *ref; > > > > tag = pgalloc_tag_get(&old->page); > > if (!tag) > > return; > > > > - ref = get_page_tag_ref(&new->page); > > - if (!ref) > > + if (!get_page_tag_ref(&new->page, &ref, &handle)) > > return; > > > > /* Clear the old ref to the original allocation tag. */ > > clear_page_tag_ref(&old->page); > > /* Decrement the counters of the tag on get_new_folio. */ > > - alloc_tag_sub(ref, folio_nr_pages(new)); > > - > > - __alloc_tag_ref_set(ref, tag); > > - > > - put_page_tag_ref(ref); > > + alloc_tag_sub(&ref, folio_nr_pages(new)); > > mm-stable has folio_size(new) here, fixed up. Oh, right. You merged that patch tonight and I formatted my patchset yesterday :) Thanks for the fixup. > > I think we aleady discussed this, but there's a crazy amount of > inlining here. pgalloc_tag_split() is huge, and has four callsites. I must have missed that discussion but I am happy to unline this function. I think splitting is heavy enough operation that this uninlining would not have be noticeable. Thanks!