[PATCH] xentrace: free CPU mask string before overwriting pointer
While multiple -c options may be unexpected, we'd still better deal with them properly. Also restore the blank line that was bogusly zapped by the same commit. Coverity-ID: 1638723 Fixes: e4ad2836842a ("xentrace: Implement cpu mask range parsing of human values (-c)") Signed-off-by: Jan Beulich --- a/tools/xentrace/xentrace.c +++ b/tools/xentrace/xentrace.c @@ -1105,8 +1105,10 @@ static void parse_args(int argc, char ** break; case 'c': /* set new cpu mask for filtering (when xch is set). */ +free(opts.cpu_mask_str); opts.cpu_mask_str = strdup(optarg); break; + case 'e': /* set new event mask for filtering*/ parse_evtmask(optarg); break;
[PATCH] xl: properly dispose of libxl_dominfo struct instances
The ssid_label field requires separate freeing; make sure to call libxl_dominfo_dispose(). And then, for good measure, also libxl_dominfo_init(). Coverity-ID: 1638727 Coverity-ID: 1638728 Fixes: c458c404da16 ("xl: use libxl_domain_info to get the uuid in printf_info") Fixes: 48dab9767d2e ("tools/xl: use libxl_domain_info to get domain type for vcpu-pin") Signed-off-by: Jan Beulich --- I wasn't quite sure about use of libxl_dominfo_init(): vcpuset(), for example, doesn't call it. --- a/tools/xl/xl_sxp.c +++ b/tools/xl/xl_sxp.c @@ -45,8 +45,10 @@ void printf_info_sexp(int domid, libxl_d /* retrieve the UUID from dominfo, since it is probably generated * during parsing and thus does not match the real one */ +libxl_dominfo_init(&info); if (libxl_domain_info(ctx, &info, domid) == 0) { fprintf(fh, "\t(uuid " LIBXL_UUID_FMT ")\n", LIBXL_UUID_BYTES(info.uuid)); +libxl_dominfo_dispose(&info); } else { fprintf(fh, "\t(uuid )\n"); } --- a/tools/xl/xl_vcpu.c +++ b/tools/xl/xl_vcpu.c @@ -286,6 +286,8 @@ int main_vcpupin(int argc, char **argv) if (!ignore_masks && hard) { libxl_dominfo dominfo; +libxl_dominfo_init(&dominfo); + if (libxl_domain_info(ctx, &dominfo, domid)) { fprintf(stderr, "Could not get domain info\n"); goto out; @@ -293,6 +295,8 @@ int main_vcpupin(int argc, char **argv) /* HVM and PVH domains use the same global affinity mask */ apply_global_affinity_masks(dominfo.domain_type, hard, 1); + +libxl_dominfo_dispose(&dominfo); } if (force) {
[PATCH] xl: properly dispose of vTPM struct instance
The backend_domname field requires separate freeing; make sure to call libxl_device_vtpm_dispose() also on respective error paths. Coverity-ID: 1638719 Fixes: dde22055ac3a ("libxl: add vtpm support") Signed-off-by: Jan Beulich --- a/tools/xl/xl_vtpm.c +++ b/tools/xl/xl_vtpm.c @@ -44,12 +44,14 @@ int main_vtpmattach(int argc, char **arg if (MATCH_OPTION("uuid", *argv, oparg)) { if(libxl_uuid_from_string(&(vtpm.uuid), oparg)) { fprintf(stderr, "Invalid uuid specified (%s)\n", oparg); +libxl_device_vtpm_dispose(&vtpm); return 1; } } else if (MATCH_OPTION("backend", *argv, oparg)) { replace_string(&vtpm.backend_domname, oparg); } else { fprintf(stderr, "unrecognized argument `%s'\n", *argv); +libxl_device_vtpm_dispose(&vtpm); return 1; } } @@ -65,6 +67,7 @@ int main_vtpmattach(int argc, char **arg if (libxl_device_vtpm_add(ctx, domid, &vtpm, 0)) { fprintf(stderr, "libxl_device_vtpm_add failed.\n"); +libxl_device_vtpm_dispose(&vtpm); return 1; } libxl_device_vtpm_dispose(&vtpm);
Re: [PATCH v2 1/2] x86/uaccess: rework user access speculative harden guards
On 2025-01-10 09:56, Nicola Vetrini wrote: On 2025-01-10 09:29, Roger Pau Monné wrote: On Thu, Jan 09, 2025 at 03:57:24PM -0800, Stefano Stabellini wrote: On Thu, 9 Jan 2025, Nicola Vetrini wrote: > On 2025-01-04 01:20, Stefano Stabellini wrote: > > Hi Nicola, one question below > > I will update ECLAIR to treat the two forms as the same, so this patch can be > dropped. If you think it's helpful I can send a patch spelling out this - > arbitrary, but reasonable in my opinion - extension to the MISRA rule (which > does not consider the implications related to the use of GNU exensions) so > that contributors have a clear picture of the situation. Thank you Nicola! Yes the patch would be appreciated :-) So unless the proposed adjustment is considered better for code readability patch 1 can be dropped, and patch 2 could be applied after the ECLAIR change is in effect? Yes, exactly How long will it take Nicola to get the ECLAIR change propagated into the Gitlab runner? Thanks, Roger. We're still fixing the false positive upstream, but it shouldn't take too long so I think next week I should be able to refresh the runner. Hi Roger, the runner is updated so, assuming no new violation of Rule 20.7 appeared in the meantime, the rule should now be clean. In the next few days I'll prepare a patch to the docs to document the behaviour. Thanks, Nicola -- Nicola Vetrini, B.Sc. Software Engineer BUGSENG (https://bugseng.com) LinkedIn: https://www.linkedin.com/in/nicola-vetrini-a42471253
Re: [PATCH] xl: properly dispose of libxl_dominfo struct instances
On 14/01/2025 8:12 am, Jan Beulich wrote: > The ssid_label field requires separate freeing; make sure to call > libxl_dominfo_dispose(). And then, for good measure, also > libxl_dominfo_init(). > > Coverity-ID: 1638727 > Coverity-ID: 1638728 > Fixes: c458c404da16 ("xl: use libxl_domain_info to get the uuid in > printf_info") > Fixes: 48dab9767d2e ("tools/xl: use libxl_domain_info to get domain type for > vcpu-pin") > Signed-off-by: Jan Beulich > --- > I wasn't quite sure about use of libxl_dominfo_init(): vcpuset(), for > example, doesn't call it. It's a written requirement (somewhere) that *_init() and *_dispose() do get called. Except everyone's lazy with them and plenty of scenarios function without, which is why it's often Coverity telling us about it (but only for the instances where there's a real malloc()/free() gone missing). I expect it would be better to extend this patch to fix up vcpuset() too. The changes so far LGTM. ~Andrew
Re: [PATCH] xentrace: free CPU mask string before overwriting pointer
On 14/01/2025 8:12 am, Jan Beulich wrote: > While multiple -c options may be unexpected, we'd still better deal with > them properly. > > Also restore the blank line that was bogusly zapped by the same commit. > > Coverity-ID: 1638723 > Fixes: e4ad2836842a ("xentrace: Implement cpu mask range parsing of human > values (-c)") > Signed-off-by: Jan Beulich Reviewed-by: Andrew Cooper
Re: [PATCH] xl: properly dispose of libxl_dominfo struct instances
On 1/14/25 9:12 AM, Jan Beulich wrote: The ssid_label field requires separate freeing; make sure to call libxl_dominfo_dispose(). And then, for good measure, also libxl_dominfo_init(). Coverity-ID: 1638727 Coverity-ID: 1638728 Fixes: c458c404da16 ("xl: use libxl_domain_info to get the uuid in printf_info") Fixes: 48dab9767d2e ("tools/xl: use libxl_domain_info to get domain type for vcpu-pin") Signed-off-by: Jan Beulich Release-Acked-by: Oleksii Kurochko Thanks. ~ Oleksii --- I wasn't quite sure about use of libxl_dominfo_init(): vcpuset(), for example, doesn't call it. --- a/tools/xl/xl_sxp.c +++ b/tools/xl/xl_sxp.c @@ -45,8 +45,10 @@ void printf_info_sexp(int domid, libxl_d /* retrieve the UUID from dominfo, since it is probably generated * during parsing and thus does not match the real one */ +libxl_dominfo_init(&info); if (libxl_domain_info(ctx, &info, domid) == 0) { fprintf(fh, "\t(uuid " LIBXL_UUID_FMT ")\n", LIBXL_UUID_BYTES(info.uuid)); +libxl_dominfo_dispose(&info); } else { fprintf(fh, "\t(uuid )\n"); } --- a/tools/xl/xl_vcpu.c +++ b/tools/xl/xl_vcpu.c @@ -286,6 +286,8 @@ int main_vcpupin(int argc, char **argv) if (!ignore_masks && hard) { libxl_dominfo dominfo; +libxl_dominfo_init(&dominfo); + if (libxl_domain_info(ctx, &dominfo, domid)) { fprintf(stderr, "Could not get domain info\n"); goto out; @@ -293,6 +295,8 @@ int main_vcpupin(int argc, char **argv) /* HVM and PVH domains use the same global affinity mask */ apply_global_affinity_masks(dominfo.domain_type, hard, 1); + +libxl_dominfo_dispose(&dominfo); } if (force) {
Re: [PATCH v2 2/2] automation/eclair: make Misra rule 20.7 blocking for x86 also
Hello Oleksii, This is in principle ready to go in now (I'm currently running a private Eclair scan to ensure the patch is still OK against current staging). I would like to ask for a release Ack. Thanks, Roger. On Tue, Nov 26, 2024 at 10:35:08AM +0100, Roger Pau Monne wrote: > There are no violations left, make the rule globally blocking for both x86 and > ARM. > > Signed-off-by: Roger Pau Monné > Reviewed-by: Andrew Cooper > --- > automation/eclair_analysis/ECLAIR/tagging.ecl | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/automation/eclair_analysis/ECLAIR/tagging.ecl > b/automation/eclair_analysis/ECLAIR/tagging.ecl > index 755ea3271fc9..cb4e233e838d 100644 > --- a/automation/eclair_analysis/ECLAIR/tagging.ecl > +++ b/automation/eclair_analysis/ECLAIR/tagging.ecl > @@ -80,6 +80,7 @@ MC3R1.R20.2|| > MC3R1.R20.3|| > MC3R1.R20.4|| > MC3R1.R20.6|| > +MC3R1.R20.7|| > MC3R1.R20.9|| > MC3R1.R20.11|| > MC3R1.R20.12|| > @@ -116,7 +117,7 @@ if(string_equal(target,"x86_64"), > ) > > if(string_equal(target,"arm64"), > - > service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6||MC3R1.R20.7"}) > + > service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6"}) > ) > > > -reports+={clean:added,"service(clean_guidelines_common||additional_clean_guidelines)"} > -- > 2.46.0 >
Re: [PATCH v2 2/2] automation/eclair: make Misra rule 20.7 blocking for x86 also
On 2025-01-14 12:22, Roger Pau Monné wrote: Hello Oleksii, This is in principle ready to go in now (I'm currently running a private Eclair scan to ensure the patch is still OK against current staging). I would like to ask for a release Ack. One nit below, which I overlooked initially Thanks, Roger. On Tue, Nov 26, 2024 at 10:35:08AM +0100, Roger Pau Monne wrote: There are no violations left, make the rule globally blocking for both x86 and ARM. Signed-off-by: Roger Pau Monné Reviewed-by: Andrew Cooper --- automation/eclair_analysis/ECLAIR/tagging.ecl | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/automation/eclair_analysis/ECLAIR/tagging.ecl b/automation/eclair_analysis/ECLAIR/tagging.ecl index 755ea3271fc9..cb4e233e838d 100644 --- a/automation/eclair_analysis/ECLAIR/tagging.ecl +++ b/automation/eclair_analysis/ECLAIR/tagging.ecl @@ -80,6 +80,7 @@ MC3R1.R20.2|| MC3R1.R20.3|| MC3R1.R20.4|| MC3R1.R20.6|| +MC3R1.R20.7|| MC3R1.R20.9|| MC3R1.R20.11|| MC3R1.R20.12|| @@ -116,7 +117,7 @@ if(string_equal(target,"x86_64"), ) this hunk will not apply because it uses MC3R1, rather than MC3R2. Should be an easy fix. if(string_equal(target,"arm64"), - service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6||MC3R1.R20.7"}) + service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6"}) ) here as well -reports+={clean:added,"service(clean_guidelines_common||additional_clean_guidelines)"} -- 2.46.0 -- Nicola Vetrini, B.Sc. Software Engineer BUGSENG (https://bugseng.com) LinkedIn: https://www.linkedin.com/in/nicola-vetrini-a42471253
Re: [PATCH] xentrace: free CPU mask string before overwriting pointer
On 1/14/25 9:12 AM, Jan Beulich wrote: While multiple -c options may be unexpected, we'd still better deal with them properly. Also restore the blank line that was bogusly zapped by the same commit. Coverity-ID: 1638723 Fixes: e4ad2836842a ("xentrace: Implement cpu mask range parsing of human values (-c)") Signed-off-by: Jan Beulich R-Acked-by: Oleksii Kurochko Thanks. ~ Oleksii --- a/tools/xentrace/xentrace.c +++ b/tools/xentrace/xentrace.c @@ -1105,8 +1105,10 @@ static void parse_args(int argc, char ** break; case 'c': /* set new cpu mask for filtering (when xch is set). */ +free(opts.cpu_mask_str); opts.cpu_mask_str = strdup(optarg); break; + case 'e': /* set new event mask for filtering*/ parse_evtmask(optarg); break;
Re: [PATCH] xl: properly dispose of vTPM struct instance
On 1/14/25 9:13 AM, Jan Beulich wrote: The backend_domname field requires separate freeing; make sure to call libxl_device_vtpm_dispose() also on respective error paths. Coverity-ID: 1638719 Fixes: dde22055ac3a ("libxl: add vtpm support") Signed-off-by: Jan Beulich R-Acked-By: Oleksii Kurochko Thanks. ~ Oleksii --- a/tools/xl/xl_vtpm.c +++ b/tools/xl/xl_vtpm.c @@ -44,12 +44,14 @@ int main_vtpmattach(int argc, char **arg if (MATCH_OPTION("uuid", *argv, oparg)) { if(libxl_uuid_from_string(&(vtpm.uuid), oparg)) { fprintf(stderr, "Invalid uuid specified (%s)\n", oparg); +libxl_device_vtpm_dispose(&vtpm); return 1; } } else if (MATCH_OPTION("backend", *argv, oparg)) { replace_string(&vtpm.backend_domname, oparg); } else { fprintf(stderr, "unrecognized argument `%s'\n", *argv); +libxl_device_vtpm_dispose(&vtpm); return 1; } } @@ -65,6 +67,7 @@ int main_vtpmattach(int argc, char **arg if (libxl_device_vtpm_add(ctx, domid, &vtpm, 0)) { fprintf(stderr, "libxl_device_vtpm_add failed.\n"); +libxl_device_vtpm_dispose(&vtpm); return 1; } libxl_device_vtpm_dispose(&vtpm);
Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock
On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote: > > On 1/13/25 6:52 PM, Roger Pau Monné wrote: > > On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki wrote: > > > On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote: > > > > Allow setting the used wallclock from the command line. When the > > > > option is set > > > > to a value different than `auto` the probing is bypassed and the > > > > selected > > > > implementation is used (as long as it's available). > > > > > > > > The `xen` and `efi` options require being booted as a Xen guest (with > > > > Xen guest > > > > supported built-in) or from UEFI firmware respectively. > > > > > > > > Signed-off-by: Roger Pau Monné > > > Reviewed-by: Marek Marczykowski-Górecki > > Thanks for the review. > > > > Oleksii, can I get your opinion as Release Manager about whether this > > (and the following patch) would be suitable for committing to staging > > given the current release state? > > > > It's a workaround for broken EFI implementations that many downstreams > > carry on their patch queue. > > Based on your commit message, I understand this as addressing a bug ( but not > very critical > as IIUC downstreams have the similar patch on their side ). Therefore, if it > has been properly > reviewed and tested, we should consider including it in the current release. IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves the same behavior as proposed here. > IIUC, setting the wallclock to EFI should align with the behavior Xen had > previously. > It might be preferable to use that argument as the default for a while, > allowing others to verify the "auto" > value over time. After that, we could consider making "auto" the default. > That said, I am not particularly strict about following this approach. We cannot really set efi as the default, as it would break when booting on legacy BIOS systems. We could take only patch 1 and leave patch 2 after Xen 4.20 has branched, but at that point I would see little benefit in having just patch 1. I don't have a strong opinion, but downstreams have been complaining about Xen behavior regarding the usage of EFI_GET_TIME, so it might be good to not ship yet another release with such allegedly broken behavior. Let me know what you think, as I would need a formal Release-Ack if this is to be committed. Thanks, Roger. > ~ Oleksii > > > > > > Thanks, Roger.
Re: [PATCH] x86/boot: Handle better alignment for 32 bit code
On 14.01.2025 10:59, Frediano Ziglio wrote: > Output file didn't have correct alignment. > Allows alignment into data or code up to 2mb. > > Signed-off-by: Frediano Ziglio Afaic this is way too little of a description. For example, ... > --- a/xen/arch/x86/boot/Makefile > +++ b/xen/arch/x86/boot/Makefile > @@ -40,8 +40,8 @@ LD32 := $(LD) $(subst x86_64,i386,$(LDFLAGS_DIRECT)) > # are affected by both text_diff and text_gap. Ensure the sum of gap and > diff > # is greater than 2^16 so that any 16bit relocations if present in the object > # file turns into a build-time error. > -text_gap := 0x010200 > -text_diff := 0x408020 > +text_gap := 0x01c240 > +text_diff := 0x7e3dc0 ... how is anyone to derive how we ended up with these magic numbers? What parts of them are arbitrary, and what parts are required to be the way they are? > @@ -69,7 +69,6 @@ $(obj)/built-in-32.%.bin: $(obj)/build32.%.lds > $(obj)/built-in-32.tmp.o > $(LD32) $(orphan-handling-y) -N -T $< -o $(@:bin=o) $(filter %.o,$^) > $(NM) -p --format=bsd $(@:bin=o) > $(@:bin=map) > $(OBJCOPY) -j .text -O binary $(@:bin=o) $@ > - rm -f $(@:bin=o) This looks like an unrelated change. It may be okay to have here, but then it needs mentioning in the description. > @@ -80,6 +79,7 @@ cmd_combine = \ >--bin1 $(obj)/built-in-32.base.bin \ >--bin2 $(obj)/built-in-32.offset.bin \ >--map $(obj)/built-in-32.base.map \ > + --align $(shell $(OBJDUMP) -h > $(obj)/built-in-32.base.o|sed '/text.*2\*\*/ {s/.*2\*\*//;p;}; d') \ >--exports cmdline_parse_early,reloc,reloc_trampoline32 \ >--output$@ > > @@ -90,4 +90,4 @@ $(obj)/built-in-32.S: $(obj)/built-in-32.base.bin > $(obj)/built-in-32.offset.bin >$(srctree)/tools/combine_two_binaries.py FORCE > $(call if_changed,combine) > > -clean-files := built-in-32.*.bin built-in-32.*.map build32.*.lds > +clean-files := built-in-32.*.bin built-in-32.*.map built-in-32.*.o > build32.*.lds This looks like it would have been needed already before, if the build process was interrupted before the "rm" that you remove above. Jan
Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock
On 1/14/25 12:27 PM, Roger Pau Monné wrote: On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote: On 1/13/25 6:52 PM, Roger Pau Monné wrote: On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki wrote: On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote: Allow setting the used wallclock from the command line. When the option is set to a value different than `auto` the probing is bypassed and the selected implementation is used (as long as it's available). The `xen` and `efi` options require being booted as a Xen guest (with Xen guest supported built-in) or from UEFI firmware respectively. Signed-off-by: Roger Pau Monné Reviewed-by: Marek Marczykowski-Górecki Thanks for the review. Oleksii, can I get your opinion as Release Manager about whether this (and the following patch) would be suitable for committing to staging given the current release state? It's a workaround for broken EFI implementations that many downstreams carry on their patch queue. Based on your commit message, I understand this as addressing a bug ( but not very critical as IIUC downstreams have the similar patch on their side ). Therefore, if it has been properly reviewed and tested, we should consider including it in the current release. IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves the same behavior as proposed here. IIUC, setting the wallclock to EFI should align with the behavior Xen had previously. It might be preferable to use that argument as the default for a while, allowing others to verify the "auto" value over time. After that, we could consider making "auto" the default. That said, I am not particularly strict about following this approach. We cannot really set efi as the default, as it would break when booting on legacy BIOS systems. We could take only patch 1 and leave patch 2 after Xen 4.20 has branched, but at that point I would see little benefit in having just patch 1. I don't see a lot of benefit of comitting only the one patch either. I don't have a strong opinion, but downstreams have been complaining about Xen behavior regarding the usage of EFI_GET_TIME, so it might be good to not ship yet another release with such allegedly broken behavior. Agree with that. As I mentioned above I consider it as a bug and based on that several mentioned above downstreams have the similar patch I could consider that as tested approach so .. Let me know what you think, as I would need a formal Release-Ack if this is to be committed. ... R-Acked-by: Oleksii Kurochko. Thanks. ~ Oleksii Thanks, Roger. ~ Oleksii Thanks, Roger.
Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock
On 1/14/25 12:40 PM, Oleksii Kurochko wrote: On 1/14/25 12:27 PM, Roger Pau Monné wrote: On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote: On 1/13/25 6:52 PM, Roger Pau Monné wrote: On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki wrote: On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote: Allow setting the used wallclock from the command line. When the option is set to a value different than `auto` the probing is bypassed and the selected implementation is used (as long as it's available). The `xen` and `efi` options require being booted as a Xen guest (with Xen guest supported built-in) or from UEFI firmware respectively. Signed-off-by: Roger Pau Monné Reviewed-by: Marek Marczykowski-Górecki Thanks for the review. Oleksii, can I get your opinion as Release Manager about whether this (and the following patch) would be suitable for committing to staging given the current release state? It's a workaround for broken EFI implementations that many downstreams carry on their patch queue. Based on your commit message, I understand this as addressing a bug ( but not very critical as IIUC downstreams have the similar patch on their side ). Therefore, if it has been properly reviewed and tested, we should consider including it in the current release. IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves the same behavior as proposed here. IIUC, setting the wallclock to EFI should align with the behavior Xen had previously. It might be preferable to use that argument as the default for a while, allowing others to verify the "auto" value over time. After that, we could consider making "auto" the default. That said, I am not particularly strict about following this approach. We cannot really set efi as the default, as it would break when booting on legacy BIOS systems. We could take only patch 1 and leave patch 2 after Xen 4.20 has branched, but at that point I would see little benefit in having just patch 1. I don't see a lot of benefit of comitting only the one patch either. I don't have a strong opinion, but downstreams have been complaining about Xen behavior regarding the usage of EFI_GET_TIME, so it might be good to not ship yet another release with such allegedly broken behavior. Agree with that. As I mentioned above I consider it as a bug and based on that several mentioned above downstreams have the similar patch I could consider that as tested approach so .. Let me know what you think, as I would need a formal Release-Ack if this is to be committed. ... R-Acked-by: Oleksii Kurochko. Sorry for the noise. I missed to add that it would also be nice to mention IMO ( that now we have wallclock parameter ) in CHANGELOG so downstreams will receive "notification" that Xen provides a workaround for the mentioned issue in the patch series. Thanks. ~ Oleksii Thanks. ~ Oleksii Thanks, Roger. ~ Oleksii Thanks, Roger.
Re: [PATCH v2 2/2] automation/eclair: make Misra rule 20.7 blocking for x86 also
Hi Roger, On 1/14/25 12:22 PM, Roger Pau Monné wrote: Hello Oleksii, This is in principle ready to go in now (I'm currently running a private Eclair scan to ensure the patch is still OK against current staging). I would like to ask for a release Ack. R-Acked-by: Oleksii Kurochko Thanks. ~ Oleksii Thanks, Roger. On Tue, Nov 26, 2024 at 10:35:08AM +0100, Roger Pau Monne wrote: There are no violations left, make the rule globally blocking for both x86 and ARM. Signed-off-by: Roger Pau Monné Reviewed-by: Andrew Cooper --- automation/eclair_analysis/ECLAIR/tagging.ecl | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/automation/eclair_analysis/ECLAIR/tagging.ecl b/automation/eclair_analysis/ECLAIR/tagging.ecl index 755ea3271fc9..cb4e233e838d 100644 --- a/automation/eclair_analysis/ECLAIR/tagging.ecl +++ b/automation/eclair_analysis/ECLAIR/tagging.ecl @@ -80,6 +80,7 @@ MC3R1.R20.2|| MC3R1.R20.3|| MC3R1.R20.4|| MC3R1.R20.6|| +MC3R1.R20.7|| MC3R1.R20.9|| MC3R1.R20.11|| MC3R1.R20.12|| @@ -116,7 +117,7 @@ if(string_equal(target,"x86_64"), ) if(string_equal(target,"arm64"), - service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6||MC3R1.R20.7"}) + service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6"}) ) -reports+={clean:added,"service(clean_guidelines_common||additional_clean_guidelines)"} -- 2.46.0
[PATCH v2] x86/boot: Handle better alignment for 32 bit code
Output file didn't have correct alignment. Allows alignment into data or code up to 2mb. Intermediate object files are kept in order to copy alignment from object produced by the linker and final object (produced by combine_two_binaries.py script). Signed-off-by: Frediano Ziglio --- xen/arch/x86/boot/Makefile| 12 xen/tools/combine_two_binaries.py | 7 ++- 2 files changed, 14 insertions(+), 5 deletions(-) Changes since v1: - Improve comments and description. diff --git a/xen/arch/x86/boot/Makefile b/xen/arch/x86/boot/Makefile index 13d4583173..a56d8a7e0f 100644 --- a/xen/arch/x86/boot/Makefile +++ b/xen/arch/x86/boot/Makefile @@ -40,8 +40,12 @@ LD32 := $(LD) $(subst x86_64,i386,$(LDFLAGS_DIRECT)) # are affected by both text_diff and text_gap. Ensure the sum of gap and diff # is greater than 2^16 so that any 16bit relocations if present in the object # file turns into a build-time error. -text_gap := 0x010200 -text_diff := 0x408020 +# As gap will affect the output section size it should not be huge to avoid the +# creation of huge files. +# The sum of gap and diff will affect the possible alignment so should be a +# multiple of the possible alignment. +text_gap := 0x01c240 +text_diff := 0x7e3dc0 $(obj)/build32.base.lds: AFLAGS-y += -DGAP=$(text_gap) -DTEXT_DIFF=$(text_diff) $(obj)/build32.offset.lds: AFLAGS-y += -DGAP=$(text_gap) -DTEXT_DIFF=$(text_diff) -DAPPLY_OFFSET @@ -69,7 +73,6 @@ $(obj)/built-in-32.%.bin: $(obj)/build32.%.lds $(obj)/built-in-32.tmp.o $(LD32) $(orphan-handling-y) -N -T $< -o $(@:bin=o) $(filter %.o,$^) $(NM) -p --format=bsd $(@:bin=o) > $(@:bin=map) $(OBJCOPY) -j .text -O binary $(@:bin=o) $@ - rm -f $(@:bin=o) quiet_cmd_combine = GEN $@ cmd_combine = \ @@ -80,6 +83,7 @@ cmd_combine = \ --bin1 $(obj)/built-in-32.base.bin \ --bin2 $(obj)/built-in-32.offset.bin \ --map $(obj)/built-in-32.base.map \ + --align $(shell $(OBJDUMP) -h $(obj)/built-in-32.base.o|sed '/text.*2\*\*/ {s/.*2\*\*//;p;}; d') \ --exports cmdline_parse_early,reloc,reloc_trampoline32 \ --output$@ @@ -90,4 +94,4 @@ $(obj)/built-in-32.S: $(obj)/built-in-32.base.bin $(obj)/built-in-32.offset.bin $(srctree)/tools/combine_two_binaries.py FORCE $(call if_changed,combine) -clean-files := built-in-32.*.bin built-in-32.*.map build32.*.lds +clean-files := built-in-32.*.bin built-in-32.*.map built-in-32.*.o build32.*.lds diff --git a/xen/tools/combine_two_binaries.py b/xen/tools/combine_two_binaries.py index 581e57cbc0..8e587c24fb 100755 --- a/xen/tools/combine_two_binaries.py +++ b/xen/tools/combine_two_binaries.py @@ -26,6 +26,10 @@ parser.add_argument('--text-diff', dest='text_diff', required=True, type=auto_int, help='Difference between code section start') +parser.add_argument('--align', dest='align', +default=2, +type=auto_int, +help='Alignment in power of 2') parser.add_argument('--output', dest='output', help='Output file') parser.add_argument('--map', dest='mapfile', @@ -93,7 +97,7 @@ if size1 > size2: file1, file2 = file2, file1 size1, size2 = size2, size1 if size2 != size1 + gap: -raise Exception('File sizes do not match') +raise Exception('File sizes do not match %d != %d + %d' % (size2, size1, gap)) del size2 file1.seek(0, 0) @@ -219,6 +223,7 @@ print('''/* * File autogenerated by combine_two_binaries.py DO NOT EDIT */''', file=out) print('\t' + args.section_header, file=out) +print('\t.p2align\t' + str(args.align), file=out) print('obj32_start:', file=out) output(out) print('\n\t.section .note.GNU-stack,"",@progbits', file=out) -- 2.34.1
Re: [PATCH] xl: properly dispose of vTPM struct instance
On 14/01/2025 8:13 am, Jan Beulich wrote: > The backend_domname field requires separate freeing; make sure to call > libxl_device_vtpm_dispose() also on respective error paths. > > Coverity-ID: 1638719 > Fixes: dde22055ac3a ("libxl: add vtpm support") > Signed-off-by: Jan Beulich Reviewed-by: Andrew Cooper
[PATCH v2 0/3] xen: fix usage of devices behind a VMD bridge
Hello, The following series should fix the usage of devices behind a VMD bridge when running Linux as a Xen PV hardware domain (dom0). I've only been able to test PV. I think PVH should also work but I don't have hardware capable of testing it right now. I don't expect the first two patches to be problematic, the last patch is likely to be more controversial. I've tested it internally and didn't see any issues, but my testing of PV mode is mostly limited to dom0. Thanks, Roger. Roger Pau Monne (3): xen/pci: do not register devices with segments >= 0x1 vmd: disable MSI remapping bypass under Xen pci/msi: remove pci_msi_ignore_mask arch/x86/pci/xen.c | 8 ++-- drivers/pci/controller/vmd.c | 19 ++ drivers/pci/msi/msi.c| 37 drivers/xen/pci.c| 19 ++ include/linux/msi.h | 3 ++- kernel/irq/msi.c | 2 +- 6 files changed, 64 insertions(+), 24 deletions(-) -- 2.46.0
[PATCH v2 2/3] vmd: disable MSI remapping bypass under Xen
MSI remapping bypass (directly configuring MSI entries for devices on the VMD bus) won't work under Xen, as Xen is not aware of devices in such bus, and hence cannot configure the entries using the pIRQ interface in the PV case, and in the PVH case traps won't be setup for MSI entries for such devices. Until Xen is aware of devices in the VMD bus prevent the VMD_FEAT_CAN_BYPASS_MSI_REMAP capability from being used when running as any kind of Xen guest. The MSI remapping bypass is an optional feature of VMD bridges, and hence when running under Xen it will be masked and devices will be forced to redirect its interrupts from the VMD bridge. That mode of operation must always be supported by VMD bridges and works when Xen is not aware of devices behind the VMD bridge. Signed-off-by: Roger Pau Monné --- Changes since v1: - Add xen header. - Expand comment. --- drivers/pci/controller/vmd.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c index 264a180403a0..33c9514bd926 100644 --- a/drivers/pci/controller/vmd.c +++ b/drivers/pci/controller/vmd.c @@ -17,6 +17,8 @@ #include #include +#include + #include #define VMD_CFGBAR 0 @@ -965,6 +967,23 @@ static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id) struct vmd_dev *vmd; int err; + if (xen_domain()) + /* +* Xen doesn't have knowledge about devices in the VMD bus +* because the config space of devices behind the VMD bridge is +* not known to Xen, and hence Xen cannot discover or configure +* them in any way. +* +* Bypass of MSI remapping won't work in that case as direct +* write by Linux to the MSI entries won't result in functional +* interrupts, as it's Xen the entity that manages the host +* interrupt controller and must configure interrupts. +* However multiplexing of interrupts by the VMD bridge will +* work under Xen, so force the usage of that mode which must +* always be supported by VMD bridges. +*/ + features &= ~VMD_FEAT_CAN_BYPASS_MSI_REMAP; + if (resource_size(&dev->resource[VMD_CFGBAR]) < (1 << 20)) return -ENOMEM; -- 2.46.0
[PATCH v2 1/3] xen/pci: do not register devices with segments >= 0x10000
The current hypercall interface for doing PCI device operations always uses a segment field that has a 16 bit width. However on Linux there are buses like VMD that hook up devices into the PCI hierarchy at segment >= 0x1, after the maximum possible segment enumerated in ACPI. Attempting to register or manage those devices with Xen would result in errors at best, or overlaps with existing devices living on the truncated equivalent segment values. Note also that the VMD segment numbers are arbitrarily assigned by the OS, and hence there would need to be some negotiation between Xen and the OS to agree on how to enumerate VMD segments and devices behind them. Skip notifying Xen about those devices. Given how VMD bridges can multiplex interrupts on behalf of devices behind them there's no need for Xen to be aware of such devices for them to be usable by Linux. Signed-off-by: Roger Pau Monné --- Changes since v1: - Adjust commit message width to 75 columns. - Expand commit message. --- drivers/xen/pci.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c index 416f231809cb..08e82fd1263e 100644 --- a/drivers/xen/pci.c +++ b/drivers/xen/pci.c @@ -43,6 +43,13 @@ static int xen_add_device(struct device *dev) pci_mcfg_reserved = true; } #endif + + if (pci_domain_nr(pci_dev->bus) >> 16) { + dev_info(dev, +"not registering with Xen: invalid PCI segment\n"); + return 0; + } + if (pci_seg_supported) { DEFINE_RAW_FLEX(struct physdev_pci_device_add, add, optarr, 1); @@ -149,6 +156,12 @@ static int xen_remove_device(struct device *dev) int r; struct pci_dev *pci_dev = to_pci_dev(dev); + if (pci_domain_nr(pci_dev->bus) >> 16) { + dev_info(dev, +"not unregistering with Xen: invalid PCI segment\n"); + return 0; + } + if (pci_seg_supported) { struct physdev_pci_device device = { .seg = pci_domain_nr(pci_dev->bus), @@ -182,6 +195,12 @@ int xen_reset_device(const struct pci_dev *dev) .flags = PCI_DEVICE_RESET_FLR, }; + if (pci_domain_nr(dev->bus) >> 16) { + dev_info(&dev->dev, +"unable to notify Xen of device reset: invalid PCI segment\n"); + return 0; + } + return HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_reset, &device); } EXPORT_SYMBOL_GPL(xen_reset_device); -- 2.46.0
[PATCH v2 3/3] pci/msi: remove pci_msi_ignore_mask
Setting pci_msi_ignore_mask inhibits the toggling of the mask bit for both MSI and MSI-X entries globally, regardless of the IRQ chip they are using. Only Xen sets the pci_msi_ignore_mask when routing physical interrupts over event channels, to prevent PCI code from attempting to toggle the maskbit, as it's Xen that controls the bit. However, the pci_msi_ignore_mask being global will affect devices that use MSI interrupts but are not routing those interrupts over event channels (not using the Xen pIRQ chip). One example is devices behind a VMD PCI bridge. In that scenario the VMD bridge configures MSI(-X) using the normal IRQ chip (the pIRQ one in the Xen case), and devices behind the bridge configure the MSI entries using indexes into the VMD bridge MSI table. The VMD bridge then demultiplexes such interrupts and delivers to the destination device(s). Having pci_msi_ignore_mask set in that scenario prevents (un)masking of MSI entries for devices behind the VMD bridge. Move the signaling of no entry masking into the MSI domain flags, as that allows setting it on a per-domain basis. Set it for the Xen MSI domain that uses the pIRQ chip, while leaving it unset for the rest of the cases. Remove pci_msi_ignore_mask at once, since it was only used by Xen code, and with Xen dropping usage the variable is unneeded. This fixes using devices behind a VMD bridge on Xen PV hardware domains. Albeit Devices behind a VMD bridge are not known to Xen, that doesn't mean Linux cannot use them. By inhibiting the usage of VMD_FEAT_CAN_BYPASS_MSI_REMAP and the removal of the pci_msi_ignore_mask bodge devices behind a VMD bridge do work fine when use from a Linux Xen hardware domain. That's the whole point of the series. Signed-off-by: Roger Pau Monné --- Changes since v1: - Fix build. - Expand commit message. --- arch/x86/pci/xen.c| 8 ++-- drivers/pci/msi/msi.c | 37 + include/linux/msi.h | 3 ++- kernel/irq/msi.c | 2 +- 4 files changed, 26 insertions(+), 24 deletions(-) diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c index 0f2fe524f60d..b8755cde2419 100644 --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c @@ -436,7 +436,8 @@ static struct msi_domain_ops xen_pci_msi_domain_ops = { }; static struct msi_domain_info xen_pci_msi_domain_info = { - .flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_FREE_MSI_DESCS | MSI_FLAG_DEV_SYSFS, + .flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_FREE_MSI_DESCS | + MSI_FLAG_DEV_SYSFS | MSI_FLAG_NO_MASK, .ops= &xen_pci_msi_domain_ops, }; @@ -484,11 +485,6 @@ static __init void xen_setup_pci_msi(void) * in allocating the native domain and never use it. */ x86_init.irqs.create_pci_msi_domain = xen_create_pci_msi_domain; - /* -* With XEN PIRQ/Eventchannels in use PCI/MSI[-X] masking is solely -* controlled by the hypervisor. -*/ - pci_msi_ignore_mask = 1; } #else /* CONFIG_PCI_MSI */ diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c index 3a45879d85db..dcbb4f9ac578 100644 --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -10,12 +10,12 @@ #include #include #include +#include #include "../pci.h" #include "msi.h" int pci_msi_enable = 1; -int pci_msi_ignore_mask; /** * pci_msi_supported - check whether MSI may be enabled on a device @@ -285,6 +285,8 @@ static void pci_msi_set_enable(struct pci_dev *dev, int enable) static int msi_setup_msi_desc(struct pci_dev *dev, int nvec, struct irq_affinity_desc *masks) { + const struct irq_domain *d = dev_get_msi_domain(&dev->dev); + const struct msi_domain_info *info = d->host_data; struct msi_desc desc; u16 control; @@ -295,8 +297,7 @@ static int msi_setup_msi_desc(struct pci_dev *dev, int nvec, /* Lies, damned lies, and MSIs */ if (dev->dev_flags & PCI_DEV_FLAGS_HAS_MSI_MASKING) control |= PCI_MSI_FLAGS_MASKBIT; - /* Respect XEN's mask disabling */ - if (pci_msi_ignore_mask) + if (info->flags & MSI_FLAG_NO_MASK) control &= ~PCI_MSI_FLAGS_MASKBIT; desc.nvec_used = nvec; @@ -600,12 +601,15 @@ static void __iomem *msix_map_region(struct pci_dev *dev, */ void msix_prepare_msi_desc(struct pci_dev *dev, struct msi_desc *desc) { + const struct irq_domain *d = dev_get_msi_domain(&dev->dev); + const struct msi_domain_info *info = d->host_data; + desc->nvec_used = 1; desc->pci.msi_attrib.is_msix= 1; desc->pci.msi_attrib.is_64 = 1; desc->pci.msi_attrib.default_irq= dev->irq; desc->pci.mask_base = dev->msix_base; - desc->pci.msi_attrib.can_mask = !pci_msi_ignore_mask && + desc->pci.msi_attrib.can_mask
Re: [XEN PATCH] intel/msr: Fix handling of MSR_RAPL_POWER_UNIT
On Tue, Jan 14, 2025 at 10:32:25AM +0100, Roger Pau Monné wrote: > On Mon, Jan 13, 2025 at 06:42:44PM +, Teddy Astie wrote: > > Solaris 11.4 tries to access this MSR on some Intel platforms without > > properly > > setting up a proper #GP handler, which leads to a immediate crash. > > > > Emulate the access of this MSR by giving it a legal value (all values set to > > default, as defined by Intel SDM "RAPL Interfaces"). > > > > Fixes: 84e848fd7a1 ('x86/hvm: disallow access to unknown MSRs') Nit: I think we usually use 12 hex character hashes, the above one is 11 characters long. > > Hm, Seems like I've sent this too early. I wanted to say I wasn't convinced this is a fix for the above, but I can see how the change can be seen as a regression if Solaris booted before that change in behavior, so I'm fine with leaving the "Fixes:" tag. Thanks, Roger.
Re: [XEN PATCH] intel/msr: Fix handling of MSR_RAPL_POWER_UNIT
On Mon, Jan 13, 2025 at 06:42:44PM +, Teddy Astie wrote: > Solaris 11.4 tries to access this MSR on some Intel platforms without properly > setting up a proper #GP handler, which leads to a immediate crash. > > Emulate the access of this MSR by giving it a legal value (all values set to > default, as defined by Intel SDM "RAPL Interfaces"). > > Fixes: 84e848fd7a1 ('x86/hvm: disallow access to unknown MSRs') Hm, > Signed-off-by: Teddy Astie > --- > Does it have a risk of negatively affecting other operating systems expecting > this MSR read to fail ? > --- > xen/arch/x86/include/asm/msr-index.h | 2 ++ > xen/arch/x86/msr.c | 16 > 2 files changed, 18 insertions(+) > > diff --git a/xen/arch/x86/include/asm/msr-index.h > b/xen/arch/x86/include/asm/msr-index.h > index 9cdb5b2625..2adcdf344f 100644 > --- a/xen/arch/x86/include/asm/msr-index.h > +++ b/xen/arch/x86/include/asm/msr-index.h > @@ -144,6 +144,8 @@ > #define MSR_RTIT_ADDR_A(n) (0x0580 + (n) * 2) > #define MSR_RTIT_ADDR_B(n) (0x0581 + (n) * 2) > > +#define MSR_RAPL_POWER_UNIT 0x0606 > + > #define MSR_U_CET 0x06a0 > #define MSR_S_CET 0x06a2 > #define CET_SHSTK_EN (_AC(1, ULL) << 0) > diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c > index 289cf10b78..b14d42dacf 100644 > --- a/xen/arch/x86/msr.c > +++ b/xen/arch/x86/msr.c > @@ -169,6 +169,22 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t > *val) > if ( likely(!is_cpufreq_controller(d)) || rdmsr_safe(msr, *val) == 0 > ) > break; > goto gp_fault; > + Trailing spaces in the added newline. > +/* > + * Solaris 11.4 DomU tries to use read this MSR without setting up a > + * proper #GP handler leading to a crash. Emulate this MSR by giving > a > + * legal value. > + */ The comment should be after (inside) the case statement IMO (but not strong opinion. Could you also raise a bug with Solaris and put a link to the bug report here, so that we have a reference to it? > +case MSR_RAPL_POWER_UNIT: > +if ( !(cp->x86_vendor & (X86_VENDOR_INTEL | X86_VENDOR_CENTAUR)) ) Has Centaur ever released a CPU with RAPL? > +goto gp_fault; > + > +/* > + * Return a legal register content with all default values defined in > + * Intel Architecture Software Developer Manual 16.10.1 RAPL > Interfaces > + */ > +*val = 0xA1003; The SPR Specification defines the default as 000A0E03h: * SDM: Energy Status Units (bits 12:8): Energy related information (in Joules) is based on the multiplier, 1/2^ESU; where ESU is an unsigned integer represented by bits 12:8. Default value is 1b, indicating energy status unit is in 15.3 micro-Joules increment. * SPR: Energy Units (ENERGY_UNIT): Energy Units used for power control registers. The actual unit value is calculated by 1 J / Power(2,ENERGY_UNIT). The default value of 14 corresponds to Ux.14 number. Note that KVM just returns all 0s [0], so we might consider doing the same, as otherwise that could lead OSes to poke at further RAPL related MSRs if the returned value from MSR_RAPL_POWER_UNIT looks plausible. [0] https://elixir.bootlin.com/linux/v6.12.6/source/arch/x86/kvm/x86.c#L4236 Thanks.
Re: [PATCH v4 0/4] Add/enable stack protector
On 14/01/2025 4:25 am, Volodymyr Babchuk wrote: > Volodymyr Babchuk (4): > common: remove -fno-stack-protector from EMBEDDED_EXTRA_CFLAGS > xen: common: add ability to enable stack protector > xen: arm: enable stack protector feature > CHANGELOG.md: Mention stack-protector feature Reviewed-by: Andrew Cooper There's one minor formatting error which can be fixed on commit. ~Andrew
Re: [PATCH 2/3] vmd: disable MSI remapping bypass under Xen
On Mon, Jan 13, 2025 at 09:53:21AM -0700, Keith Busch wrote: > On Mon, Jan 13, 2025 at 05:45:20PM +0100, Roger Pau Monné wrote: > > On Mon, Jan 13, 2025 at 08:11:19AM -0700, Keith Busch wrote: > > > On Mon, Jan 13, 2025 at 11:03:58AM +0100, Roger Pau Monné wrote: > > > > > > > > Hm, OK, but isn't the limit 80 columns according to the kernel coding > > > > style (Documentation/process/coding-style.rst)? > > > > > > That's the coding style. The commit message style is described in a > > > different doc: > > > > > > > > > https://docs.kernel.org/process/submitting-patches.html#the-canonical-patch-format > > > > It's quite confusing to have different line length for code vs commit > > messages, but my fault for not reading those. Will adjust to 75 > > columns them. > > The output of 'git log' prepends spaces to the subject and body of the > listed commits. The lower limit for commit messages vs. code makes the > change log look readable in an 80-char terminal. Oh, I see, thanks for explaining. The 80 column limit for code mean the resulting diff (and `git log` output) could have a maximum width of 81 characters (because of the prepended +,-, ), but since Linux uses hard tabs for indentation this is not really an issue as it would be if using spaces. Regards, Roger.
Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock
On 1/13/25 6:52 PM, Roger Pau Monné wrote: On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki wrote: On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote: Allow setting the used wallclock from the command line. When the option is set to a value different than `auto` the probing is bypassed and the selected implementation is used (as long as it's available). The `xen` and `efi` options require being booted as a Xen guest (with Xen guest supported built-in) or from UEFI firmware respectively. Signed-off-by: Roger Pau Monné Reviewed-by: Marek Marczykowski-Górecki Thanks for the review. Oleksii, can I get your opinion as Release Manager about whether this (and the following patch) would be suitable for committing to staging given the current release state? It's a workaround for broken EFI implementations that many downstreams carry on their patch queue. Based on your commit message, I understand this as addressing a bug ( but not very critical as IIUC downstreams have the similar patch on their side ). Therefore, if it has been properly reviewed and tested, we should consider including it in the current release. IIUC, setting the wallclock to EFI should align with the behavior Xen had previously. It might be preferable to use that argument as the default for a while, allowing others to verify the "auto" value over time. After that, we could consider making "auto" the default. That said, I am not particularly strict about following this approach. ~ Oleksii Thanks, Roger.
Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock
On Tue, Jan 14, 2025 at 12:44:39PM +0100, Oleksii Kurochko wrote: > > On 1/14/25 12:40 PM, Oleksii Kurochko wrote: > > > > > > On 1/14/25 12:27 PM, Roger Pau Monné wrote: > > > On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote: > > > > On 1/13/25 6:52 PM, Roger Pau Monné wrote: > > > > > On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki > > > > > wrote: > > > > > > On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote: > > > > > > > Allow setting the used wallclock from the command line. When the > > > > > > > option is set > > > > > > > to a value different than `auto` the probing is bypassed and the > > > > > > > selected > > > > > > > implementation is used (as long as it's available). > > > > > > > > > > > > > > The `xen` and `efi` options require being booted as a Xen guest > > > > > > > (with Xen guest > > > > > > > supported built-in) or from UEFI firmware respectively. > > > > > > > > > > > > > > Signed-off-by: Roger Pau Monné > > > > > > Reviewed-by: Marek > > > > > > Marczykowski-Górecki > > > > > Thanks for the review. > > > > > > > > > > Oleksii, can I get your opinion as Release Manager about whether this > > > > > (and the following patch) would be suitable for committing to staging > > > > > given the current release state? > > > > > > > > > > It's a workaround for broken EFI implementations that many downstreams > > > > > carry on their patch queue. > > > > Based on your commit message, I understand this as addressing a bug ( > > > > but not very critical > > > > as IIUC downstreams have the similar patch on their side ). Therefore, > > > > if it has been properly > > > > reviewed and tested, we should consider including it in the current > > > > release. > > > IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves > > > the same behavior as proposed here. > > > > > > > IIUC, setting the wallclock to EFI should align with the behavior Xen > > > > had previously. > > > > It might be preferable to use that argument as the default for a while, > > > > allowing others to verify the "auto" > > > > value over time. After that, we could consider making "auto" the > > > > default. > > > > That said, I am not particularly strict about following this approach. > > > We cannot really set efi as the default, as it would break when > > > booting on legacy BIOS systems. > > > > > > We could take only patch 1 and leave patch 2 after Xen 4.20 has > > > branched, but at that point I would see little benefit in having just > > > patch 1. > > I don't see a lot of benefit of comitting only the one patch either. > > > > > > > I don't have a strong opinion, but downstreams have been complaining > > > about Xen behavior regarding the usage of EFI_GET_TIME, so it might be > > > good to not ship yet another release with such allegedly broken > > > behavior. > > Agree with that. As I mentioned above I consider it as a bug and based on > > that several mentioned above downstreams have the similar patch I could > > consider > > that as tested approach so .. > > > Let me know what you think, as I would need a formal Release-Ack if > > > this is to be committed. > > ... R-Acked-by: Oleksii Kurochko. > > Sorry for the noise. > > I missed to add that it would also be nice to mention IMO ( that now we have > wallclock parameter ) > in CHANGELOG so downstreams will receive "notification" that Xen provides a > workaround for the mentioned > issue in the patch series. Indeed. Would you be OK with adding the following chunk to patch 2? diff --git a/CHANGELOG.md b/CHANGELOG.md index 8507e6556a56..1de1d1eca17f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,6 +12,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) leaving this to the guest kernel to do in guest context. - On x86: - Prefer ACPI reboot over UEFI ResetSystem() run time service call. + - Prefer CMOS over EFI_GET_TIME as time source. - Switched the xAPIC flat driver to use physical destination mode for external interrupts instead of logical destination mode. @@ -24,6 +25,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) - Support for LLC (Last Level Cache) coloring. - On x86: - xl suspend/resume subcommands. + - `wallclock` command line option to select time source. ### Removed - On x86:
Re: [PATCH v4 2/4] xen: common: add ability to enable stack protector
On 14/01/2025 4:25 am, Volodymyr Babchuk wrote: > diff --git a/xen/common/stack-protector.c b/xen/common/stack-protector.c > new file mode 100644 > index 00..8fa9f6147f > --- /dev/null > +++ b/xen/common/stack-protector.c > @@ -0,0 +1,51 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +#include > +#include > +#include > +#include > + > +/* > + * Initial value is chosen by a fair dice roll. > + * It will be updated during boot process. > + */ > +#if BITS_PER_LONG == 32 > +unsigned long __ro_after_init __stack_chk_guard = 0xdd2cc927UL; > +#else > +unsigned long __ro_after_init __stack_chk_guard = 0x2d853605a4d9a09cUL; > +#endif > + > +/* > + * This function should be called from early asm or from a C function > + * that escapes stack canary tracking (by calling > + * reset_stack_and_jump() for example). > + */ > +void __init asmlinkage boot_stack_chk_guard_setup(void) > +{ > +/* > + * Linear congruent generator (X_n+1 = X_n * a + c). > + * > + * Constant is taken from "Tables Of Linear Congruential > + * Generators Of Different Sizes And Good Lattice Structure" by > + * Pierre L’Ecuyer. > + */ > +#if BITS_PER_LONG == 32 > +const unsigned long a = 2891336453UL; > +#else > +const unsigned long a = 2862933555777941757UL; > +#endif > +const unsigned long c = 1; > + > +unsigned long cycles = get_cycles(); > + > +/* Use the initial value if we can't generate random one */ > +if ( !cycles ) > +return; Indentation. Can be fixed on commit. Everything else LGTM. ~Andrew
[PATCH] x86/boot: Handle better alignment for 32 bit code
Output file didn't have correct alignment. Allows alignment into data or code up to 2mb. Signed-off-by: Frediano Ziglio --- xen/arch/x86/boot/Makefile| 8 xen/tools/combine_two_binaries.py | 7 ++- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/xen/arch/x86/boot/Makefile b/xen/arch/x86/boot/Makefile index 13d4583173..9a8ecba7aa 100644 --- a/xen/arch/x86/boot/Makefile +++ b/xen/arch/x86/boot/Makefile @@ -40,8 +40,8 @@ LD32 := $(LD) $(subst x86_64,i386,$(LDFLAGS_DIRECT)) # are affected by both text_diff and text_gap. Ensure the sum of gap and diff # is greater than 2^16 so that any 16bit relocations if present in the object # file turns into a build-time error. -text_gap := 0x010200 -text_diff := 0x408020 +text_gap := 0x01c240 +text_diff := 0x7e3dc0 $(obj)/build32.base.lds: AFLAGS-y += -DGAP=$(text_gap) -DTEXT_DIFF=$(text_diff) $(obj)/build32.offset.lds: AFLAGS-y += -DGAP=$(text_gap) -DTEXT_DIFF=$(text_diff) -DAPPLY_OFFSET @@ -69,7 +69,6 @@ $(obj)/built-in-32.%.bin: $(obj)/build32.%.lds $(obj)/built-in-32.tmp.o $(LD32) $(orphan-handling-y) -N -T $< -o $(@:bin=o) $(filter %.o,$^) $(NM) -p --format=bsd $(@:bin=o) > $(@:bin=map) $(OBJCOPY) -j .text -O binary $(@:bin=o) $@ - rm -f $(@:bin=o) quiet_cmd_combine = GEN $@ cmd_combine = \ @@ -80,6 +79,7 @@ cmd_combine = \ --bin1 $(obj)/built-in-32.base.bin \ --bin2 $(obj)/built-in-32.offset.bin \ --map $(obj)/built-in-32.base.map \ + --align $(shell $(OBJDUMP) -h $(obj)/built-in-32.base.o|sed '/text.*2\*\*/ {s/.*2\*\*//;p;}; d') \ --exports cmdline_parse_early,reloc,reloc_trampoline32 \ --output$@ @@ -90,4 +90,4 @@ $(obj)/built-in-32.S: $(obj)/built-in-32.base.bin $(obj)/built-in-32.offset.bin $(srctree)/tools/combine_two_binaries.py FORCE $(call if_changed,combine) -clean-files := built-in-32.*.bin built-in-32.*.map build32.*.lds +clean-files := built-in-32.*.bin built-in-32.*.map built-in-32.*.o build32.*.lds diff --git a/xen/tools/combine_two_binaries.py b/xen/tools/combine_two_binaries.py index 581e57cbc0..8e587c24fb 100755 --- a/xen/tools/combine_two_binaries.py +++ b/xen/tools/combine_two_binaries.py @@ -26,6 +26,10 @@ parser.add_argument('--text-diff', dest='text_diff', required=True, type=auto_int, help='Difference between code section start') +parser.add_argument('--align', dest='align', +default=2, +type=auto_int, +help='Alignment in power of 2') parser.add_argument('--output', dest='output', help='Output file') parser.add_argument('--map', dest='mapfile', @@ -93,7 +97,7 @@ if size1 > size2: file1, file2 = file2, file1 size1, size2 = size2, size1 if size2 != size1 + gap: -raise Exception('File sizes do not match') +raise Exception('File sizes do not match %d != %d + %d' % (size2, size1, gap)) del size2 file1.seek(0, 0) @@ -219,6 +223,7 @@ print('''/* * File autogenerated by combine_two_binaries.py DO NOT EDIT */''', file=out) print('\t' + args.section_header, file=out) +print('\t.p2align\t' + str(args.align), file=out) print('obj32_start:', file=out) output(out) print('\n\t.section .note.GNU-stack,"",@progbits', file=out) -- 2.34.1
Re: [XEN PATCH] intel/msr: Fix handling of MSR_RAPL_POWER_UNIT
On 13/01/2025 6:42 pm, Teddy Astie wrote: > Solaris 11.4 tries Is it only Solaris 11.4, or is the simply the one repro you had? Have you reported a bug? > to access this MSR on some Intel platforms without properly > setting up a proper #GP handler, which leads to a immediate crash. Minor grammar note. Either "without a proper #GP handler" or "without properly setting up a #GP handler", but having two proper(ly)'s in there is less than ideal. > Emulate the access of this MSR by giving it a legal value (all values set to > default, as defined by Intel SDM "RAPL Interfaces"). > > Fixes: 84e848fd7a1 ('x86/hvm: disallow access to unknown MSRs') > Signed-off-by: Teddy Astie > --- > Does it have a risk of negatively affecting other operating systems expecting > this MSR read to fail? It's Complicated. RAPL is a non-architectural feature (on Intel; AMD did it properly). It does not have a CPUID bit to announce the presence of the MSRs. Therefore OSes use a mixture of model numbers and {wr,rd}msr_safe() to probe. I expect this will change the behaviour of Linux. ~Andrew
Re: [PATCH v2 2/2] automation/eclair: make Misra rule 20.7 blocking for x86 also
On Tue, Jan 14, 2025 at 12:24:30PM +0100, Nicola Vetrini wrote: > On 2025-01-14 12:22, Roger Pau Monné wrote: > > Hello Oleksii, > > > > This is in principle ready to go in now (I'm currently running a > > private Eclair scan to ensure the patch is still OK against current > > staging). I would like to ask for a release Ack. > > > > One nit below, which I overlooked initially > > > Thanks, Roger. > > > > On Tue, Nov 26, 2024 at 10:35:08AM +0100, Roger Pau Monne wrote: > > > There are no violations left, make the rule globally blocking for > > > both x86 and > > > ARM. > > > > > > Signed-off-by: Roger Pau Monné > > > Reviewed-by: Andrew Cooper > > > --- > > > automation/eclair_analysis/ECLAIR/tagging.ecl | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/automation/eclair_analysis/ECLAIR/tagging.ecl > > > b/automation/eclair_analysis/ECLAIR/tagging.ecl > > > index 755ea3271fc9..cb4e233e838d 100644 > > > --- a/automation/eclair_analysis/ECLAIR/tagging.ecl > > > +++ b/automation/eclair_analysis/ECLAIR/tagging.ecl > > > @@ -80,6 +80,7 @@ MC3R1.R20.2|| > > > MC3R1.R20.3|| > > > MC3R1.R20.4|| > > > MC3R1.R20.6|| > > > +MC3R1.R20.7|| > > > MC3R1.R20.9|| > > > MC3R1.R20.11|| > > > MC3R1.R20.12|| > > > @@ -116,7 +117,7 @@ if(string_equal(target,"x86_64"), > > > ) > > this hunk will not apply because it uses MC3R1, rather than MC3R2. Should be > an easy fix. > > > > > > > if(string_equal(target,"arm64"), > > > - > > > service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6||MC3R1.R20.7"}) > > > + > > > service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6"}) > > > ) > > here as well Yeah indeed, I had to rebase the patch: https://gitlab.com/xen-project/people/royger/xen/-/commit/538439d59dc338ee3861bf1bc056783671ba1fc2 Let's see if Eclair is happy with it, currently running a pipeline. Thanks, Roger.
Re: [PATCH for 4.21 v1 1/1] xen/riscv: identify specific ISA supported by cpu
On 1/14/25 8:33 AM, Jan Beulich wrote: +RISCV_ISA_EXT_DATA(i, RISCV_ISA_EXT_i), +RISCV_ISA_EXT_DATA(m, RISCV_ISA_EXT_m), +RISCV_ISA_EXT_DATA(a, RISCV_ISA_EXT_a), +RISCV_ISA_EXT_DATA(f, RISCV_ISA_EXT_f), +RISCV_ISA_EXT_DATA(d, RISCV_ISA_EXT_d), +RISCV_ISA_EXT_DATA(q, RISCV_ISA_EXT_q), +RISCV_ISA_EXT_DATA(h, RISCV_ISA_EXT_h), +RISCV_ISA_EXT_DATA(zicntr, RISCV_ISA_EXT_ZICNTR), +RISCV_ISA_EXT_DATA(zicsr, RISCV_ISA_EXT_ZICSR), +RISCV_ISA_EXT_DATA(zifencei, RISCV_ISA_EXT_ZIFENCEI), +RISCV_ISA_EXT_DATA(zihintpause, RISCV_ISA_EXT_ZIHINTPAUSE), +RISCV_ISA_EXT_DATA(zihpm, RISCV_ISA_EXT_ZIHPM), +RISCV_ISA_EXT_DATA(zbb, RISCV_ISA_EXT_ZBB), Isn't it kind of implied that with the presence of Zbb, B should also be present? My interpretation of the RISC-V Bitmanip Extension spec is that the 'B' extension is essentially a collection of the Zba, Zbb, Zbs, and other extensions, but it isn't an extension by itself. The following is mentioned in the spec: The bit-manipulation (bitmanip) extension collection is comprised of several component extensions to the base RISC-V architecture that are intended to provide some combination of code size reduction, performance improvement, and energy reduction. While the instructions are intended to have general use, some instructions are more useful in some domains than others. Hence, several smaller bitmanip extensions are provided, rather than one large extension. Each of these smaller extensions is grouped by common function and use case, and each of which has its own Zb*-extension name. Still the doc has '"B" Extension for Bit Manipulation' as the title of the chapter. And gas accepts B as an extension (e.g. ".option arch, +b"). I think it is fine. B, in this case, just represents Zba, Zbb, Zbc, Zbs and that all of them are supported at the same time. But I see chips that doesn't have B because it doesn't have one of those extensions. +/* + * Workaround for invalid single-letter 's' & 'u' (QEMU). + * No need to set the bit in riscv_isa as 's' & 'u' are + * not valid ISA extensions. It works unless the first + * multi-letter extension in the ISA string begins with + * "Su" and is not prefixed with an underscore. + */ +if ( ext[-1] != '_' && ext[1] == 'u' ) +{ +++isa; +ext_err = true; +break; +} I'm afraid I don't understand this; the comment raises more questions than it answers. Some details could be found here about these QEMU workaround from LK view: https://lore.kernel.org/linux-riscv/ae93358e-e117-b43d-faad-772c529f8...@irq.a4lg.com/#t This leads to the following fix in QEMU: https://patchwork.kernel.org/project/qemu-devel/patch/dee09d708405075420b29115c1e9e87910b8da55.1648270894.git.research_tra...@irq.a4lg.com/#24792587 Considering QEMU's patch, these workaround isn't needed anymore since QEMU 7.1 ( it has been released30 Aug 2022 ) probably we could update the QEMU version on our CI and just drop these changes. Or, at least, update the comment with the links mentioned above and add a message that these changes are needed only for QEMU < 7.1. Am I right that we don't have something like GCC_VERSION in Xen but for QEMU? How could there be? At the time of building Xen we know what compiler version is in use, but we clearly don't know under what qemu versions it might later be run. Agree with that, there is no any sense for having something similar as GCC_VERSIOB but for QEMU. Then I will just update the comment around this workaround with some clarifications. +riscv_isa_parse_string(isa, this_isa); + +if ( bitmap_empty(riscv_isa, RISCV_ISA_EXT_MAX) ) +bitmap_copy(riscv_isa, this_isa, RISCV_ISA_EXT_MAX); +else +bitmap_and(riscv_isa, riscv_isa, this_isa, RISCV_ISA_EXT_MAX); What if the first instance had no extensions at all? You'll then copy what the second instance say, ending up with extensions not supported by one of the CPUs. I think that it's impossible that there is no extensions at all and it should be considered as a bug of provided riscv,isa property. Thereby it should be enough to add BUG_ON(!bitmap_empty(this_isa, RISCV_ISA_EXT_MAX)) before if-condition. Well, you can of course make such an assumption. I don't think though that it's technically impossible to have an extension-less environment. Xen won't be able to run there, though (we'll require H at the very least aiui, and I'm sure we really also require Zicsr). I would like to clarify some things. I think we are counting by word 'extension' different things. I am including to this `Base ISA` ( and likely it is incorrect to do so ( or,at least, confusing ) and I will try not to do that in the future. I am using it in this manner because `Base ISA` is included to the table 74. Standard ISA
Re: [PATCH v3 0/3] Add stack protector
On 12.12.2024 02:17, Andrew Cooper wrote: > On 12/12/2024 12:13 am, Volodymyr Babchuk wrote: >> Hello Jan, >> >> Jan Beulich writes: >> >>> On 11.12.2024 03:04, Volodymyr Babchuk wrote: Both GCC and Clang support -fstack-protector feature, which add stack canaries to functions where stack corruption is possible. This series makes possible to use this feature in Xen. I tested this on ARM64 and it is working as intended. Tested both with GCC and Clang. It is hard to enable this feature on x86, as GCC stores stack canary in %fs:40 by default, but Xen can't use %fs for various reasons. It is possibly to change stack canary location new newer GCC versions, but this will change minimal GCC requirement, which is also hard due to various reasons. So, this series focus mostly on ARM and RISCV. >>> Why exactly would it not be possible to offer the feature when new enough >>> gcc is in use? >> It is possible to use this feature with a modern enough GCC, yes. Are >> you suggesting to make HAS_STACK_PROTECTOR dependent on GCC_VERSION for >> x86 platform? > > (With the knowledge that this is a disputed Kconfig pattern, and will > need rebasing), the way I want this to work is simply: > > diff --git a/xen/Makefile b/xen/Makefile > index 0de0101fd0bf..5d0a88fb3c3f 100644 > --- a/xen/Makefile > +++ b/xen/Makefile > @@ -434,6 +434,9 @@ endif > > ifeq ($(CONFIG_STACK_PROTECTOR),y) > CFLAGS += -fstack-protector > +ifeq ($(CONFIG_X86),y) > +CFLAGS += -mstack-protector-guard=global > +endif > else > CFLAGS += -fno-stack-protector > endif > diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig > index 9cdd04721afa..7951ca908b36 100644 > --- a/xen/arch/x86/Kconfig > +++ b/xen/arch/x86/Kconfig > @@ -28,6 +28,7 @@ config X86 > select HAS_PCI_MSI > select HAS_PIRQ > select HAS_SCHED_GRANULARITY > + select HAS_STACK_PROTECTOR if > $(cc-option,-mstack-protector-guard=global) > select HAS_UBSAN > select HAS_VMAP > select HAS_VPCI if HVM > > > > Sadly, it doesn't build. I get a handful of: > > prelink.o: in function `cmdline_parse': > /home/andrew/xen.git/xen/common/kernel.c:216:(.init.text+0x20f2): failed > to convert GOTPCREL relocation against '__stack_chk_guard'; relink with > --no-relax > /home/andrew/xen.git/xen/common/kernel.c:230:(.init.text+0x246f): failed > to convert GOTPCREL relocation against '__stack_chk_guard'; relink with > --no-relax > > which is more toolchain-whispering than I feel like doing tonight. For reference: https://sourceware.org/pipermail/binutils/2025-January/138631.html You didn't enter a gcc bug report yet, did you? Jan
Re: [XEN PATCH] MAINTAINERS: Change reviewer of the ECLAIR integration
On Tue, Jan 14, 2025 at 03:48:54PM +0100, Nicola Vetrini wrote: > Simone Ballarin is no longer actively involved in reviewing > the ECLAIR integration for Xen. I am stepping up as a reviewer. > > Signed-off-by: Nicola Vetrini Acked-by: Roger Pau Monné Adding Simone to the Cc list, it would be helpful if he can also provide an Ack to signal he is OK with begin removed. Thanks, Roger.
Re: [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch
On 09.01.2025 18:33, Roger Pau Monné wrote: > On Thu, Jan 09, 2025 at 09:59:58AM +0100, Jan Beulich wrote: >> On 08.01.2025 15:26, Roger Pau Monne wrote: >>> @@ -2048,8 +2060,6 @@ static void __context_switch(void) >>> if ( pd != nd ) >>> cpumask_clear_cpu(cpu, pd->dirty_cpumask); >>> write_atomic(&p->dirty_cpu, VCPU_CPU_CLEAN); >>> - >>> -per_cpu(curr_vcpu, cpu) = n; >>> } >>> >>> void context_switch(struct vcpu *prev, struct vcpu *next) >>> @@ -2081,16 +2091,36 @@ void context_switch(struct vcpu *prev, struct vcpu >>> *next) >>> >>> local_irq_disable(); >>> >>> -set_current(next); >>> - >>> if ( (per_cpu(curr_vcpu, cpu) == next) || >>> (is_idle_domain(nextd) && cpu_online(cpu)) ) >>> { >>> +/* >>> + * Lazy context switch to the idle vCPU, set current == idle. Full >>> + * context switch happens if/when sync_local_execstate() is called. >>> + */ >>> +set_current(next); >>> local_irq_enable(); >> >> The comment is misleading as far as the first half of the if() condition >> goes: >> No further switching is going to happen in that case, aiui. > > Right, I should clarify that comment: this is either a lazy context > switch, or the return from a lazy state to the previously running > vCPU. > >>> } >>> else >>> { >>> -__context_switch(); >>> +/* >>> + * curr_vcpu will always point to the currently loaded vCPU >>> context, as >>> + * it's not updated when doing a lazy switch to the idle vCPU. >>> + */ >>> +struct vcpu *prev_ctx = per_cpu(curr_vcpu, cpu); >>> + >>> +if ( prev_ctx != current ) >>> +{ >>> +/* >>> + * Doing a full context switch to a non-idle vCPU from a lazy >>> + * context switched state. Adjust current to point to the >>> + * currently loaded vCPU context. >>> + */ >>> +ASSERT(current == idle_vcpu[cpu]); >>> +ASSERT(!is_idle_vcpu(next)); >>> +set_current(prev_ctx); >> >> This feels wrong, as in "current" then not representing what it should >> represent, >> for a certain time window. I may be dense, but neither comment not >> description >> clarify to me why this might be needed. I can see that it's needed to please >> the >> ASSERT() you add to __context_switch(), yet then I might ask why that >> assertion >> is put there. > > This is done so that when calling __context_switch() current == > curr_vcpu, and map_domain_page() can be used without getting into an > infinite sync_local_execstate() recursion loop. Yet it's the purpose of __context_switch() to bring curr_vcpu in sync with current. IOW both matching up is supposed to be an exit condition of the function, not an entry one. Plus, as indicated when we were talking this through yesterday, the set_current() here make "current" no longer point at what - from the scheduler's perspective - is (supposed to be) the current vCPU. Aiui this adjustment is the reason for ... >>> --- a/xen/arch/x86/traps.c >>> +++ b/xen/arch/x86/traps.c >>> @@ -2232,8 +2232,6 @@ void __init trap_init(void) >>> >>> void activate_debugregs(const struct vcpu *curr) >>> { >>> -ASSERT(curr == current); >>> - >>> write_debugreg(0, curr->arch.dr[0]); >>> write_debugreg(1, curr->arch.dr[1]); >>> write_debugreg(2, curr->arch.dr[2]); >> >> Why would this assertion go away? If it suddenly triggers, the parameter name >> would now end up being wrong. > > Well, at the point where activate_debugregs() gets called (in > paravirt_ctxt_switch_to()), current == previous as a result of this > change, so the assert is no longer true on purpose on that call > path. ... this behavior. Which, as said, feels wrong the latest when "curr" was renamed to no longer suggest it actually is cached "current". At that point it'll be dubious whose ->arch.dr[] are actually written into the CPU registers. Also let's not forget that there's a 2nd call here, where I very much hope it continues to be "current" that's being passed in. Jan
Re: [PATCH v2.1 06/18] x86/pv: set/clear guest GDT mappings using {populate,destroy}_perdomain_mapping()
On 08.01.2025 16:11, Roger Pau Monne wrote: > The pv_{set,destroy}_gdt() functions rely on the L1 table(s) that contain such > mappings being stashed in the domain structure, and thus such mappings being > modified by merely updating the L1 entries. > > Switch both pv_{set,destroy}_gdt() to instead use > {populate,destory}_perdomain_mapping(). Like for an earlier patch it doesn't really become clear why what is being done wants / needs doing. I might guess that it's the "stashed in the domain structure" that you ultimately want to get rid of? > --- a/xen/arch/x86/pv/descriptor-tables.c > +++ b/xen/arch/x86/pv/descriptor-tables.c > @@ -49,23 +49,20 @@ bool pv_destroy_ldt(struct vcpu *v) > > void pv_destroy_gdt(struct vcpu *v) > { > -l1_pgentry_t *pl1e = pv_gdt_ptes(v); > -mfn_t zero_mfn = _mfn(virt_to_mfn(zero_page)); > -l1_pgentry_t zero_l1e = l1e_from_mfn(zero_mfn, __PAGE_HYPERVISOR_RO); > unsigned int i; > > ASSERT(v == current || !vcpu_cpu_dirty(v)); > > -v->arch.pv.gdt_ents = 0; How can this validly go away? > -for ( i = 0; i < FIRST_RESERVED_GDT_PAGE; i++ ) > -{ > -mfn_t mfn = l1e_get_mfn(pl1e[i]); > +if ( v->arch.cr3 ) > +destroy_perdomain_mapping(v, GDT_VIRT_START(v), > + ARRAY_SIZE(v->arch.pv.gdt_frames)); How is v->arch.cr3 being non-zero related to the GDT area needing destroying? > -if ( (l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) && > - !mfn_eq(mfn, zero_mfn) ) > -put_page_and_type(mfn_to_page(mfn)); > +for ( i = 0; i < ARRAY_SIZE(v->arch.pv.gdt_frames); i++) > +{ > +if ( !v->arch.pv.gdt_frames[i] ) > +break; > > -l1e_write(&pl1e[i], zero_l1e); > +put_page_and_type(mfn_to_page(_mfn(v->arch.pv.gdt_frames[i]))); > v->arch.pv.gdt_frames[i] = 0; > } > } > @@ -74,8 +71,8 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[], > unsigned int entries) > { > struct domain *d = v->domain; > -l1_pgentry_t *pl1e; > unsigned int i, nr_frames = DIV_ROUND_UP(entries, 512); > +mfn_t mfns[ARRAY_SIZE(v->arch.pv.gdt_frames)]; Having this array is kind of odd - it'll hold all the same values as frames[], just under a different type. Considering the further copying done ... > @@ -90,6 +87,8 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[], > if ( !mfn_valid(mfn) || > !get_page_and_type(mfn_to_page(mfn), d, PGT_seg_desc_page) ) > goto fail; > + > +mfns[i] = mfn; > } > > /* Tear down the old GDT. */ > @@ -97,12 +96,9 @@ int pv_set_gdt(struct vcpu *v, const unsigned long > frames[], > > /* Install the new GDT. */ > v->arch.pv.gdt_ents = entries; > -pl1e = pv_gdt_ptes(v); > for ( i = 0; i < nr_frames; i++ ) > -{ > v->arch.pv.gdt_frames[i] = frames[i]; ... here, would it perhaps be an option to change ->arch.pv.gdt_frames[] to mfn_t[], thus allowing ... > -l1e_write(&pl1e[i], l1e_from_pfn(frames[i], __PAGE_HYPERVISOR_RW)); > -} > +populate_perdomain_mapping(v, GDT_VIRT_START(v), mfns, nr_frames); ... that array to be passed into here? Jan
Re: [PATCH v2 00/18] x86: adventures in Address Space Isolation
On 08.01.2025 15:26, Roger Pau Monne wrote: > Hello, > > The aim of this series is to introduce the functionality required to > create linear mappings visible to a single pCPU. > > Doing so requires having a per-vCPU root page-table (L4), and hence > requires shadowing the guest selected L4 on PV guests. As follow ups > (and partially to ensure the per-CPU mappings work fine) the CPU stacks > are switched to use per-CPU mappings, so that remote stack contents are > not by default mapped on all page-tables (note: for this to be true the > directmap entries for the stack pages would need to be removed also). > > There's one known shortcoming with the presented code: migration of PV > guests using per-vCPU root page-tables is not working. I need to > introduce extra logic to deal with PV shadow mode when using unique root > page-tables. I don't think this should block the series however, such > missing functionality can always be added as follow up work. > paging_domctl() is adjusted to reflect this restriction. > > The main differences compared to v1 are the usage of per-vCPU root page > tables (as opposed to per-pCPU), and the usage of the existing perdomain > family of functions to manage the mappings in the per-domain slot, that > now becomes per-vCPU. > > All patches until 17 are mostly preparatory, I think there's a nice > cleanup and generalization of the creation and managing of per-domain > mappings, by no longer storing references to L1 page-tables in the vCPU > or domain struct. Since you referred me to the cover letter, I've looked back here after making some more progress with the series. Along with my earlier comment towards the need or ultimate goal, ... > Patch 13 introduces the command line option, and would need discussion > and integration with the sparse direct map series. IMO we should get > consensus on how we want the command line to look ASAP, so that we can > basic parsing logic in place to be used by both the work here and the > direct map removal series. > > As part of this series the map_domain_page() helpers are also switched > to create per-vCPU mappings (see patch 15), which converts an existing > interface into creating per-vCPU mappings. Such interface can be used > to hide (map per-vCPU) further data that we don't want to be part of the > direct map, or even shared between vCPUs of the same domain. Also all > existing users of the interface will already create per-vCPU mappings > without needing additional changes. > > Note that none of the logic introduced in the series removes entries for > the directmap, so even when creating the per-CPU mappings the underlying > physical addresses are fully accessible when using it's direct map > entries. > > I also haven't done any benchmarking. Doesn't seem to cripple > performance up to the point that XenRT jobs would timeout before > finishing, that the only objective reference I can provide at the > moment. > > The series has been extensively tested on XenRT, but that doesn't cover > all possible use-cases, so it's likely to still have some rough edges, > handle with care. > > Thanks, Roger. > > Roger Pau Monne (18): > x86/mm: purge unneeded destroy_perdomain_mapping() > x86/domain: limit window where curr_vcpu != current on context switch > x86/mm: introduce helper to detect per-domain L1 entries that need > freeing > x86/pv: introduce function to populate perdomain area and use it to > map Xen GDT > x86/mm: switch destroy_perdomain_mapping() parameter from domain to > vCPU > x86/pv: set/clear guest GDT mappings using > {populate,destroy}_perdomain_mapping() > x86/pv: update guest LDT mappings using the linear entries > x86/pv: remove stashing of GDT/LDT L1 page-tables > x86/mm: simplify create_perdomain_mapping() interface > x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter > to vCPU > x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI > x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush > x86/spec-ctrl: introduce Address Space Isolation command line option > x86/mm: introduce per-vCPU L3 page-table > x86/mm: introduce a per-vCPU mapcache when using ASI > x86/pv: allow using a unique per-pCPU root page table (L4) > x86/mm: switch to a per-CPU mapped stack when using ASI > x86/mm: zero stack on context switch > > docs/misc/xen-command-line.pandoc| 24 +++ > xen/arch/x86/cpu/mcheck/mce.c| 4 + > xen/arch/x86/domain.c| 157 +++ > xen/arch/x86/domain_page.c | 105 ++ > xen/arch/x86/flushtlb.c | 28 ++- > xen/arch/x86/hvm/hvm.c | 6 - > xen/arch/x86/include/asm/config.h| 16 +- > xen/arch/x86/include/asm/current.h | 58 +- > xen/arch/x86/include/asm/desc.h | 6 +- > xen/arch/x86/include/asm/domain.h| 50 +++-- > xen/arch/x86/include/asm/flushtlb.h | 2 +- > xen/arch/x86/include/asm/mm.h| 15 +- > xen/arch/x86/incl
[PATCH] riscv: Add initial Xen guest support for RISC-V
From: Slavisa Petrovic This patch introduces initial support for running RISC-V as a Xen guest. It provides the necessary infrastructure and stubs for Xen-specific operations. Key changes include: - Modifications to the RISC-V kernel to integrate Xen-specific hypercalls and interfaces, with function implementations stubbed for future work. - Introduction of Xen-specific headers for RISC-V, including event handling, hypervisor interaction, and page management. Functions are defined but not yet implemented. - Stub implementations for memory management, grant tables, and context switching in the Xen environment, allowing further development and integration. Signed-off-by: Milan Djokic Signed-off-by: Slavisa Petrovic --- arch/riscv/Kbuild| 1 + arch/riscv/Kconfig | 19 +++ arch/riscv/include/asm/cpu.h | 1 + arch/riscv/include/asm/hypervisor.h | 9 ++ arch/riscv/include/asm/irq.h | 5 + arch/riscv/include/asm/sync_bitops.h | 27 arch/riscv/include/asm/xen/events.h | 28 arch/riscv/include/asm/xen/hypercall.h | 2 + arch/riscv/include/asm/xen/hypervisor.h | 2 + arch/riscv/include/asm/xen/interface.h | 2 + arch/riscv/include/asm/xen/page.h| 3 + arch/riscv/include/asm/xen/swiotlb-xen.h | 2 + arch/riscv/xen/Makefile | 2 + arch/riscv/xen/enlighten.c | 164 +++ arch/riscv/xen/grant-table.c | 57 arch/riscv/xen/hypercall.S | 71 ++ arch/riscv/xen/p2m.c | 76 +++ include/xen/interface/io/protocols.h | 3 + include/xen/riscv/hypercall.h| 71 ++ include/xen/riscv/hypervisor.h | 26 include/xen/riscv/interface.h| 85 include/xen/riscv/page.h | 106 +++ include/xen/riscv/swiotlb-xen.h | 13 ++ test.txt | 21 +++ 24 files changed, 796 insertions(+) create mode 100644 arch/riscv/include/asm/hypervisor.h create mode 100644 arch/riscv/include/asm/sync_bitops.h create mode 100644 arch/riscv/include/asm/xen/events.h create mode 100644 arch/riscv/include/asm/xen/hypercall.h create mode 100644 arch/riscv/include/asm/xen/hypervisor.h create mode 100644 arch/riscv/include/asm/xen/interface.h create mode 100644 arch/riscv/include/asm/xen/page.h create mode 100644 arch/riscv/include/asm/xen/swiotlb-xen.h create mode 100644 arch/riscv/xen/Makefile create mode 100644 arch/riscv/xen/enlighten.c create mode 100644 arch/riscv/xen/grant-table.c create mode 100644 arch/riscv/xen/hypercall.S create mode 100644 arch/riscv/xen/p2m.c create mode 100644 include/xen/riscv/hypercall.h create mode 100644 include/xen/riscv/hypervisor.h create mode 100644 include/xen/riscv/interface.h create mode 100644 include/xen/riscv/page.h create mode 100644 include/xen/riscv/swiotlb-xen.h create mode 100644 test.txt diff --git a/arch/riscv/Kbuild b/arch/riscv/Kbuild index 2c585f7a0b6e..d9b71deed2cd 100644 --- a/arch/riscv/Kbuild +++ b/arch/riscv/Kbuild @@ -5,6 +5,7 @@ obj-$(CONFIG_BUILTIN_DTB) += boot/dts/ obj-$(CONFIG_CRYPTO) += crypto/ obj-y += errata/ obj-$(CONFIG_KVM) += kvm/ +obj-$(CONFIG_XEN) += xen/ obj-$(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY) += purgatory/ diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index d4a7ca0388c0..13ea75221524 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -1071,6 +1071,25 @@ config PARAVIRT_TIME_ACCOUNTING If in doubt, say N here. +config XEN_DOM0 + def_bool y + depends on XEN + +config XEN + bool "Xen guest support on RISCV" + depends on RISCV && OF + select PARAVIRT + help + Enables support for running Linux as a Xen guest on RISC-V. + + Xen is a type-1 hypervisor that allows multiple operating systems + to run on the same hardware. Enabling this option allows the kernel + to function as a guest under the Xen hypervisor on RISC-V platforms. + + Say Y if you want to run this kernel as a guest under Xen on RISC-V. + + If unsure, say N. + config RELOCATABLE bool "Build a relocatable kernel" depends on MMU && 64BIT && !XIP_KERNEL diff --git a/arch/riscv/include/asm/cpu.h b/arch/riscv/include/asm/cpu.h index 28d45a6678ce..fb2aac6a068e 100644 --- a/arch/riscv/include/asm/cpu.h +++ b/arch/riscv/include/asm/cpu.h @@ -4,5 +4,6 @@ #define _ASM_CPU_H /* This header is required unconditionally by the ACPI core */ +#include #endif /* _ASM_CPU_H */ diff --git a/arch/riscv/include/asm/hypervisor.h b/arch/riscv/include/asm/hypervisor.h new file mode 100644 index ..3a117afe57f0 --- /dev/null +++ b/arch/riscv/include/asm/hypervisor.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_RISCV_HYPERVISOR_H +#define _ASM_RISCV_HYPERVISOR_H + +#incl
Re: [RFC PATCH v2 01/10] x86: Add architectural LBR definitions
On 02.01.2025 09:45, Tu Dinh wrote: > --- a/xen/arch/x86/include/asm/msr-index.h > +++ b/xen/arch/x86/include/asm/msr-index.h > @@ -112,6 +112,8 @@ > #define MCU_OPT_CTRL_GDS_MIT_DIS (_AC(1, ULL) << 4) > #define MCU_OPT_CTRL_GDS_MIT_LOCK (_AC(1, ULL) << 5) > > +#define MSR_LER_INFO0x01e0 > + > #define MSR_RTIT_OUTPUT_BASE0x0560 > #define MSR_RTIT_OUTPUT_MASK0x0561 > #define MSR_RTIT_CTL0x0570 > @@ -193,6 +195,16 @@ > #define MSR_UARCH_MISC_CTRL 0x1b01 > #define UARCH_CTRL_DOITM (_AC(1, ULL) << 0) > > +/* Architectural LBR state MSRs */ > +#define MSR_LBR_INFO(n) (0x1200 + (n)) > +#define MSR_LBR_CTL 0x14ce > +#define LBR_CTL_VALID _AC(0x7f000f, ULL) While I can see that such a value may be useful at some point, I think it wants introducing when needed and composing of definitions for the individual bits. Jan
Re: [PATCH v2 10/18] x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter to vCPU
On 08.01.2025 15:26, Roger Pau Monne wrote: > In preparation for the per-domain area being per-vCPU. This requires moving > some of the {create,destroy}_perdomain_mapping() calls to the domain > initialization and tear down paths into vCPU initialization and tear down. Am I confused or DYM "s/ to / from /"? Jan
Re: [RFC PATCH v2 03/10] tools: Add arch LBR feature bits
On 02.01.2025 09:45, Tu Dinh wrote: > Signed-off-by: Tu Dinh > --- > tools/libs/light/libxl_cpuid.c | 3 +++ > tools/misc/xen-cpuid.c | 3 +++ > 2 files changed, 6 insertions(+) > > diff --git a/tools/libs/light/libxl_cpuid.c b/tools/libs/light/libxl_cpuid.c > index 063fe86eb7..05be36f258 100644 > --- a/tools/libs/light/libxl_cpuid.c > +++ b/tools/libs/light/libxl_cpuid.c > @@ -342,6 +342,9 @@ int libxl_cpuid_parse_config(libxl_cpuid_policy_list > *policy, const char* str) > CPUID_ENTRY(0x0007, 1, CPUID_REG_EDX), > MSR_ENTRY(0x10a, CPUID_REG_EAX), > MSR_ENTRY(0x10a, CPUID_REG_EDX), > +CPUID_ENTRY(0x001C, NA, CPUID_REG_EAX), > +CPUID_ENTRY(0x001C, NA, CPUID_REG_EBX), > +CPUID_ENTRY(0x001C, NA, CPUID_REG_ECX), > #undef MSR_ENTRY > #undef CPUID_ENTRY > }; > diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c > index 4c4593528d..4f0fb0a6ea 100644 > --- a/tools/misc/xen-cpuid.c > +++ b/tools/misc/xen-cpuid.c > @@ -37,6 +37,9 @@ static const struct { > { "CPUID 0x0007:1.edx", "7d1" }, > { "MSR_ARCH_CAPS.lo", "m10Al" }, > { "MSR_ARCH_CAPS.hi", "m10Ah" }, > +{ "CPUID 0x001c.eax", "1Ca" }, > +{ "CPUID 0x001c.ebx", "1Cb" }, > +{ "CPUID 0x001c.ecx", "1Cc" }, > }; > > #define COL_ALIGN "24" I think this needs folding into patch 2. These tables want to stay in sync at all times with what the public interface has. Jan
Re: [RFC PATCH v2 02/10] x86: Define arch LBR feature bits
On 02.01.2025 09:45, Tu Dinh wrote: > --- a/xen/arch/x86/include/asm/cpufeature.h > +++ b/xen/arch/x86/include/asm/cpufeature.h > @@ -219,6 +219,11 @@ static inline bool boot_cpu_has(unsigned int feat) > #define cpu_has_rfds_no boot_cpu_has(X86_FEATURE_RFDS_NO) > #define cpu_has_rfds_clear boot_cpu_has(X86_FEATURE_RFDS_CLEAR) > > +/* CPUID level 0x001c.eax */ > + > +#define current_cpu_has_lbr_lip cpu_has(¤t_cpu_data, \ > +X86_FEATURE_LBR_LIP); Why would this, unlike all other similar constructs, need to use current_cpu_data rather than boot_cpu_has()? If there is a reason, it almost certainly wants naming in the description. > --- a/xen/include/public/arch-x86/cpufeatureset.h > +++ b/xen/include/public/arch-x86/cpufeatureset.h > @@ -284,7 +284,7 @@ XEN_CPUFEATURE(SERIALIZE, 9*32+14) /*A SERIALIZE > insn */ > XEN_CPUFEATURE(HYBRID,9*32+15) /* Heterogeneous platform */ > XEN_CPUFEATURE(TSXLDTRK, 9*32+16) /*a TSX load tracking suspend/resume > insns */ > XEN_CPUFEATURE(PCONFIG, 9*32+18) /* PCONFIG instruction */ > -XEN_CPUFEATURE(ARCH_LBR, 9*32+19) /* Architectural Last Branch Record > */ > +XEN_CPUFEATURE(ARCH_LBR, 9*32+19) /*s Architectural Last Branch Record > */ The 's' here (and below) may only be added once all of the respective handling is complete. > --- a/xen/include/xen/lib/x86/cpu-policy.h > +++ b/xen/include/xen/lib/x86/cpu-policy.h > @@ -22,6 +22,9 @@ > #define FEATURESET_7d1 15 /* 0x0007:1.edx*/ > #define FEATURESET_m10Al 16 /* 0x010a.eax */ > #define FEATURESET_m10Ah 17 /* 0x010a.edx */ > +#define FEATURESET_1Ca 18 /* 0x001c.eax */ > +#define FEATURESET_1Cb 19 /* 0x001c.ebx */ > +#define FEATURESET_1Cc 20 /* 0x001c.ecx */ > > struct cpuid_leaf > { > @@ -85,7 +88,7 @@ unsigned int x86_cpuid_lookup_vendor(uint32_t ebx, uint32_t > ecx, uint32_t edx); > */ > const char *x86_cpuid_vendor_to_str(unsigned int vendor); > > -#define CPUID_GUEST_NR_BASIC (0xdu + 1) > +#define CPUID_GUEST_NR_BASIC (0x1cu + 1) Are you sure this can be done with no other prep work? I've been sitting on AMX and AVX10 patches where I need to bump this, too. Yet I continue to think that something along the lines of the 3-patch series at [1] is necessary up front. > @@ -158,6 +161,52 @@ struct cpu_policy > uint64_t :64, :64; /* Leaf 0xb - Topology. */ > uint64_t :64, :64; /* Leaf 0xc - rsvd */ > uint64_t :64, :64; /* Leaf 0xd - XSTATE. */ > + > +uint64_t :64, :64; /* Leaf 0xe - rsvd */ > +uint64_t :64, :64; /* Leaf 0xf - rsvd */ > +uint64_t :64, :64; /* Leaf 0x10 - rsvd */ > +uint64_t :64, :64; /* Leaf 0x11 - rsvd */ > +uint64_t :64, :64; /* Leaf 0x12 - rsvd */ > +uint64_t :64, :64; /* Leaf 0x13 - rsvd */ > +uint64_t :64, :64; /* Leaf 0x14 - rsvd */ > +uint64_t :64, :64; /* Leaf 0x15 - rsvd */ > +uint64_t :64, :64; /* Leaf 0x16 - rsvd */ > +uint64_t :64, :64; /* Leaf 0x17 - rsvd */ > +uint64_t :64, :64; /* Leaf 0x18 - rsvd */ > +uint64_t :64, :64; /* Leaf 0x19 - rsvd */ > +uint64_t :64, :64; /* Leaf 0x1a - rsvd */ > +uint64_t :64, :64; /* Leaf 0x1b - rsvd */ > + > +union { > +uint32_t _1Ca; > +struct { > +uint32_t supported_depths:8; According to XEN_CPUFEATURE(LBR_DEPTH_...) further up these are 8 individual bits. Further, is there a reason you don't use here what the additions there produce in the generated header, but you rather re-define the fields from scratch? Jan [1] https://lists.xen.org/archives/html/xen-devel/2024-08/msg00591.html
Re: [RFC PATCH v2 04/10] x86: Calculate arch LBR CPUID policies
On 02.01.2025 09:45, Tu Dinh wrote: > --- a/xen/arch/x86/cpu-policy.c > +++ b/xen/arch/x86/cpu-policy.c > @@ -190,6 +190,16 @@ static void sanitise_featureset(uint32_t *fs) > } > } > > +static void recalculate_arch_lbr(struct cpu_policy *p) > +{ > +if ( p->basic.max_leaf < 0x1c || > + !(cpu_policy_xstates(&host_cpu_policy) & X86_XSS_LBR) || A reference to the host policy looks questionable here. If you ... > + p->basic.lbr_1Ca.supported_depths == 0) > +p->feat.arch_lbr = 0; > +if ( !p->feat.arch_lbr ) > +p->basic.raw[0x1c] = EMPTY_LEAF; > +} > + > static void recalculate_xstate(struct cpu_policy *p) > { > uint64_t xstates = XSTATE_FP_SSE; > @@ -219,6 +229,9 @@ static void recalculate_xstate(struct cpu_policy *p) > if ( p->feat.amx_tile ) > xstates |= X86_XCR0_TILE_CFG | X86_XCR0_TILE_DATA; > > +if ( p->feat.arch_lbr ) > +xstates |= X86_XSS_LBR; > + > /* Subleaf 0 */ > p->xstate.max_size = > xstate_uncompressed_size(xstates & ~XSTATE_XSAVES_ONLY); > @@ -271,6 +284,8 @@ static void recalculate_misc(struct cpu_policy *p) > > p->basic.raw[0xc] = EMPTY_LEAF; > > +zero_leaves(p->basic.raw, 0xe, 0x1b); > + > p->extd.e1d &= ~CPUID_COMMON_1D_FEATURES; > > /* Most of Power/RAS hidden from guests. */ > @@ -630,6 +645,7 @@ static void __init calculate_pv_max_policy(void) > > sanitise_featureset(fs); > x86_cpu_featureset_to_policy(fs, p); > +recalculate_arch_lbr(p); > recalculate_xstate(p); ... flipped the order of these calls, couldn't you use the respective bit from the correct policy? Then again you need p->feat.arch_lbr to actually set X86_XSS_LBR. Still the aspect in recalculate_arch_lbr() wants sorting somehow. Jan
Re: [RFC PATCH v2 05/10] x86: Keep a copy of XSAVE area size
On 02.01.2025 09:45, Tu Dinh wrote: > Signed-off-by: Tu Dinh This needs a non-empty description to clarify why this would be needed. Jan > --- a/xen/arch/x86/include/asm/domain.h > +++ b/xen/arch/x86/include/asm/domain.h > @@ -638,6 +638,7 @@ struct arch_vcpu > * #NM handler, we XRSTOR the states we XSAVE-ed; > */ > struct xsave_struct *xsave_area; > +unsigned int xsave_area_size; > uint64_t xcr0; > /* Accumulated eXtended features mask for using XSAVE/XRESTORE by Xen > * itself, as we can never know whether guest OS depends on content > diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c > index af9e345a7a..baae8e1a13 100644 > --- a/xen/arch/x86/xstate.c > +++ b/xen/arch/x86/xstate.c > @@ -550,6 +550,7 @@ int xstate_alloc_save_area(struct vcpu *v) > save_area->fpu_sse.mxcsr = MXCSR_DEFAULT; > > v->arch.xsave_area = save_area; > +v->arch.xsave_area_size = size; > v->arch.xcr0 = 0; > v->arch.xcr0_accum = 0; >
Re: [PATCH] riscv: Add initial Xen guest support for RISC-V
On Tue, Jan 14, 2025 at 9:41 PM Milan Djokic wrote: > > From: Slavisa Petrovic > > This patch introduces initial support for running RISC-V as a Xen guest. > It provides the necessary infrastructure and stubs for Xen-specific > operations. Key changes include: > > - Modifications to the RISC-V kernel to integrate Xen-specific hypercalls > and interfaces, with function implementations stubbed for future work. > - Introduction of Xen-specific headers for RISC-V, including event > handling, hypervisor interaction, and page management. Functions are > defined but not yet implemented. > - Stub implementations for memory management, grant tables, and context > switching in the Xen environment, allowing further development and > integration. > > Signed-off-by: Milan Djokic > Signed-off-by: Slavisa Petrovic A single patch with many changes is hard to review. Please convert this patch into a series with smaller patches. Also, include a cover letter in the series explaining how to test Xen on RISC-V. Regards, Anup > --- > arch/riscv/Kbuild| 1 + > arch/riscv/Kconfig | 19 +++ > arch/riscv/include/asm/cpu.h | 1 + > arch/riscv/include/asm/hypervisor.h | 9 ++ > arch/riscv/include/asm/irq.h | 5 + > arch/riscv/include/asm/sync_bitops.h | 27 > arch/riscv/include/asm/xen/events.h | 28 > arch/riscv/include/asm/xen/hypercall.h | 2 + > arch/riscv/include/asm/xen/hypervisor.h | 2 + > arch/riscv/include/asm/xen/interface.h | 2 + > arch/riscv/include/asm/xen/page.h| 3 + > arch/riscv/include/asm/xen/swiotlb-xen.h | 2 + > arch/riscv/xen/Makefile | 2 + > arch/riscv/xen/enlighten.c | 164 +++ > arch/riscv/xen/grant-table.c | 57 > arch/riscv/xen/hypercall.S | 71 ++ > arch/riscv/xen/p2m.c | 76 +++ > include/xen/interface/io/protocols.h | 3 + > include/xen/riscv/hypercall.h| 71 ++ > include/xen/riscv/hypervisor.h | 26 > include/xen/riscv/interface.h| 85 > include/xen/riscv/page.h | 106 +++ > include/xen/riscv/swiotlb-xen.h | 13 ++ > test.txt | 21 +++ > 24 files changed, 796 insertions(+) > create mode 100644 arch/riscv/include/asm/hypervisor.h > create mode 100644 arch/riscv/include/asm/sync_bitops.h > create mode 100644 arch/riscv/include/asm/xen/events.h > create mode 100644 arch/riscv/include/asm/xen/hypercall.h > create mode 100644 arch/riscv/include/asm/xen/hypervisor.h > create mode 100644 arch/riscv/include/asm/xen/interface.h > create mode 100644 arch/riscv/include/asm/xen/page.h > create mode 100644 arch/riscv/include/asm/xen/swiotlb-xen.h > create mode 100644 arch/riscv/xen/Makefile > create mode 100644 arch/riscv/xen/enlighten.c > create mode 100644 arch/riscv/xen/grant-table.c > create mode 100644 arch/riscv/xen/hypercall.S > create mode 100644 arch/riscv/xen/p2m.c > create mode 100644 include/xen/riscv/hypercall.h > create mode 100644 include/xen/riscv/hypervisor.h > create mode 100644 include/xen/riscv/interface.h > create mode 100644 include/xen/riscv/page.h > create mode 100644 include/xen/riscv/swiotlb-xen.h > create mode 100644 test.txt > > diff --git a/arch/riscv/Kbuild b/arch/riscv/Kbuild > index 2c585f7a0b6e..d9b71deed2cd 100644 > --- a/arch/riscv/Kbuild > +++ b/arch/riscv/Kbuild > @@ -5,6 +5,7 @@ obj-$(CONFIG_BUILTIN_DTB) += boot/dts/ > obj-$(CONFIG_CRYPTO) += crypto/ > obj-y += errata/ > obj-$(CONFIG_KVM) += kvm/ > +obj-$(CONFIG_XEN) += xen/ > > obj-$(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY) += purgatory/ > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > index d4a7ca0388c0..13ea75221524 100644 > --- a/arch/riscv/Kconfig > +++ b/arch/riscv/Kconfig > @@ -1071,6 +1071,25 @@ config PARAVIRT_TIME_ACCOUNTING > > If in doubt, say N here. > > +config XEN_DOM0 > + def_bool y > + depends on XEN > + > +config XEN > + bool "Xen guest support on RISCV" > + depends on RISCV && OF > + select PARAVIRT > + help > + Enables support for running Linux as a Xen guest on RISC-V. > + > + Xen is a type-1 hypervisor that allows multiple operating systems > + to run on the same hardware. Enabling this option allows the kernel > + to function as a guest under the Xen hypervisor on RISC-V platforms. > + > + Say Y if you want to run this kernel as a guest under Xen on RISC-V. > + > + If unsure, say N. > + > config RELOCATABLE > bool "Build a relocatable kernel" > depends on MMU && 64BIT && !XIP_KERNEL > diff --git a/arch/riscv/include/asm/cpu.h b/arch/riscv/include/asm/cpu.h > index 28d45a6678ce..fb2aac6a068e 100644 > --- a/arch/riscv/include/asm/cpu.h > +++ b/arch/riscv/includ
Re: [XEN PATCH] MAINTAINERS: Change reviewer of the ECLAIR integration
On 2025-01-14 16:00, Roger Pau Monné wrote: On Tue, Jan 14, 2025 at 03:48:54PM +0100, Nicola Vetrini wrote: Simone Ballarin is no longer actively involved in reviewing the ECLAIR integration for Xen. I am stepping up as a reviewer. Signed-off-by: Nicola Vetrini Acked-by: Roger Pau Monné Acked-by: Simone Ballarin Adding Simone to the Cc list, it would be helpful if he can also provide an Ack to signal he is OK with begin removed. Thanks, Roger.
Re: [PATCH v3 0/3] Add stack protector
On 14/01/2025 1:22 pm, Jan Beulich wrote: > On 12.12.2024 02:17, Andrew Cooper wrote: >> On 12/12/2024 12:13 am, Volodymyr Babchuk wrote: >>> Hello Jan, >>> >>> Jan Beulich writes: >>> On 11.12.2024 03:04, Volodymyr Babchuk wrote: > Both GCC and Clang support -fstack-protector feature, which add stack > canaries to functions where stack corruption is possible. This series > makes possible to use this feature in Xen. I tested this on ARM64 and > it is working as intended. Tested both with GCC and Clang. > > It is hard to enable this feature on x86, as GCC stores stack canary > in %fs:40 by default, but Xen can't use %fs for various reasons. It is > possibly to change stack canary location new newer GCC versions, but > this will change minimal GCC requirement, which is also hard due to > various reasons. So, this series focus mostly on ARM and RISCV. Why exactly would it not be possible to offer the feature when new enough gcc is in use? >>> It is possible to use this feature with a modern enough GCC, yes. Are >>> you suggesting to make HAS_STACK_PROTECTOR dependent on GCC_VERSION for >>> x86 platform? >> (With the knowledge that this is a disputed Kconfig pattern, and will >> need rebasing), the way I want this to work is simply: >> >> diff --git a/xen/Makefile b/xen/Makefile >> index 0de0101fd0bf..5d0a88fb3c3f 100644 >> --- a/xen/Makefile >> +++ b/xen/Makefile >> @@ -434,6 +434,9 @@ endif >> >> ifeq ($(CONFIG_STACK_PROTECTOR),y) >> CFLAGS += -fstack-protector >> +ifeq ($(CONFIG_X86),y) >> +CFLAGS += -mstack-protector-guard=global >> +endif >> else >> CFLAGS += -fno-stack-protector >> endif >> diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig >> index 9cdd04721afa..7951ca908b36 100644 >> --- a/xen/arch/x86/Kconfig >> +++ b/xen/arch/x86/Kconfig >> @@ -28,6 +28,7 @@ config X86 >> select HAS_PCI_MSI >> select HAS_PIRQ >> select HAS_SCHED_GRANULARITY >> + select HAS_STACK_PROTECTOR if >> $(cc-option,-mstack-protector-guard=global) >> select HAS_UBSAN >> select HAS_VMAP >> select HAS_VPCI if HVM >> >> >> >> Sadly, it doesn't build. I get a handful of: >> >> prelink.o: in function `cmdline_parse': >> /home/andrew/xen.git/xen/common/kernel.c:216:(.init.text+0x20f2): failed >> to convert GOTPCREL relocation against '__stack_chk_guard'; relink with >> --no-relax >> /home/andrew/xen.git/xen/common/kernel.c:230:(.init.text+0x246f): failed >> to convert GOTPCREL relocation against '__stack_chk_guard'; relink with >> --no-relax >> >> which is more toolchain-whispering than I feel like doing tonight. > For reference: > https://sourceware.org/pipermail/binutils/2025-January/138631.html > > You didn't enter a gcc bug report yet, did you? No, not yet. I'm afraid I've not had the time. ~Andrew
[PATCH v2] xl: properly dispose of libxl_dominfo struct instances
The ssid_label field requires separate freeing; make sure to call libxl_dominfo_dispose() as well as libxl_dominfo_init(). Since vcpuset() calls only the former, add a call to the latter there at the same time. Coverity-ID: 1638727 Coverity-ID: 1638728 Fixes: c458c404da16 ("xl: use libxl_domain_info to get the uuid in printf_info") Fixes: 48dab9767d2e ("tools/xl: use libxl_domain_info to get domain type for vcpu-pin") Signed-off-by: Jan Beulich Release-Acked-by: Oleksii Kurochko --- v2: Add call to libxl_dominfo_init() to vcpuset(). --- a/tools/xl/xl_sxp.c +++ b/tools/xl/xl_sxp.c @@ -45,8 +45,10 @@ void printf_info_sexp(int domid, libxl_d /* retrieve the UUID from dominfo, since it is probably generated * during parsing and thus does not match the real one */ +libxl_dominfo_init(&info); if (libxl_domain_info(ctx, &info, domid) == 0) { fprintf(fh, "\t(uuid " LIBXL_UUID_FMT ")\n", LIBXL_UUID_BYTES(info.uuid)); +libxl_dominfo_dispose(&info); } else { fprintf(fh, "\t(uuid )\n"); } --- a/tools/xl/xl_vcpu.c +++ b/tools/xl/xl_vcpu.c @@ -286,6 +286,8 @@ int main_vcpupin(int argc, char **argv) if (!ignore_masks && hard) { libxl_dominfo dominfo; +libxl_dominfo_init(&dominfo); + if (libxl_domain_info(ctx, &dominfo, domid)) { fprintf(stderr, "Could not get domain info\n"); goto out; @@ -293,6 +295,8 @@ int main_vcpupin(int argc, char **argv) /* HVM and PVH domains use the same global affinity mask */ apply_global_affinity_masks(dominfo.domain_type, hard, 1); + +libxl_dominfo_dispose(&dominfo); } if (force) { @@ -348,6 +352,7 @@ static int vcpuset(uint32_t domid, const unsigned int online_vcpus, host_cpu = libxl_get_max_cpus(ctx); libxl_dominfo dominfo; +libxl_dominfo_init(&dominfo); if (libxl_domain_info(ctx, &dominfo, domid)) return 1;
Re: [PATCH v2 15/25] drm/omapdrm: Compute dumb-buffer sizes with drm_mode_size_dumb()
Hi, On 09/01/2025 16:57, Thomas Zimmermann wrote: Call drm_mode_size_dumb() to compute dumb-buffer scanline pitch and buffer size. Align the pitch to a multiple of 8. Signed-off-by: Thomas Zimmermann Cc: Tomi Valkeinen --- drivers/gpu/drm/omapdrm/omap_gem.c | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/omapdrm/omap_gem.c b/drivers/gpu/drm/omapdrm/omap_gem.c index b9c67e4ca360..b8413a2dcdeb 100644 --- a/drivers/gpu/drm/omapdrm/omap_gem.c +++ b/drivers/gpu/drm/omapdrm/omap_gem.c @@ -11,6 +11,7 @@ #include #include +#include #include #include @@ -583,15 +584,13 @@ static int omap_gem_object_mmap(struct drm_gem_object *obj, struct vm_area_struc int omap_gem_dumb_create(struct drm_file *file, struct drm_device *dev, struct drm_mode_create_dumb *args) { - union omap_gem_size gsize; - - args->pitch = DIV_ROUND_UP(args->width * args->bpp, 8); - - args->size = PAGE_ALIGN(args->pitch * args->height); + union omap_gem_size gsize = { }; + int ret; - gsize = (union omap_gem_size){ - .bytes = args->size, - }; + ret = drm_mode_size_dumb(dev, args, SZ_8, 0); + if (ret) + return ret; + gsize.bytes = args->size; return omap_gem_new_handle(dev, file, gsize, OMAP_BO_SCANOUT | OMAP_BO_WC, &args->handle); Tested on dra76 evm. Reviewed-by: Tomi Valkeinen Tomi
Re: [PATCH v3 0/3] Add stack protector
On 14/01/2025 1:47 pm, Jan Beulich wrote: > On 14.01.2025 14:28, Andrew Cooper wrote: >> On 14/01/2025 1:22 pm, Jan Beulich wrote: >>> On 12.12.2024 02:17, Andrew Cooper wrote: On 12/12/2024 12:13 am, Volodymyr Babchuk wrote: > Hello Jan, > > Jan Beulich writes: > >> On 11.12.2024 03:04, Volodymyr Babchuk wrote: >>> Both GCC and Clang support -fstack-protector feature, which add stack >>> canaries to functions where stack corruption is possible. This series >>> makes possible to use this feature in Xen. I tested this on ARM64 and >>> it is working as intended. Tested both with GCC and Clang. >>> >>> It is hard to enable this feature on x86, as GCC stores stack canary >>> in %fs:40 by default, but Xen can't use %fs for various reasons. It is >>> possibly to change stack canary location new newer GCC versions, but >>> this will change minimal GCC requirement, which is also hard due to >>> various reasons. So, this series focus mostly on ARM and RISCV. >> Why exactly would it not be possible to offer the feature when new enough >> gcc is in use? > It is possible to use this feature with a modern enough GCC, yes. Are > you suggesting to make HAS_STACK_PROTECTOR dependent on GCC_VERSION for > x86 platform? (With the knowledge that this is a disputed Kconfig pattern, and will need rebasing), the way I want this to work is simply: diff --git a/xen/Makefile b/xen/Makefile index 0de0101fd0bf..5d0a88fb3c3f 100644 --- a/xen/Makefile +++ b/xen/Makefile @@ -434,6 +434,9 @@ endif ifeq ($(CONFIG_STACK_PROTECTOR),y) CFLAGS += -fstack-protector +ifeq ($(CONFIG_X86),y) +CFLAGS += -mstack-protector-guard=global +endif else CFLAGS += -fno-stack-protector endif diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig index 9cdd04721afa..7951ca908b36 100644 --- a/xen/arch/x86/Kconfig +++ b/xen/arch/x86/Kconfig @@ -28,6 +28,7 @@ config X86 select HAS_PCI_MSI select HAS_PIRQ select HAS_SCHED_GRANULARITY + select HAS_STACK_PROTECTOR if $(cc-option,-mstack-protector-guard=global) select HAS_UBSAN select HAS_VMAP select HAS_VPCI if HVM Sadly, it doesn't build. I get a handful of: prelink.o: in function `cmdline_parse': /home/andrew/xen.git/xen/common/kernel.c:216:(.init.text+0x20f2): failed to convert GOTPCREL relocation against '__stack_chk_guard'; relink with --no-relax /home/andrew/xen.git/xen/common/kernel.c:230:(.init.text+0x246f): failed to convert GOTPCREL relocation against '__stack_chk_guard'; relink with --no-relax which is more toolchain-whispering than I feel like doing tonight. >>> For reference: >>> https://sourceware.org/pipermail/binutils/2025-January/138631.html >>> >>> You didn't enter a gcc bug report yet, did you? >> No, not yet. I'm afraid I've not had the time. > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118473 Thankyou. ~Andrew
[XEN PATCH] MAINTAINERS: Change reviewer of the ECLAIR integration
Simone Ballarin is no longer actively involved in reviewing the ECLAIR integration for Xen. I am stepping up as a reviewer. Signed-off-by: Nicola Vetrini --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 392f780f7617..c11b82eca98f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -297,7 +297,7 @@ F: xen/include/xen/device_tree.h F: xen/drivers/passthrough/device_tree.c ECLAIR -R: Simone Ballarin +R: Nicola Vetrini S: Supported F: automation/eclair_analysis/ F: automation/scripts/eclair -- 2.43.0
Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock
On Tue, Jan 14, 2025 at 03:23:21PM +0100, Oleksii Kurochko wrote: > > On 1/14/25 1:13 PM, Roger Pau Monné wrote: > > On Tue, Jan 14, 2025 at 12:44:39PM +0100, Oleksii Kurochko wrote: > > > On 1/14/25 12:40 PM, Oleksii Kurochko wrote: > > > > > > > > On 1/14/25 12:27 PM, Roger Pau Monné wrote: > > > > > On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote: > > > > > > On 1/13/25 6:52 PM, Roger Pau Monné wrote: > > > > > > > On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek > > > > > > > Marczykowski-Górecki wrote: > > > > > > > > On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote: > > > > > > > > > Allow setting the used wallclock from the command line. When > > > > > > > > > the option is set > > > > > > > > > to a value different than `auto` the probing is bypassed and > > > > > > > > > the selected > > > > > > > > > implementation is used (as long as it's available). > > > > > > > > > > > > > > > > > > The `xen` and `efi` options require being booted as a Xen > > > > > > > > > guest (with Xen guest > > > > > > > > > supported built-in) or from UEFI firmware respectively. > > > > > > > > > > > > > > > > > > Signed-off-by: Roger Pau Monné > > > > > > > > Reviewed-by: Marek > > > > > > > > Marczykowski-Górecki > > > > > > > Thanks for the review. > > > > > > > > > > > > > > Oleksii, can I get your opinion as Release Manager about whether > > > > > > > this > > > > > > > (and the following patch) would be suitable for committing to > > > > > > > staging > > > > > > > given the current release state? > > > > > > > > > > > > > > It's a workaround for broken EFI implementations that many > > > > > > > downstreams > > > > > > > carry on their patch queue. > > > > > > Based on your commit message, I understand this as addressing a bug > > > > > > ( but not very critical > > > > > > as IIUC downstreams have the similar patch on their side ). > > > > > > Therefore, if it has been properly > > > > > > reviewed and tested, we should consider including it in the current > > > > > > release. > > > > > IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves > > > > > the same behavior as proposed here. > > > > > > > > > > > IIUC, setting the wallclock to EFI should align with the behavior > > > > > > Xen had previously. > > > > > > It might be preferable to use that argument as the default for a > > > > > > while, allowing others to verify the "auto" > > > > > > value over time. After that, we could consider making "auto" the > > > > > > default. > > > > > > That said, I am not particularly strict about following this > > > > > > approach. > > > > > We cannot really set efi as the default, as it would break when > > > > > booting on legacy BIOS systems. > > > > > > > > > > We could take only patch 1 and leave patch 2 after Xen 4.20 has > > > > > branched, but at that point I would see little benefit in having just > > > > > patch 1. > > > > I don't see a lot of benefit of comitting only the one patch either. > > > > > > > > > > > > > I don't have a strong opinion, but downstreams have been complaining > > > > > about Xen behavior regarding the usage of EFI_GET_TIME, so it might be > > > > > good to not ship yet another release with such allegedly broken > > > > > behavior. > > > > Agree with that. As I mentioned above I consider it as a bug and based > > > > on > > > > that several mentioned above downstreams have the similar patch I could > > > > consider > > > > that as tested approach so .. > > > > > Let me know what you think, as I would need a formal Release-Ack if > > > > > this is to be committed. > > > > ... R-Acked-by: Oleksii Kurochko. > > > Sorry for the noise. > > > > > > I missed to add that it would also be nice to mention IMO ( that now we > > > have wallclock parameter ) > > > in CHANGELOG so downstreams will receive "notification" that Xen provides > > > a workaround for the mentioned > > > issue in the patch series. > > Indeed. Would you be OK with adding the following chunk to patch 2? > > > > diff --git a/CHANGELOG.md b/CHANGELOG.md > > index 8507e6556a56..1de1d1eca17f 100644 > > --- a/CHANGELOG.md > > +++ b/CHANGELOG.md > > @@ -12,6 +12,7 @@ The format is based on [Keep a > > Changelog](https://keepachangelog.com/en/1.0.0/) > > leaving this to the guest kernel to do in guest context. > >- On x86: > > - Prefer ACPI reboot over UEFI ResetSystem() run time service call. > > + - Prefer CMOS over EFI_GET_TIME as time source. > > - Switched the xAPIC flat driver to use physical destination mode for > > external > >interrupts instead of logical destination mode. > > @@ -24,6 +25,7 @@ The format is based on [Keep a > > Changelog](https://keepachangelog.com/en/1.0.0/) > > - Support for LLC (Last Level Cache) coloring. > >- On x86: > > - xl suspend/resume subcommands. > > + - `wallclock` command line option to select time source. > > What about of adding word 'Add' before `wallclock`? It's i
Re: [PATCH v2] xl: properly dispose of libxl_dominfo struct instances
On 14/01/2025 1:29 pm, Jan Beulich wrote: > The ssid_label field requires separate freeing; make sure to call > libxl_dominfo_dispose() as well as libxl_dominfo_init(). Since vcpuset() > calls only the former, add a call to the latter there at the same time. > > Coverity-ID: 1638727 > Coverity-ID: 1638728 > Fixes: c458c404da16 ("xl: use libxl_domain_info to get the uuid in > printf_info") > Fixes: 48dab9767d2e ("tools/xl: use libxl_domain_info to get domain type for > vcpu-pin") > Signed-off-by: Jan Beulich > Release-Acked-by: Oleksii Kurochko > --- > v2: Add call to libxl_dominfo_init() to vcpuset(). Reviewed-by: Andrew Cooper
Re: [PATCH v3 0/3] Add stack protector
On 14.01.2025 14:28, Andrew Cooper wrote: > On 14/01/2025 1:22 pm, Jan Beulich wrote: >> On 12.12.2024 02:17, Andrew Cooper wrote: >>> On 12/12/2024 12:13 am, Volodymyr Babchuk wrote: Hello Jan, Jan Beulich writes: > On 11.12.2024 03:04, Volodymyr Babchuk wrote: >> Both GCC and Clang support -fstack-protector feature, which add stack >> canaries to functions where stack corruption is possible. This series >> makes possible to use this feature in Xen. I tested this on ARM64 and >> it is working as intended. Tested both with GCC and Clang. >> >> It is hard to enable this feature on x86, as GCC stores stack canary >> in %fs:40 by default, but Xen can't use %fs for various reasons. It is >> possibly to change stack canary location new newer GCC versions, but >> this will change minimal GCC requirement, which is also hard due to >> various reasons. So, this series focus mostly on ARM and RISCV. > Why exactly would it not be possible to offer the feature when new enough > gcc is in use? It is possible to use this feature with a modern enough GCC, yes. Are you suggesting to make HAS_STACK_PROTECTOR dependent on GCC_VERSION for x86 platform? >>> (With the knowledge that this is a disputed Kconfig pattern, and will >>> need rebasing), the way I want this to work is simply: >>> >>> diff --git a/xen/Makefile b/xen/Makefile >>> index 0de0101fd0bf..5d0a88fb3c3f 100644 >>> --- a/xen/Makefile >>> +++ b/xen/Makefile >>> @@ -434,6 +434,9 @@ endif >>> >>> ifeq ($(CONFIG_STACK_PROTECTOR),y) >>> CFLAGS += -fstack-protector >>> +ifeq ($(CONFIG_X86),y) >>> +CFLAGS += -mstack-protector-guard=global >>> +endif >>> else >>> CFLAGS += -fno-stack-protector >>> endif >>> diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig >>> index 9cdd04721afa..7951ca908b36 100644 >>> --- a/xen/arch/x86/Kconfig >>> +++ b/xen/arch/x86/Kconfig >>> @@ -28,6 +28,7 @@ config X86 >>> select HAS_PCI_MSI >>> select HAS_PIRQ >>> select HAS_SCHED_GRANULARITY >>> + select HAS_STACK_PROTECTOR if >>> $(cc-option,-mstack-protector-guard=global) >>> select HAS_UBSAN >>> select HAS_VMAP >>> select HAS_VPCI if HVM >>> >>> >>> >>> Sadly, it doesn't build. I get a handful of: >>> >>> prelink.o: in function `cmdline_parse': >>> /home/andrew/xen.git/xen/common/kernel.c:216:(.init.text+0x20f2): failed >>> to convert GOTPCREL relocation against '__stack_chk_guard'; relink with >>> --no-relax >>> /home/andrew/xen.git/xen/common/kernel.c:230:(.init.text+0x246f): failed >>> to convert GOTPCREL relocation against '__stack_chk_guard'; relink with >>> --no-relax >>> >>> which is more toolchain-whispering than I feel like doing tonight. >> For reference: >> https://sourceware.org/pipermail/binutils/2025-January/138631.html >> >> You didn't enter a gcc bug report yet, did you? > > No, not yet. I'm afraid I've not had the time. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118473 Jan
Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock
On 1/14/25 1:13 PM, Roger Pau Monné wrote: On Tue, Jan 14, 2025 at 12:44:39PM +0100, Oleksii Kurochko wrote: On 1/14/25 12:40 PM, Oleksii Kurochko wrote: On 1/14/25 12:27 PM, Roger Pau Monné wrote: On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote: On 1/13/25 6:52 PM, Roger Pau Monné wrote: On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki wrote: On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote: Allow setting the used wallclock from the command line. When the option is set to a value different than `auto` the probing is bypassed and the selected implementation is used (as long as it's available). The `xen` and `efi` options require being booted as a Xen guest (with Xen guest supported built-in) or from UEFI firmware respectively. Signed-off-by: Roger Pau Monné Reviewed-by: Marek Marczykowski-Górecki Thanks for the review. Oleksii, can I get your opinion as Release Manager about whether this (and the following patch) would be suitable for committing to staging given the current release state? It's a workaround for broken EFI implementations that many downstreams carry on their patch queue. Based on your commit message, I understand this as addressing a bug ( but not very critical as IIUC downstreams have the similar patch on their side ). Therefore, if it has been properly reviewed and tested, we should consider including it in the current release. IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves the same behavior as proposed here. IIUC, setting the wallclock to EFI should align with the behavior Xen had previously. It might be preferable to use that argument as the default for a while, allowing others to verify the "auto" value over time. After that, we could consider making "auto" the default. That said, I am not particularly strict about following this approach. We cannot really set efi as the default, as it would break when booting on legacy BIOS systems. We could take only patch 1 and leave patch 2 after Xen 4.20 has branched, but at that point I would see little benefit in having just patch 1. I don't see a lot of benefit of comitting only the one patch either. I don't have a strong opinion, but downstreams have been complaining about Xen behavior regarding the usage of EFI_GET_TIME, so it might be good to not ship yet another release with such allegedly broken behavior. Agree with that. As I mentioned above I consider it as a bug and based on that several mentioned above downstreams have the similar patch I could consider that as tested approach so .. Let me know what you think, as I would need a formal Release-Ack if this is to be committed. ... R-Acked-by: Oleksii Kurochko. Sorry for the noise. I missed to add that it would also be nice to mention IMO ( that now we have wallclock parameter ) in CHANGELOG so downstreams will receive "notification" that Xen provides a workaround for the mentioned issue in the patch series. Indeed. Would you be OK with adding the following chunk to patch 2? diff --git a/CHANGELOG.md b/CHANGELOG.md index 8507e6556a56..1de1d1eca17f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,6 +12,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) leaving this to the guest kernel to do in guest context. - On x86: - Prefer ACPI reboot over UEFI ResetSystem() run time service call. + - Prefer CMOS over EFI_GET_TIME as time source. - Switched the xAPIC flat driver to use physical destination mode for external interrupts instead of logical destination mode. @@ -24,6 +25,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) - Support for LLC (Last Level Cache) coloring. - On x86: - xl suspend/resume subcommands. + - `wallclock` command line option to select time source. What about of adding word 'Add' before `wallclock`? Other LGTM: Acked-By: Oleksii Kurochko Thanks. ~ Oleksii ### Removed - On x86:
Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock
On 1/14/25 3:58 PM, Roger Pau Monné wrote: On Tue, Jan 14, 2025 at 03:23:21PM +0100, Oleksii Kurochko wrote: On 1/14/25 1:13 PM, Roger Pau Monné wrote: On Tue, Jan 14, 2025 at 12:44:39PM +0100, Oleksii Kurochko wrote: On 1/14/25 12:40 PM, Oleksii Kurochko wrote: On 1/14/25 12:27 PM, Roger Pau Monné wrote: On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote: On 1/13/25 6:52 PM, Roger Pau Monné wrote: On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki wrote: On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote: Allow setting the used wallclock from the command line. When the option is set to a value different than `auto` the probing is bypassed and the selected implementation is used (as long as it's available). The `xen` and `efi` options require being booted as a Xen guest (with Xen guest supported built-in) or from UEFI firmware respectively. Signed-off-by: Roger Pau Monné Reviewed-by: Marek Marczykowski-Górecki Thanks for the review. Oleksii, can I get your opinion as Release Manager about whether this (and the following patch) would be suitable for committing to staging given the current release state? It's a workaround for broken EFI implementations that many downstreams carry on their patch queue. Based on your commit message, I understand this as addressing a bug ( but not very critical as IIUC downstreams have the similar patch on their side ). Therefore, if it has been properly reviewed and tested, we should consider including it in the current release. IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves the same behavior as proposed here. IIUC, setting the wallclock to EFI should align with the behavior Xen had previously. It might be preferable to use that argument as the default for a while, allowing others to verify the "auto" value over time. After that, we could consider making "auto" the default. That said, I am not particularly strict about following this approach. We cannot really set efi as the default, as it would break when booting on legacy BIOS systems. We could take only patch 1 and leave patch 2 after Xen 4.20 has branched, but at that point I would see little benefit in having just patch 1. I don't see a lot of benefit of comitting only the one patch either. I don't have a strong opinion, but downstreams have been complaining about Xen behavior regarding the usage of EFI_GET_TIME, so it might be good to not ship yet another release with such allegedly broken behavior. Agree with that. As I mentioned above I consider it as a bug and based on that several mentioned above downstreams have the similar patch I could consider that as tested approach so .. Let me know what you think, as I would need a formal Release-Ack if this is to be committed. ... R-Acked-by: Oleksii Kurochko. Sorry for the noise. I missed to add that it would also be nice to mention IMO ( that now we have wallclock parameter ) in CHANGELOG so downstreams will receive "notification" that Xen provides a workaround for the mentioned issue in the patch series. Indeed. Would you be OK with adding the following chunk to patch 2? diff --git a/CHANGELOG.md b/CHANGELOG.md index 8507e6556a56..1de1d1eca17f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,6 +12,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) leaving this to the guest kernel to do in guest context. - On x86: - Prefer ACPI reboot over UEFI ResetSystem() run time service call. + - Prefer CMOS over EFI_GET_TIME as time source. - Switched the xAPIC flat driver to use physical destination mode for external interrupts instead of logical destination mode. @@ -24,6 +25,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) - Support for LLC (Last Level Cache) coloring. - On x86: - xl suspend/resume subcommands. + - `wallclock` command line option to select time source. What about of adding word 'Add' before `wallclock`? It's in the "Added" section, so I thought it would be redundant. Other LGTM: Acked-By: Oleksii Kurochko Let me know if you would still like me to prepend "Add" to the above item. Oh, then it makes sense. We can go without "Add" then. ~ Oleksii Thanks, Roger.
Re: [PATCH v2 07/18] x86/pv: update guest LDT mappings using the linear entries
On 08.01.2025 15:26, Roger Pau Monne wrote: > The pv_map_ldt_shadow_page() and pv_destroy_ldt() functions rely on the L1 > table(s) that contain such mappings being stashed in the domain structure, and > thus such mappings being modified by merely updating the require L1 entries. > > Switch pv_map_ldt_shadow_page() to unconditionally use the linear recursive, > as > that logic is always called while the vCPU is running on the current pCPU. > > For pv_destroy_ldt() use the linear mappings if the vCPU is the one currently > running on the pCPU, otherwise use destroy_mappings(). > > Note this requires keeping an array with the pages currently mapped at the LDT > area, as that allows dropping the extra taken page reference when removing the > mappings. I'm confused by the wording of this paragraph: It reads as if you were changing reference obtaining / dropping, yet it all looks to stay the same. If I'm not mistaken you use the array to replace the acquiring of the MFNs in question from the L1 page table entries. If so, I think it would be nice if this could be described in a more direct way. Perhaps first and foremost by replacing "allows" and getting rid of "extra". Jan
[PATCH for-4.20] automation/gitlab: disable coverage from clang randconfig
If randconfig enables coverage support the build times out due to GNU LD taking too long. For the time being prevent coverage from being enabled in clang randconfig job. Signed-off-by: Roger Pau Monné --- Cc: Oleksii Kurochko --- I will fix the orphaned section stuff separately, as I'm considering just removing LLVM coverage support because the llvm coverage format is not stable, and the code to dump it has already become stale. However I need to think about it, and in the short term disabling coverage support from randconfig is more straightforward. --- automation/gitlab-ci/build.yaml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml index cb84f379b754..bc4a8a5ad20c 100644 --- a/automation/gitlab-ci/build.yaml +++ b/automation/gitlab-ci/build.yaml @@ -556,6 +556,8 @@ debian-12-x86_64-clang-randconfig: variables: CONTAINER: debian:12-x86_64 RANDCONFIG: y +EXTRA_FIXED_RANDCONFIG: | + CONFIG_COVERAGE=n # Disable coverage otherwise build times out. debian-12-x86_64-gcc: extends: .gcc-x86-64-build -- 2.46.0
Re: [PATCH for-4.20] automation/gitlab: disable coverage from clang randconfig
On 14/01/2025 5:43 pm, Roger Pau Monne wrote: > If randconfig enables coverage support the build times out due to GNU LD > taking too long. For the time being prevent coverage from being enabled in > clang randconfig job. > > Signed-off-by: Roger Pau Monné Acked-by: Andrew Cooper > --- > Cc: Oleksii Kurochko > --- > I will fix the orphaned section stuff separately, as I'm considering just > removing LLVM coverage support because the llvm coverage format is not > stable, and the code to dump it has already become stale. However I need > to think about it, and in the short term disabling coverage support from > randconfig is more straightforward. Oh, so it's broken too? Unless the fix is trivial, we should have a Kconfig level disable on it. No point letting people turn on something that's broken. ~Andrew
Re: [PATCH] riscv: Add initial Xen guest support for RISC-V
Le 14/01/2025 à 17:13, Milan Djokic a écrit : > diff --git a/test.txt b/test.txt > new file mode 100644 > index ..e54267998982 > --- /dev/null > +++ b/test.txt > @@ -0,0 +1,21 @@ > +WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? > +#120: > +new file mode 100644 > + > +WARNING: do not add new typedefs > +#808: FILE: include/xen/riscv/interface.h:15: > ++typedef struct { union { type * p; uint64_aligned_t q; }; } \ > + > +WARNING: please, no spaces at the start of a line > +#1006: FILE: include/xen/riscv/swiotlb-xen.h:10: > ++return 0;$ > + > +total: 0 errors, 3 warnings, 810 lines checked > + > +NOTE: For some of the reported defects, checkpatch may be able to > + mechanically convert to the typical style using --fix or --fix-inplace. > + > +0001-riscv-Add-initial-Xen-guest-support-for-RISC-V.patch has style > problems, please review. > + > +NOTE: If any of the errors are false positives, please report > + them to the maintainer, see CHECKPATCH in MAINTAINERS. I am not sure you want this here. > diff --git a/arch/riscv/include/asm/hypervisor.h b/arch/riscv/include/> asm/hypervisor.h > new file mode 100644 > index ..3a117afe57f0 > --- /dev/null > +++ b/arch/riscv/include/asm/hypervisor.h > @@ -0,0 +1,9 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _ASM_RISCV_HYPERVISOR_H > +#define _ASM_RISCV_HYPERVISOR_H > + > +#include > + > +void kvm_init_hyp_services(void); > + > +#endif kvm_init_hyp_services seems KVM-specific and doesn't seem to exist (yet) for RISC-V. Teddy Teddy Astie | Vates XCP-ng Developer XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech
[PATCH v4 06/30] static_call: Add read-only-after-init static calls
From: Josh Poimboeuf Deferring a code patching IPI is unsafe if the patched code is in a noinstr region. In that case the text poke code must trigger an immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ CPU running in userspace. If a noinstr static call only needs to be patched during boot, its key can be made ro-after-init to ensure it will never be patched at runtime. Signed-off-by: Josh Poimboeuf --- include/linux/static_call.h | 16 1 file changed, 16 insertions(+) diff --git a/include/linux/static_call.h b/include/linux/static_call.h index 78a77a4ae0ea8..ea6ca57e2a829 100644 --- a/include/linux/static_call.h +++ b/include/linux/static_call.h @@ -192,6 +192,14 @@ extern long __static_call_return0(void); }; \ ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func) +#define DEFINE_STATIC_CALL_RO(name, _func) \ + DECLARE_STATIC_CALL(name, _func); \ + struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\ + .func = _func, \ + .type = 1, \ + }; \ + ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func) + #define DEFINE_STATIC_CALL_NULL(name, _func) \ DECLARE_STATIC_CALL(name, _func); \ struct static_call_key STATIC_CALL_KEY(name) = {\ @@ -200,6 +208,14 @@ extern long __static_call_return0(void); }; \ ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) +#define DEFINE_STATIC_CALL_NULL_RO(name, _func) \ + DECLARE_STATIC_CALL(name, _func); \ + struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\ + .func = NULL, \ + .type = 1, \ + }; \ + ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) + #define DEFINE_STATIC_CALL_RET0(name, _func) \ DECLARE_STATIC_CALL(name, _func); \ struct static_call_key STATIC_CALL_KEY(name) = {\ -- 2.43.0
[PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition
Context === We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a pure-userspace application get regularly interrupted by IPIs sent from housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs leading to various on_each_cpu() calls, e.g.: 64359.052209596NetworkManager 01405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush) smp_call_function_many_cond+0x1 smp_call_function+0x39 on_each_cpu+0x2a flush_tlb_kernel_range+0x7b __purge_vmap_area_lazy+0x70 _vm_unmap_aliases.part.42+0xdf change_page_attr_set_clr+0x16a set_memory_ro+0x26 bpf_int_jit_compile+0x2f9 bpf_prog_select_runtime+0xc6 bpf_prepare_filter+0x523 sk_attach_filter+0x13 sock_setsockopt+0x92c __sys_setsockopt+0x16a __x64_sys_setsockopt+0x20 do_syscall_64+0x87 entry_SYSCALL_64_after_hwframe+0x65 The heart of this series is the thought that while we cannot remove NOHZ_FULL CPUs from the list of CPUs targeted by these IPIs, they may not have to execute the callbacks immediately. Anything that only affects kernelspace can wait until the next user->kernel transition, providing it can be executed "early enough" in the entry code. The original implementation is from Peter [1]. Nicolas then added kernel TLB invalidation deferral to that [2], and I picked it up from there. Deferral approach = Storing each and every callback, like a secondary call_single_queue turned out to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in userspace for as long as possible - no signal of any form would be sent when deferring an IPI. This means that any form of queuing for deferred callbacks would end up as a convoluted memory leak. Deferred IPIs must thus be coalesced, which this series achieves by assigning IPIs a "type" and having a mapping of IPI type to callback, leveraged upon kernel entry. What about IPIs whose callback take a parameter, you may ask? Peter suggested during OSPM23 [3] that since on_each_cpu() targets housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or housekeeping-CPU-local state to "reconstruct" the data that would have been sent via the IPI. This series does not affect any IPI callback that requires an argument, but the approach would remain the same (one coalescable callback executed on kernel entry). Kernel entry vs execution of the deferred operation === This is what I've referred to as the "Danger Zone" during my LPC24 talk [4]. There is a non-zero length of code that is executed upon kernel entry before the deferred operation can be itself executed (i.e. before we start getting into context_tracking.c proper), i.e.: idtentry_func_foo()<--- we're in the kernel irqentry_enter() enter_from_user_mode() __ct_user_exit() ct_kernel_enter_state() ct_work_flush()<--- deferred operation is executed here This means one must take extra care to what can happen in the early entry code, and that cannot happen. For instance, we really don't want to hit instructions that have been modified by a remote text_poke() while we're on our way to execute a deferred sync_core(). Patches doing the actual deferral have more detail on this. Patches === o Patches 1-2 are standalone objtool cleanups. o Patches 3-4 add an RCU testing feature. o Patches 5-6 add infrastructure for annotating static keys and static calls that may be used in noinstr code (courtesy of Josh). o Patches 7-19 use said annotations on relevant keys / calls. o Patch 20 enforces proper usage of said annotations (courtesy of Josh). o Patches 21-23 fiddle with CT_STATE* within context tracking o Patches 24-29 add the actual IPI deferral faff o Patch 30 adds a freebie: deferring IPIs for NOHZ_IDLE. Not tested that much! if you care about battery-powered devices and energy consumption, go give it a try! Patches are also available at: https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v4 Stuff I'd like eyes and neurons on == Context-tracking vs idle. Patch 22 "improves" the situation by adding an IDLE->KERNEL transition when getting an IRQ while idle, but it leaves the following window: ~> IRQ ct_nmi_enter() state = state + CT_STATE_KERNEL - CT_STATE_IDLE [...] ct_nmi_exit() state = state - CT_STATE_KERNEL + CT_STATE_IDLE [...] /!\ CT_STATE_IDLE here while we're really in kernelspace! /!\ ct_cpuidle_exit() state = state + CT_STATE_KERNEL - CT_STATE_IDLE Said window is contained within cpu_idle_poll() and the cpuidle call within cpuidle_enter_state(), both being noinstr (the former is __cpuidle which is noinstr itself). Thus objtool will consider it as early entry and will warn accordingly of any static key / call misuse, so the damage is somewhat contained, but it's not ideal. I trie
[PATCH v4 12/30] arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
The static call is only ever updated in __init pv_time_init() __init xen_time_setup_guest() so mark it appropriately as __ro_after_init. Signed-off-by: Valentin Schneider --- arch/arm64/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c index aa718d6a9274a..ad28fa23c9228 100644 --- a/arch/arm64/kernel/paravirt.c +++ b/arch/arm64/kernel/paravirt.c @@ -32,7 +32,7 @@ static u64 native_steal_clock(int cpu) return 0; } -DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock); struct pv_time_stolen_time_region { struct pvclock_vcpu_stolen_time __rcu *kaddr; -- 2.43.0
[PATCH v4 07/30] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
Later commits will cause objtool to warn about static calls being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs. pv_sched_clock is updated in: o __init vmware_paravirt_ops_setup() o __init xen_init_time_common() o kvm_sched_clock_init() <- __init kvmclock_init() o hv_setup_sched_clock() <- __init hv_init_tsc_clocksource() IOW purely init context, and can thus be marked as __ro_after_init. Reported-by: Josh Poimboeuf Signed-off-by: Valentin Schneider --- arch/x86/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index fec3815335558..ae6675167877b 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -73,7 +73,7 @@ static u64 native_steal_clock(int cpu) } DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); -DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); +DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock); void paravirt_set_sched_clock(u64 (*func)(void)) { -- 2.43.0
[PATCH v4 01/30] objtool: Make validate_call() recognize indirect calls to pv_ops[]
call_dest_name() does not get passed the file pointer of validate_call(), which means its invocation of insn_reloc() will always return NULL. Make it take a file pointer. While at it, make sure call_dest_name() uses arch_dest_reloc_offset(), otherwise it gets the pv_ops[] offset wrong. Fabricating an intentional warning shows the change; previously: vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section now: vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[1]() leaves .noinstr.text section Signed-off-by: Valentin Schneider Acked-by: Josh Poimboeuf --- tools/objtool/check.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 76060da755b5c..b35763f05a548 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -3448,7 +3448,7 @@ static inline bool func_uaccess_safe(struct symbol *func) return false; } -static inline const char *call_dest_name(struct instruction *insn) +static inline const char *call_dest_name(struct objtool_file *file, struct instruction *insn) { static char pvname[19]; struct reloc *reloc; @@ -3457,9 +3457,9 @@ static inline const char *call_dest_name(struct instruction *insn) if (insn_call_dest(insn)) return insn_call_dest(insn)->name; - reloc = insn_reloc(NULL, insn); + reloc = insn_reloc(file, insn); if (reloc && !strcmp(reloc->sym->name, "pv_ops")) { - idx = (reloc_addend(reloc) / sizeof(void *)); + idx = (arch_dest_reloc_offset(reloc_addend(reloc)) / sizeof(void *)); snprintf(pvname, sizeof(pvname), "pv_ops[%d]", idx); return pvname; } @@ -3538,17 +3538,19 @@ static int validate_call(struct objtool_file *file, { if (state->noinstr && state->instr <= 0 && !noinstr_call_dest(file, insn, insn_call_dest(insn))) { - WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(insn)); + WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(file, insn)); return 1; } if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) { - WARN_INSN(insn, "call to %s() with UACCESS enabled", call_dest_name(insn)); + WARN_INSN(insn, "call to %s() with UACCESS enabled", + call_dest_name(file, insn)); return 1; } if (state->df) { - WARN_INSN(insn, "call to %s() with DF set", call_dest_name(insn)); + WARN_INSN(insn, "call to %s() with DF set", + call_dest_name(file, insn)); return 1; } -- 2.43.0
[PATCH v4 03/30] rcu: Add a small-width RCU watching counter debug option
A later commit will reduce the size of the RCU watching counter to free up some bits for another purpose. Paul suggested adding a config option to test the extreme case where the counter is reduced to its minimum usable width for rcutorture to poke at, so do that. Make it only configurable under RCU_EXPERT. While at it, add a comment to explain the layout of context_tracking->state. Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney Signed-off-by: Valentin Schneider Reviewed-by: Paul E. McKenney --- include/linux/context_tracking_state.h | 44 ++ kernel/rcu/Kconfig.debug | 15 + 2 files changed, 52 insertions(+), 7 deletions(-) diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index 7b8433d5a8efe..0b81248aa03e2 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -18,12 +18,6 @@ enum ctx_state { CT_STATE_MAX= 4, }; -/* Odd value for watching, else even. */ -#define CT_RCU_WATCHING CT_STATE_MAX - -#define CT_STATE_MASK (CT_STATE_MAX - 1) -#define CT_RCU_WATCHING_MASK (~CT_STATE_MASK) - struct context_tracking { #ifdef CONFIG_CONTEXT_TRACKING_USER /* @@ -44,9 +38,45 @@ struct context_tracking { #endif }; +/* + * We cram two different things within the same atomic variable: + * + * CT_RCU_WATCHING_START CT_STATE_START + *|| + *vv + * MSB [ RCU watching counter ][ context_state ] LSB + * ^ ^ + * | | + * CT_RCU_WATCHING_ENDCT_STATE_END + * + * Bits are used from the LSB upwards, so unused bits (if any) will always be in + * upper bits of the variable. + */ #ifdef CONFIG_CONTEXT_TRACKING +#define CT_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE) + +#define CT_STATE_WIDTH bits_per(CT_STATE_MAX - 1) +#define CT_STATE_START 0 +#define CT_STATE_END (CT_STATE_START + CT_STATE_WIDTH - 1) + +#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH) +#define CT_RCU_WATCHING_WIDTH (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH) +#define CT_RCU_WATCHING_START (CT_STATE_END + 1) +#define CT_RCU_WATCHING_END (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1) +#define CT_RCU_WATCHING BIT(CT_RCU_WATCHING_START) + +#define CT_STATE_MASKGENMASK(CT_STATE_END,CT_STATE_START) +#define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START) + +#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH) + +static_assert(CT_STATE_WIDTH+ + CT_RCU_WATCHING_WIDTH + + CT_UNUSED_WIDTH == + CT_SIZE); + DECLARE_PER_CPU(struct context_tracking, context_tracking); -#endif +#endif /* CONFIG_CONTEXT_TRACKING */ #ifdef CONFIG_CONTEXT_TRACKING_USER static __always_inline int __ct_state(void) diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug index 9b0b52e1836fa..ea36953803a1e 100644 --- a/kernel/rcu/Kconfig.debug +++ b/kernel/rcu/Kconfig.debug @@ -168,4 +168,19 @@ config RCU_STRICT_GRACE_PERIOD when looking for certain types of RCU usage bugs, for example, too-short RCU read-side critical sections. + +config RCU_DYNTICKS_TORTURE + bool "Minimize RCU dynticks counter size" + depends on RCU_EXPERT && !COMPILE_TEST + default n + help + This option sets the width of the dynticks counter to its + minimum usable value. This minimum width greatly increases + the probability of flushing out bugs involving counter wrap, + but it also increases the probability of extending grace period + durations. This Kconfig option should therefore be avoided in + production due to the consequent increased probability of OOMs. + + This has no value for production and is only for testing. + endmenu # "RCU Debugging" -- 2.43.0
[PATCH v4 09/30] x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
The static call is only ever updated in __init pv_time_init() __init xen_init_time_common() __init vmware_paravirt_ops_setup() __init xen_time_setup_guest( so mark it appropriately as __ro_after_init. Signed-off-by: Valentin Schneider --- arch/x86/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index ae6675167877b..a19b0002b18d8 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -72,7 +72,7 @@ static u64 native_steal_clock(int cpu) return 0; } -DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock); DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock); void paravirt_set_sched_clock(u64 (*func)(void)) -- 2.43.0
[PATCH v4 14/30] perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
Later commits will cause objtool to warn about static calls being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs. perf_lopwr_cb is used in .noinstr code, but is only ever updated in __init amd_brs_lopwr_init(), and can thus be marked as __ro_after_init. Reported-by: Josh Poimboeuf Signed-off-by: Valentin Schneider --- arch/x86/events/amd/brs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c index 780acd3dff22a..e2ff03af15d82 100644 --- a/arch/x86/events/amd/brs.c +++ b/arch/x86/events/amd/brs.c @@ -422,7 +422,7 @@ void noinstr perf_amd_brs_lopwr_cb(bool lopwr_in) } } -DEFINE_STATIC_CALL_NULL(perf_lopwr_cb, perf_amd_brs_lopwr_cb); +DEFINE_STATIC_CALL_NULL_RO(perf_lopwr_cb, perf_amd_brs_lopwr_cb); EXPORT_STATIC_CALL_TRAMP_GPL(perf_lopwr_cb); void __init amd_brs_lopwr_init(void) -- 2.43.0
[PATCH v4 05/30] jump_label: Add annotations for validating noinstr usage
From: Josh Poimboeuf Deferring a code patching IPI is unsafe if the patched code is in a noinstr region. In that case the text poke code must trigger an immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ CPU running in userspace. Some noinstr static branches may really need to be patched at runtime, despite the resulting disruption. Add DEFINE_STATIC_KEY_*_NOINSTR() variants for those. They don't do anything special yet; that will come later. Signed-off-by: Josh Poimboeuf --- include/linux/jump_label.h | 17 + 1 file changed, 17 insertions(+) diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index f5a2727ca4a9a..88bb6e32fdcbc 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -385,6 +385,23 @@ struct static_key_false { #define DEFINE_STATIC_KEY_FALSE_RO(name) \ struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT +/* + * The _NOINSTR variants are used to tell objtool the static key is allowed to + * be used in noinstr code. + * + * They should almost never be used, as they prevent code patching IPIs from + * being deferred, which can be problematic for isolated NOHZ_FULL CPUs running + * in pure userspace. + * + * If using one of these _NOINSTR variants, please add a comment above the + * definition with the rationale. + */ +#define DEFINE_STATIC_KEY_TRUE_NOINSTR(name) \ + DEFINE_STATIC_KEY_TRUE(name) + +#define DEFINE_STATIC_KEY_FALSE_NOINSTR(name) \ + DEFINE_STATIC_KEY_FALSE(name) + #define DECLARE_STATIC_KEY_FALSE(name) \ extern struct static_key_false name -- 2.43.0
[PATCH v4 11/30] loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
The static call is only ever updated in __init pv_time_init() __init xen_time_setup_guest() so mark it appropriately as __ro_after_init. Signed-off-by: Valentin Schneider --- arch/loongarch/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c index e5a39bbad0780..b011578d3e931 100644 --- a/arch/loongarch/kernel/paravirt.c +++ b/arch/loongarch/kernel/paravirt.c @@ -20,7 +20,7 @@ static u64 native_steal_clock(int cpu) return 0; } -DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock); static bool steal_acc = true; -- 2.43.0
[PATCH v4 02/30] objtool: Flesh out warning related to pv_ops[] calls
I had to look into objtool itself to understand what this warning was about; make it more explicit. Signed-off-by: Valentin Schneider Acked-by: Josh Poimboeuf --- tools/objtool/check.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/objtool/check.c b/tools/objtool/check.c index b35763f05a548..6aa9259fc9940 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -3486,7 +3486,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn) list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) { if (!target->sec->noinstr) { - WARN("pv_ops[%d]: %s", idx, target->name); + WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name); file->pv_ops[idx].clean = false; } } -- 2.43.0
[PATCH v4 10/30] riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
The static call is only ever updated in: __init pv_time_init() __init xen_time_setup_guest() so mark it appropriately as __ro_after_init. Signed-off-by: Valentin Schneider --- arch/riscv/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/riscv/kernel/paravirt.c b/arch/riscv/kernel/paravirt.c index fa6b0339a65de..dfe8808016fd8 100644 --- a/arch/riscv/kernel/paravirt.c +++ b/arch/riscv/kernel/paravirt.c @@ -30,7 +30,7 @@ static u64 native_steal_clock(int cpu) return 0; } -DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock); static bool steal_acc = true; static int __init parse_no_stealacc(char *arg) -- 2.43.0
[PATCH v4 08/30] x86/idle: Mark x86_idle static call as __ro_after_init
Later commits will cause objtool to warn about static calls being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs. x86_idle is updated in: o xen_set_default_idle() <- __init xen_arch_setup() o __init select_idle_routine() IOW purely init context, and can thus be marked as __ro_after_init. Reported-by: Josh Poimboeuf Signed-off-by: Valentin Schneider --- arch/x86/kernel/process.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index f63f8fd00a91f..85cd62f61d633 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -746,7 +746,7 @@ void __cpuidle default_idle(void) EXPORT_SYMBOL(default_idle); #endif -DEFINE_STATIC_CALL_NULL(x86_idle, default_idle); +DEFINE_STATIC_CALL_NULL_RO(x86_idle, default_idle); static bool x86_idle_set(void) { -- 2.43.0
[PATCH v4 04/30] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
We now have an RCU_EXPERT config for testing small-sized RCU dynticks counter: CONFIG_RCU_DYNTICKS_TORTURE. Modify scenario TREE04 to exercise to use this config in order to test a ridiculously small counter (2 bits). Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney Signed-off-by: Valentin Schneider Reviewed-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/configs/rcu/TREE04 | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 index dc4985064b3ad..67caf4276bb01 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 @@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y CONFIG_RCU_EQS_DEBUG=y CONFIG_RCU_LAZY=y +CONFIG_RCU_DYNTICKS_TORTURE=y -- 2.43.0
[PATCH v4 20/30] objtool: Add noinstr validation for static branches/calls
From: Josh Poimboeuf Warn about static branches/calls in noinstr regions, unless the corresponding key is RO-after-init or has been manually whitelisted with DEFINE_STATIC_KEY_*_NOINSTR((). Signed-off-by: Josh Poimboeuf [Added NULL check for insn_call_dest() return value] Signed-off-by: Valentin Schneider --- include/linux/jump_label.h | 17 +++-- include/linux/objtool.h | 7 ++ include/linux/static_call.h | 3 + tools/objtool/Documentation/objtool.txt | 34 + tools/objtool/check.c | 92 ++--- tools/objtool/include/objtool/check.h | 1 + tools/objtool/include/objtool/elf.h | 1 + tools/objtool/include/objtool/special.h | 1 + tools/objtool/special.c | 18 - 9 files changed, 156 insertions(+), 18 deletions(-) diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index 88bb6e32fdcbc..dc8a82a62c109 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -75,6 +75,7 @@ #include #include +#include extern bool static_key_initialized; @@ -373,8 +374,9 @@ struct static_key_false { #define DEFINE_STATIC_KEY_TRUE(name) \ struct static_key_true name = STATIC_KEY_TRUE_INIT -#define DEFINE_STATIC_KEY_TRUE_RO(name)\ - struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT +#define DEFINE_STATIC_KEY_TRUE_RO(name) \ + struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT; \ + ANNOTATE_NOINSTR_ALLOWED(name) #define DECLARE_STATIC_KEY_TRUE(name) \ extern struct static_key_true name @@ -382,8 +384,9 @@ struct static_key_false { #define DEFINE_STATIC_KEY_FALSE(name) \ struct static_key_false name = STATIC_KEY_FALSE_INIT -#define DEFINE_STATIC_KEY_FALSE_RO(name) \ - struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT +#define DEFINE_STATIC_KEY_FALSE_RO(name) \ + struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT; \ + ANNOTATE_NOINSTR_ALLOWED(name) /* * The _NOINSTR variants are used to tell objtool the static key is allowed to @@ -397,10 +400,12 @@ struct static_key_false { * definition with the rationale. */ #define DEFINE_STATIC_KEY_TRUE_NOINSTR(name) \ - DEFINE_STATIC_KEY_TRUE(name) + DEFINE_STATIC_KEY_TRUE(name); \ + ANNOTATE_NOINSTR_ALLOWED(name) #define DEFINE_STATIC_KEY_FALSE_NOINSTR(name) \ - DEFINE_STATIC_KEY_FALSE(name) + DEFINE_STATIC_KEY_FALSE(name); \ + ANNOTATE_NOINSTR_ALLOWED(name) #define DECLARE_STATIC_KEY_FALSE(name) \ extern struct static_key_false name diff --git a/include/linux/objtool.h b/include/linux/objtool.h index b3b8d3dab52d5..1a7389f273063 100644 --- a/include/linux/objtool.h +++ b/include/linux/objtool.h @@ -34,6 +34,12 @@ static void __used __section(".discard.func_stack_frame_non_standard") \ *__func_stack_frame_non_standard_##func = func +#define __ANNOTATE_NOINSTR_ALLOWED(key) \ + static void __used __section(".discard.noinstr_allowed") \ + *__annotate_noinstr_allowed_##key = &key + +#define ANNOTATE_NOINSTR_ALLOWED(key) __ANNOTATE_NOINSTR_ALLOWED(key) + /* * STACK_FRAME_NON_STANDARD_FP() is a frame-pointer-specific function ignore * for the case where a function is intentionally missing frame pointer setup, @@ -157,6 +163,7 @@ #define STACK_FRAME_NON_STANDARD_FP(func) #define ANNOTATE_NOENDBR #define ASM_REACHABLE +#define ANNOTATE_NOINSTR_ALLOWED(key) #else #define ANNOTATE_INTRA_FUNCTION_CALL .macro UNWIND_HINT type:req sp_reg=0 sp_offset=0 signal=0 diff --git a/include/linux/static_call.h b/include/linux/static_call.h index ea6ca57e2a829..0d4b16d348501 100644 --- a/include/linux/static_call.h +++ b/include/linux/static_call.h @@ -133,6 +133,7 @@ #include #include +#include #include #ifdef CONFIG_HAVE_STATIC_CALL @@ -198,6 +199,7 @@ extern long __static_call_return0(void); .func = _func, \ .type = 1, \ }; \ + ANNOTATE_NOINSTR_ALLOWED(STATIC_CALL_TRAMP(name)); \ ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func) #define DEFINE_STATIC_CALL_NULL(name, _func) \ @@ -214,6 +216,7 @@ extern long __static_call_return0(void); .func = NULL, \ .type = 1, \ }; \ + ANNOTATE_NOINSTR_ALLOWED(STATIC_CA
[PATCH v4 19/30] stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs. stack_erasing_bypass is used in .noinstr code, and can be modified at runtime (proc/sys/kernel/stack_erasing write). However it is not expected that it will be flipped during latency-sensitive operations, and thus shouldn't be a source of interference wrt the text patching IPI. Mark it to let objtool know not to warn about it. Reported-by: Josh Poimboeuf Signed-off-by: Valentin Schneider --- kernel/stackleak.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/stackleak.c b/kernel/stackleak.c index 39fd620a7db6f..a4f07fbc13f61 100644 --- a/kernel/stackleak.c +++ b/kernel/stackleak.c @@ -18,7 +18,11 @@ #include #include -static DEFINE_STATIC_KEY_FALSE(stack_erasing_bypass); +/* + * This static key can only be modified via its sysctl interface. It is + * expected it will remain stable during latency-senstive operations. + */ +static DEFINE_STATIC_KEY_FALSE_NOINSTR(stack_erasing_bypass); #ifdef CONFIG_SYSCTL static int stack_erasing_sysctl(const struct ctl_table *table, int write, -- 2.43.0
[PATCH v4 27/30] x86/tlb: Make __flush_tlb_local() noinstr-compliant
Later patches will require issuing a __flush_tlb_all() from noinstr code. This requires making both __flush_tlb_local() and __flush_tlb_global() noinstr-compliant. For __flush_tlb_local(), xen_flush_tlb() has already been made noinstr, so it's just native_flush_tlb_global(), and simply __always_inline'ing invalidate_user_asid() gets us there Signed-off-by: Valentin Schneider --- arch/x86/include/asm/paravirt.h | 2 +- arch/x86/mm/tlb.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index b3daee3d46677..0c0dd186c03e6 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -70,7 +70,7 @@ void native_flush_tlb_one_user(unsigned long addr); void native_flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info); -static inline void __flush_tlb_local(void) +static __always_inline void __flush_tlb_local(void) { PVOP_VCALL0(mmu.flush_tlb_user); } diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 2d2ab3e221f0c..18b40bbc2fa15 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -257,7 +257,7 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen, * * See SWITCH_TO_USER_CR3. */ -static inline void invalidate_user_asid(u16 asid) +static __always_inline void invalidate_user_asid(u16 asid) { /* There is no user ASID if address space separation is off */ if (!IS_ENABLED(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION)) @@ -1206,7 +1206,7 @@ STATIC_NOPV noinstr void native_flush_tlb_global(void) /* * Flush the entire current user mapping */ -STATIC_NOPV void native_flush_tlb_local(void) +STATIC_NOPV noinstr void native_flush_tlb_local(void) { /* * Preemption or interrupts must be disabled to protect the access -- 2.43.0
[PATCH v4 18/30] x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as allowed in .noinstr
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs. These keys are used in .noinstr code, and can be modified at runtime (/proc/kernel/vmx* write). However it is not expected that they will be flipped during latency-sensitive operations, and thus shouldn't be a source of interference wrt the text patching IPI. Mark it to let objtool know not to warn about it. Reported-by: Josh Poimboeuf Signed-off-by: Valentin Schneider --- arch/x86/kvm/vmx/vmx.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 893366e537322..a028c38f44e02 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -225,8 +225,15 @@ module_param(pt_mode, int, S_IRUGO); struct x86_pmu_lbr __ro_after_init vmx_lbr_caps; -static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush); -static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond); +/* + * Both of these static keys end up being used in .noinstr sections, however + * they are only modified: + * - at init + * - from a /proc/kernel/vmx* write + * thus during latency-sensitive operations they should remain stable. + */ +static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_should_flush); +static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_flush_cond); static DEFINE_MUTEX(vmx_l1d_flush_mutex); /* Storage for pre module init parameter parsing */ -- 2.43.0
[PATCH v4 16/30] x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs. mds_idle_clear is used in .noinstr code, and can be modified at runtime (SMT hotplug). Suppressing the text_poke_sync() IPI has little benefits for this key, as hotplug implies eventually going through takedown_cpu() -> stop_machine_cpuslocked() which is going to cause interference on all online CPUs anyway. Mark it to let objtool know not to warn about it. Signed-off-by: Valentin Schneider --- arch/x86/kernel/cpu/bugs.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 47a01d4028f60..acad84dcfc3cd 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -113,8 +113,13 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb); /* Control unconditional IBPB in switch_mm() */ DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb); -/* Control MDS CPU buffer clear before idling (halt, mwait) */ -DEFINE_STATIC_KEY_FALSE(mds_idle_clear); +/* + * Control MDS CPU buffer clear before idling (halt, mwait) + * + * Allowed in .noinstr as this key is updated during hotplug which comes with + * more interference than just the text patching IPI. + */ +DEFINE_STATIC_KEY_FALSE_NOINSTR(mds_idle_clear); EXPORT_SYMBOL_GPL(mds_idle_clear); /* -- 2.43.0
[PATCH v4 17/30] sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs. __sched_clock_stable is used in .noinstr code, and can be modified at runtime (e.g. time_cpufreq_notifier()). Suppressing the text_poke_sync() IPI has little benefits for this key, as NOHZ_FULL is incompatible with an unstable TSC anyway. Mark it to let objtool know not to warn about it. Signed-off-by: Valentin Schneider --- kernel/sched/clock.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c index 200e5568b9894..e59986bc14a43 100644 --- a/kernel/sched/clock.c +++ b/kernel/sched/clock.c @@ -75,8 +75,11 @@ static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running); * * Similarly we start with __sched_clock_stable_early, thereby assuming we * will become stable, such that there's only a single 1 -> 0 transition. + * + * Allowed in .noinstr as an unstable TLC is incompatible with NOHZ_FULL, + * thus the text patching IPI would be the least of our concerns. */ -static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable); +static DEFINE_STATIC_KEY_FALSE_NOINSTR(__sched_clock_stable); static int __sched_clock_stable_early = 1; /* -- 2.43.0
[PATCH v4 21/30] context_tracking: Explicitely use CT_STATE_KERNEL where it is missing
CT_STATE_KERNEL being zero means it can be (and is) omitted in a handful of places. A later patch will change CT_STATE_KERNEL into a non-zero value, prepare that by using it where it should be: o In the initial CT state o At kernel entry / exit No change in functionality intended. Signed-off-by: Valentin Schneider --- kernel/context_tracking.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 938c48952d265..a61498a8425e2 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -31,7 +31,7 @@ DEFINE_PER_CPU(struct context_tracking, context_tracking) = { .nesting = 1, .nmi_nesting = CT_NESTING_IRQ_NONIDLE, #endif - .state = ATOMIC_INIT(CT_RCU_WATCHING), + .state = ATOMIC_INIT(CT_RCU_WATCHING | CT_STATE_KERNEL), }; EXPORT_SYMBOL_GPL(context_tracking); @@ -147,7 +147,7 @@ static void noinstr ct_kernel_exit(bool user, int offset) instrumentation_end(); WRITE_ONCE(ct->nesting, 0); /* Avoid irq-access tearing. */ // RCU is watching here ... - ct_kernel_exit_state(offset); + ct_kernel_exit_state(offset - CT_STATE_KERNEL); // ... but is no longer watching here. rcu_task_exit(); } @@ -175,7 +175,7 @@ static void noinstr ct_kernel_enter(bool user, int offset) } rcu_task_enter(); // RCU is not watching here ... - ct_kernel_enter_state(offset); + ct_kernel_enter_state(offset + CT_STATE_KERNEL); // ... but is watching here. instrumentation_begin(); @@ -537,7 +537,7 @@ void noinstr __ct_user_enter(enum ctx_state state) * RCU only requires CT_RCU_WATCHING increments to be fully * ordered. */ - raw_atomic_add(state, &ct->state); + raw_atomic_add(state - CT_STATE_KERNEL, &ct->state); } } } @@ -647,7 +647,7 @@ void noinstr __ct_user_exit(enum ctx_state state) * RCU only requires CT_RCU_WATCHING increments to be fully * ordered. */ - raw_atomic_sub(state, &ct->state); + raw_atomic_sub(state - CT_STATE_KERNEL, &ct->state); } } } -- 2.43.0
[PATCH v4 15/30] sched/clock: Mark sched_clock_running key as __ro_after_init
sched_clock_running is only ever enabled in the __init functions sched_clock_init() and sched_clock_init_late(), and is never disabled. Mark it __ro_after_init. Signed-off-by: Valentin Schneider --- kernel/sched/clock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c index a09655b481402..200e5568b9894 100644 --- a/kernel/sched/clock.c +++ b/kernel/sched/clock.c @@ -66,7 +66,7 @@ notrace unsigned long long __weak sched_clock(void) } EXPORT_SYMBOL_GPL(sched_clock); -static DEFINE_STATIC_KEY_FALSE(sched_clock_running); +static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running); #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK /* -- 2.43.0
[PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs. Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry. Deferral vs early entry danger zone === This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code. Vmalloc uses are, as reported by vmallocinfo: $ cat /proc/vmallocinfo | awk '{ print $3 }' | sort | uniq __pci_enable_msix_range+0x32b/0x560 acpi_os_map_iomem+0x22d/0x250 bpf_prog_alloc_no_stats+0x34/0x140 fork_idle+0x79/0x120 gen_pool_add_owner+0x3e/0xb0 ? hpet_enable+0xbf/0x470 irq_init_percpu_irqstack+0x129/0x160 kernel_clone+0xab/0x3b0 memremap+0x164/0x330 n_tty_open+0x18/0xa0 pcpu_create_chunk+0x4e/0x1b0 pcpu_create_chunk+0x75/0x1b0 pcpu_get_vm_areas+0x0/0x1100 unpurged vp_modern_map_capability+0x140/0x270 zisofs_init+0x16/0x30 I've categorized these as: a) Device or percpu mappings For these to be unmapped, the device (or CPU) has to be removed and an eventual IRQ freed. Even if the IRQ isn't freed, context tracking entry happens before handling the IRQ itself, per irqentry_enter() usage. __pci_enable_msix_range() acpi_os_map_iomem() irq_init_percpu_irqstack() (not even unmapped when CPU is hot-unplugged!) memremap() n_tty_open() pcpu_create_chunk() pcpu_get_vm_areas() vp_modern_map_capability() b) CONFIG_VMAP_STACK fork_idle() & kernel_clone() vmalloc'd kernel stacks are AFAICT a safe example, as a task running in userspace needs to enter kernelspace to execute do_exit() before its stack can be vfree'd. c) Non-issues bpf_prog_alloc_no_stats() - early entry is noinstr, no BPF! hpet_enable() - hpet_clear_mapping() is only called if __init function fails, no runtime worries zisofs_init () - used for zisofs block decompression, that's way past context tracking entry d) I'm not sure, have a look? gen_pool_add_owner() - AIUI this is mainly for PCI / DMA stuff, which again I wouldn't expect to be accessed before context tracking entry. Changes == Blindly deferring any and all flush of the kernel mappings is a risky move, so introduce a variant of flush_tlb_kernel_range() that explicitly allows deferral. Use it for vunmap flushes. Note that while flush_tlb_kernel_range() may end up issuing a full flush (including user mappings), this only happens when reaching a invalidation range threshold where it is cheaper to do a full flush than to individually invalidate each page in the range via INVLPG. IOW, it doesn't *require* invalidating user mappings, and thus remains safe to defer until a later kernel entry. Signed-off-by: Valentin Schneider --- arch/x86/include/asm/context_tracking_work.h | 4 +++ arch/x86/include/asm/tlbflush.h | 1 + arch/x86/mm/tlb.c| 23 +++-- include/linux/context_tracking_work.h| 4 ++- mm/vmalloc.c | 35 +--- 5 files changed, 58 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 485b32881fde5..da3d270289836 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -3,6 +3,7 @@ #define _ASM_X86_CONTEXT_TRACKING_WORK_H #include +#include static __always_inline void arch_context_tracking_work(enum ct_work work) { @@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(enum ct_work work) case CT_WORK_SYNC: sync_core(); break; + case CT_WORK_TLBI: + __flush_tlb_all(); + break; case CT_WORK_MAX: WARN_ON_ONCE(true); } diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 4d11396250999..6e690acc35e53 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -248,6 +248,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, bool freed_tables); extern void flush_tlb_kernel_range(unsigned long start, unsigned long end); +extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end); static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) { diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 119765772ab11..47fb437acf52a 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -1042,6 +1043,11 @@
[PATCH v4 30/30] context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE
With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these transitions: ct_idle_enter() ct_kernel_exit() ct_state_inc_clear_work() ct_idle_exit() ct_kernel_enter() ct_work_flush() With just CONTEXT_TRACKING_IDLE, ct_state_inc_clear_work() is just ct_state_inc() and ct_work_flush() is a no-op. However, making them be functional as if under CONTEXT_TRACKING_WORK would allow NO_HZ_IDLE to leverage IPI deferral to keep idle CPUs idle longer. Having this enabled for NO_HZ_IDLE is a different argument than for having it for NO_HZ_FULL (power savings vs latency/performance), but the backing mechanism is identical. Add a default-no option to enable IPI deferral with NO_HZ_IDLE. Signed-off-by: Valentin Schneider --- kernel/time/Kconfig | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index 7e8106a0d981f..c7398fe5382a0 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -183,9 +183,23 @@ config CONTEXT_TRACKING_USER_FORCE config CONTEXT_TRACKING_WORK bool - depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER + depends on HAVE_CONTEXT_TRACKING_WORK && (CONTEXT_TRACKING_USER || CONTEXT_TRACKING_WORK_IDLE) default y +config CONTEXT_TRACKING_WORK_IDLE + bool + depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_IDLE && !CONTEXT_TRACKING_USER + default n + help +This option enables deferral of some IPIs when they are targeted at CPUs +that are idle. This can help keep CPUs idle longer, but induces some +extra overhead to idle <-> kernel transitions and to IPI sending. + +Say Y if the power improvements are worth more to you than the added +overheads. + +Say N otherwise. + config NO_HZ bool "Old Idle dynticks config" help -- 2.43.0
[PATCH v4 28/30] x86/tlb: Make __flush_tlb_all() noinstr
Later patches will require issuing a __flush_tlb_all() from noinstr code. Both __flush_tlb_local() and __flush_tlb_global() are now noinstr-compliant, so __flush_tlb_all() can be made noinstr itself. Signed-off-by: Valentin Schneider --- arch/x86/include/asm/tlbflush.h | 2 +- arch/x86/mm/tlb.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 69e79fff41b80..4d11396250999 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -17,7 +17,7 @@ DECLARE_PER_CPU(u64, tlbstate_untag_mask); -void __flush_tlb_all(void); +noinstr void __flush_tlb_all(void); #define TLB_FLUSH_ALL -1UL #define TLB_GENERATION_INVALID 0 diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 18b40bbc2fa15..119765772ab11 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1229,7 +1229,7 @@ void flush_tlb_local(void) /* * Flush everything */ -void __flush_tlb_all(void) +noinstr void __flush_tlb_all(void) { /* * This is to catch users with enabled preemption and the PGE feature @@ -1243,7 +1243,7 @@ void __flush_tlb_all(void) /* * !PGE -> !PCID (setup_pcid()), thus every flush is total. */ - flush_tlb_local(); + __flush_tlb_local(); } } EXPORT_SYMBOL_GPL(__flush_tlb_all); -- 2.43.0
[PATCH v4 13/30] arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
The static call is only ever updated in __init xen_time_setup_guest() so mark it appropriately as __ro_after_init. Signed-off-by: Valentin Schneider --- arch/arm/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/kernel/paravirt.c b/arch/arm/kernel/paravirt.c index 7dd9806369fb0..632d8d5e06db3 100644 --- a/arch/arm/kernel/paravirt.c +++ b/arch/arm/kernel/paravirt.c @@ -20,4 +20,4 @@ static u64 native_steal_clock(int cpu) return 0; } -DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock); -- 2.43.0
[PATCH v4 22/30] context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry
ct_nmi_{enter, exit}() only touches the RCU watching counter and doesn't modify the actual CT state part context_tracking.state. This means that upon receiving an IRQ when idle, the CT_STATE_IDLE->CT_STATE_KERNEL transition only happens in ct_idle_exit(). One can note that ct_nmi_enter() can only ever be entered with the CT state as either CT_STATE_KERNEL or CT_STATE_IDLE, as an IRQ/NMI happenning in the CT_STATE_USER or CT_STATE_GUEST states will be routed down to ct_user_exit(). Add/remove CT_STATE_IDLE from the context tracking state as needed in ct_nmi_{enter, exit}(). Note that this leaves the following window where the CPU is executing code in kernelspace, but the context tracking state is CT_STATE_IDLE: ~> IRQ ct_nmi_enter() state = state + CT_STATE_KERNEL - CT_STATE_IDLE [...] ct_nmi_exit() state = state - CT_STATE_KERNEL + CT_STATE_IDLE [...] /!\ CT_STATE_IDLE here while we're really in kernelspace! /!\ ct_cpuidle_exit() state = state + CT_STATE_KERNEL - CT_STATE_IDLE Signed-off-by: Valentin Schneider --- kernel/context_tracking.c | 22 +++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index a61498a8425e2..15f10ddec8cbe 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -236,7 +236,9 @@ void noinstr ct_nmi_exit(void) instrumentation_end(); // RCU is watching here ... - ct_kernel_exit_state(CT_RCU_WATCHING); + ct_kernel_exit_state(CT_RCU_WATCHING - +CT_STATE_KERNEL + +CT_STATE_IDLE); // ... but is no longer watching here. if (!in_nmi()) @@ -259,6 +261,7 @@ void noinstr ct_nmi_enter(void) { long incby = 2; struct context_tracking *ct = this_cpu_ptr(&context_tracking); + int curr_state; /* Complain about underflow. */ WARN_ON_ONCE(ct_nmi_nesting() < 0); @@ -271,13 +274,26 @@ void noinstr ct_nmi_enter(void) * to be in the outermost NMI handler that interrupted an RCU-idle * period (observation due to Andy Lutomirski). */ - if (!rcu_is_watching_curr_cpu()) { + curr_state = raw_atomic_read(this_cpu_ptr(&context_tracking.state)); + if (!(curr_state & CT_RCU_WATCHING)) { if (!in_nmi()) rcu_task_enter(); + /* +* RCU isn't watching, so we're one of +* CT_STATE_IDLE +* CT_STATE_USER +* CT_STATE_GUEST +* guest/user entry is handled by ct_user_enter(), so this has +* to be idle entry. +*/ + WARN_ON_ONCE((curr_state & CT_STATE_MASK) != CT_STATE_IDLE); + // RCU is not watching here ... - ct_kernel_enter_state(CT_RCU_WATCHING); + ct_kernel_enter_state(CT_RCU_WATCHING + + CT_STATE_KERNEL - + CT_STATE_IDLE); // ... but is watching here. instrumentation_begin(); -- 2.43.0
[PATCH v4 23/30] context_tracking: Turn CT_STATE_* into bits
A later patch will require to easily exclude CT_STATE_KERNEL from a genuine a ct->state read CT_STATE_KERNEL, which requires that value being non-zero and exclusive with the other CT_STATE_* values. This increases the size of the CT_STATE region of the ct->state variable by two bits. Signed-off-by: Valentin Schneider --- include/linux/context_tracking_state.h | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index 0b81248aa03e2..eb2149b20baef 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -11,11 +11,11 @@ enum ctx_state { CT_STATE_DISABLED = -1, /* returned by ct_state() if unknown */ - CT_STATE_KERNEL = 0, - CT_STATE_IDLE = 1, - CT_STATE_USER = 2, - CT_STATE_GUEST = 3, - CT_STATE_MAX= 4, + CT_STATE_KERNEL = 1, + CT_STATE_IDLE = 2, + CT_STATE_USER = 4, + CT_STATE_GUEST = 8, + CT_STATE_MAX= 9, }; struct context_tracking { -- 2.43.0
[PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs. As the synchronization IPIs are sent using a blocking call, returning from text_poke_bp_batch() implies all CPUs will observe the patched instruction(s), and this should be preserved even if the IPI is deferred. In other words, to safely defer this synchronization, any kernel instruction leading to the execution of the deferred instruction sync (ct_work_flush()) must *not* be mutable (patchable) at runtime. This means we must pay attention to mutable instructions in the early entry code: - alternatives - static keys - static calls - all sorts of probes (kprobes/ftrace/bpf/???) The early entry code leading to ct_work_flush() is noinstr, which gets rid of the probes. Alternatives are safe, because it's boot-time patching (before SMP is even brought up) which is before any IPI deferral can happen. This leaves us with static keys and static calls. Any static key used in early entry code should be only forever-enabled at boot time, IOW __ro_after_init (pretty much like alternatives). Exceptions are explicitly marked as allowed in .noinstr and will always generate an IPI when flipped. The same applies to static calls - they should be only updated at boot time, or manually marked as an exception. Objtool is now able to point at static keys/calls that don't respect this, and all static keys/calls used in early entry code have now been verified as behaving appropriately. Leverage the new context_tracking infrastructure to defer sync_core() IPIs to a target CPU's next kernel entry. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Nicolas Saenz Julienne Signed-off-by: Valentin Schneider --- arch/x86/include/asm/context_tracking_work.h | 6 ++-- arch/x86/include/asm/text-patching.h | 1 + arch/x86/kernel/alternative.c| 38 arch/x86/kernel/kprobes/core.c | 4 +-- arch/x86/kernel/kprobes/opt.c| 4 +-- arch/x86/kernel/module.c | 2 +- include/asm-generic/sections.h | 15 include/linux/context_tracking_work.h| 4 +-- 8 files changed, 59 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 5f3b2d0977235..485b32881fde5 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -2,11 +2,13 @@ #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H #define _ASM_X86_CONTEXT_TRACKING_WORK_H +#include + static __always_inline void arch_context_tracking_work(enum ct_work work) { switch (work) { - case CT_WORK_n: - // Do work... + case CT_WORK_SYNC: + sync_core(); break; case CT_WORK_MAX: WARN_ON_ONCE(true); diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index ab9e143ec9fea..9dfa46f721c1d 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -33,6 +33,7 @@ extern void apply_relocation(u8 *buf, const u8 * const instr, size_t instrlen, u */ extern void *text_poke(void *addr, const void *opcode, size_t len); extern void text_poke_sync(void); +extern void text_poke_sync_deferrable(void); extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len); extern void *text_poke_copy(void *addr, const void *opcode, size_t len); #define text_poke_copy text_poke_copy diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 243843e44e89d..633deea8a89cb 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -2109,9 +2110,24 @@ static void do_sync_core(void *info) sync_core(); } +static bool do_sync_core_defer_cond(int cpu, void *info) +{ + return !ct_set_cpu_work(cpu, CT_WORK_SYNC); +} + +static void __text_poke_sync(smp_cond_func_t cond_func) +{ + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1); +} + void text_poke_sync(void) { - on_each_cpu(do_sync_core, NULL, 1); + __text_poke_sync(NULL); +} + +void text_poke_sync_deferrable(void) +{ + __text_poke_sync(do_sync_core_defer_cond); } /* @@ -2282,6 +2298,7 @@ static int tp_vec_nr; */ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries) { + smp_cond_func_t cond = do_sync_core_defer_cond; unsigned char int3 = INT3_INSN_OPCODE; unsigned int i; int do_sync; @@ -2317,11 +2334,20 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * First step: add a int3 trap to the address that will
Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider wrote: > vunmap()'s issued from housekeeping CPUs are a relatively common source of > interference for isolated NOHZ_FULL CPUs, as they are hit by the > flush_tlb_kernel_range() IPIs. > > Given that CPUs executing in userspace do not access data in the vmalloc > range, these IPIs could be deferred until their next kernel entry. > > Deferral vs early entry danger zone > === > > This requires a guarantee that nothing in the vmalloc range can be vunmap'd > and then accessed in early entry code. In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
Re: [PATCH v4 26/30] x86,tlb: Make __flush_tlb_global() noinstr-compliant
On 1/14/25 09:51, Valentin Schneider wrote: > + cr4 = this_cpu_read(cpu_tlbstate.cr4); > + asm volatile("mov %0,%%cr4": : "r" (cr4 ^ X86_CR4_PGE) : "memory"); > + asm volatile("mov %0,%%cr4": : "r" (cr4) : "memory"); > + /* > + * In lieu of not having the pinning crap, hard fail if CR4 doesn't > + * match the expected value. This ensures that anybody doing dodgy gets > + * the fallthrough check. > + */ > + BUG_ON(cr4 != this_cpu_read(cpu_tlbstate.cr4)); Let's say someone managed to write to cpu_tlbstate.cr4 where they cleared one of the pinned bits. Before this patch, CR4 pinning would WARN_ONCE() about it pretty quickly and also reset the cleared bits. After this patch, the first native_flush_tlb_global() can clear pinned bits, at least until native_write_cr4() gets called the next time. That seems like it'll undermine CR4 pinning at least somewhat. What keeps native_write_cr4() from being noinstr-compliant now? Is it just the WARN_ONCE()? If so, I'd kinda rather have a native_write_cr4_nowarn() that's noinstr-compliant but retains all the other CR4 pinning behavior. Would something like the attached patch be _worse_?diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 3e9037690814..2044d516f06f 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -423,24 +423,40 @@ void native_write_cr0(unsigned long val) } EXPORT_SYMBOL(native_write_cr0); -void __no_profile native_write_cr4(unsigned long val) +void __no_profile __native_write_cr4(unsigned long val, unsigned long *bits_changed) { - unsigned long bits_changed = 0; - set_register: asm volatile("mov %0,%%cr4": "+r" (val) : : "memory"); if (static_branch_likely(&cr_pinning)) { if (unlikely((val & cr4_pinned_mask) != cr4_pinned_bits)) { - bits_changed = (val & cr4_pinned_mask) ^ cr4_pinned_bits; + *bits_changed = (val & cr4_pinned_mask) ^ cr4_pinned_bits; val = (val & ~cr4_pinned_mask) | cr4_pinned_bits; goto set_register; } - /* Warn after we've corrected the changed bits. */ - WARN_ONCE(bits_changed, "pinned CR4 bits changed: 0x%lx!?\n", - bits_changed); } } + +void __no_profile native_write_cr4(unsigned long val) +{ + unsigned long bits_changed = 0; + + __native_write_cr4(val, &bits_changed); + + if (!bits_changed) + return + + WARN_ONCE(bits_changed, "pinned CR4 bits changed: 0x%lx!?\n", + bits_changed); +} + +void __no_profile native_write_cr4_nowarn(unsigned long val) +{ + unsigned long bits_changed = 0; + + __native_write_cr4(val, &bits_changed); +} + #if IS_MODULE(CONFIG_LKDTM) EXPORT_SYMBOL_GPL(native_write_cr4); #endif
[PATCH v4 24/30] context-tracking: Introduce work deferral infrastructure
smp_call_function() & friends have the unfortunate habit of sending IPIs to isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online CPUs. Some callsites can be bent into doing the right, such as done by commit: cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs") Unfortunately, not all SMP callbacks can be omitted in this fashion. However, some of them only affect execution in kernelspace, which means they don't have to be executed *immediately* if the target CPU is in userspace: stashing the callback and executing it upon the next kernel entry would suffice. x86 kernel instruction patching or kernel TLB invalidation are prime examples of it. Reduce the RCU dynticks counter width to free up some bits to be used as a deferred callback bitmask. Add some build-time checks to validate that setup. Presence of CT_STATE_KERNEL in the ct_state prevents queuing deferred work. Later commits introduce the bit:callback mappings. Link: https://lore.kernel.org/all/20210929151723.162004...@infradead.org/ Signed-off-by: Nicolas Saenz Julienne Signed-off-by: Valentin Schneider --- arch/Kconfig | 9 +++ arch/x86/Kconfig | 1 + arch/x86/include/asm/context_tracking_work.h | 16 + include/linux/context_tracking.h | 21 +++ include/linux/context_tracking_state.h | 30 ++--- include/linux/context_tracking_work.h| 26 kernel/context_tracking.c| 66 +++- kernel/time/Kconfig | 5 ++ 8 files changed, 162 insertions(+), 12 deletions(-) create mode 100644 arch/x86/include/asm/context_tracking_work.h create mode 100644 include/linux/context_tracking_work.h diff --git a/arch/Kconfig b/arch/Kconfig index 6682b2a53e342..b637f20f0fc68 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -952,6 +952,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK - No use of instrumentation, unless instrumentation_begin() got called. +config HAVE_CONTEXT_TRACKING_WORK + bool + help + Architecture supports deferring work while not in kernel context. + This is especially useful on setups with isolated CPUs that might + want to avoid being interrupted to perform housekeeping tasks (for + ex. TLB invalidation or icache invalidation). The housekeeping + operations are performed upon re-entering the kernel. + config HAVE_TIF_NOHZ bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 9d7bd0ae48c42..ca5dd4cfc354b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -216,6 +216,7 @@ config X86 select HAVE_CMPXCHG_LOCAL select HAVE_CONTEXT_TRACKING_USER if X86_64 select HAVE_CONTEXT_TRACKING_USER_OFFSTACK if HAVE_CONTEXT_TRACKING_USER + select HAVE_CONTEXT_TRACKING_WORK if X86_64 select HAVE_C_RECORDMCOUNT select HAVE_OBJTOOL_MCOUNT if HAVE_OBJTOOL select HAVE_OBJTOOL_NOP_MCOUNT if HAVE_OBJTOOL_MCOUNT diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h new file mode 100644 index 0..5f3b2d0977235 --- /dev/null +++ b/arch/x86/include/asm/context_tracking_work.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H +#define _ASM_X86_CONTEXT_TRACKING_WORK_H + +static __always_inline void arch_context_tracking_work(enum ct_work work) +{ + switch (work) { + case CT_WORK_n: + // Do work... + break; + case CT_WORK_MAX: + WARN_ON_ONCE(true); + } +} + +#endif diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index af9fe87a09225..0b0faa040e9b5 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -5,6 +5,7 @@ #include #include #include +#include #include #include @@ -137,6 +138,26 @@ static __always_inline unsigned long ct_state_inc(int incby) return raw_atomic_add_return(incby, this_cpu_ptr(&context_tracking.state)); } +#ifdef CONFIG_CONTEXT_TRACKING_WORK +static __always_inline unsigned long ct_state_inc_clear_work(int incby) +{ + struct context_tracking *ct = this_cpu_ptr(&context_tracking); + unsigned long new, old, state; + + state = arch_atomic_read(&ct->state); + do { + old = state; + new = old & ~CT_WORK_MASK; + new += incby; + state = arch_atomic_cmpxchg(&ct->state, old, new); + } while (old != state); + + return new; +} +#else +#define ct_state_inc_clear_work(x) ct_state_inc(x) +#endif + static __always_inline bool warn_rcu_enter(void) { bool ret = false; diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index eb2
[PATCH v4 26/30] x86,tlb: Make __flush_tlb_global() noinstr-compliant
From: Peter Zijlstra Later patches will require issuing a __flush_tlb_all() from noinstr code. This requires making both __flush_tlb_local() and __flush_tlb_global() noinstr-compliant. For __flush_tlb_global(), both native_flush_tlb_global() and xen_flush_tlb() need to be made noinstr. Forgo using __native_flush_tlb_global() / native_write_cr4() and have the ASM directly inlined in the native function. For the Xen stuff, __always_inline a handful of helpers. Not-signed-off-by: Peter Zijlstra [Changelog faff] Signed-off-by: Valentin Schneider --- arch/x86/include/asm/invpcid.h | 13 ++--- arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/xen/hypercall.h | 11 +-- arch/x86/mm/tlb.c| 15 +++ arch/x86/xen/mmu_pv.c| 10 +- arch/x86/xen/xen-ops.h | 12 6 files changed, 40 insertions(+), 23 deletions(-) diff --git a/arch/x86/include/asm/invpcid.h b/arch/x86/include/asm/invpcid.h index 734482afbf81d..ff26136fcd9c6 100644 --- a/arch/x86/include/asm/invpcid.h +++ b/arch/x86/include/asm/invpcid.h @@ -2,7 +2,7 @@ #ifndef _ASM_X86_INVPCID #define _ASM_X86_INVPCID -static inline void __invpcid(unsigned long pcid, unsigned long addr, +static __always_inline void __invpcid(unsigned long pcid, unsigned long addr, unsigned long type) { struct { u64 d[2]; } desc = { { pcid, addr } }; @@ -13,7 +13,7 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr, * mappings, we don't want the compiler to reorder any subsequent * memory accesses before the TLB flush. */ - asm volatile("invpcid %[desc], %[type]" + asm_inline volatile("invpcid %[desc], %[type]" :: [desc] "m" (desc), [type] "r" (type) : "memory"); } @@ -23,26 +23,25 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr, #define INVPCID_TYPE_ALL_NON_GLOBAL3 /* Flush all mappings for a given pcid and addr, not including globals. */ -static inline void invpcid_flush_one(unsigned long pcid, -unsigned long addr) +static __always_inline void invpcid_flush_one(unsigned long pcid, unsigned long addr) { __invpcid(pcid, addr, INVPCID_TYPE_INDIV_ADDR); } /* Flush all mappings for a given PCID, not including globals. */ -static inline void invpcid_flush_single_context(unsigned long pcid) +static __always_inline void invpcid_flush_single_context(unsigned long pcid) { __invpcid(pcid, 0, INVPCID_TYPE_SINGLE_CTXT); } /* Flush all mappings, including globals, for all PCIDs. */ -static inline void invpcid_flush_all(void) +static __always_inline void invpcid_flush_all(void) { __invpcid(0, 0, INVPCID_TYPE_ALL_INCL_GLOBAL); } /* Flush all mappings for all PCIDs except globals. */ -static inline void invpcid_flush_all_nonglobals(void) +static __always_inline void invpcid_flush_all_nonglobals(void) { __invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL); } diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index d4eb9e1d61b8e..b3daee3d46677 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -75,7 +75,7 @@ static inline void __flush_tlb_local(void) PVOP_VCALL0(mmu.flush_tlb_user); } -static inline void __flush_tlb_global(void) +static __always_inline void __flush_tlb_global(void) { PVOP_VCALL0(mmu.flush_tlb_kernel); } diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h index 97771b9d33af3..291e9f8006f62 100644 --- a/arch/x86/include/asm/xen/hypercall.h +++ b/arch/x86/include/asm/xen/hypercall.h @@ -365,8 +365,8 @@ MULTI_mmu_update(struct multicall_entry *mcl, struct mmu_update *req, trace_xen_mc_entry(mcl, 4); } -static inline void -MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, +static __always_inline void +__MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, int *success_count, domid_t domid) { mcl->op = __HYPERVISOR_mmuext_op; @@ -374,6 +374,13 @@ MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, mcl->args[1] = count; mcl->args[2] = (unsigned long)success_count; mcl->args[3] = domid; +} + +static inline void +MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, + int *success_count, domid_t domid) +{ + __MULTI_mmuext_op(mcl, op, count, success_count, domid); trace_xen_mc_entry(mcl, 4); } diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index a2becb85bea79..2d2ab3e221f0c 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1169,9 +1169,10 @@ void flush_tlb_one_user(unsigned long addr) /* * Flush everything */ -STATIC_NOPV void native_flush_tlb_global(void) +STATIC_NOPV noinstr void native_flush_tlb_global(void)
Re: [XEN PATCH] MAINTAINERS: Change reviewer of the ECLAIR integration
On Tue, 14 Jan 2025, Simone Ballarin wrote: > On 2025-01-14 16:00, Roger Pau Monné wrote: > > On Tue, Jan 14, 2025 at 03:48:54PM +0100, Nicola Vetrini wrote: > > > Simone Ballarin is no longer actively involved in reviewing > > > the ECLAIR integration for Xen. I am stepping up as a reviewer. > > > > > > Signed-off-by: Nicola Vetrini > > > > Acked-by: Roger Pau Monné > > > > Acked-by: Simone Ballarin Acked-by: Stefano Stabellini
Re: [PATCH for-4.20] automation/gitlab: disable coverage from clang randconfig
On Tue, 14 Jan 2025, Andrew Cooper wrote: > On 14/01/2025 5:43 pm, Roger Pau Monne wrote: > > If randconfig enables coverage support the build times out due to GNU LD > > taking too long. For the time being prevent coverage from being enabled in > > clang randconfig job. > > > > Signed-off-by: Roger Pau Monné > > Acked-by: Andrew Cooper Acked-by: Stefano Stabellini
Re: [PATCH] riscv: Add initial Xen guest support for RISC-V
+Oleksii On Tue, 14 Jan 2025, Milan Djokic wrote: > From: Slavisa Petrovic > > This patch introduces initial support for running RISC-V as a Xen guest. > It provides the necessary infrastructure and stubs for Xen-specific > operations. Key changes include: > > - Modifications to the RISC-V kernel to integrate Xen-specific hypercalls > and interfaces, with function implementations stubbed for future work. > - Introduction of Xen-specific headers for RISC-V, including event > handling, hypervisor interaction, and page management. Functions are > defined but not yet implemented. > - Stub implementations for memory management, grant tables, and context > switching in the Xen environment, allowing further development and > integration. > > Signed-off-by: Milan Djokic > Signed-off-by: Slavisa Petrovic Hi Milan, Slavisa, Thank you very much for your contribution! Which Xen tree are you using for development? I am asking because RISC-V support in Xen is still in the process of being upstreamed, and [1] is the tree that consolidates all patches currently on the to-be-upstreamed list. While the specific Xen tree you are using is not particularly important at this stage, and using [1] is not a requirement, I am asking because it is essential to align on the hypervisor ABI, especially regarding any details that have not yet been upstreamed. Specifically, is there anything in this patch series that relies on features not yet upstream in Xen? [1] https://gitlab.com/xen-project/people/olkur/xen/-/tree/latest?ref_type=heads
RE: [PATCH v1 02/11] xen/x86: introduce new sub-hypercall to get CPPC data
[AMD Official Use Only - AMD Internal Distribution Only] Hi, > -Original Message- > From: Jan Beulich > Sent: Thursday, January 9, 2025 5:46 PM > To: Penny, Zheng > Cc: Stabellini, Stefano ; Huang, Ray > ; Ragiadakou, Xenia ; > Andryuk, Jason ; Andrew Cooper > ; Roger Pau Monné ; Julien > Grall ; Stefano Stabellini ; xen- > de...@lists.xenproject.org > Subject: Re: [PATCH v1 02/11] xen/x86: introduce new sub-hypercall to get CPPC > data > > On 03.12.2024 09:11, Penny Zheng wrote: > > In order to provide backward compatibility with existing governors > > that represent performance as frequencies, like ondemand, the _CPC > > table can optionally provide processor frequency range values, Lowest > > frequency and Norminal frequency, to let OS use Lowest Frequency/ > > Performance and Nominal Frequency/Performance as anchor points to > > create linear mapping of CPPC abstract performance to CPU frequency. > > > > As Xen is uncapable of parsing the ACPI dynamic table, this commit > > introduces a new sub-hypercall to get required CPPC data from > > dom0 kernel. > > "get" as used both here and in the title is, to me, something an entity does > actively. > Xen is entirely passive here, though. (Reading the title I was first assuming > this is > about a sub-op to get certain data out of > Xen.) How about "a new sub-hypercall to pass/deliver CPPC to Xen"? or any better suggestions? > > > --- a/xen/arch/x86/platform_hypercall.c > > +++ b/xen/arch/x86/platform_hypercall.c > > @@ -572,6 +572,12 @@ ret_t do_platform_op( > > break; > > } > > > > +case XEN_PM_CPPC: > > +{ > > +ret = set_cppc_pminfo(op->u.set_pminfo.id, &op- > >u.set_pminfo.u.cppc_data); > > +} > > +break; > > No such unnecessary figure braces please, which - once dropped - will also > call for > "break" to be indented one level deeper. Understood. > > > --- a/xen/arch/x86/x86_64/cpufreq.c > > +++ b/xen/arch/x86/x86_64/cpufreq.c > > @@ -54,3 +54,21 @@ int compat_set_px_pminfo(uint32_t acpi_id, > > > > return set_px_pminfo(acpi_id, xen_perf); } > > + > > +int compat_set_cppc_pminfo(uint32_t acpi_id, > > + struct compat_processor_cppc *cppc_data) { > > +struct xen_processor_cppc *xen_cppc; > > +unsigned long xlat_page_current; > > + > > +xlat_malloc_init(xlat_page_current); > > + > > +xen_cppc = xlat_malloc_array(xlat_page_current, > > +struct xen_processor_cppc, 1); > > +if ( unlikely(xen_cppc == NULL) ) > > +return -EFAULT; > > + > > +XLAT_processor_cppc(xen_cppc, cppc_data); > > + > > +return set_cppc_pminfo(acpi_id, xen_cppc); } > > Why's this needed? The structure - for now at least - consists of only > uint32_t-s, > and hence has identical layout for compat callers. > Understood. Not familiar with the compat framework > > --- a/xen/drivers/cpufreq/cpufreq.c > > +++ b/xen/drivers/cpufreq/cpufreq.c > > @@ -458,6 +458,56 @@ static void print_PPC(unsigned int platform_limit) > > printk("\t_PPC: %d\n", platform_limit); } > > > > +static void print_CPPC(struct xen_processor_cppc *cppc_data) > > Pointer-to-const? > Sure. > > +{ > > +printk("\t_CPC: highest_perf=%u, lowest_perf=%u, " > > + "nominal_perf=%u, lowest_nonlinear_perf=%u, " > > + "nominal_freq=%uMhz, lowest_freq=%uMhz\n", > > + cppc_data->highest_perf, cppc_data->lowest_perf, > > + cppc_data->nominal_perf, cppc_data->lowest_nonlinear_perf, > > + cppc_data->nominal_freq, cppc_data->lowest_freq); } > > + > > +int set_cppc_pminfo(uint32_t acpi_id, struct xen_processor_cppc > > +*cppc_data) > > Pointer-to-const? > Sure. > > +{ > > +int ret = 0, cpuid; > > +struct processor_pminfo *pm_info; > > + > > +cpuid = get_cpu_id(acpi_id); > > +if ( cpuid < 0 || !cppc_data ) > > +{ > > +ret = -EINVAL; > > +goto out; > > +} > > +if ( cpufreq_verbose ) > > +printk("Set CPU acpi_id(%d) cpuid(%d) CPPC State info:\n", > > + acpi_id, cpuid); > > + > > +pm_info = processor_pminfo[cpuid]; > > +if ( !pm_info ) > > +{ > > +pm_info = xzalloc(struct processor_pminfo); > > Please be aware that new code is supposed to be using xvmalloc(). Thanks for the update. > > > +if ( !pm_info ) > > +{ > > +ret = -ENOMEM; > > +goto out; > > +} > > +processor_pminfo[cpuid] = pm_info; > > +} > > +pm_info->acpi_id = acpi_id; > > +pm_info->id = cpuid; > > + > > +memcpy ((void *)&pm_info->cppc_data, > > +(void *)cppc_data, > > +sizeof(struct xen_processor_cppc)); > > What use are these casts? Also please no blank before the opening parenthesis > of > a function call, and please sizeof(*cppc_data). Yet then - why memcpy() in > the first > place? This can be a (type safe) structure assignment, can't it? > Yes,
Re: [PATCH v2 11/25] drm/loongson: Compute dumb-buffer sizes with drm_mode_size_dumb()
Hi, On 2025/1/9 22:57, Thomas Zimmermann wrote: Call drm_mode_size_dumb() to compute dumb-buffer scanline pitch and buffer size. Align the pitch according to hardware requirements. Signed-off-by: Thomas Zimmermann Cc: Sui Jingfeng Cc: Sui Jingfeng Reviewed-by: Sui Jingfeng --- drivers/gpu/drm/loongson/lsdc_gem.c | 29 - 1 file changed, 8 insertions(+), 21 deletions(-) diff --git a/drivers/gpu/drm/loongson/lsdc_gem.c b/drivers/gpu/drm/loongson/lsdc_gem.c index a720d8f53209..9f982b85301f 100644 --- a/drivers/gpu/drm/loongson/lsdc_gem.c +++ b/drivers/gpu/drm/loongson/lsdc_gem.c @@ -6,6 +6,7 @@ #include #include +#include #include #include #include @@ -204,45 +205,31 @@ int lsdc_dumb_create(struct drm_file *file, struct drm_device *ddev, const struct lsdc_desc *descp = ldev->descp; u32 domain = LSDC_GEM_DOMAIN_VRAM; struct drm_gem_object *gobj; - size_t size; - u32 pitch; - u32 handle; int ret; - if (!args->width || !args->height) - return -EINVAL; - - if (args->bpp != 32 && args->bpp != 16) - return -EINVAL; - - pitch = args->width * args->bpp / 8; - pitch = ALIGN(pitch, descp->pitch_align); - size = pitch * args->height; - size = ALIGN(size, PAGE_SIZE); + ret = drm_mode_size_dumb(ddev, args, descp->pitch_align, 0); + if (ret) + return ret; /* Maximum single bo size allowed is the half vram size available */ - if (size > ldev->vram_size / 2) { - drm_err(ddev, "Requesting(%zuMiB) failed\n", size >> 20); + if (args->size > ldev->vram_size / 2) { + drm_err(ddev, "Requesting(%zuMiB) failed\n", (size_t)(args->size >> PAGE_SHIFT)); return -ENOMEM; } - gobj = lsdc_gem_object_create(ddev, domain, size, false, NULL, NULL); + gobj = lsdc_gem_object_create(ddev, domain, args->size, false, NULL, NULL); if (IS_ERR(gobj)) { drm_err(ddev, "Failed to create gem object\n"); return PTR_ERR(gobj); } - ret = drm_gem_handle_create(file, gobj, &handle); + ret = drm_gem_handle_create(file, gobj, &args->handle); /* drop reference from allocate, handle holds it now */ drm_gem_object_put(gobj); if (ret) return ret; - args->pitch = pitch; - args->size = size; - args->handle = handle; - return 0; } -- Best regards, Sui
Re: [PATCH v4 10/30] riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
On Tue, Jan 14, 2025 at 06:51:23PM +0100, Valentin Schneider wrote: > The static call is only ever updated in: > > __init pv_time_init() > __init xen_time_setup_guest() > > so mark it appropriately as __ro_after_init. > > Signed-off-by: Valentin Schneider > --- > arch/riscv/kernel/paravirt.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/riscv/kernel/paravirt.c b/arch/riscv/kernel/paravirt.c > index fa6b0339a65de..dfe8808016fd8 100644 > --- a/arch/riscv/kernel/paravirt.c > +++ b/arch/riscv/kernel/paravirt.c > @@ -30,7 +30,7 @@ static u64 native_steal_clock(int cpu) > return 0; > } > > -DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); > +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock); > > static bool steal_acc = true; > static int __init parse_no_stealacc(char *arg) > -- > 2.43.0 > Reviewed-by: Andrew Jones