[PATCH] xentrace: free CPU mask string before overwriting pointer

2025-01-14 Thread Jan Beulich
While multiple -c options may be unexpected, we'd still better deal with
them properly.

Also restore the blank line that was bogusly zapped by the same commit.

Coverity-ID: 1638723
Fixes: e4ad2836842a ("xentrace: Implement cpu mask range parsing of human 
values (-c)")
Signed-off-by: Jan Beulich 

--- a/tools/xentrace/xentrace.c
+++ b/tools/xentrace/xentrace.c
@@ -1105,8 +1105,10 @@ static void parse_args(int argc, char **
 break;
 
 case 'c': /* set new cpu mask for filtering (when xch is set). */
+free(opts.cpu_mask_str);
 opts.cpu_mask_str = strdup(optarg);
 break;
+
 case 'e': /* set new event mask for filtering*/
 parse_evtmask(optarg);
 break;



[PATCH] xl: properly dispose of libxl_dominfo struct instances

2025-01-14 Thread Jan Beulich
The ssid_label field requires separate freeing; make sure to call
libxl_dominfo_dispose(). And then, for good measure, also
libxl_dominfo_init().

Coverity-ID: 1638727
Coverity-ID: 1638728
Fixes: c458c404da16 ("xl: use libxl_domain_info to get the uuid in printf_info")
Fixes: 48dab9767d2e ("tools/xl: use libxl_domain_info to get domain type for 
vcpu-pin")
Signed-off-by: Jan Beulich 
---
I wasn't quite sure about use of libxl_dominfo_init(): vcpuset(), for
example, doesn't call it.

--- a/tools/xl/xl_sxp.c
+++ b/tools/xl/xl_sxp.c
@@ -45,8 +45,10 @@ void printf_info_sexp(int domid, libxl_d
 /* retrieve the UUID from dominfo, since it is probably generated
  * during parsing and thus does not match the real one
  */
+libxl_dominfo_init(&info);
 if (libxl_domain_info(ctx, &info, domid) == 0) {
 fprintf(fh, "\t(uuid " LIBXL_UUID_FMT ")\n", 
LIBXL_UUID_BYTES(info.uuid));
+libxl_dominfo_dispose(&info);
 } else {
 fprintf(fh, "\t(uuid )\n");
 }
--- a/tools/xl/xl_vcpu.c
+++ b/tools/xl/xl_vcpu.c
@@ -286,6 +286,8 @@ int main_vcpupin(int argc, char **argv)
 if (!ignore_masks && hard) {
 libxl_dominfo dominfo;
 
+libxl_dominfo_init(&dominfo);
+
 if (libxl_domain_info(ctx, &dominfo, domid)) {
 fprintf(stderr, "Could not get domain info\n");
 goto out;
@@ -293,6 +295,8 @@ int main_vcpupin(int argc, char **argv)
 
 /* HVM and PVH domains use the same global affinity mask */
 apply_global_affinity_masks(dominfo.domain_type, hard, 1);
+
+libxl_dominfo_dispose(&dominfo);
 }
 
 if (force) {



[PATCH] xl: properly dispose of vTPM struct instance

2025-01-14 Thread Jan Beulich
The backend_domname field requires separate freeing; make sure to call
libxl_device_vtpm_dispose() also on respective error paths.

Coverity-ID: 1638719
Fixes: dde22055ac3a ("libxl: add vtpm support")
Signed-off-by: Jan Beulich 

--- a/tools/xl/xl_vtpm.c
+++ b/tools/xl/xl_vtpm.c
@@ -44,12 +44,14 @@ int main_vtpmattach(int argc, char **arg
 if (MATCH_OPTION("uuid", *argv, oparg)) {
 if(libxl_uuid_from_string(&(vtpm.uuid), oparg)) {
 fprintf(stderr, "Invalid uuid specified (%s)\n", oparg);
+libxl_device_vtpm_dispose(&vtpm);
 return 1;
 }
 } else if (MATCH_OPTION("backend", *argv, oparg)) {
 replace_string(&vtpm.backend_domname, oparg);
 } else {
 fprintf(stderr, "unrecognized argument `%s'\n", *argv);
+libxl_device_vtpm_dispose(&vtpm);
 return 1;
 }
 }
@@ -65,6 +67,7 @@ int main_vtpmattach(int argc, char **arg
 
 if (libxl_device_vtpm_add(ctx, domid, &vtpm, 0)) {
 fprintf(stderr, "libxl_device_vtpm_add failed.\n");
+libxl_device_vtpm_dispose(&vtpm);
 return 1;
 }
 libxl_device_vtpm_dispose(&vtpm);



Re: [PATCH v2 1/2] x86/uaccess: rework user access speculative harden guards

2025-01-14 Thread Nicola Vetrini

On 2025-01-10 09:56, Nicola Vetrini wrote:

On 2025-01-10 09:29, Roger Pau Monné wrote:

On Thu, Jan 09, 2025 at 03:57:24PM -0800, Stefano Stabellini wrote:

On Thu, 9 Jan 2025, Nicola Vetrini wrote:
> On 2025-01-04 01:20, Stefano Stabellini wrote:
> > Hi Nicola, one question below



>
> I will update ECLAIR to treat the two forms as the same, so this patch can be
> dropped. If you think it's helpful I can send a patch spelling out this -
> arbitrary, but reasonable in my opinion - extension to the MISRA rule (which
> does not consider the implications related to the use of GNU exensions) so
> that contributors have a clear picture of the situation.

Thank you Nicola! Yes the patch would be appreciated :-)


So unless the proposed adjustment is considered better for code
readability patch 1 can be dropped, and patch 2 could be applied after
the ECLAIR change is in effect?



Yes, exactly


How long will it take Nicola to get the ECLAIR change propagated into
the Gitlab runner?

Thanks, Roger.


We're still fixing the false positive upstream, but it shouldn't take 
too long so I think next week I should be able to refresh the runner.


Hi Roger,

the runner is updated so, assuming no new violation of Rule 20.7 
appeared in the meantime, the rule should now be clean. In the next few 
days I'll prepare a patch to the docs to document the behaviour.


Thanks,
 Nicola

--
Nicola Vetrini, B.Sc.
Software Engineer
BUGSENG (https://bugseng.com)
LinkedIn: https://www.linkedin.com/in/nicola-vetrini-a42471253



Re: [PATCH] xl: properly dispose of libxl_dominfo struct instances

2025-01-14 Thread Andrew Cooper
On 14/01/2025 8:12 am, Jan Beulich wrote:
> The ssid_label field requires separate freeing; make sure to call
> libxl_dominfo_dispose(). And then, for good measure, also
> libxl_dominfo_init().
>
> Coverity-ID: 1638727
> Coverity-ID: 1638728
> Fixes: c458c404da16 ("xl: use libxl_domain_info to get the uuid in 
> printf_info")
> Fixes: 48dab9767d2e ("tools/xl: use libxl_domain_info to get domain type for 
> vcpu-pin")
> Signed-off-by: Jan Beulich 
> ---
> I wasn't quite sure about use of libxl_dominfo_init(): vcpuset(), for
> example, doesn't call it.

It's a written requirement (somewhere) that *_init() and *_dispose() do
get called.

Except everyone's lazy with them and plenty of scenarios function
without, which is why it's often Coverity telling us about it (but only
for the instances where there's a real malloc()/free() gone missing).

I expect it would be better to extend this patch to fix up vcpuset()
too.  The changes so far LGTM.

~Andrew



Re: [PATCH] xentrace: free CPU mask string before overwriting pointer

2025-01-14 Thread Andrew Cooper
On 14/01/2025 8:12 am, Jan Beulich wrote:
> While multiple -c options may be unexpected, we'd still better deal with
> them properly.
>
> Also restore the blank line that was bogusly zapped by the same commit.
>
> Coverity-ID: 1638723
> Fixes: e4ad2836842a ("xentrace: Implement cpu mask range parsing of human 
> values (-c)")
> Signed-off-by: Jan Beulich 

Reviewed-by: Andrew Cooper 



Re: [PATCH] xl: properly dispose of libxl_dominfo struct instances

2025-01-14 Thread Oleksii Kurochko


On 1/14/25 9:12 AM, Jan Beulich wrote:

The ssid_label field requires separate freeing; make sure to call
libxl_dominfo_dispose(). And then, for good measure, also
libxl_dominfo_init().

Coverity-ID: 1638727
Coverity-ID: 1638728
Fixes: c458c404da16 ("xl: use libxl_domain_info to get the uuid in printf_info")
Fixes: 48dab9767d2e ("tools/xl: use libxl_domain_info to get domain type for 
vcpu-pin")
Signed-off-by: Jan Beulich


Release-Acked-by: Oleksii Kurochko

Thanks.

~ Oleksii


---
I wasn't quite sure about use of libxl_dominfo_init(): vcpuset(), for
example, doesn't call it.

--- a/tools/xl/xl_sxp.c
+++ b/tools/xl/xl_sxp.c
@@ -45,8 +45,10 @@ void printf_info_sexp(int domid, libxl_d
  /* retrieve the UUID from dominfo, since it is probably generated
   * during parsing and thus does not match the real one
   */
+libxl_dominfo_init(&info);
  if (libxl_domain_info(ctx, &info, domid) == 0) {
  fprintf(fh, "\t(uuid " LIBXL_UUID_FMT ")\n", 
LIBXL_UUID_BYTES(info.uuid));
+libxl_dominfo_dispose(&info);
  } else {
  fprintf(fh, "\t(uuid )\n");
  }
--- a/tools/xl/xl_vcpu.c
+++ b/tools/xl/xl_vcpu.c
@@ -286,6 +286,8 @@ int main_vcpupin(int argc, char **argv)
  if (!ignore_masks && hard) {
  libxl_dominfo dominfo;
  
+libxl_dominfo_init(&dominfo);

+
  if (libxl_domain_info(ctx, &dominfo, domid)) {
  fprintf(stderr, "Could not get domain info\n");
  goto out;
@@ -293,6 +295,8 @@ int main_vcpupin(int argc, char **argv)
  
  /* HVM and PVH domains use the same global affinity mask */

  apply_global_affinity_masks(dominfo.domain_type, hard, 1);
+
+libxl_dominfo_dispose(&dominfo);
  }
  
  if (force) {

Re: [PATCH v2 2/2] automation/eclair: make Misra rule 20.7 blocking for x86 also

2025-01-14 Thread Roger Pau Monné
Hello Oleksii,

This is in principle ready to go in now (I'm currently running a
private Eclair scan to ensure the patch is still OK against current
staging).  I would like to ask for a release Ack.

Thanks, Roger.

On Tue, Nov 26, 2024 at 10:35:08AM +0100, Roger Pau Monne wrote:
> There are no violations left, make the rule globally blocking for both x86 and
> ARM.
> 
> Signed-off-by: Roger Pau Monné 
> Reviewed-by: Andrew Cooper 
> ---
>  automation/eclair_analysis/ECLAIR/tagging.ecl | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/automation/eclair_analysis/ECLAIR/tagging.ecl 
> b/automation/eclair_analysis/ECLAIR/tagging.ecl
> index 755ea3271fc9..cb4e233e838d 100644
> --- a/automation/eclair_analysis/ECLAIR/tagging.ecl
> +++ b/automation/eclair_analysis/ECLAIR/tagging.ecl
> @@ -80,6 +80,7 @@ MC3R1.R20.2||
>  MC3R1.R20.3||
>  MC3R1.R20.4||
>  MC3R1.R20.6||
> +MC3R1.R20.7||
>  MC3R1.R20.9||
>  MC3R1.R20.11||
>  MC3R1.R20.12||
> @@ -116,7 +117,7 @@ if(string_equal(target,"x86_64"),
>  )
>  
>  if(string_equal(target,"arm64"),
> -
> service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6||MC3R1.R20.7"})
> +
> service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6"})
>  )
>  
>  
> -reports+={clean:added,"service(clean_guidelines_common||additional_clean_guidelines)"}
> -- 
> 2.46.0
> 



Re: [PATCH v2 2/2] automation/eclair: make Misra rule 20.7 blocking for x86 also

2025-01-14 Thread Nicola Vetrini

On 2025-01-14 12:22, Roger Pau Monné wrote:

Hello Oleksii,

This is in principle ready to go in now (I'm currently running a
private Eclair scan to ensure the patch is still OK against current
staging).  I would like to ask for a release Ack.



One nit below, which I overlooked initially


Thanks, Roger.

On Tue, Nov 26, 2024 at 10:35:08AM +0100, Roger Pau Monne wrote:
There are no violations left, make the rule globally blocking for both 
x86 and

ARM.

Signed-off-by: Roger Pau Monné 
Reviewed-by: Andrew Cooper 
---
 automation/eclair_analysis/ECLAIR/tagging.ecl | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/automation/eclair_analysis/ECLAIR/tagging.ecl 
b/automation/eclair_analysis/ECLAIR/tagging.ecl

index 755ea3271fc9..cb4e233e838d 100644
--- a/automation/eclair_analysis/ECLAIR/tagging.ecl
+++ b/automation/eclair_analysis/ECLAIR/tagging.ecl
@@ -80,6 +80,7 @@ MC3R1.R20.2||
 MC3R1.R20.3||
 MC3R1.R20.4||
 MC3R1.R20.6||
+MC3R1.R20.7||
 MC3R1.R20.9||
 MC3R1.R20.11||
 MC3R1.R20.12||
@@ -116,7 +117,7 @@ if(string_equal(target,"x86_64"),
 )


this hunk will not apply because it uses MC3R1, rather than MC3R2. 
Should be an easy fix.




 if(string_equal(target,"arm64"),
-
service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6||MC3R1.R20.7"})
+
service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6"})

 )


here as well



 
-reports+={clean:added,"service(clean_guidelines_common||additional_clean_guidelines)"}

--
2.46.0



--
Nicola Vetrini, B.Sc.
Software Engineer
BUGSENG (https://bugseng.com)
LinkedIn: https://www.linkedin.com/in/nicola-vetrini-a42471253



Re: [PATCH] xentrace: free CPU mask string before overwriting pointer

2025-01-14 Thread Oleksii Kurochko


On 1/14/25 9:12 AM, Jan Beulich wrote:

While multiple -c options may be unexpected, we'd still better deal with
them properly.

Also restore the blank line that was bogusly zapped by the same commit.

Coverity-ID: 1638723
Fixes: e4ad2836842a ("xentrace: Implement cpu mask range parsing of human values 
(-c)")
Signed-off-by: Jan Beulich


R-Acked-by: Oleksii Kurochko

Thanks.

~ Oleksii



--- a/tools/xentrace/xentrace.c
+++ b/tools/xentrace/xentrace.c
@@ -1105,8 +1105,10 @@ static void parse_args(int argc, char **
  break;
  
  case 'c': /* set new cpu mask for filtering (when xch is set). */

+free(opts.cpu_mask_str);
  opts.cpu_mask_str = strdup(optarg);
  break;
+
  case 'e': /* set new event mask for filtering*/
  parse_evtmask(optarg);
  break;

Re: [PATCH] xl: properly dispose of vTPM struct instance

2025-01-14 Thread Oleksii Kurochko



On 1/14/25 9:13 AM, Jan Beulich wrote:

The backend_domname field requires separate freeing; make sure to call
libxl_device_vtpm_dispose() also on respective error paths.

Coverity-ID: 1638719
Fixes: dde22055ac3a ("libxl: add vtpm support")
Signed-off-by: Jan Beulich 


R-Acked-By: Oleksii Kurochko 

Thanks.

~ Oleksii



--- a/tools/xl/xl_vtpm.c
+++ b/tools/xl/xl_vtpm.c
@@ -44,12 +44,14 @@ int main_vtpmattach(int argc, char **arg
  if (MATCH_OPTION("uuid", *argv, oparg)) {
  if(libxl_uuid_from_string(&(vtpm.uuid), oparg)) {
  fprintf(stderr, "Invalid uuid specified (%s)\n", oparg);
+libxl_device_vtpm_dispose(&vtpm);
  return 1;
  }
  } else if (MATCH_OPTION("backend", *argv, oparg)) {
  replace_string(&vtpm.backend_domname, oparg);
  } else {
  fprintf(stderr, "unrecognized argument `%s'\n", *argv);
+libxl_device_vtpm_dispose(&vtpm);
  return 1;
  }
  }
@@ -65,6 +67,7 @@ int main_vtpmattach(int argc, char **arg
  
  if (libxl_device_vtpm_add(ctx, domid, &vtpm, 0)) {

  fprintf(stderr, "libxl_device_vtpm_add failed.\n");
+libxl_device_vtpm_dispose(&vtpm);
  return 1;
  }
  libxl_device_vtpm_dispose(&vtpm);




Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock

2025-01-14 Thread Roger Pau Monné
On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote:
> 
> On 1/13/25 6:52 PM, Roger Pau Monné wrote:
> > On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki wrote:
> > > On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote:
> > > > Allow setting the used wallclock from the command line.  When the 
> > > > option is set
> > > > to a value different than `auto` the probing is bypassed and the 
> > > > selected
> > > > implementation is used (as long as it's available).
> > > > 
> > > > The `xen` and `efi` options require being booted as a Xen guest (with 
> > > > Xen guest
> > > > supported built-in) or from UEFI firmware respectively.
> > > > 
> > > > Signed-off-by: Roger Pau Monné
> > > Reviewed-by: Marek Marczykowski-Górecki
> > Thanks for the review.
> > 
> > Oleksii, can I get your opinion as Release Manager about whether this
> > (and the following patch) would be suitable for committing to staging
> > given the current release state?
> > 
> > It's a workaround for broken EFI implementations that many downstreams
> > carry on their patch queue.
> 
> Based on your commit message, I understand this as addressing a bug ( but not 
> very critical
> as IIUC downstreams have the similar patch on their side ). Therefore, if it 
> has been properly
> reviewed and tested, we should consider including it in the current release.

IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves
the same behavior as proposed here.

> IIUC, setting the wallclock to EFI should align with the behavior Xen had 
> previously.
> It might be preferable to use that argument as the default for a while, 
> allowing others to verify the "auto"
> value over time. After that, we could consider making "auto" the default.
> That said, I am not particularly strict about following this approach.

We cannot really set efi as the default, as it would break when
booting on legacy BIOS systems.

We could take only patch 1 and leave patch 2 after Xen 4.20 has
branched, but at that point I would see little benefit in having just
patch 1.

I don't have a strong opinion, but downstreams have been complaining
about Xen behavior regarding the usage of EFI_GET_TIME, so it might be
good to not ship yet another release with such allegedly broken
behavior.

Let me know what you think, as I would need a formal Release-Ack if
this is to be committed.

Thanks, Roger.

> ~ Oleksii
> 
> 
> > 
> > Thanks, Roger.



Re: [PATCH] x86/boot: Handle better alignment for 32 bit code

2025-01-14 Thread Jan Beulich
On 14.01.2025 10:59, Frediano Ziglio wrote:
> Output file didn't have correct alignment.
> Allows alignment into data or code up to 2mb.
> 
> Signed-off-by: Frediano Ziglio 

Afaic this is way too little of a description. For example, ...

> --- a/xen/arch/x86/boot/Makefile
> +++ b/xen/arch/x86/boot/Makefile
> @@ -40,8 +40,8 @@ LD32 := $(LD) $(subst x86_64,i386,$(LDFLAGS_DIRECT))
>  # are affected by both text_diff and text_gap.  Ensure the sum of gap and 
> diff
>  # is greater than 2^16 so that any 16bit relocations if present in the object
>  # file turns into a build-time error.
> -text_gap := 0x010200
> -text_diff := 0x408020
> +text_gap := 0x01c240
> +text_diff := 0x7e3dc0

... how is anyone to derive how we ended up with these magic numbers?
What parts of them are arbitrary, and what parts are required to be the
way they are?

> @@ -69,7 +69,6 @@ $(obj)/built-in-32.%.bin: $(obj)/build32.%.lds 
> $(obj)/built-in-32.tmp.o
>   $(LD32) $(orphan-handling-y) -N -T $< -o $(@:bin=o) $(filter %.o,$^)
>   $(NM) -p --format=bsd $(@:bin=o) > $(@:bin=map)
>   $(OBJCOPY) -j .text -O binary $(@:bin=o) $@
> - rm -f $(@:bin=o)

This looks like an unrelated change. It may be okay to have here, but
then it needs mentioning in the description.

> @@ -80,6 +79,7 @@ cmd_combine = \
>--bin1  $(obj)/built-in-32.base.bin \
>--bin2  $(obj)/built-in-32.offset.bin \
>--map   $(obj)/built-in-32.base.map \
> +  --align $(shell $(OBJDUMP) -h 
> $(obj)/built-in-32.base.o|sed '/text.*2\*\*/ {s/.*2\*\*//;p;}; d') \
>--exports   cmdline_parse_early,reloc,reloc_trampoline32 \
>--output$@
>  
> @@ -90,4 +90,4 @@ $(obj)/built-in-32.S: $(obj)/built-in-32.base.bin 
> $(obj)/built-in-32.offset.bin
>$(srctree)/tools/combine_two_binaries.py FORCE
>   $(call if_changed,combine)
>  
> -clean-files := built-in-32.*.bin built-in-32.*.map build32.*.lds
> +clean-files := built-in-32.*.bin built-in-32.*.map built-in-32.*.o 
> build32.*.lds

This looks like it would have been needed already before, if the build
process was interrupted before the "rm" that you remove above.

Jan



Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock

2025-01-14 Thread Oleksii Kurochko


On 1/14/25 12:27 PM, Roger Pau Monné wrote:

On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote:

On 1/13/25 6:52 PM, Roger Pau Monné wrote:

On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki wrote:

On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote:

Allow setting the used wallclock from the command line.  When the option is set
to a value different than `auto` the probing is bypassed and the selected
implementation is used (as long as it's available).

The `xen` and `efi` options require being booted as a Xen guest (with Xen guest
supported built-in) or from UEFI firmware respectively.

Signed-off-by: Roger Pau Monné

Reviewed-by: Marek Marczykowski-Górecki

Thanks for the review.

Oleksii, can I get your opinion as Release Manager about whether this
(and the following patch) would be suitable for committing to staging
given the current release state?

It's a workaround for broken EFI implementations that many downstreams
carry on their patch queue.

Based on your commit message, I understand this as addressing a bug ( but not 
very critical
as IIUC downstreams have the similar patch on their side ). Therefore, if it 
has been properly
reviewed and tested, we should consider including it in the current release.

IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves
the same behavior as proposed here.


IIUC, setting the wallclock to EFI should align with the behavior Xen had 
previously.
It might be preferable to use that argument as the default for a while, allowing others 
to verify the "auto"
value over time. After that, we could consider making "auto" the default.
That said, I am not particularly strict about following this approach.

We cannot really set efi as the default, as it would break when
booting on legacy BIOS systems.

We could take only patch 1 and leave patch 2 after Xen 4.20 has
branched, but at that point I would see little benefit in having just
patch 1.


I don't see a lot of benefit of comitting only the one patch either.




I don't have a strong opinion, but downstreams have been complaining
about Xen behavior regarding the usage of EFI_GET_TIME, so it might be
good to not ship yet another release with such allegedly broken
behavior.


Agree with that. As  I mentioned above I consider it as a bug and based on
that several mentioned above downstreams have the similar patch I could consider
that as tested approach so ..



Let me know what you think, as I would need a formal Release-Ack if
this is to be committed.


... R-Acked-by: Oleksii Kurochko.

Thanks.

~ Oleksii



Thanks, Roger.


~ Oleksii



Thanks, Roger.

Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock

2025-01-14 Thread Oleksii Kurochko


On 1/14/25 12:40 PM, Oleksii Kurochko wrote:



On 1/14/25 12:27 PM, Roger Pau Monné wrote:

On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote:

On 1/13/25 6:52 PM, Roger Pau Monné wrote:

On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki wrote:

On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote:

Allow setting the used wallclock from the command line.  When the option is set
to a value different than `auto` the probing is bypassed and the selected
implementation is used (as long as it's available).

The `xen` and `efi` options require being booted as a Xen guest (with Xen guest
supported built-in) or from UEFI firmware respectively.

Signed-off-by: Roger Pau Monné

Reviewed-by: Marek Marczykowski-Górecki

Thanks for the review.

Oleksii, can I get your opinion as Release Manager about whether this
(and the following patch) would be suitable for committing to staging
given the current release state?

It's a workaround for broken EFI implementations that many downstreams
carry on their patch queue.

Based on your commit message, I understand this as addressing a bug ( but not 
very critical
as IIUC downstreams have the similar patch on their side ). Therefore, if it 
has been properly
reviewed and tested, we should consider including it in the current release.

IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves
the same behavior as proposed here.


IIUC, setting the wallclock to EFI should align with the behavior Xen had 
previously.
It might be preferable to use that argument as the default for a while, allowing others 
to verify the "auto"
value over time. After that, we could consider making "auto" the default.
That said, I am not particularly strict about following this approach.

We cannot really set efi as the default, as it would break when
booting on legacy BIOS systems.

We could take only patch 1 and leave patch 2 after Xen 4.20 has
branched, but at that point I would see little benefit in having just
patch 1.

I don't see a lot of benefit of comitting only the one patch either.



I don't have a strong opinion, but downstreams have been complaining
about Xen behavior regarding the usage of EFI_GET_TIME, so it might be
good to not ship yet another release with such allegedly broken
behavior.

Agree with that. As  I mentioned above I consider it as a bug and based on
that several mentioned above downstreams have the similar patch I could consider
that as tested approach so ..

Let me know what you think, as I would need a formal Release-Ack if
this is to be committed.

... R-Acked-by: Oleksii Kurochko.


Sorry for the noise.

I missed to add that it would also be nice to mention IMO ( that now we have 
wallclock parameter )
in CHANGELOG so downstreams will receive "notification" that Xen provides a 
workaround for the mentioned
issue in the patch series.

Thanks.

~ Oleksii



Thanks.
~ Oleksii

Thanks, Roger.


~ Oleksii



Thanks, Roger.

Re: [PATCH v2 2/2] automation/eclair: make Misra rule 20.7 blocking for x86 also

2025-01-14 Thread Oleksii Kurochko

Hi Roger,

On 1/14/25 12:22 PM, Roger Pau Monné wrote:

Hello Oleksii,

This is in principle ready to go in now (I'm currently running a
private Eclair scan to ensure the patch is still OK against current
staging).  I would like to ask for a release Ack.


R-Acked-by: Oleksii Kurochko 

Thanks.

~ Oleksii




Thanks, Roger.

On Tue, Nov 26, 2024 at 10:35:08AM +0100, Roger Pau Monne wrote:

There are no violations left, make the rule globally blocking for both x86 and
ARM.

Signed-off-by: Roger Pau Monné 
Reviewed-by: Andrew Cooper 
---
  automation/eclair_analysis/ECLAIR/tagging.ecl | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/automation/eclair_analysis/ECLAIR/tagging.ecl 
b/automation/eclair_analysis/ECLAIR/tagging.ecl
index 755ea3271fc9..cb4e233e838d 100644
--- a/automation/eclair_analysis/ECLAIR/tagging.ecl
+++ b/automation/eclair_analysis/ECLAIR/tagging.ecl
@@ -80,6 +80,7 @@ MC3R1.R20.2||
  MC3R1.R20.3||
  MC3R1.R20.4||
  MC3R1.R20.6||
+MC3R1.R20.7||
  MC3R1.R20.9||
  MC3R1.R20.11||
  MC3R1.R20.12||
@@ -116,7 +117,7 @@ if(string_equal(target,"x86_64"),
  )
  
  if(string_equal(target,"arm64"),

-
service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6||MC3R1.R20.7"})
+
service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6"})
  )
  
  -reports+={clean:added,"service(clean_guidelines_common||additional_clean_guidelines)"}

--
2.46.0





[PATCH v2] x86/boot: Handle better alignment for 32 bit code

2025-01-14 Thread Frediano Ziglio
Output file didn't have correct alignment.
Allows alignment into data or code up to 2mb.
Intermediate object files are kept in order to copy alignment
from object produced by the linker and final object (produced
by combine_two_binaries.py script).

Signed-off-by: Frediano Ziglio 
---
 xen/arch/x86/boot/Makefile| 12 
 xen/tools/combine_two_binaries.py |  7 ++-
 2 files changed, 14 insertions(+), 5 deletions(-)

Changes since v1:
- Improve comments and description.

diff --git a/xen/arch/x86/boot/Makefile b/xen/arch/x86/boot/Makefile
index 13d4583173..a56d8a7e0f 100644
--- a/xen/arch/x86/boot/Makefile
+++ b/xen/arch/x86/boot/Makefile
@@ -40,8 +40,12 @@ LD32 := $(LD) $(subst x86_64,i386,$(LDFLAGS_DIRECT))
 # are affected by both text_diff and text_gap.  Ensure the sum of gap and diff
 # is greater than 2^16 so that any 16bit relocations if present in the object
 # file turns into a build-time error.
-text_gap := 0x010200
-text_diff := 0x408020
+# As gap will affect the output section size it should not be huge to avoid the
+# creation of huge files.
+# The sum of gap and diff will affect the possible alignment so should be a
+# multiple of the possible alignment.
+text_gap := 0x01c240
+text_diff := 0x7e3dc0
 
 $(obj)/build32.base.lds: AFLAGS-y += -DGAP=$(text_gap) -DTEXT_DIFF=$(text_diff)
 $(obj)/build32.offset.lds: AFLAGS-y += -DGAP=$(text_gap) 
-DTEXT_DIFF=$(text_diff) -DAPPLY_OFFSET
@@ -69,7 +73,6 @@ $(obj)/built-in-32.%.bin: $(obj)/build32.%.lds 
$(obj)/built-in-32.tmp.o
$(LD32) $(orphan-handling-y) -N -T $< -o $(@:bin=o) $(filter %.o,$^)
$(NM) -p --format=bsd $(@:bin=o) > $(@:bin=map)
$(OBJCOPY) -j .text -O binary $(@:bin=o) $@
-   rm -f $(@:bin=o)
 
 quiet_cmd_combine = GEN $@
 cmd_combine = \
@@ -80,6 +83,7 @@ cmd_combine = \
   --bin1  $(obj)/built-in-32.base.bin \
   --bin2  $(obj)/built-in-32.offset.bin \
   --map   $(obj)/built-in-32.base.map \
+  --align $(shell $(OBJDUMP) -h $(obj)/built-in-32.base.o|sed 
'/text.*2\*\*/ {s/.*2\*\*//;p;}; d') \
   --exports   cmdline_parse_early,reloc,reloc_trampoline32 \
   --output$@
 
@@ -90,4 +94,4 @@ $(obj)/built-in-32.S: $(obj)/built-in-32.base.bin 
$(obj)/built-in-32.offset.bin
   $(srctree)/tools/combine_two_binaries.py FORCE
$(call if_changed,combine)
 
-clean-files := built-in-32.*.bin built-in-32.*.map build32.*.lds
+clean-files := built-in-32.*.bin built-in-32.*.map built-in-32.*.o 
build32.*.lds
diff --git a/xen/tools/combine_two_binaries.py 
b/xen/tools/combine_two_binaries.py
index 581e57cbc0..8e587c24fb 100755
--- a/xen/tools/combine_two_binaries.py
+++ b/xen/tools/combine_two_binaries.py
@@ -26,6 +26,10 @@ parser.add_argument('--text-diff', dest='text_diff',
 required=True,
 type=auto_int,
 help='Difference between code section start')
+parser.add_argument('--align', dest='align',
+default=2,
+type=auto_int,
+help='Alignment in power of 2')
 parser.add_argument('--output', dest='output',
 help='Output file')
 parser.add_argument('--map', dest='mapfile',
@@ -93,7 +97,7 @@ if size1 > size2:
 file1, file2 = file2, file1
 size1, size2 = size2, size1
 if size2 != size1 + gap:
-raise Exception('File sizes do not match')
+raise Exception('File sizes do not match %d != %d + %d' % (size2, size1, 
gap))
 del size2
 
 file1.seek(0, 0)
@@ -219,6 +223,7 @@ print('''/*
  * File autogenerated by combine_two_binaries.py DO NOT EDIT
  */''', file=out)
 print('\t' + args.section_header, file=out)
+print('\t.p2align\t' + str(args.align), file=out)
 print('obj32_start:', file=out)
 output(out)
 print('\n\t.section .note.GNU-stack,"",@progbits', file=out)
-- 
2.34.1




Re: [PATCH] xl: properly dispose of vTPM struct instance

2025-01-14 Thread Andrew Cooper
On 14/01/2025 8:13 am, Jan Beulich wrote:
> The backend_domname field requires separate freeing; make sure to call
> libxl_device_vtpm_dispose() also on respective error paths.
>
> Coverity-ID: 1638719
> Fixes: dde22055ac3a ("libxl: add vtpm support")
> Signed-off-by: Jan Beulich 

Reviewed-by: Andrew Cooper 



[PATCH v2 0/3] xen: fix usage of devices behind a VMD bridge

2025-01-14 Thread Roger Pau Monne
Hello,

The following series should fix the usage of devices behind a VMD bridge
when running Linux as a Xen PV hardware domain (dom0).  I've only been
able to test PV. I think PVH should also work but I don't have hardware
capable of testing it right now.

I don't expect the first two patches to be problematic, the last patch
is likely to be more controversial.  I've tested it internally and
didn't see any issues, but my testing of PV mode is mostly limited to
dom0.

Thanks, Roger.

Roger Pau Monne (3):
  xen/pci: do not register devices with segments >= 0x1
  vmd: disable MSI remapping bypass under Xen
  pci/msi: remove pci_msi_ignore_mask

 arch/x86/pci/xen.c   |  8 ++--
 drivers/pci/controller/vmd.c | 19 ++
 drivers/pci/msi/msi.c| 37 
 drivers/xen/pci.c| 19 ++
 include/linux/msi.h  |  3 ++-
 kernel/irq/msi.c |  2 +-
 6 files changed, 64 insertions(+), 24 deletions(-)

-- 
2.46.0




[PATCH v2 2/3] vmd: disable MSI remapping bypass under Xen

2025-01-14 Thread Roger Pau Monne
MSI remapping bypass (directly configuring MSI entries for devices on the
VMD bus) won't work under Xen, as Xen is not aware of devices in such bus,
and hence cannot configure the entries using the pIRQ interface in the PV
case, and in the PVH case traps won't be setup for MSI entries for such
devices.

Until Xen is aware of devices in the VMD bus prevent the
VMD_FEAT_CAN_BYPASS_MSI_REMAP capability from being used when running as
any kind of Xen guest.

The MSI remapping bypass is an optional feature of VMD bridges, and hence
when running under Xen it will be masked and devices will be forced to
redirect its interrupts from the VMD bridge.  That mode of operation must
always be supported by VMD bridges and works when Xen is not aware of
devices behind the VMD bridge.

Signed-off-by: Roger Pau Monné 
---
Changes since v1:
 - Add xen header.
 - Expand comment.
---
 drivers/pci/controller/vmd.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 264a180403a0..33c9514bd926 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -17,6 +17,8 @@
 #include 
 #include 
 
+#include 
+
 #include 
 
 #define VMD_CFGBAR 0
@@ -965,6 +967,23 @@ static int vmd_probe(struct pci_dev *dev, const struct 
pci_device_id *id)
struct vmd_dev *vmd;
int err;
 
+   if (xen_domain())
+   /*
+* Xen doesn't have knowledge about devices in the VMD bus
+* because the config space of devices behind the VMD bridge is
+* not known to Xen, and hence Xen cannot discover or configure
+* them in any way.
+*
+* Bypass of MSI remapping won't work in that case as direct
+* write by Linux to the MSI entries won't result in functional
+* interrupts, as it's Xen the entity that manages the host
+* interrupt controller and must configure interrupts.
+* However multiplexing of interrupts by the VMD bridge will
+* work under Xen, so force the usage of that mode which must
+* always be supported by VMD bridges.
+*/
+   features &= ~VMD_FEAT_CAN_BYPASS_MSI_REMAP;
+
if (resource_size(&dev->resource[VMD_CFGBAR]) < (1 << 20))
return -ENOMEM;
 
-- 
2.46.0




[PATCH v2 1/3] xen/pci: do not register devices with segments >= 0x10000

2025-01-14 Thread Roger Pau Monne
The current hypercall interface for doing PCI device operations always uses
a segment field that has a 16 bit width.  However on Linux there are buses
like VMD that hook up devices into the PCI hierarchy at segment >= 0x1,
after the maximum possible segment enumerated in ACPI.

Attempting to register or manage those devices with Xen would result in
errors at best, or overlaps with existing devices living on the truncated
equivalent segment values.  Note also that the VMD segment numbers are
arbitrarily assigned by the OS, and hence there would need to be some
negotiation between Xen and the OS to agree on how to enumerate VMD
segments and devices behind them.

Skip notifying Xen about those devices.  Given how VMD bridges can
multiplex interrupts on behalf of devices behind them there's no need for
Xen to be aware of such devices for them to be usable by Linux.

Signed-off-by: Roger Pau Monné 
---
Changes since v1:
 - Adjust commit message width to 75 columns.
 - Expand commit message.
---
 drivers/xen/pci.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
index 416f231809cb..08e82fd1263e 100644
--- a/drivers/xen/pci.c
+++ b/drivers/xen/pci.c
@@ -43,6 +43,13 @@ static int xen_add_device(struct device *dev)
pci_mcfg_reserved = true;
}
 #endif
+
+   if (pci_domain_nr(pci_dev->bus) >> 16) {
+   dev_info(dev,
+"not registering with Xen: invalid PCI segment\n");
+   return 0;
+   }
+
if (pci_seg_supported) {
DEFINE_RAW_FLEX(struct physdev_pci_device_add, add, optarr, 1);
 
@@ -149,6 +156,12 @@ static int xen_remove_device(struct device *dev)
int r;
struct pci_dev *pci_dev = to_pci_dev(dev);
 
+   if (pci_domain_nr(pci_dev->bus) >> 16) {
+   dev_info(dev,
+"not unregistering with Xen: invalid PCI segment\n");
+   return 0;
+   }
+
if (pci_seg_supported) {
struct physdev_pci_device device = {
.seg = pci_domain_nr(pci_dev->bus),
@@ -182,6 +195,12 @@ int xen_reset_device(const struct pci_dev *dev)
.flags = PCI_DEVICE_RESET_FLR,
};
 
+   if (pci_domain_nr(dev->bus) >> 16) {
+   dev_info(&dev->dev,
+"unable to notify Xen of device reset: invalid PCI 
segment\n");
+   return 0;
+   }
+
return HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_reset, &device);
 }
 EXPORT_SYMBOL_GPL(xen_reset_device);
-- 
2.46.0




[PATCH v2 3/3] pci/msi: remove pci_msi_ignore_mask

2025-01-14 Thread Roger Pau Monne
Setting pci_msi_ignore_mask inhibits the toggling of the mask bit for both
MSI and MSI-X entries globally, regardless of the IRQ chip they are using.
Only Xen sets the pci_msi_ignore_mask when routing physical interrupts over
event channels, to prevent PCI code from attempting to toggle the maskbit,
as it's Xen that controls the bit.

However, the pci_msi_ignore_mask being global will affect devices that use
MSI interrupts but are not routing those interrupts over event channels
(not using the Xen pIRQ chip).  One example is devices behind a VMD PCI
bridge.  In that scenario the VMD bridge configures MSI(-X) using the
normal IRQ chip (the pIRQ one in the Xen case), and devices behind the
bridge configure the MSI entries using indexes into the VMD bridge MSI
table.  The VMD bridge then demultiplexes such interrupts and delivers to
the destination device(s).  Having pci_msi_ignore_mask set in that scenario
prevents (un)masking of MSI entries for devices behind the VMD bridge.

Move the signaling of no entry masking into the MSI domain flags, as that
allows setting it on a per-domain basis.  Set it for the Xen MSI domain
that uses the pIRQ chip, while leaving it unset for the rest of the
cases.

Remove pci_msi_ignore_mask at once, since it was only used by Xen code, and
with Xen dropping usage the variable is unneeded.

This fixes using devices behind a VMD bridge on Xen PV hardware domains.

Albeit Devices behind a VMD bridge are not known to Xen, that doesn't mean
Linux cannot use them.  By inhibiting the usage of
VMD_FEAT_CAN_BYPASS_MSI_REMAP and the removal of the pci_msi_ignore_mask
bodge devices behind a VMD bridge do work fine when use from a Linux Xen
hardware domain.  That's the whole point of the series.

Signed-off-by: Roger Pau Monné 
---
Changes since v1:
 - Fix build.
 - Expand commit message.
---
 arch/x86/pci/xen.c|  8 ++--
 drivers/pci/msi/msi.c | 37 +
 include/linux/msi.h   |  3 ++-
 kernel/irq/msi.c  |  2 +-
 4 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 0f2fe524f60d..b8755cde2419 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -436,7 +436,8 @@ static struct msi_domain_ops xen_pci_msi_domain_ops = {
 };
 
 static struct msi_domain_info xen_pci_msi_domain_info = {
-   .flags  = MSI_FLAG_PCI_MSIX | MSI_FLAG_FREE_MSI_DESCS | 
MSI_FLAG_DEV_SYSFS,
+   .flags  = MSI_FLAG_PCI_MSIX | MSI_FLAG_FREE_MSI_DESCS |
+ MSI_FLAG_DEV_SYSFS | MSI_FLAG_NO_MASK,
.ops= &xen_pci_msi_domain_ops,
 };
 
@@ -484,11 +485,6 @@ static __init void xen_setup_pci_msi(void)
 * in allocating the native domain and never use it.
 */
x86_init.irqs.create_pci_msi_domain = xen_create_pci_msi_domain;
-   /*
-* With XEN PIRQ/Eventchannels in use PCI/MSI[-X] masking is solely
-* controlled by the hypervisor.
-*/
-   pci_msi_ignore_mask = 1;
 }
 
 #else /* CONFIG_PCI_MSI */
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index 3a45879d85db..dcbb4f9ac578 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -10,12 +10,12 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../pci.h"
 #include "msi.h"
 
 int pci_msi_enable = 1;
-int pci_msi_ignore_mask;
 
 /**
  * pci_msi_supported - check whether MSI may be enabled on a device
@@ -285,6 +285,8 @@ static void pci_msi_set_enable(struct pci_dev *dev, int 
enable)
 static int msi_setup_msi_desc(struct pci_dev *dev, int nvec,
  struct irq_affinity_desc *masks)
 {
+   const struct irq_domain *d = dev_get_msi_domain(&dev->dev);
+   const struct msi_domain_info *info = d->host_data;
struct msi_desc desc;
u16 control;
 
@@ -295,8 +297,7 @@ static int msi_setup_msi_desc(struct pci_dev *dev, int nvec,
/* Lies, damned lies, and MSIs */
if (dev->dev_flags & PCI_DEV_FLAGS_HAS_MSI_MASKING)
control |= PCI_MSI_FLAGS_MASKBIT;
-   /* Respect XEN's mask disabling */
-   if (pci_msi_ignore_mask)
+   if (info->flags & MSI_FLAG_NO_MASK)
control &= ~PCI_MSI_FLAGS_MASKBIT;
 
desc.nvec_used  = nvec;
@@ -600,12 +601,15 @@ static void __iomem *msix_map_region(struct pci_dev *dev,
  */
 void msix_prepare_msi_desc(struct pci_dev *dev, struct msi_desc *desc)
 {
+   const struct irq_domain *d = dev_get_msi_domain(&dev->dev);
+   const struct msi_domain_info *info = d->host_data;
+
desc->nvec_used = 1;
desc->pci.msi_attrib.is_msix= 1;
desc->pci.msi_attrib.is_64  = 1;
desc->pci.msi_attrib.default_irq= dev->irq;
desc->pci.mask_base = dev->msix_base;
-   desc->pci.msi_attrib.can_mask   = !pci_msi_ignore_mask &&
+   desc->pci.msi_attrib.can_mask

Re: [XEN PATCH] intel/msr: Fix handling of MSR_RAPL_POWER_UNIT

2025-01-14 Thread Roger Pau Monné
On Tue, Jan 14, 2025 at 10:32:25AM +0100, Roger Pau Monné wrote:
> On Mon, Jan 13, 2025 at 06:42:44PM +, Teddy Astie wrote:
> > Solaris 11.4 tries to access this MSR on some Intel platforms without 
> > properly
> > setting up a proper #GP handler, which leads to a immediate crash.
> > 
> > Emulate the access of this MSR by giving it a legal value (all values set to
> > default, as defined by Intel SDM "RAPL Interfaces").
> > 
> > Fixes: 84e848fd7a1 ('x86/hvm: disallow access to unknown MSRs')

Nit: I think we usually use 12 hex character hashes, the above one is
11 characters long.

> 
> Hm, 

Seems like I've sent this too early.

I wanted to say I wasn't convinced this is a fix for the above, but I
can see how the change can be seen as a regression if Solaris booted
before that change in behavior, so I'm fine with leaving the "Fixes:" tag.

Thanks, Roger.



Re: [XEN PATCH] intel/msr: Fix handling of MSR_RAPL_POWER_UNIT

2025-01-14 Thread Roger Pau Monné
On Mon, Jan 13, 2025 at 06:42:44PM +, Teddy Astie wrote:
> Solaris 11.4 tries to access this MSR on some Intel platforms without properly
> setting up a proper #GP handler, which leads to a immediate crash.
> 
> Emulate the access of this MSR by giving it a legal value (all values set to
> default, as defined by Intel SDM "RAPL Interfaces").
> 
> Fixes: 84e848fd7a1 ('x86/hvm: disallow access to unknown MSRs')

Hm, 

> Signed-off-by: Teddy Astie 
> ---
> Does it have a risk of negatively affecting other operating systems expecting
> this MSR read to fail ?
> ---
>  xen/arch/x86/include/asm/msr-index.h |  2 ++
>  xen/arch/x86/msr.c   | 16 
>  2 files changed, 18 insertions(+)
> 
> diff --git a/xen/arch/x86/include/asm/msr-index.h 
> b/xen/arch/x86/include/asm/msr-index.h
> index 9cdb5b2625..2adcdf344f 100644
> --- a/xen/arch/x86/include/asm/msr-index.h
> +++ b/xen/arch/x86/include/asm/msr-index.h
> @@ -144,6 +144,8 @@
>  #define MSR_RTIT_ADDR_A(n) (0x0580 + (n) * 2)
>  #define MSR_RTIT_ADDR_B(n) (0x0581 + (n) * 2)
>  
> +#define MSR_RAPL_POWER_UNIT 0x0606
> +
>  #define MSR_U_CET   0x06a0
>  #define MSR_S_CET   0x06a2
>  #define  CET_SHSTK_EN   (_AC(1, ULL) <<  0)
> diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c
> index 289cf10b78..b14d42dacf 100644
> --- a/xen/arch/x86/msr.c
> +++ b/xen/arch/x86/msr.c
> @@ -169,6 +169,22 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t 
> *val)
>  if ( likely(!is_cpufreq_controller(d)) || rdmsr_safe(msr, *val) == 0 
> )
>  break;
>  goto gp_fault;
> +

Trailing spaces in the added newline.

> +/*
> + * Solaris 11.4 DomU tries to use read this MSR without setting up a
> + * proper #GP handler leading to a crash. Emulate this MSR by giving 
> a
> + * legal value.
> + */

The comment should be after (inside) the case statement IMO (but not
strong opinion.  Could you also raise a bug with Solaris and put a
link to the bug report here, so that we have a reference to it?

> +case MSR_RAPL_POWER_UNIT:
> +if ( !(cp->x86_vendor & (X86_VENDOR_INTEL | X86_VENDOR_CENTAUR)) )

Has Centaur ever released a CPU with RAPL?

> +goto gp_fault;
> +
> +/*
> + * Return a legal register content with all default values defined in
> + * Intel Architecture Software Developer Manual 16.10.1 RAPL 
> Interfaces
> + */
> +*val = 0xA1003;

The SPR Specification defines the default as 000A0E03h:

* SDM:

Energy Status Units (bits 12:8): Energy related information (in
Joules) is based on the multiplier, 1/2^ESU; where ESU is an unsigned
integer represented by bits 12:8. Default value is 1b, indicating
energy status unit is in 15.3 micro-Joules increment.

* SPR:

Energy Units (ENERGY_UNIT):
Energy Units used for power control registers.
The actual unit value is calculated by 1 J / Power(2,ENERGY_UNIT).
The default value of 14 corresponds to Ux.14 number.

Note that KVM just returns all 0s [0], so we might consider doing the
same, as otherwise that could lead OSes to poke at further RAPL
related MSRs if the returned value from MSR_RAPL_POWER_UNIT looks
plausible.

[0] https://elixir.bootlin.com/linux/v6.12.6/source/arch/x86/kvm/x86.c#L4236

Thanks.



Re: [PATCH v4 0/4] Add/enable stack protector

2025-01-14 Thread Andrew Cooper
On 14/01/2025 4:25 am, Volodymyr Babchuk wrote:
> Volodymyr Babchuk (4):
>   common: remove -fno-stack-protector from EMBEDDED_EXTRA_CFLAGS
>   xen: common: add ability to enable stack protector
>   xen: arm: enable stack protector feature
>   CHANGELOG.md: Mention stack-protector feature

Reviewed-by: Andrew Cooper 

There's one minor formatting error which can be fixed on commit.

~Andrew



Re: [PATCH 2/3] vmd: disable MSI remapping bypass under Xen

2025-01-14 Thread Roger Pau Monné
On Mon, Jan 13, 2025 at 09:53:21AM -0700, Keith Busch wrote:
> On Mon, Jan 13, 2025 at 05:45:20PM +0100, Roger Pau Monné wrote:
> > On Mon, Jan 13, 2025 at 08:11:19AM -0700, Keith Busch wrote:
> > > On Mon, Jan 13, 2025 at 11:03:58AM +0100, Roger Pau Monné wrote:
> > > > 
> > > > Hm, OK, but isn't the limit 80 columns according to the kernel coding
> > > > style (Documentation/process/coding-style.rst)?
> > > 
> > > That's the coding style. The commit message style is described in a
> > > different doc:
> > > 
> > >   
> > > https://docs.kernel.org/process/submitting-patches.html#the-canonical-patch-format
> > 
> > It's quite confusing to have different line length for code vs commit
> > messages, but my fault for not reading those. Will adjust to 75
> > columns them.
> 
> The output of 'git log' prepends spaces to the subject and body of the
> listed commits. The lower limit for commit messages vs. code makes the
> change log look readable in an 80-char terminal.

Oh, I see, thanks for explaining.

The 80 column limit for code mean the resulting diff (and `git log`
output) could have a maximum width of 81 characters (because of the
prepended +,-, ), but since Linux uses hard tabs for indentation this
is not really an issue as it would be if using spaces.

Regards, Roger.



Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock

2025-01-14 Thread Oleksii Kurochko


On 1/13/25 6:52 PM, Roger Pau Monné wrote:

On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki wrote:

On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote:

Allow setting the used wallclock from the command line.  When the option is set
to a value different than `auto` the probing is bypassed and the selected
implementation is used (as long as it's available).

The `xen` and `efi` options require being booted as a Xen guest (with Xen guest
supported built-in) or from UEFI firmware respectively.

Signed-off-by: Roger Pau Monné

Reviewed-by: Marek Marczykowski-Górecki

Thanks for the review.

Oleksii, can I get your opinion as Release Manager about whether this
(and the following patch) would be suitable for committing to staging
given the current release state?

It's a workaround for broken EFI implementations that many downstreams
carry on their patch queue.


Based on your commit message, I understand this as addressing a bug ( but not 
very critical
as IIUC downstreams have the similar patch on their side ). Therefore, if it 
has been properly
reviewed and tested, we should consider including it in the current release.

IIUC, setting the wallclock to EFI should align with the behavior Xen had 
previously.
It might be preferable to use that argument as the default for a while, allowing others 
to verify the "auto"
value over time. After that, we could consider making "auto" the default.
That said, I am not particularly strict about following this approach.

~ Oleksii




Thanks, Roger.

Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock

2025-01-14 Thread Roger Pau Monné
On Tue, Jan 14, 2025 at 12:44:39PM +0100, Oleksii Kurochko wrote:
> 
> On 1/14/25 12:40 PM, Oleksii Kurochko wrote:
> > 
> > 
> > On 1/14/25 12:27 PM, Roger Pau Monné wrote:
> > > On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote:
> > > > On 1/13/25 6:52 PM, Roger Pau Monné wrote:
> > > > > On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki 
> > > > > wrote:
> > > > > > On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote:
> > > > > > > Allow setting the used wallclock from the command line.  When the 
> > > > > > > option is set
> > > > > > > to a value different than `auto` the probing is bypassed and the 
> > > > > > > selected
> > > > > > > implementation is used (as long as it's available).
> > > > > > > 
> > > > > > > The `xen` and `efi` options require being booted as a Xen guest 
> > > > > > > (with Xen guest
> > > > > > > supported built-in) or from UEFI firmware respectively.
> > > > > > > 
> > > > > > > Signed-off-by: Roger Pau Monné
> > > > > > Reviewed-by: Marek 
> > > > > > Marczykowski-Górecki
> > > > > Thanks for the review.
> > > > > 
> > > > > Oleksii, can I get your opinion as Release Manager about whether this
> > > > > (and the following patch) would be suitable for committing to staging
> > > > > given the current release state?
> > > > > 
> > > > > It's a workaround for broken EFI implementations that many downstreams
> > > > > carry on their patch queue.
> > > > Based on your commit message, I understand this as addressing a bug ( 
> > > > but not very critical
> > > > as IIUC downstreams have the similar patch on their side ). Therefore, 
> > > > if it has been properly
> > > > reviewed and tested, we should consider including it in the current 
> > > > release.
> > > IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves
> > > the same behavior as proposed here.
> > > 
> > > > IIUC, setting the wallclock to EFI should align with the behavior Xen 
> > > > had previously.
> > > > It might be preferable to use that argument as the default for a while, 
> > > > allowing others to verify the "auto"
> > > > value over time. After that, we could consider making "auto" the 
> > > > default.
> > > > That said, I am not particularly strict about following this approach.
> > > We cannot really set efi as the default, as it would break when
> > > booting on legacy BIOS systems.
> > > 
> > > We could take only patch 1 and leave patch 2 after Xen 4.20 has
> > > branched, but at that point I would see little benefit in having just
> > > patch 1.
> > I don't see a lot of benefit of comitting only the one patch either.
> > 
> > 
> > > I don't have a strong opinion, but downstreams have been complaining
> > > about Xen behavior regarding the usage of EFI_GET_TIME, so it might be
> > > good to not ship yet another release with such allegedly broken
> > > behavior.
> > Agree with that. As  I mentioned above I consider it as a bug and based on
> > that several mentioned above downstreams have the similar patch I could 
> > consider
> > that as tested approach so ..
> > > Let me know what you think, as I would need a formal Release-Ack if
> > > this is to be committed.
> > ... R-Acked-by: Oleksii Kurochko.
> 
> Sorry for the noise.
> 
> I missed to add that it would also be nice to mention IMO ( that now we have 
> wallclock parameter )
> in CHANGELOG so downstreams will receive "notification" that Xen provides a 
> workaround for the mentioned
> issue in the patch series.

Indeed.  Would you be OK with adding the following chunk to patch 2?

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 8507e6556a56..1de1d1eca17f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -12,6 +12,7 @@ The format is based on [Keep a 
Changelog](https://keepachangelog.com/en/1.0.0/)
leaving this to the guest kernel to do in guest context.
  - On x86:
- Prefer ACPI reboot over UEFI ResetSystem() run time service call.
+   - Prefer CMOS over EFI_GET_TIME as time source.
- Switched the xAPIC flat driver to use physical destination mode for 
external
  interrupts instead of logical destination mode.
 
@@ -24,6 +25,7 @@ The format is based on [Keep a 
Changelog](https://keepachangelog.com/en/1.0.0/)
- Support for LLC (Last Level Cache) coloring.
  - On x86:
- xl suspend/resume subcommands.
+   - `wallclock` command line option to select time source.
 
 ### Removed
  - On x86:



Re: [PATCH v4 2/4] xen: common: add ability to enable stack protector

2025-01-14 Thread Andrew Cooper
On 14/01/2025 4:25 am, Volodymyr Babchuk wrote:
> diff --git a/xen/common/stack-protector.c b/xen/common/stack-protector.c
> new file mode 100644
> index 00..8fa9f6147f
> --- /dev/null
> +++ b/xen/common/stack-protector.c
> @@ -0,0 +1,51 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * Initial value is chosen by a fair dice roll.
> + * It will be updated during boot process.
> + */
> +#if BITS_PER_LONG == 32
> +unsigned long __ro_after_init __stack_chk_guard = 0xdd2cc927UL;
> +#else
> +unsigned long __ro_after_init __stack_chk_guard = 0x2d853605a4d9a09cUL;
> +#endif
> +
> +/*
> + * This function should be called from early asm or from a C function
> + * that escapes stack canary tracking (by calling
> + * reset_stack_and_jump() for example).
> + */
> +void __init asmlinkage boot_stack_chk_guard_setup(void)
> +{
> +/*
> + * Linear congruent generator (X_n+1 = X_n * a + c).
> + *
> + * Constant is taken from "Tables Of Linear Congruential
> + * Generators Of Different Sizes And Good Lattice Structure" by
> + * Pierre L’Ecuyer.
> + */
> +#if BITS_PER_LONG == 32
> +const unsigned long a = 2891336453UL;
> +#else
> +const unsigned long a = 2862933555777941757UL;
> +#endif
> +const unsigned long c = 1;
> +
> +unsigned long cycles = get_cycles();
> +
> +/* Use the initial value if we can't generate random one */
> +if ( !cycles )
> +return;

Indentation.  Can be fixed on commit.

Everything else LGTM.

~Andrew



[PATCH] x86/boot: Handle better alignment for 32 bit code

2025-01-14 Thread Frediano Ziglio
Output file didn't have correct alignment.
Allows alignment into data or code up to 2mb.

Signed-off-by: Frediano Ziglio 
---
 xen/arch/x86/boot/Makefile| 8 
 xen/tools/combine_two_binaries.py | 7 ++-
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/boot/Makefile b/xen/arch/x86/boot/Makefile
index 13d4583173..9a8ecba7aa 100644
--- a/xen/arch/x86/boot/Makefile
+++ b/xen/arch/x86/boot/Makefile
@@ -40,8 +40,8 @@ LD32 := $(LD) $(subst x86_64,i386,$(LDFLAGS_DIRECT))
 # are affected by both text_diff and text_gap.  Ensure the sum of gap and diff
 # is greater than 2^16 so that any 16bit relocations if present in the object
 # file turns into a build-time error.
-text_gap := 0x010200
-text_diff := 0x408020
+text_gap := 0x01c240
+text_diff := 0x7e3dc0
 
 $(obj)/build32.base.lds: AFLAGS-y += -DGAP=$(text_gap) -DTEXT_DIFF=$(text_diff)
 $(obj)/build32.offset.lds: AFLAGS-y += -DGAP=$(text_gap) 
-DTEXT_DIFF=$(text_diff) -DAPPLY_OFFSET
@@ -69,7 +69,6 @@ $(obj)/built-in-32.%.bin: $(obj)/build32.%.lds 
$(obj)/built-in-32.tmp.o
$(LD32) $(orphan-handling-y) -N -T $< -o $(@:bin=o) $(filter %.o,$^)
$(NM) -p --format=bsd $(@:bin=o) > $(@:bin=map)
$(OBJCOPY) -j .text -O binary $(@:bin=o) $@
-   rm -f $(@:bin=o)
 
 quiet_cmd_combine = GEN $@
 cmd_combine = \
@@ -80,6 +79,7 @@ cmd_combine = \
   --bin1  $(obj)/built-in-32.base.bin \
   --bin2  $(obj)/built-in-32.offset.bin \
   --map   $(obj)/built-in-32.base.map \
+  --align $(shell $(OBJDUMP) -h $(obj)/built-in-32.base.o|sed 
'/text.*2\*\*/ {s/.*2\*\*//;p;}; d') \
   --exports   cmdline_parse_early,reloc,reloc_trampoline32 \
   --output$@
 
@@ -90,4 +90,4 @@ $(obj)/built-in-32.S: $(obj)/built-in-32.base.bin 
$(obj)/built-in-32.offset.bin
   $(srctree)/tools/combine_two_binaries.py FORCE
$(call if_changed,combine)
 
-clean-files := built-in-32.*.bin built-in-32.*.map build32.*.lds
+clean-files := built-in-32.*.bin built-in-32.*.map built-in-32.*.o 
build32.*.lds
diff --git a/xen/tools/combine_two_binaries.py 
b/xen/tools/combine_two_binaries.py
index 581e57cbc0..8e587c24fb 100755
--- a/xen/tools/combine_two_binaries.py
+++ b/xen/tools/combine_two_binaries.py
@@ -26,6 +26,10 @@ parser.add_argument('--text-diff', dest='text_diff',
 required=True,
 type=auto_int,
 help='Difference between code section start')
+parser.add_argument('--align', dest='align',
+default=2,
+type=auto_int,
+help='Alignment in power of 2')
 parser.add_argument('--output', dest='output',
 help='Output file')
 parser.add_argument('--map', dest='mapfile',
@@ -93,7 +97,7 @@ if size1 > size2:
 file1, file2 = file2, file1
 size1, size2 = size2, size1
 if size2 != size1 + gap:
-raise Exception('File sizes do not match')
+raise Exception('File sizes do not match %d != %d + %d' % (size2, size1, 
gap))
 del size2
 
 file1.seek(0, 0)
@@ -219,6 +223,7 @@ print('''/*
  * File autogenerated by combine_two_binaries.py DO NOT EDIT
  */''', file=out)
 print('\t' + args.section_header, file=out)
+print('\t.p2align\t' + str(args.align), file=out)
 print('obj32_start:', file=out)
 output(out)
 print('\n\t.section .note.GNU-stack,"",@progbits', file=out)
-- 
2.34.1




Re: [XEN PATCH] intel/msr: Fix handling of MSR_RAPL_POWER_UNIT

2025-01-14 Thread Andrew Cooper
On 13/01/2025 6:42 pm, Teddy Astie wrote:
> Solaris 11.4 tries

Is it only Solaris 11.4, or is the simply the one repro you had?

Have you reported a bug?

>  to access this MSR on some Intel platforms without properly
> setting up a proper #GP handler, which leads to a immediate crash.

Minor grammar note.  Either "without a proper #GP handler" or "without
properly setting up a #GP handler", but having two proper(ly)'s in there
is less than ideal.

> Emulate the access of this MSR by giving it a legal value (all values set to
> default, as defined by Intel SDM "RAPL Interfaces").
>
> Fixes: 84e848fd7a1 ('x86/hvm: disallow access to unknown MSRs')
> Signed-off-by: Teddy Astie 
> ---
> Does it have a risk of negatively affecting other operating systems expecting
> this MSR read to fail?

It's Complicated.

RAPL is a non-architectural feature (on Intel; AMD did it properly).  It
does not have a CPUID bit to announce the presence of the MSRs. 
Therefore OSes use a mixture of model numbers and {wr,rd}msr_safe() to
probe.

I expect this will change the behaviour of Linux.

~Andrew



Re: [PATCH v2 2/2] automation/eclair: make Misra rule 20.7 blocking for x86 also

2025-01-14 Thread Roger Pau Monné
On Tue, Jan 14, 2025 at 12:24:30PM +0100, Nicola Vetrini wrote:
> On 2025-01-14 12:22, Roger Pau Monné wrote:
> > Hello Oleksii,
> > 
> > This is in principle ready to go in now (I'm currently running a
> > private Eclair scan to ensure the patch is still OK against current
> > staging).  I would like to ask for a release Ack.
> > 
> 
> One nit below, which I overlooked initially
> 
> > Thanks, Roger.
> > 
> > On Tue, Nov 26, 2024 at 10:35:08AM +0100, Roger Pau Monne wrote:
> > > There are no violations left, make the rule globally blocking for
> > > both x86 and
> > > ARM.
> > > 
> > > Signed-off-by: Roger Pau Monné 
> > > Reviewed-by: Andrew Cooper 
> > > ---
> > >  automation/eclair_analysis/ECLAIR/tagging.ecl | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/automation/eclair_analysis/ECLAIR/tagging.ecl
> > > b/automation/eclair_analysis/ECLAIR/tagging.ecl
> > > index 755ea3271fc9..cb4e233e838d 100644
> > > --- a/automation/eclair_analysis/ECLAIR/tagging.ecl
> > > +++ b/automation/eclair_analysis/ECLAIR/tagging.ecl
> > > @@ -80,6 +80,7 @@ MC3R1.R20.2||
> > >  MC3R1.R20.3||
> > >  MC3R1.R20.4||
> > >  MC3R1.R20.6||
> > > +MC3R1.R20.7||
> > >  MC3R1.R20.9||
> > >  MC3R1.R20.11||
> > >  MC3R1.R20.12||
> > > @@ -116,7 +117,7 @@ if(string_equal(target,"x86_64"),
> > >  )
> 
> this hunk will not apply because it uses MC3R1, rather than MC3R2. Should be
> an easy fix.
> 
> > > 
> > >  if(string_equal(target,"arm64"),
> > > -
> > > service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6||MC3R1.R20.7"})
> > > +
> > > service_selector({"additional_clean_guidelines","MC3R1.R2.1||MC3R1.R5.3||MC3.R11.2||MC3R1.R16.6"})
> > >  )
> 
> here as well

Yeah indeed, I had to rebase the patch:

https://gitlab.com/xen-project/people/royger/xen/-/commit/538439d59dc338ee3861bf1bc056783671ba1fc2

Let's see if Eclair is happy with it, currently running a pipeline.

Thanks, Roger.



Re: [PATCH for 4.21 v1 1/1] xen/riscv: identify specific ISA supported by cpu

2025-01-14 Thread Oleksii Kurochko


On 1/14/25 8:33 AM, Jan Beulich wrote:

+RISCV_ISA_EXT_DATA(i, RISCV_ISA_EXT_i),
+RISCV_ISA_EXT_DATA(m, RISCV_ISA_EXT_m),
+RISCV_ISA_EXT_DATA(a, RISCV_ISA_EXT_a),
+RISCV_ISA_EXT_DATA(f, RISCV_ISA_EXT_f),
+RISCV_ISA_EXT_DATA(d, RISCV_ISA_EXT_d),
+RISCV_ISA_EXT_DATA(q, RISCV_ISA_EXT_q),
+RISCV_ISA_EXT_DATA(h, RISCV_ISA_EXT_h),
+RISCV_ISA_EXT_DATA(zicntr, RISCV_ISA_EXT_ZICNTR),
+RISCV_ISA_EXT_DATA(zicsr, RISCV_ISA_EXT_ZICSR),
+RISCV_ISA_EXT_DATA(zifencei, RISCV_ISA_EXT_ZIFENCEI),
+RISCV_ISA_EXT_DATA(zihintpause, RISCV_ISA_EXT_ZIHINTPAUSE),
+RISCV_ISA_EXT_DATA(zihpm, RISCV_ISA_EXT_ZIHPM),
+RISCV_ISA_EXT_DATA(zbb, RISCV_ISA_EXT_ZBB),

Isn't it kind of implied that with the presence of Zbb, B should also be
present?

My interpretation of the RISC-V Bitmanip Extension spec is that the 'B' 
extension is essentially a collection of
the Zba, Zbb, Zbs, and other extensions, but it isn't an extension by itself.
The following is mentioned in the spec:
The bit-manipulation (bitmanip) extension collection is comprised of 
several component extensions to the base
RISC-V architecture that are intended to provide some combination of code 
size reduction, performance
improvement, and energy reduction. While the instructions are intended to 
have general use, some instructions
are more useful in some domains than others. Hence, several smaller 
bitmanip extensions are provided, rather
than one large extension. Each of these smaller extensions is grouped by 
common function and use case, and
each of which has its own Zb*-extension name.

Still the doc has '"B" Extension for Bit Manipulation' as the title of the
chapter.
And gas accepts B as an extension (e.g. ".option arch, +b").


I think it is fine.
B, in this case, just represents Zba, Zbb, Zbc, Zbs and that all of them are 
supported at the same time.
But I see chips that doesn't have B because it doesn't have one of those 
extensions.




+/*
+ * Workaround for invalid single-letter 's' & 'u' (QEMU).
+ * No need to set the bit in riscv_isa as 's' & 'u' are
+ * not valid ISA extensions. It works unless the first
+ * multi-letter extension in the ISA string begins with
+ * "Su" and is not prefixed with an underscore.
+ */
+if ( ext[-1] != '_' && ext[1] == 'u' )
+{
+++isa;
+ext_err = true;
+break;
+}

I'm afraid I don't understand this; the comment raises more questions
than it answers.

Some details could be found here about these QEMU workaround from LK view:
https://lore.kernel.org/linux-riscv/ae93358e-e117-b43d-faad-772c529f8...@irq.a4lg.com/#t

This leads to the following fix in QEMU:
https://patchwork.kernel.org/project/qemu-devel/patch/dee09d708405075420b29115c1e9e87910b8da55.1648270894.git.research_tra...@irq.a4lg.com/#24792587

Considering QEMU's patch, these workaround isn't needed anymore since QEMU 7.1 
( it has been released30 Aug 2022 ) probably we could update the
QEMU version on our CI and just drop these changes.
Or, at least, update the comment with the links mentioned above and add a message 
that these changes are needed only for QEMU < 7.1.
Am I right that we don't have something like GCC_VERSION in Xen but for QEMU?

How could there be? At the time of building Xen we know what compiler
version is in use, but we clearly don't know under what qemu versions
it might later be run.


Agree with that, there is no any sense for having something similar as 
GCC_VERSIOB but
for QEMU. Then I will just update the comment around this workaround with some 
clarifications.




+riscv_isa_parse_string(isa, this_isa);
+
+if ( bitmap_empty(riscv_isa, RISCV_ISA_EXT_MAX) )
+bitmap_copy(riscv_isa, this_isa, RISCV_ISA_EXT_MAX);
+else
+bitmap_and(riscv_isa, riscv_isa, this_isa, RISCV_ISA_EXT_MAX);

What if the first instance had no extensions at all? You'll then copy what
the second instance say, ending up with extensions not supported by one of
the CPUs.

I think that it's impossible that there is no extensions at all and it should be
considered as a bug of provided riscv,isa property. Thereby it should be enough 
to
add BUG_ON(!bitmap_empty(this_isa, RISCV_ISA_EXT_MAX)) before if-condition.

Well, you can of course make such an assumption. I don't think though that
it's technically impossible to have an extension-less environment. Xen
won't be able to run there, though (we'll require H at the very least aiui,
and I'm sure we really also require Zicsr).


I would like to clarify some things. I think we are counting by word 
'extension' different things.
I am including to this `Base ISA` ( and likely it is incorrect to do so ( or,at 
least, confusing )
and I will try not to do that in the future. I am using it in this manner 
because `Base ISA` is
included to the table 74. Standard ISA 

Re: [PATCH v3 0/3] Add stack protector

2025-01-14 Thread Jan Beulich
On 12.12.2024 02:17, Andrew Cooper wrote:
> On 12/12/2024 12:13 am, Volodymyr Babchuk wrote:
>> Hello Jan,
>>
>> Jan Beulich  writes:
>>
>>> On 11.12.2024 03:04, Volodymyr Babchuk wrote:
 Both GCC and Clang support -fstack-protector feature, which add stack
 canaries to functions where stack corruption is possible. This series
 makes possible to use this feature in Xen. I tested this on ARM64 and
 it is working as intended. Tested both with GCC and Clang.

 It is hard to enable this feature on x86, as GCC stores stack canary
 in %fs:40 by default, but Xen can't use %fs for various reasons. It is
 possibly to change stack canary location new newer GCC versions, but
 this will change minimal GCC requirement, which is also hard due to
 various reasons. So, this series focus mostly on ARM and RISCV.
>>> Why exactly would it not be possible to offer the feature when new enough
>>> gcc is in use?
>> It is possible to use this feature with a modern enough GCC, yes. Are
>> you suggesting to make HAS_STACK_PROTECTOR dependent on GCC_VERSION for
>> x86 platform?
> 
> (With the knowledge that this is a disputed Kconfig pattern, and will
> need rebasing), the way I want this to work is simply:
> 
> diff --git a/xen/Makefile b/xen/Makefile
> index 0de0101fd0bf..5d0a88fb3c3f 100644
> --- a/xen/Makefile
> +++ b/xen/Makefile
> @@ -434,6 +434,9 @@ endif
>  
>  ifeq ($(CONFIG_STACK_PROTECTOR),y)
>  CFLAGS += -fstack-protector
> +ifeq ($(CONFIG_X86),y)
> +CFLAGS += -mstack-protector-guard=global
> +endif
>  else
>  CFLAGS += -fno-stack-protector
>  endif
> diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
> index 9cdd04721afa..7951ca908b36 100644
> --- a/xen/arch/x86/Kconfig
> +++ b/xen/arch/x86/Kconfig
> @@ -28,6 +28,7 @@ config X86
>     select HAS_PCI_MSI
>     select HAS_PIRQ
>     select HAS_SCHED_GRANULARITY
> +   select HAS_STACK_PROTECTOR if
> $(cc-option,-mstack-protector-guard=global)
>     select HAS_UBSAN
>     select HAS_VMAP
>     select HAS_VPCI if HVM
> 
> 
> 
> Sadly, it doesn't build.  I get a handful of:
> 
> prelink.o: in function `cmdline_parse':
> /home/andrew/xen.git/xen/common/kernel.c:216:(.init.text+0x20f2): failed
> to convert GOTPCREL relocation against '__stack_chk_guard'; relink with
> --no-relax
> /home/andrew/xen.git/xen/common/kernel.c:230:(.init.text+0x246f): failed
> to convert GOTPCREL relocation against '__stack_chk_guard'; relink with
> --no-relax
> 
> which is more toolchain-whispering than I feel like doing tonight.

For reference:
https://sourceware.org/pipermail/binutils/2025-January/138631.html

You didn't enter a gcc bug report yet, did you?

Jan



Re: [XEN PATCH] MAINTAINERS: Change reviewer of the ECLAIR integration

2025-01-14 Thread Roger Pau Monné
On Tue, Jan 14, 2025 at 03:48:54PM +0100, Nicola Vetrini wrote:
> Simone Ballarin is no longer actively involved in reviewing
> the ECLAIR integration for Xen. I am stepping up as a reviewer.
> 
> Signed-off-by: Nicola Vetrini 

Acked-by: Roger Pau Monné 

Adding Simone to the Cc list, it would be helpful if he can also
provide an Ack to signal he is OK with begin removed.

Thanks, Roger.



Re: [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch

2025-01-14 Thread Jan Beulich
On 09.01.2025 18:33, Roger Pau Monné wrote:
> On Thu, Jan 09, 2025 at 09:59:58AM +0100, Jan Beulich wrote:
>> On 08.01.2025 15:26, Roger Pau Monne wrote:
>>> @@ -2048,8 +2060,6 @@ static void __context_switch(void)
>>>  if ( pd != nd )
>>>  cpumask_clear_cpu(cpu, pd->dirty_cpumask);
>>>  write_atomic(&p->dirty_cpu, VCPU_CPU_CLEAN);
>>> -
>>> -per_cpu(curr_vcpu, cpu) = n;
>>>  }
>>>  
>>>  void context_switch(struct vcpu *prev, struct vcpu *next)
>>> @@ -2081,16 +2091,36 @@ void context_switch(struct vcpu *prev, struct vcpu 
>>> *next)
>>>  
>>>  local_irq_disable();
>>>  
>>> -set_current(next);
>>> -
>>>  if ( (per_cpu(curr_vcpu, cpu) == next) ||
>>>   (is_idle_domain(nextd) && cpu_online(cpu)) )
>>>  {
>>> +/*
>>> + * Lazy context switch to the idle vCPU, set current == idle.  Full
>>> + * context switch happens if/when sync_local_execstate() is called.
>>> + */
>>> +set_current(next);
>>>  local_irq_enable();
>>
>> The comment is misleading as far as the first half of the if() condition 
>> goes:
>> No further switching is going to happen in that case, aiui.
> 
> Right, I should clarify that comment: this is either a lazy context
> switch, or the return from a lazy state to the previously running
> vCPU.
> 
>>>  }
>>>  else
>>>  {
>>> -__context_switch();
>>> +/*
>>> + * curr_vcpu will always point to the currently loaded vCPU 
>>> context, as
>>> + * it's not updated when doing a lazy switch to the idle vCPU.
>>> + */
>>> +struct vcpu *prev_ctx = per_cpu(curr_vcpu, cpu);
>>> +
>>> +if ( prev_ctx != current )
>>> +{
>>> +/*
>>> + * Doing a full context switch to a non-idle vCPU from a lazy
>>> + * context switched state.  Adjust current to point to the
>>> + * currently loaded vCPU context.
>>> + */
>>> +ASSERT(current == idle_vcpu[cpu]);
>>> +ASSERT(!is_idle_vcpu(next));
>>> +set_current(prev_ctx);
>>
>> This feels wrong, as in "current" then not representing what it should 
>> represent,
>> for a certain time window. I may be dense, but neither comment not 
>> description
>> clarify to me why this might be needed. I can see that it's needed to please 
>> the
>> ASSERT() you add to __context_switch(), yet then I might ask why that 
>> assertion
>> is put there.
> 
> This is done so that when calling __context_switch() current ==
> curr_vcpu, and map_domain_page() can be used without getting into an
> infinite sync_local_execstate() recursion loop.

Yet it's the purpose of __context_switch() to bring curr_vcpu in sync
with current. IOW both matching up is supposed to be an exit condition
of the function, not an entry one.

Plus, as indicated when we were talking this through yesterday, the
set_current() here make "current" no longer point at what - from the
scheduler's perspective - is (supposed to be) the current vCPU.

Aiui this adjustment is the reason for ...

>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -2232,8 +2232,6 @@ void __init trap_init(void)
>>>  
>>>  void activate_debugregs(const struct vcpu *curr)
>>>  {
>>> -ASSERT(curr == current);
>>> -
>>>  write_debugreg(0, curr->arch.dr[0]);
>>>  write_debugreg(1, curr->arch.dr[1]);
>>>  write_debugreg(2, curr->arch.dr[2]);
>>
>> Why would this assertion go away? If it suddenly triggers, the parameter name
>> would now end up being wrong.
> 
> Well, at the point where activate_debugregs() gets called (in
> paravirt_ctxt_switch_to()), current == previous as a result of this
> change, so the assert is no longer true on purpose on that call
> path.

... this behavior. Which, as said, feels wrong the latest when "curr" was
renamed to no longer suggest it actually is cached "current". At that point
it'll be dubious whose ->arch.dr[] are actually written into the CPU
registers.

Also let's not forget that there's a 2nd call here, where I very much hope
it continues to be "current" that's being passed in.

Jan



Re: [PATCH v2.1 06/18] x86/pv: set/clear guest GDT mappings using {populate,destroy}_perdomain_mapping()

2025-01-14 Thread Jan Beulich
On 08.01.2025 16:11, Roger Pau Monne wrote:
> The pv_{set,destroy}_gdt() functions rely on the L1 table(s) that contain such
> mappings being stashed in the domain structure, and thus such mappings being
> modified by merely updating the L1 entries.
> 
> Switch both pv_{set,destroy}_gdt() to instead use
> {populate,destory}_perdomain_mapping().

Like for an earlier patch it doesn't really become clear why what is being done
wants / needs doing. I might guess that it's the "stashed in the domain 
structure"
that you ultimately want to get rid of?

> --- a/xen/arch/x86/pv/descriptor-tables.c
> +++ b/xen/arch/x86/pv/descriptor-tables.c
> @@ -49,23 +49,20 @@ bool pv_destroy_ldt(struct vcpu *v)
>  
>  void pv_destroy_gdt(struct vcpu *v)
>  {
> -l1_pgentry_t *pl1e = pv_gdt_ptes(v);
> -mfn_t zero_mfn = _mfn(virt_to_mfn(zero_page));
> -l1_pgentry_t zero_l1e = l1e_from_mfn(zero_mfn, __PAGE_HYPERVISOR_RO);
>  unsigned int i;
>  
>  ASSERT(v == current || !vcpu_cpu_dirty(v));
>  
> -v->arch.pv.gdt_ents = 0;

How can this validly go away?

> -for ( i = 0; i < FIRST_RESERVED_GDT_PAGE; i++ )
> -{
> -mfn_t mfn = l1e_get_mfn(pl1e[i]);
> +if ( v->arch.cr3 )
> +destroy_perdomain_mapping(v, GDT_VIRT_START(v),
> +  ARRAY_SIZE(v->arch.pv.gdt_frames));

How is v->arch.cr3 being non-zero related to the GDT area needing
destroying?

> -if ( (l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) &&
> - !mfn_eq(mfn, zero_mfn) )
> -put_page_and_type(mfn_to_page(mfn));
> +for ( i = 0; i < ARRAY_SIZE(v->arch.pv.gdt_frames); i++)
> +{
> +if ( !v->arch.pv.gdt_frames[i] )
> +break;
>  
> -l1e_write(&pl1e[i], zero_l1e);
> +put_page_and_type(mfn_to_page(_mfn(v->arch.pv.gdt_frames[i])));
>  v->arch.pv.gdt_frames[i] = 0;
>  }
>  }
> @@ -74,8 +71,8 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
> unsigned int entries)
>  {
>  struct domain *d = v->domain;
> -l1_pgentry_t *pl1e;
>  unsigned int i, nr_frames = DIV_ROUND_UP(entries, 512);
> +mfn_t mfns[ARRAY_SIZE(v->arch.pv.gdt_frames)];

Having this array is kind of odd - it'll hold all the same values as
frames[], just under a different type. Considering the further copying
done ...

> @@ -90,6 +87,8 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
>  if ( !mfn_valid(mfn) ||
>   !get_page_and_type(mfn_to_page(mfn), d, PGT_seg_desc_page) )
>  goto fail;
> +
> +mfns[i] = mfn;
>  }
>  
>  /* Tear down the old GDT. */
> @@ -97,12 +96,9 @@ int pv_set_gdt(struct vcpu *v, const unsigned long 
> frames[],
>  
>  /* Install the new GDT. */
>  v->arch.pv.gdt_ents = entries;
> -pl1e = pv_gdt_ptes(v);
>  for ( i = 0; i < nr_frames; i++ )
> -{
>  v->arch.pv.gdt_frames[i] = frames[i];

... here, would it perhaps be an option to change ->arch.pv.gdt_frames[]
to mfn_t[], thus allowing ...

> -l1e_write(&pl1e[i], l1e_from_pfn(frames[i], __PAGE_HYPERVISOR_RW));
> -}
> +populate_perdomain_mapping(v, GDT_VIRT_START(v), mfns, nr_frames);

... that array to be passed into here?

Jan



Re: [PATCH v2 00/18] x86: adventures in Address Space Isolation

2025-01-14 Thread Jan Beulich
On 08.01.2025 15:26, Roger Pau Monne wrote:
> Hello,
> 
> The aim of this series is to introduce the functionality required to
> create linear mappings visible to a single pCPU.
> 
> Doing so requires having a per-vCPU root page-table (L4), and hence
> requires shadowing the guest selected L4 on PV guests.  As follow ups
> (and partially to ensure the per-CPU mappings work fine) the CPU stacks
> are switched to use per-CPU mappings, so that remote stack contents are
> not by default mapped on all page-tables (note: for this to be true the
> directmap entries for the stack pages would need to be removed also).
> 
> There's one known shortcoming with the presented code: migration of PV
> guests using per-vCPU root page-tables is not working.  I need to
> introduce extra logic to deal with PV shadow mode when using unique root
> page-tables.  I don't think this should block the series however, such
> missing functionality can always be added as follow up work.
> paging_domctl() is adjusted to reflect this restriction.
> 
> The main differences compared to v1 are the usage of per-vCPU root page
> tables (as opposed to per-pCPU), and the usage of the existing perdomain
> family of functions to manage the mappings in the per-domain slot, that
> now becomes per-vCPU.
> 
> All patches until 17 are mostly preparatory, I think there's a nice
> cleanup and generalization of the creation and managing of per-domain
> mappings, by no longer storing references to L1 page-tables in the vCPU
> or domain struct.

Since you referred me to the cover letter, I've looked back here after
making some more progress with the series. Along with my earlier comment
towards the need or ultimate goal, ...

> Patch 13 introduces the command line option, and would need discussion
> and integration with the sparse direct map series.  IMO we should get
> consensus on how we want the command line to look ASAP, so that we can
> basic parsing logic in place to be used by both the work here and the
> direct map removal series.
> 
> As part of this series the map_domain_page() helpers are also switched
> to create per-vCPU mappings (see patch 15), which converts an existing
> interface into creating per-vCPU mappings.  Such interface can be used
> to hide (map per-vCPU) further data that we don't want to be part of the
> direct map, or even shared between vCPUs of the same domain.  Also all
> existing users of the interface will already create per-vCPU mappings
> without needing additional changes.
> 
> Note that none of the logic introduced in the series removes entries for
> the directmap, so even when creating the per-CPU mappings the underlying
> physical addresses are fully accessible when using it's direct map
> entries.
> 
> I also haven't done any benchmarking.  Doesn't seem to cripple
> performance up to the point that XenRT jobs would timeout before
> finishing, that the only objective reference I can provide at the
> moment.
> 
> The series has been extensively tested on XenRT, but that doesn't cover
> all possible use-cases, so it's likely to still have some rough edges,
> handle with care.
> 
> Thanks, Roger.
> 
> Roger Pau Monne (18):
>   x86/mm: purge unneeded destroy_perdomain_mapping()
>   x86/domain: limit window where curr_vcpu != current on context switch
>   x86/mm: introduce helper to detect per-domain L1 entries that need
> freeing
>   x86/pv: introduce function to populate perdomain area and use it to
> map Xen GDT
>   x86/mm: switch destroy_perdomain_mapping() parameter from domain to
> vCPU
>   x86/pv: set/clear guest GDT mappings using
> {populate,destroy}_perdomain_mapping()
>   x86/pv: update guest LDT mappings using the linear entries
>   x86/pv: remove stashing of GDT/LDT L1 page-tables
>   x86/mm: simplify create_perdomain_mapping() interface
>   x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter
> to vCPU
>   x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI
>   x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush
>   x86/spec-ctrl: introduce Address Space Isolation command line option
>   x86/mm: introduce per-vCPU L3 page-table
>   x86/mm: introduce a per-vCPU mapcache when using ASI
>   x86/pv: allow using a unique per-pCPU root page table (L4)
>   x86/mm: switch to a per-CPU mapped stack when using ASI
>   x86/mm: zero stack on context switch
> 
>  docs/misc/xen-command-line.pandoc|  24 +++
>  xen/arch/x86/cpu/mcheck/mce.c|   4 +
>  xen/arch/x86/domain.c| 157 +++
>  xen/arch/x86/domain_page.c   | 105 ++
>  xen/arch/x86/flushtlb.c  |  28 ++-
>  xen/arch/x86/hvm/hvm.c   |   6 -
>  xen/arch/x86/include/asm/config.h|  16 +-
>  xen/arch/x86/include/asm/current.h   |  58 +-
>  xen/arch/x86/include/asm/desc.h  |   6 +-
>  xen/arch/x86/include/asm/domain.h|  50 +++--
>  xen/arch/x86/include/asm/flushtlb.h  |   2 +-
>  xen/arch/x86/include/asm/mm.h|  15 +-
>  xen/arch/x86/incl

[PATCH] riscv: Add initial Xen guest support for RISC-V

2025-01-14 Thread Milan Djokic
From: Slavisa Petrovic 

This patch introduces initial support for running RISC-V as a Xen guest.
It provides the necessary infrastructure and stubs for Xen-specific
operations. Key changes include:

- Modifications to the RISC-V kernel to integrate Xen-specific hypercalls
  and interfaces, with function implementations stubbed for future work.
- Introduction of Xen-specific headers for RISC-V, including event
  handling, hypervisor interaction, and page management. Functions are
  defined but not yet implemented.
- Stub implementations for memory management, grant tables, and context
  switching in the Xen environment, allowing further development and
  integration.

Signed-off-by: Milan Djokic 
Signed-off-by: Slavisa Petrovic 
---
 arch/riscv/Kbuild|   1 +
 arch/riscv/Kconfig   |  19 +++
 arch/riscv/include/asm/cpu.h |   1 +
 arch/riscv/include/asm/hypervisor.h  |   9 ++
 arch/riscv/include/asm/irq.h |   5 +
 arch/riscv/include/asm/sync_bitops.h |  27 
 arch/riscv/include/asm/xen/events.h  |  28 
 arch/riscv/include/asm/xen/hypercall.h   |   2 +
 arch/riscv/include/asm/xen/hypervisor.h  |   2 +
 arch/riscv/include/asm/xen/interface.h   |   2 +
 arch/riscv/include/asm/xen/page.h|   3 +
 arch/riscv/include/asm/xen/swiotlb-xen.h |   2 +
 arch/riscv/xen/Makefile  |   2 +
 arch/riscv/xen/enlighten.c   | 164 +++
 arch/riscv/xen/grant-table.c |  57 
 arch/riscv/xen/hypercall.S   |  71 ++
 arch/riscv/xen/p2m.c |  76 +++
 include/xen/interface/io/protocols.h |   3 +
 include/xen/riscv/hypercall.h|  71 ++
 include/xen/riscv/hypervisor.h   |  26 
 include/xen/riscv/interface.h|  85 
 include/xen/riscv/page.h | 106 +++
 include/xen/riscv/swiotlb-xen.h  |  13 ++
 test.txt |  21 +++
 24 files changed, 796 insertions(+)
 create mode 100644 arch/riscv/include/asm/hypervisor.h
 create mode 100644 arch/riscv/include/asm/sync_bitops.h
 create mode 100644 arch/riscv/include/asm/xen/events.h
 create mode 100644 arch/riscv/include/asm/xen/hypercall.h
 create mode 100644 arch/riscv/include/asm/xen/hypervisor.h
 create mode 100644 arch/riscv/include/asm/xen/interface.h
 create mode 100644 arch/riscv/include/asm/xen/page.h
 create mode 100644 arch/riscv/include/asm/xen/swiotlb-xen.h
 create mode 100644 arch/riscv/xen/Makefile
 create mode 100644 arch/riscv/xen/enlighten.c
 create mode 100644 arch/riscv/xen/grant-table.c
 create mode 100644 arch/riscv/xen/hypercall.S
 create mode 100644 arch/riscv/xen/p2m.c
 create mode 100644 include/xen/riscv/hypercall.h
 create mode 100644 include/xen/riscv/hypervisor.h
 create mode 100644 include/xen/riscv/interface.h
 create mode 100644 include/xen/riscv/page.h
 create mode 100644 include/xen/riscv/swiotlb-xen.h
 create mode 100644 test.txt

diff --git a/arch/riscv/Kbuild b/arch/riscv/Kbuild
index 2c585f7a0b6e..d9b71deed2cd 100644
--- a/arch/riscv/Kbuild
+++ b/arch/riscv/Kbuild
@@ -5,6 +5,7 @@ obj-$(CONFIG_BUILTIN_DTB) += boot/dts/
 obj-$(CONFIG_CRYPTO) += crypto/
 obj-y += errata/
 obj-$(CONFIG_KVM) += kvm/
+obj-$(CONFIG_XEN) += xen/

 obj-$(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY) += purgatory/

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index d4a7ca0388c0..13ea75221524 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -1071,6 +1071,25 @@ config PARAVIRT_TIME_ACCOUNTING

  If in doubt, say N here.

+config XEN_DOM0
+   def_bool y
+   depends on XEN
+
+config XEN
+   bool "Xen guest support on RISCV"
+   depends on RISCV && OF
+   select PARAVIRT
+   help
+ Enables support for running Linux as a Xen guest on RISC-V.
+
+ Xen is a type-1 hypervisor that allows multiple operating systems
+ to run on the same hardware. Enabling this option allows the kernel
+ to function as a guest under the Xen hypervisor on RISC-V platforms.
+
+ Say Y if you want to run this kernel as a guest under Xen on RISC-V.
+
+ If unsure, say N.
+
 config RELOCATABLE
bool "Build a relocatable kernel"
depends on MMU && 64BIT && !XIP_KERNEL
diff --git a/arch/riscv/include/asm/cpu.h b/arch/riscv/include/asm/cpu.h
index 28d45a6678ce..fb2aac6a068e 100644
--- a/arch/riscv/include/asm/cpu.h
+++ b/arch/riscv/include/asm/cpu.h
@@ -4,5 +4,6 @@
 #define _ASM_CPU_H

 /* This header is required unconditionally by the ACPI core */
+#include 

 #endif /* _ASM_CPU_H */
diff --git a/arch/riscv/include/asm/hypervisor.h 
b/arch/riscv/include/asm/hypervisor.h
new file mode 100644
index ..3a117afe57f0
--- /dev/null
+++ b/arch/riscv/include/asm/hypervisor.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_RISCV_HYPERVISOR_H
+#define _ASM_RISCV_HYPERVISOR_H
+
+#incl

Re: [RFC PATCH v2 01/10] x86: Add architectural LBR definitions

2025-01-14 Thread Jan Beulich
On 02.01.2025 09:45, Tu Dinh wrote:
> --- a/xen/arch/x86/include/asm/msr-index.h
> +++ b/xen/arch/x86/include/asm/msr-index.h
> @@ -112,6 +112,8 @@
>  #define  MCU_OPT_CTRL_GDS_MIT_DIS   (_AC(1, ULL) <<  4)
>  #define  MCU_OPT_CTRL_GDS_MIT_LOCK  (_AC(1, ULL) <<  5)
>  
> +#define MSR_LER_INFO0x01e0
> +
>  #define MSR_RTIT_OUTPUT_BASE0x0560
>  #define MSR_RTIT_OUTPUT_MASK0x0561
>  #define MSR_RTIT_CTL0x0570
> @@ -193,6 +195,16 @@
>  #define MSR_UARCH_MISC_CTRL 0x1b01
>  #define  UARCH_CTRL_DOITM   (_AC(1, ULL) <<  0)
>  
> +/* Architectural LBR state MSRs */
> +#define MSR_LBR_INFO(n) (0x1200 + (n))
> +#define MSR_LBR_CTL 0x14ce
> +#define  LBR_CTL_VALID  _AC(0x7f000f, ULL)

While I can see that such a value may be useful at some point, I think it wants
introducing when needed and composing of definitions for the individual bits.

Jan



Re: [PATCH v2 10/18] x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter to vCPU

2025-01-14 Thread Jan Beulich
On 08.01.2025 15:26, Roger Pau Monne wrote:
> In preparation for the per-domain area being per-vCPU.  This requires moving
> some of the {create,destroy}_perdomain_mapping() calls to the domain
> initialization and tear down paths into vCPU initialization and tear down.

Am I confused or DYM "s/ to / from /"?

Jan



Re: [RFC PATCH v2 03/10] tools: Add arch LBR feature bits

2025-01-14 Thread Jan Beulich
On 02.01.2025 09:45, Tu Dinh wrote:
> Signed-off-by: Tu Dinh 
> ---
>  tools/libs/light/libxl_cpuid.c | 3 +++
>  tools/misc/xen-cpuid.c | 3 +++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/tools/libs/light/libxl_cpuid.c b/tools/libs/light/libxl_cpuid.c
> index 063fe86eb7..05be36f258 100644
> --- a/tools/libs/light/libxl_cpuid.c
> +++ b/tools/libs/light/libxl_cpuid.c
> @@ -342,6 +342,9 @@ int libxl_cpuid_parse_config(libxl_cpuid_policy_list 
> *policy, const char* str)
>  CPUID_ENTRY(0x0007,  1, CPUID_REG_EDX),
>  MSR_ENTRY(0x10a, CPUID_REG_EAX),
>  MSR_ENTRY(0x10a, CPUID_REG_EDX),
> +CPUID_ENTRY(0x001C, NA, CPUID_REG_EAX),
> +CPUID_ENTRY(0x001C, NA, CPUID_REG_EBX),
> +CPUID_ENTRY(0x001C, NA, CPUID_REG_ECX),
>  #undef MSR_ENTRY
>  #undef CPUID_ENTRY
>  };
> diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
> index 4c4593528d..4f0fb0a6ea 100644
> --- a/tools/misc/xen-cpuid.c
> +++ b/tools/misc/xen-cpuid.c
> @@ -37,6 +37,9 @@ static const struct {
>  { "CPUID 0x0007:1.edx", "7d1" },
>  { "MSR_ARCH_CAPS.lo", "m10Al" },
>  { "MSR_ARCH_CAPS.hi", "m10Ah" },
> +{ "CPUID 0x001c.eax",   "1Ca" },
> +{ "CPUID 0x001c.ebx",   "1Cb" },
> +{ "CPUID 0x001c.ecx",   "1Cc" },
>  };
>  
>  #define COL_ALIGN "24"

I think this needs folding into patch 2. These tables want to stay in sync at 
all
times with what the public interface has.

Jan



Re: [RFC PATCH v2 02/10] x86: Define arch LBR feature bits

2025-01-14 Thread Jan Beulich
On 02.01.2025 09:45, Tu Dinh wrote:
> --- a/xen/arch/x86/include/asm/cpufeature.h
> +++ b/xen/arch/x86/include/asm/cpufeature.h
> @@ -219,6 +219,11 @@ static inline bool boot_cpu_has(unsigned int feat)
>  #define cpu_has_rfds_no boot_cpu_has(X86_FEATURE_RFDS_NO)
>  #define cpu_has_rfds_clear  boot_cpu_has(X86_FEATURE_RFDS_CLEAR)
>  
> +/* CPUID level 0x001c.eax */
> +
> +#define current_cpu_has_lbr_lip cpu_has(¤t_cpu_data, \
> +X86_FEATURE_LBR_LIP);

Why would this, unlike all other similar constructs, need to use
current_cpu_data rather than boot_cpu_has()? If there is a reason, it almost
certainly wants naming in the description.

> --- a/xen/include/public/arch-x86/cpufeatureset.h
> +++ b/xen/include/public/arch-x86/cpufeatureset.h
> @@ -284,7 +284,7 @@ XEN_CPUFEATURE(SERIALIZE, 9*32+14) /*A  SERIALIZE 
> insn */
>  XEN_CPUFEATURE(HYBRID,9*32+15) /*   Heterogeneous platform */
>  XEN_CPUFEATURE(TSXLDTRK,  9*32+16) /*a  TSX load tracking suspend/resume 
> insns */
>  XEN_CPUFEATURE(PCONFIG,   9*32+18) /*   PCONFIG instruction */
> -XEN_CPUFEATURE(ARCH_LBR,  9*32+19) /*   Architectural Last Branch Record 
> */
> +XEN_CPUFEATURE(ARCH_LBR,  9*32+19) /*s  Architectural Last Branch Record 
> */

The 's' here (and below) may only be added once all of the respective handling 
is
complete.

> --- a/xen/include/xen/lib/x86/cpu-policy.h
> +++ b/xen/include/xen/lib/x86/cpu-policy.h
> @@ -22,6 +22,9 @@
>  #define FEATURESET_7d1   15 /* 0x0007:1.edx*/
>  #define FEATURESET_m10Al 16 /* 0x010a.eax  */
>  #define FEATURESET_m10Ah 17 /* 0x010a.edx  */
> +#define FEATURESET_1Ca   18 /* 0x001c.eax  */
> +#define FEATURESET_1Cb   19 /* 0x001c.ebx  */
> +#define FEATURESET_1Cc   20 /* 0x001c.ecx  */
>  
>  struct cpuid_leaf
>  {
> @@ -85,7 +88,7 @@ unsigned int x86_cpuid_lookup_vendor(uint32_t ebx, uint32_t 
> ecx, uint32_t edx);
>   */
>  const char *x86_cpuid_vendor_to_str(unsigned int vendor);
>  
> -#define CPUID_GUEST_NR_BASIC  (0xdu + 1)
> +#define CPUID_GUEST_NR_BASIC  (0x1cu + 1)

Are you sure this can be done with no other prep work? I've been sitting
on AMX and AVX10 patches where I need to bump this, too. Yet I continue
to think that something along the lines of the 3-patch series at [1] is
necessary up front.

> @@ -158,6 +161,52 @@ struct cpu_policy
>  uint64_t :64, :64; /* Leaf 0xb - Topology. */
>  uint64_t :64, :64; /* Leaf 0xc - rsvd */
>  uint64_t :64, :64; /* Leaf 0xd - XSTATE. */
> +
> +uint64_t :64, :64; /* Leaf 0xe - rsvd */
> +uint64_t :64, :64; /* Leaf 0xf - rsvd */
> +uint64_t :64, :64; /* Leaf 0x10 - rsvd */
> +uint64_t :64, :64; /* Leaf 0x11 - rsvd */
> +uint64_t :64, :64; /* Leaf 0x12 - rsvd */
> +uint64_t :64, :64; /* Leaf 0x13 - rsvd */
> +uint64_t :64, :64; /* Leaf 0x14 - rsvd */
> +uint64_t :64, :64; /* Leaf 0x15 - rsvd */
> +uint64_t :64, :64; /* Leaf 0x16 - rsvd */
> +uint64_t :64, :64; /* Leaf 0x17 - rsvd */
> +uint64_t :64, :64; /* Leaf 0x18 - rsvd */
> +uint64_t :64, :64; /* Leaf 0x19 - rsvd */
> +uint64_t :64, :64; /* Leaf 0x1a - rsvd */
> +uint64_t :64, :64; /* Leaf 0x1b - rsvd */
> +
> +union {
> +uint32_t _1Ca;
> +struct {
> +uint32_t supported_depths:8;

According to XEN_CPUFEATURE(LBR_DEPTH_...) further up these are 8 individual
bits. Further, is there a reason you don't use here what the additions there
produce in the generated header, but you rather re-define the fields from
scratch?

Jan

[1] https://lists.xen.org/archives/html/xen-devel/2024-08/msg00591.html



Re: [RFC PATCH v2 04/10] x86: Calculate arch LBR CPUID policies

2025-01-14 Thread Jan Beulich
On 02.01.2025 09:45, Tu Dinh wrote:
> --- a/xen/arch/x86/cpu-policy.c
> +++ b/xen/arch/x86/cpu-policy.c
> @@ -190,6 +190,16 @@ static void sanitise_featureset(uint32_t *fs)
>  }
>  }
>  
> +static void recalculate_arch_lbr(struct cpu_policy *p)
> +{
> +if ( p->basic.max_leaf < 0x1c ||
> + !(cpu_policy_xstates(&host_cpu_policy) & X86_XSS_LBR) ||

A reference to the host policy looks questionable here. If you ...

> + p->basic.lbr_1Ca.supported_depths == 0)
> +p->feat.arch_lbr = 0;
> +if ( !p->feat.arch_lbr )
> +p->basic.raw[0x1c] = EMPTY_LEAF;
> +}
> +
>  static void recalculate_xstate(struct cpu_policy *p)
>  {
>  uint64_t xstates = XSTATE_FP_SSE;
> @@ -219,6 +229,9 @@ static void recalculate_xstate(struct cpu_policy *p)
>  if ( p->feat.amx_tile )
>  xstates |= X86_XCR0_TILE_CFG | X86_XCR0_TILE_DATA;
>  
> +if ( p->feat.arch_lbr )
> +xstates |= X86_XSS_LBR;
> +
>  /* Subleaf 0 */
>  p->xstate.max_size =
>  xstate_uncompressed_size(xstates & ~XSTATE_XSAVES_ONLY);
> @@ -271,6 +284,8 @@ static void recalculate_misc(struct cpu_policy *p)
>  
>  p->basic.raw[0xc] = EMPTY_LEAF;
>  
> +zero_leaves(p->basic.raw, 0xe, 0x1b);
> +
>  p->extd.e1d &= ~CPUID_COMMON_1D_FEATURES;
>  
>  /* Most of Power/RAS hidden from guests. */
> @@ -630,6 +645,7 @@ static void __init calculate_pv_max_policy(void)
>  
>  sanitise_featureset(fs);
>  x86_cpu_featureset_to_policy(fs, p);
> +recalculate_arch_lbr(p);
>  recalculate_xstate(p);

... flipped the order of these calls, couldn't you use the respective bit
from the correct policy? Then again you need p->feat.arch_lbr to actually
set X86_XSS_LBR. Still the aspect in recalculate_arch_lbr() wants sorting
somehow.

Jan



Re: [RFC PATCH v2 05/10] x86: Keep a copy of XSAVE area size

2025-01-14 Thread Jan Beulich
On 02.01.2025 09:45, Tu Dinh wrote:
> Signed-off-by: Tu Dinh 

This needs a non-empty description to clarify why this would be needed.

Jan

> --- a/xen/arch/x86/include/asm/domain.h
> +++ b/xen/arch/x86/include/asm/domain.h
> @@ -638,6 +638,7 @@ struct arch_vcpu
>   * #NM handler, we XRSTOR the states we XSAVE-ed;
>   */
>  struct xsave_struct *xsave_area;
> +unsigned int xsave_area_size;
>  uint64_t xcr0;
>  /* Accumulated eXtended features mask for using XSAVE/XRESTORE by Xen
>   * itself, as we can never know whether guest OS depends on content
> diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
> index af9e345a7a..baae8e1a13 100644
> --- a/xen/arch/x86/xstate.c
> +++ b/xen/arch/x86/xstate.c
> @@ -550,6 +550,7 @@ int xstate_alloc_save_area(struct vcpu *v)
>  save_area->fpu_sse.mxcsr = MXCSR_DEFAULT;
>  
>  v->arch.xsave_area = save_area;
> +v->arch.xsave_area_size = size;
>  v->arch.xcr0 = 0;
>  v->arch.xcr0_accum = 0;
>  




Re: [PATCH] riscv: Add initial Xen guest support for RISC-V

2025-01-14 Thread Anup Patel
On Tue, Jan 14, 2025 at 9:41 PM Milan Djokic  wrote:
>
> From: Slavisa Petrovic 
>
> This patch introduces initial support for running RISC-V as a Xen guest.
> It provides the necessary infrastructure and stubs for Xen-specific
> operations. Key changes include:
>
> - Modifications to the RISC-V kernel to integrate Xen-specific hypercalls
>   and interfaces, with function implementations stubbed for future work.
> - Introduction of Xen-specific headers for RISC-V, including event
>   handling, hypervisor interaction, and page management. Functions are
>   defined but not yet implemented.
> - Stub implementations for memory management, grant tables, and context
>   switching in the Xen environment, allowing further development and
>   integration.
>
> Signed-off-by: Milan Djokic 
> Signed-off-by: Slavisa Petrovic 

A single patch with many changes is hard to review.

Please convert this patch into a series with smaller patches.
Also, include a cover letter in the series explaining how to
test Xen on RISC-V.

Regards,
Anup

> ---
>  arch/riscv/Kbuild|   1 +
>  arch/riscv/Kconfig   |  19 +++
>  arch/riscv/include/asm/cpu.h |   1 +
>  arch/riscv/include/asm/hypervisor.h  |   9 ++
>  arch/riscv/include/asm/irq.h |   5 +
>  arch/riscv/include/asm/sync_bitops.h |  27 
>  arch/riscv/include/asm/xen/events.h  |  28 
>  arch/riscv/include/asm/xen/hypercall.h   |   2 +
>  arch/riscv/include/asm/xen/hypervisor.h  |   2 +
>  arch/riscv/include/asm/xen/interface.h   |   2 +
>  arch/riscv/include/asm/xen/page.h|   3 +
>  arch/riscv/include/asm/xen/swiotlb-xen.h |   2 +
>  arch/riscv/xen/Makefile  |   2 +
>  arch/riscv/xen/enlighten.c   | 164 +++
>  arch/riscv/xen/grant-table.c |  57 
>  arch/riscv/xen/hypercall.S   |  71 ++
>  arch/riscv/xen/p2m.c |  76 +++
>  include/xen/interface/io/protocols.h |   3 +
>  include/xen/riscv/hypercall.h|  71 ++
>  include/xen/riscv/hypervisor.h   |  26 
>  include/xen/riscv/interface.h|  85 
>  include/xen/riscv/page.h | 106 +++
>  include/xen/riscv/swiotlb-xen.h  |  13 ++
>  test.txt |  21 +++
>  24 files changed, 796 insertions(+)
>  create mode 100644 arch/riscv/include/asm/hypervisor.h
>  create mode 100644 arch/riscv/include/asm/sync_bitops.h
>  create mode 100644 arch/riscv/include/asm/xen/events.h
>  create mode 100644 arch/riscv/include/asm/xen/hypercall.h
>  create mode 100644 arch/riscv/include/asm/xen/hypervisor.h
>  create mode 100644 arch/riscv/include/asm/xen/interface.h
>  create mode 100644 arch/riscv/include/asm/xen/page.h
>  create mode 100644 arch/riscv/include/asm/xen/swiotlb-xen.h
>  create mode 100644 arch/riscv/xen/Makefile
>  create mode 100644 arch/riscv/xen/enlighten.c
>  create mode 100644 arch/riscv/xen/grant-table.c
>  create mode 100644 arch/riscv/xen/hypercall.S
>  create mode 100644 arch/riscv/xen/p2m.c
>  create mode 100644 include/xen/riscv/hypercall.h
>  create mode 100644 include/xen/riscv/hypervisor.h
>  create mode 100644 include/xen/riscv/interface.h
>  create mode 100644 include/xen/riscv/page.h
>  create mode 100644 include/xen/riscv/swiotlb-xen.h
>  create mode 100644 test.txt
>
> diff --git a/arch/riscv/Kbuild b/arch/riscv/Kbuild
> index 2c585f7a0b6e..d9b71deed2cd 100644
> --- a/arch/riscv/Kbuild
> +++ b/arch/riscv/Kbuild
> @@ -5,6 +5,7 @@ obj-$(CONFIG_BUILTIN_DTB) += boot/dts/
>  obj-$(CONFIG_CRYPTO) += crypto/
>  obj-y += errata/
>  obj-$(CONFIG_KVM) += kvm/
> +obj-$(CONFIG_XEN) += xen/
>
>  obj-$(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY) += purgatory/
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index d4a7ca0388c0..13ea75221524 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -1071,6 +1071,25 @@ config PARAVIRT_TIME_ACCOUNTING
>
>   If in doubt, say N here.
>
> +config XEN_DOM0
> +   def_bool y
> +   depends on XEN
> +
> +config XEN
> +   bool "Xen guest support on RISCV"
> +   depends on RISCV && OF
> +   select PARAVIRT
> +   help
> + Enables support for running Linux as a Xen guest on RISC-V.
> +
> + Xen is a type-1 hypervisor that allows multiple operating systems
> + to run on the same hardware. Enabling this option allows the kernel
> + to function as a guest under the Xen hypervisor on RISC-V platforms.
> +
> + Say Y if you want to run this kernel as a guest under Xen on RISC-V.
> +
> + If unsure, say N.
> +
>  config RELOCATABLE
> bool "Build a relocatable kernel"
> depends on MMU && 64BIT && !XIP_KERNEL
> diff --git a/arch/riscv/include/asm/cpu.h b/arch/riscv/include/asm/cpu.h
> index 28d45a6678ce..fb2aac6a068e 100644
> --- a/arch/riscv/include/asm/cpu.h
> +++ b/arch/riscv/includ

Re: [XEN PATCH] MAINTAINERS: Change reviewer of the ECLAIR integration

2025-01-14 Thread Simone Ballarin

On 2025-01-14 16:00, Roger Pau Monné wrote:

On Tue, Jan 14, 2025 at 03:48:54PM +0100, Nicola Vetrini wrote:

Simone Ballarin is no longer actively involved in reviewing
the ECLAIR integration for Xen. I am stepping up as a reviewer.

Signed-off-by: Nicola Vetrini 


Acked-by: Roger Pau Monné 



Acked-by: Simone Ballarin 


Adding Simone to the Cc list, it would be helpful if he can also
provide an Ack to signal he is OK with begin removed.

Thanks, Roger.





Re: [PATCH v3 0/3] Add stack protector

2025-01-14 Thread Andrew Cooper
On 14/01/2025 1:22 pm, Jan Beulich wrote:
> On 12.12.2024 02:17, Andrew Cooper wrote:
>> On 12/12/2024 12:13 am, Volodymyr Babchuk wrote:
>>> Hello Jan,
>>>
>>> Jan Beulich  writes:
>>>
 On 11.12.2024 03:04, Volodymyr Babchuk wrote:
> Both GCC and Clang support -fstack-protector feature, which add stack
> canaries to functions where stack corruption is possible. This series
> makes possible to use this feature in Xen. I tested this on ARM64 and
> it is working as intended. Tested both with GCC and Clang.
>
> It is hard to enable this feature on x86, as GCC stores stack canary
> in %fs:40 by default, but Xen can't use %fs for various reasons. It is
> possibly to change stack canary location new newer GCC versions, but
> this will change minimal GCC requirement, which is also hard due to
> various reasons. So, this series focus mostly on ARM and RISCV.
 Why exactly would it not be possible to offer the feature when new enough
 gcc is in use?
>>> It is possible to use this feature with a modern enough GCC, yes. Are
>>> you suggesting to make HAS_STACK_PROTECTOR dependent on GCC_VERSION for
>>> x86 platform?
>> (With the knowledge that this is a disputed Kconfig pattern, and will
>> need rebasing), the way I want this to work is simply:
>>
>> diff --git a/xen/Makefile b/xen/Makefile
>> index 0de0101fd0bf..5d0a88fb3c3f 100644
>> --- a/xen/Makefile
>> +++ b/xen/Makefile
>> @@ -434,6 +434,9 @@ endif
>>  
>>  ifeq ($(CONFIG_STACK_PROTECTOR),y)
>>  CFLAGS += -fstack-protector
>> +ifeq ($(CONFIG_X86),y)
>> +CFLAGS += -mstack-protector-guard=global
>> +endif
>>  else
>>  CFLAGS += -fno-stack-protector
>>  endif
>> diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
>> index 9cdd04721afa..7951ca908b36 100644
>> --- a/xen/arch/x86/Kconfig
>> +++ b/xen/arch/x86/Kconfig
>> @@ -28,6 +28,7 @@ config X86
>>     select HAS_PCI_MSI
>>     select HAS_PIRQ
>>     select HAS_SCHED_GRANULARITY
>> +   select HAS_STACK_PROTECTOR if
>> $(cc-option,-mstack-protector-guard=global)
>>     select HAS_UBSAN
>>     select HAS_VMAP
>>     select HAS_VPCI if HVM
>>
>>
>>
>> Sadly, it doesn't build.  I get a handful of:
>>
>> prelink.o: in function `cmdline_parse':
>> /home/andrew/xen.git/xen/common/kernel.c:216:(.init.text+0x20f2): failed
>> to convert GOTPCREL relocation against '__stack_chk_guard'; relink with
>> --no-relax
>> /home/andrew/xen.git/xen/common/kernel.c:230:(.init.text+0x246f): failed
>> to convert GOTPCREL relocation against '__stack_chk_guard'; relink with
>> --no-relax
>>
>> which is more toolchain-whispering than I feel like doing tonight.
> For reference:
> https://sourceware.org/pipermail/binutils/2025-January/138631.html
>
> You didn't enter a gcc bug report yet, did you?

No, not yet.  I'm afraid I've not had the time.

~Andrew



[PATCH v2] xl: properly dispose of libxl_dominfo struct instances

2025-01-14 Thread Jan Beulich
The ssid_label field requires separate freeing; make sure to call
libxl_dominfo_dispose() as well as libxl_dominfo_init(). Since vcpuset()
calls only the former, add a call to the latter there at the same time.

Coverity-ID: 1638727
Coverity-ID: 1638728
Fixes: c458c404da16 ("xl: use libxl_domain_info to get the uuid in printf_info")
Fixes: 48dab9767d2e ("tools/xl: use libxl_domain_info to get domain type for 
vcpu-pin")
Signed-off-by: Jan Beulich 
Release-Acked-by: Oleksii Kurochko
---
v2: Add call to libxl_dominfo_init() to vcpuset().

--- a/tools/xl/xl_sxp.c
+++ b/tools/xl/xl_sxp.c
@@ -45,8 +45,10 @@ void printf_info_sexp(int domid, libxl_d
 /* retrieve the UUID from dominfo, since it is probably generated
  * during parsing and thus does not match the real one
  */
+libxl_dominfo_init(&info);
 if (libxl_domain_info(ctx, &info, domid) == 0) {
 fprintf(fh, "\t(uuid " LIBXL_UUID_FMT ")\n", 
LIBXL_UUID_BYTES(info.uuid));
+libxl_dominfo_dispose(&info);
 } else {
 fprintf(fh, "\t(uuid )\n");
 }
--- a/tools/xl/xl_vcpu.c
+++ b/tools/xl/xl_vcpu.c
@@ -286,6 +286,8 @@ int main_vcpupin(int argc, char **argv)
 if (!ignore_masks && hard) {
 libxl_dominfo dominfo;
 
+libxl_dominfo_init(&dominfo);
+
 if (libxl_domain_info(ctx, &dominfo, domid)) {
 fprintf(stderr, "Could not get domain info\n");
 goto out;
@@ -293,6 +295,8 @@ int main_vcpupin(int argc, char **argv)
 
 /* HVM and PVH domains use the same global affinity mask */
 apply_global_affinity_masks(dominfo.domain_type, hard, 1);
+
+libxl_dominfo_dispose(&dominfo);
 }
 
 if (force) {
@@ -348,6 +352,7 @@ static int vcpuset(uint32_t domid, const
 unsigned int online_vcpus, host_cpu = libxl_get_max_cpus(ctx);
 libxl_dominfo dominfo;
 
+libxl_dominfo_init(&dominfo);
 if (libxl_domain_info(ctx, &dominfo, domid))
 return 1;
 



Re: [PATCH v2 15/25] drm/omapdrm: Compute dumb-buffer sizes with drm_mode_size_dumb()

2025-01-14 Thread Tomi Valkeinen

Hi,

On 09/01/2025 16:57, Thomas Zimmermann wrote:

Call drm_mode_size_dumb() to compute dumb-buffer scanline pitch and
buffer size. Align the pitch to a multiple of 8.

Signed-off-by: Thomas Zimmermann 
Cc: Tomi Valkeinen 
---
  drivers/gpu/drm/omapdrm/omap_gem.c | 15 +++
  1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/omapdrm/omap_gem.c 
b/drivers/gpu/drm/omapdrm/omap_gem.c
index b9c67e4ca360..b8413a2dcdeb 100644
--- a/drivers/gpu/drm/omapdrm/omap_gem.c
+++ b/drivers/gpu/drm/omapdrm/omap_gem.c
@@ -11,6 +11,7 @@
  #include 
  #include 
  
+#include 

  #include 
  #include 
  
@@ -583,15 +584,13 @@ static int omap_gem_object_mmap(struct drm_gem_object *obj, struct vm_area_struc

  int omap_gem_dumb_create(struct drm_file *file, struct drm_device *dev,
struct drm_mode_create_dumb *args)
  {
-   union omap_gem_size gsize;
-
-   args->pitch = DIV_ROUND_UP(args->width * args->bpp, 8);
-
-   args->size = PAGE_ALIGN(args->pitch * args->height);
+   union omap_gem_size gsize = { };
+   int ret;
  
-	gsize = (union omap_gem_size){

-   .bytes = args->size,
-   };
+   ret = drm_mode_size_dumb(dev, args, SZ_8, 0);
+   if (ret)
+   return ret;
+   gsize.bytes = args->size;
  
  	return omap_gem_new_handle(dev, file, gsize,

OMAP_BO_SCANOUT | OMAP_BO_WC, &args->handle);


Tested on dra76 evm.

Reviewed-by: Tomi Valkeinen 

 Tomi




Re: [PATCH v3 0/3] Add stack protector

2025-01-14 Thread Andrew Cooper
On 14/01/2025 1:47 pm, Jan Beulich wrote:
> On 14.01.2025 14:28, Andrew Cooper wrote:
>> On 14/01/2025 1:22 pm, Jan Beulich wrote:
>>> On 12.12.2024 02:17, Andrew Cooper wrote:
 On 12/12/2024 12:13 am, Volodymyr Babchuk wrote:
> Hello Jan,
>
> Jan Beulich  writes:
>
>> On 11.12.2024 03:04, Volodymyr Babchuk wrote:
>>> Both GCC and Clang support -fstack-protector feature, which add stack
>>> canaries to functions where stack corruption is possible. This series
>>> makes possible to use this feature in Xen. I tested this on ARM64 and
>>> it is working as intended. Tested both with GCC and Clang.
>>>
>>> It is hard to enable this feature on x86, as GCC stores stack canary
>>> in %fs:40 by default, but Xen can't use %fs for various reasons. It is
>>> possibly to change stack canary location new newer GCC versions, but
>>> this will change minimal GCC requirement, which is also hard due to
>>> various reasons. So, this series focus mostly on ARM and RISCV.
>> Why exactly would it not be possible to offer the feature when new enough
>> gcc is in use?
> It is possible to use this feature with a modern enough GCC, yes. Are
> you suggesting to make HAS_STACK_PROTECTOR dependent on GCC_VERSION for
> x86 platform?
 (With the knowledge that this is a disputed Kconfig pattern, and will
 need rebasing), the way I want this to work is simply:

 diff --git a/xen/Makefile b/xen/Makefile
 index 0de0101fd0bf..5d0a88fb3c3f 100644
 --- a/xen/Makefile
 +++ b/xen/Makefile
 @@ -434,6 +434,9 @@ endif
  
  ifeq ($(CONFIG_STACK_PROTECTOR),y)
  CFLAGS += -fstack-protector
 +ifeq ($(CONFIG_X86),y)
 +CFLAGS += -mstack-protector-guard=global
 +endif
  else
  CFLAGS += -fno-stack-protector
  endif
 diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
 index 9cdd04721afa..7951ca908b36 100644
 --- a/xen/arch/x86/Kconfig
 +++ b/xen/arch/x86/Kconfig
 @@ -28,6 +28,7 @@ config X86
     select HAS_PCI_MSI
     select HAS_PIRQ
     select HAS_SCHED_GRANULARITY
 +   select HAS_STACK_PROTECTOR if
 $(cc-option,-mstack-protector-guard=global)
     select HAS_UBSAN
     select HAS_VMAP
     select HAS_VPCI if HVM



 Sadly, it doesn't build.  I get a handful of:

 prelink.o: in function `cmdline_parse':
 /home/andrew/xen.git/xen/common/kernel.c:216:(.init.text+0x20f2): failed
 to convert GOTPCREL relocation against '__stack_chk_guard'; relink with
 --no-relax
 /home/andrew/xen.git/xen/common/kernel.c:230:(.init.text+0x246f): failed
 to convert GOTPCREL relocation against '__stack_chk_guard'; relink with
 --no-relax

 which is more toolchain-whispering than I feel like doing tonight.
>>> For reference:
>>> https://sourceware.org/pipermail/binutils/2025-January/138631.html
>>>
>>> You didn't enter a gcc bug report yet, did you?
>> No, not yet.  I'm afraid I've not had the time.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118473

Thankyou.

~Andrew



[XEN PATCH] MAINTAINERS: Change reviewer of the ECLAIR integration

2025-01-14 Thread Nicola Vetrini
Simone Ballarin is no longer actively involved in reviewing
the ECLAIR integration for Xen. I am stepping up as a reviewer.

Signed-off-by: Nicola Vetrini 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 392f780f7617..c11b82eca98f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -297,7 +297,7 @@ F:  xen/include/xen/device_tree.h
 F: xen/drivers/passthrough/device_tree.c
 
 ECLAIR
-R: Simone Ballarin 
+R: Nicola Vetrini 
 S: Supported
 F: automation/eclair_analysis/
 F: automation/scripts/eclair
-- 
2.43.0




Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock

2025-01-14 Thread Roger Pau Monné
On Tue, Jan 14, 2025 at 03:23:21PM +0100, Oleksii Kurochko wrote:
> 
> On 1/14/25 1:13 PM, Roger Pau Monné wrote:
> > On Tue, Jan 14, 2025 at 12:44:39PM +0100, Oleksii Kurochko wrote:
> > > On 1/14/25 12:40 PM, Oleksii Kurochko wrote:
> > > > 
> > > > On 1/14/25 12:27 PM, Roger Pau Monné wrote:
> > > > > On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote:
> > > > > > On 1/13/25 6:52 PM, Roger Pau Monné wrote:
> > > > > > > On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek 
> > > > > > > Marczykowski-Górecki wrote:
> > > > > > > > On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote:
> > > > > > > > > Allow setting the used wallclock from the command line.  When 
> > > > > > > > > the option is set
> > > > > > > > > to a value different than `auto` the probing is bypassed and 
> > > > > > > > > the selected
> > > > > > > > > implementation is used (as long as it's available).
> > > > > > > > > 
> > > > > > > > > The `xen` and `efi` options require being booted as a Xen 
> > > > > > > > > guest (with Xen guest
> > > > > > > > > supported built-in) or from UEFI firmware respectively.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Roger Pau Monné
> > > > > > > > Reviewed-by: Marek 
> > > > > > > > Marczykowski-Górecki
> > > > > > > Thanks for the review.
> > > > > > > 
> > > > > > > Oleksii, can I get your opinion as Release Manager about whether 
> > > > > > > this
> > > > > > > (and the following patch) would be suitable for committing to 
> > > > > > > staging
> > > > > > > given the current release state?
> > > > > > > 
> > > > > > > It's a workaround for broken EFI implementations that many 
> > > > > > > downstreams
> > > > > > > carry on their patch queue.
> > > > > > Based on your commit message, I understand this as addressing a bug 
> > > > > > ( but not very critical
> > > > > > as IIUC downstreams have the similar patch on their side ). 
> > > > > > Therefore, if it has been properly
> > > > > > reviewed and tested, we should consider including it in the current 
> > > > > > release.
> > > > > IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves
> > > > > the same behavior as proposed here.
> > > > > 
> > > > > > IIUC, setting the wallclock to EFI should align with the behavior 
> > > > > > Xen had previously.
> > > > > > It might be preferable to use that argument as the default for a 
> > > > > > while, allowing others to verify the "auto"
> > > > > > value over time. After that, we could consider making "auto" the 
> > > > > > default.
> > > > > > That said, I am not particularly strict about following this 
> > > > > > approach.
> > > > > We cannot really set efi as the default, as it would break when
> > > > > booting on legacy BIOS systems.
> > > > > 
> > > > > We could take only patch 1 and leave patch 2 after Xen 4.20 has
> > > > > branched, but at that point I would see little benefit in having just
> > > > > patch 1.
> > > > I don't see a lot of benefit of comitting only the one patch either.
> > > > 
> > > > 
> > > > > I don't have a strong opinion, but downstreams have been complaining
> > > > > about Xen behavior regarding the usage of EFI_GET_TIME, so it might be
> > > > > good to not ship yet another release with such allegedly broken
> > > > > behavior.
> > > > Agree with that. As  I mentioned above I consider it as a bug and based 
> > > > on
> > > > that several mentioned above downstreams have the similar patch I could 
> > > > consider
> > > > that as tested approach so ..
> > > > > Let me know what you think, as I would need a formal Release-Ack if
> > > > > this is to be committed.
> > > > ... R-Acked-by: Oleksii Kurochko.
> > > Sorry for the noise.
> > > 
> > > I missed to add that it would also be nice to mention IMO ( that now we 
> > > have wallclock parameter )
> > > in CHANGELOG so downstreams will receive "notification" that Xen provides 
> > > a workaround for the mentioned
> > > issue in the patch series.
> > Indeed.  Would you be OK with adding the following chunk to patch 2?
> > 
> > diff --git a/CHANGELOG.md b/CHANGELOG.md
> > index 8507e6556a56..1de1d1eca17f 100644
> > --- a/CHANGELOG.md
> > +++ b/CHANGELOG.md
> > @@ -12,6 +12,7 @@ The format is based on [Keep a 
> > Changelog](https://keepachangelog.com/en/1.0.0/)
> >  leaving this to the guest kernel to do in guest context.
> >- On x86:
> >  - Prefer ACPI reboot over UEFI ResetSystem() run time service call.
> > +   - Prefer CMOS over EFI_GET_TIME as time source.
> >  - Switched the xAPIC flat driver to use physical destination mode for 
> > external
> >interrupts instead of logical destination mode.
> > @@ -24,6 +25,7 @@ The format is based on [Keep a 
> > Changelog](https://keepachangelog.com/en/1.0.0/)
> >  - Support for LLC (Last Level Cache) coloring.
> >- On x86:
> >  - xl suspend/resume subcommands.
> > +   - `wallclock` command line option to select time source.
> 
> What about of adding word 'Add' before `wallclock`?

It's i

Re: [PATCH v2] xl: properly dispose of libxl_dominfo struct instances

2025-01-14 Thread Andrew Cooper
On 14/01/2025 1:29 pm, Jan Beulich wrote:
> The ssid_label field requires separate freeing; make sure to call
> libxl_dominfo_dispose() as well as libxl_dominfo_init(). Since vcpuset()
> calls only the former, add a call to the latter there at the same time.
>
> Coverity-ID: 1638727
> Coverity-ID: 1638728
> Fixes: c458c404da16 ("xl: use libxl_domain_info to get the uuid in 
> printf_info")
> Fixes: 48dab9767d2e ("tools/xl: use libxl_domain_info to get domain type for 
> vcpu-pin")
> Signed-off-by: Jan Beulich 
> Release-Acked-by: Oleksii Kurochko
> ---
> v2: Add call to libxl_dominfo_init() to vcpuset().

Reviewed-by: Andrew Cooper 



Re: [PATCH v3 0/3] Add stack protector

2025-01-14 Thread Jan Beulich
On 14.01.2025 14:28, Andrew Cooper wrote:
> On 14/01/2025 1:22 pm, Jan Beulich wrote:
>> On 12.12.2024 02:17, Andrew Cooper wrote:
>>> On 12/12/2024 12:13 am, Volodymyr Babchuk wrote:
 Hello Jan,

 Jan Beulich  writes:

> On 11.12.2024 03:04, Volodymyr Babchuk wrote:
>> Both GCC and Clang support -fstack-protector feature, which add stack
>> canaries to functions where stack corruption is possible. This series
>> makes possible to use this feature in Xen. I tested this on ARM64 and
>> it is working as intended. Tested both with GCC and Clang.
>>
>> It is hard to enable this feature on x86, as GCC stores stack canary
>> in %fs:40 by default, but Xen can't use %fs for various reasons. It is
>> possibly to change stack canary location new newer GCC versions, but
>> this will change minimal GCC requirement, which is also hard due to
>> various reasons. So, this series focus mostly on ARM and RISCV.
> Why exactly would it not be possible to offer the feature when new enough
> gcc is in use?
 It is possible to use this feature with a modern enough GCC, yes. Are
 you suggesting to make HAS_STACK_PROTECTOR dependent on GCC_VERSION for
 x86 platform?
>>> (With the knowledge that this is a disputed Kconfig pattern, and will
>>> need rebasing), the way I want this to work is simply:
>>>
>>> diff --git a/xen/Makefile b/xen/Makefile
>>> index 0de0101fd0bf..5d0a88fb3c3f 100644
>>> --- a/xen/Makefile
>>> +++ b/xen/Makefile
>>> @@ -434,6 +434,9 @@ endif
>>>  
>>>  ifeq ($(CONFIG_STACK_PROTECTOR),y)
>>>  CFLAGS += -fstack-protector
>>> +ifeq ($(CONFIG_X86),y)
>>> +CFLAGS += -mstack-protector-guard=global
>>> +endif
>>>  else
>>>  CFLAGS += -fno-stack-protector
>>>  endif
>>> diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
>>> index 9cdd04721afa..7951ca908b36 100644
>>> --- a/xen/arch/x86/Kconfig
>>> +++ b/xen/arch/x86/Kconfig
>>> @@ -28,6 +28,7 @@ config X86
>>>     select HAS_PCI_MSI
>>>     select HAS_PIRQ
>>>     select HAS_SCHED_GRANULARITY
>>> +   select HAS_STACK_PROTECTOR if
>>> $(cc-option,-mstack-protector-guard=global)
>>>     select HAS_UBSAN
>>>     select HAS_VMAP
>>>     select HAS_VPCI if HVM
>>>
>>>
>>>
>>> Sadly, it doesn't build.  I get a handful of:
>>>
>>> prelink.o: in function `cmdline_parse':
>>> /home/andrew/xen.git/xen/common/kernel.c:216:(.init.text+0x20f2): failed
>>> to convert GOTPCREL relocation against '__stack_chk_guard'; relink with
>>> --no-relax
>>> /home/andrew/xen.git/xen/common/kernel.c:230:(.init.text+0x246f): failed
>>> to convert GOTPCREL relocation against '__stack_chk_guard'; relink with
>>> --no-relax
>>>
>>> which is more toolchain-whispering than I feel like doing tonight.
>> For reference:
>> https://sourceware.org/pipermail/binutils/2025-January/138631.html
>>
>> You didn't enter a gcc bug report yet, did you?
> 
> No, not yet.  I'm afraid I've not had the time.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118473

Jan




Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock

2025-01-14 Thread Oleksii Kurochko


On 1/14/25 1:13 PM, Roger Pau Monné wrote:

On Tue, Jan 14, 2025 at 12:44:39PM +0100, Oleksii Kurochko wrote:

On 1/14/25 12:40 PM, Oleksii Kurochko wrote:


On 1/14/25 12:27 PM, Roger Pau Monné wrote:

On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote:

On 1/13/25 6:52 PM, Roger Pau Monné wrote:

On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki wrote:

On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote:

Allow setting the used wallclock from the command line.  When the option is set
to a value different than `auto` the probing is bypassed and the selected
implementation is used (as long as it's available).

The `xen` and `efi` options require being booted as a Xen guest (with Xen guest
supported built-in) or from UEFI firmware respectively.

Signed-off-by: Roger Pau Monné

Reviewed-by: Marek Marczykowski-Górecki

Thanks for the review.

Oleksii, can I get your opinion as Release Manager about whether this
(and the following patch) would be suitable for committing to staging
given the current release state?

It's a workaround for broken EFI implementations that many downstreams
carry on their patch queue.

Based on your commit message, I understand this as addressing a bug ( but not 
very critical
as IIUC downstreams have the similar patch on their side ). Therefore, if it 
has been properly
reviewed and tested, we should consider including it in the current release.

IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves
the same behavior as proposed here.


IIUC, setting the wallclock to EFI should align with the behavior Xen had 
previously.
It might be preferable to use that argument as the default for a while, allowing others 
to verify the "auto"
value over time. After that, we could consider making "auto" the default.
That said, I am not particularly strict about following this approach.

We cannot really set efi as the default, as it would break when
booting on legacy BIOS systems.

We could take only patch 1 and leave patch 2 after Xen 4.20 has
branched, but at that point I would see little benefit in having just
patch 1.

I don't see a lot of benefit of comitting only the one patch either.



I don't have a strong opinion, but downstreams have been complaining
about Xen behavior regarding the usage of EFI_GET_TIME, so it might be
good to not ship yet another release with such allegedly broken
behavior.

Agree with that. As  I mentioned above I consider it as a bug and based on
that several mentioned above downstreams have the similar patch I could consider
that as tested approach so ..

Let me know what you think, as I would need a formal Release-Ack if
this is to be committed.

... R-Acked-by: Oleksii Kurochko.

Sorry for the noise.

I missed to add that it would also be nice to mention IMO ( that now we have 
wallclock parameter )
in CHANGELOG so downstreams will receive "notification" that Xen provides a 
workaround for the mentioned
issue in the patch series.

Indeed.  Would you be OK with adding the following chunk to patch 2?

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 8507e6556a56..1de1d1eca17f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -12,6 +12,7 @@ The format is based on [Keep a 
Changelog](https://keepachangelog.com/en/1.0.0/)
 leaving this to the guest kernel to do in guest context.
   - On x86:
 - Prefer ACPI reboot over UEFI ResetSystem() run time service call.
+   - Prefer CMOS over EFI_GET_TIME as time source.
 - Switched the xAPIC flat driver to use physical destination mode for 
external
   interrupts instead of logical destination mode.
  
@@ -24,6 +25,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)

 - Support for LLC (Last Level Cache) coloring.
   - On x86:
 - xl suspend/resume subcommands.
+   - `wallclock` command line option to select time source.


What about of adding word 'Add' before `wallclock`?

Other LGTM: Acked-By: Oleksii Kurochko

Thanks.

~ Oleksii

  
  ### Removed

   - On x86:

Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock

2025-01-14 Thread Oleksii Kurochko



On 1/14/25 3:58 PM, Roger Pau Monné wrote:

On Tue, Jan 14, 2025 at 03:23:21PM +0100, Oleksii Kurochko wrote:

On 1/14/25 1:13 PM, Roger Pau Monné wrote:

On Tue, Jan 14, 2025 at 12:44:39PM +0100, Oleksii Kurochko wrote:

On 1/14/25 12:40 PM, Oleksii Kurochko wrote:

On 1/14/25 12:27 PM, Roger Pau Monné wrote:

On Tue, Jan 14, 2025 at 12:12:03PM +0100, Oleksii Kurochko wrote:

On 1/13/25 6:52 PM, Roger Pau Monné wrote:

On Mon, Jan 13, 2025 at 05:07:55PM +0100, Marek Marczykowski-Górecki wrote:

On Fri, Sep 13, 2024 at 09:59:06AM +0200, Roger Pau Monne wrote:

Allow setting the used wallclock from the command line.  When the option is set
to a value different than `auto` the probing is bypassed and the selected
implementation is used (as long as it's available).

The `xen` and `efi` options require being booted as a Xen guest (with Xen guest
supported built-in) or from UEFI firmware respectively.

Signed-off-by: Roger Pau Monné

Reviewed-by: Marek Marczykowski-Górecki

Thanks for the review.

Oleksii, can I get your opinion as Release Manager about whether this
(and the following patch) would be suitable for committing to staging
given the current release state?

It's a workaround for broken EFI implementations that many downstreams
carry on their patch queue.

Based on your commit message, I understand this as addressing a bug ( but not 
very critical
as IIUC downstreams have the similar patch on their side ). Therefore, if it 
has been properly
reviewed and tested, we should consider including it in the current release.

IIRC at least Qubes, XenServer and XCP-ng have a patch that achieves
the same behavior as proposed here.


IIUC, setting the wallclock to EFI should align with the behavior Xen had 
previously.
It might be preferable to use that argument as the default for a while, allowing others 
to verify the "auto"
value over time. After that, we could consider making "auto" the default.
That said, I am not particularly strict about following this approach.

We cannot really set efi as the default, as it would break when
booting on legacy BIOS systems.

We could take only patch 1 and leave patch 2 after Xen 4.20 has
branched, but at that point I would see little benefit in having just
patch 1.

I don't see a lot of benefit of comitting only the one patch either.



I don't have a strong opinion, but downstreams have been complaining
about Xen behavior regarding the usage of EFI_GET_TIME, so it might be
good to not ship yet another release with such allegedly broken
behavior.

Agree with that. As  I mentioned above I consider it as a bug and based on
that several mentioned above downstreams have the similar patch I could consider
that as tested approach so ..

Let me know what you think, as I would need a formal Release-Ack if
this is to be committed.

... R-Acked-by: Oleksii Kurochko.

Sorry for the noise.

I missed to add that it would also be nice to mention IMO ( that now we have 
wallclock parameter )
in CHANGELOG so downstreams will receive "notification" that Xen provides a 
workaround for the mentioned
issue in the patch series.

Indeed.  Would you be OK with adding the following chunk to patch 2?

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 8507e6556a56..1de1d1eca17f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -12,6 +12,7 @@ The format is based on [Keep a 
Changelog](https://keepachangelog.com/en/1.0.0/)
  leaving this to the guest kernel to do in guest context.
- On x86:
  - Prefer ACPI reboot over UEFI ResetSystem() run time service call.
+   - Prefer CMOS over EFI_GET_TIME as time source.
  - Switched the xAPIC flat driver to use physical destination mode for 
external
interrupts instead of logical destination mode.
@@ -24,6 +25,7 @@ The format is based on [Keep a 
Changelog](https://keepachangelog.com/en/1.0.0/)
  - Support for LLC (Last Level Cache) coloring.
- On x86:
  - xl suspend/resume subcommands.
+   - `wallclock` command line option to select time source.

What about of adding word 'Add' before `wallclock`?

It's in the "Added" section, so I thought it would be redundant.


Other LGTM: Acked-By: Oleksii Kurochko

Let me know if you would still like me to prepend "Add" to the above
item.


Oh, then it makes sense. We can go without "Add" then.


~ Oleksii



Thanks, Roger.




Re: [PATCH v2 07/18] x86/pv: update guest LDT mappings using the linear entries

2025-01-14 Thread Jan Beulich
On 08.01.2025 15:26, Roger Pau Monne wrote:
> The pv_map_ldt_shadow_page() and pv_destroy_ldt() functions rely on the L1
> table(s) that contain such mappings being stashed in the domain structure, and
> thus such mappings being modified by merely updating the require L1 entries.
> 
> Switch pv_map_ldt_shadow_page() to unconditionally use the linear recursive, 
> as
> that logic is always called while the vCPU is running on the current pCPU.
> 
> For pv_destroy_ldt() use the linear mappings if the vCPU is the one currently
> running on the pCPU, otherwise use destroy_mappings().
> 
> Note this requires keeping an array with the pages currently mapped at the LDT
> area, as that allows dropping the extra taken page reference when removing the
> mappings.

I'm confused by the wording of this paragraph: It reads as if you were
changing reference obtaining / dropping, yet it all looks to stay the
same. If I'm not mistaken you use the array to replace the acquiring of
the MFNs in question from the L1 page table entries. If so, I think it
would be nice if this could be described in a more direct way. Perhaps
first and foremost by replacing "allows" and getting rid of "extra".

Jan



[PATCH for-4.20] automation/gitlab: disable coverage from clang randconfig

2025-01-14 Thread Roger Pau Monne
If randconfig enables coverage support the build times out due to GNU LD
taking too long.  For the time being prevent coverage from being enabled in
clang randconfig job.

Signed-off-by: Roger Pau Monné 
---
Cc: Oleksii Kurochko 
---
I will fix the orphaned section stuff separately, as I'm considering just
removing LLVM coverage support because the llvm coverage format is not
stable, and the code to dump it has already become stale.  However I need
to think about it, and in the short term disabling coverage support from
randconfig is more straightforward.
---
 automation/gitlab-ci/build.yaml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index cb84f379b754..bc4a8a5ad20c 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -556,6 +556,8 @@ debian-12-x86_64-clang-randconfig:
   variables:
 CONTAINER: debian:12-x86_64
 RANDCONFIG: y
+EXTRA_FIXED_RANDCONFIG: |
+  CONFIG_COVERAGE=n # Disable coverage otherwise build times out.
 
 debian-12-x86_64-gcc:
   extends: .gcc-x86-64-build
-- 
2.46.0




Re: [PATCH for-4.20] automation/gitlab: disable coverage from clang randconfig

2025-01-14 Thread Andrew Cooper
On 14/01/2025 5:43 pm, Roger Pau Monne wrote:
> If randconfig enables coverage support the build times out due to GNU LD
> taking too long.  For the time being prevent coverage from being enabled in
> clang randconfig job.
>
> Signed-off-by: Roger Pau Monné 

Acked-by: Andrew Cooper 

> ---
> Cc: Oleksii Kurochko 
> ---
> I will fix the orphaned section stuff separately, as I'm considering just
> removing LLVM coverage support because the llvm coverage format is not
> stable, and the code to dump it has already become stale.  However I need
> to think about it, and in the short term disabling coverage support from
> randconfig is more straightforward.

Oh, so it's broken too?  Unless the fix is trivial, we should have a
Kconfig level disable on it.  No point letting people turn on something
that's broken.

~Andrew



Re: [PATCH] riscv: Add initial Xen guest support for RISC-V

2025-01-14 Thread Teddy Astie
Le 14/01/2025 à 17:13, Milan Djokic a écrit :
> diff --git a/test.txt b/test.txt
> new file mode 100644
> index ..e54267998982
> --- /dev/null
> +++ b/test.txt
> @@ -0,0 +1,21 @@
> +WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> +#120:
> +new file mode 100644
> +
> +WARNING: do not add new typedefs
> +#808: FILE: include/xen/riscv/interface.h:15:
> ++typedef struct { union { type * p; uint64_aligned_t q; }; }  \
> +
> +WARNING: please, no spaces at the start of a line
> +#1006: FILE: include/xen/riscv/swiotlb-xen.h:10:
> ++return 0;$
> +
> +total: 0 errors, 3 warnings, 810 lines checked
> +
> +NOTE: For some of the reported defects, checkpatch may be able to
> +  mechanically convert to the typical style using --fix or --fix-inplace.
> +
> +0001-riscv-Add-initial-Xen-guest-support-for-RISC-V.patch has style 
> problems, please review.
> +
> +NOTE: If any of the errors are false positives, please report
> +  them to the maintainer, see CHECKPATCH in MAINTAINERS.

I am not sure you want this here.

 > diff --git a/arch/riscv/include/asm/hypervisor.h
b/arch/riscv/include/> asm/hypervisor.h
 > new file mode 100644
 > index ..3a117afe57f0
 > --- /dev/null
 > +++ b/arch/riscv/include/asm/hypervisor.h
 > @@ -0,0 +1,9 @@
 > +/* SPDX-License-Identifier: GPL-2.0 */
 > +#ifndef _ASM_RISCV_HYPERVISOR_H
 > +#define _ASM_RISCV_HYPERVISOR_H
 > +
 > +#include 
 > +
 > +void kvm_init_hyp_services(void);
 > +
 > +#endif

kvm_init_hyp_services seems KVM-specific and doesn't seem to exist (yet)
for RISC-V.

Teddy


Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




[PATCH v4 06/30] static_call: Add read-only-after-init static calls

2025-01-14 Thread Valentin Schneider
From: Josh Poimboeuf 

Deferring a code patching IPI is unsafe if the patched code is in a
noinstr region.  In that case the text poke code must trigger an
immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ
CPU running in userspace.

If a noinstr static call only needs to be patched during boot, its key
can be made ro-after-init to ensure it will never be patched at runtime.

Signed-off-by: Josh Poimboeuf 
---
 include/linux/static_call.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index 78a77a4ae0ea8..ea6ca57e2a829 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -192,6 +192,14 @@ extern long __static_call_return0(void);
};  \
ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
 
+#define DEFINE_STATIC_CALL_RO(name, _func) \
+   DECLARE_STATIC_CALL(name, _func);   \
+   struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\
+   .func = _func,  \
+   .type = 1,  \
+   };  \
+   ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
+
 #define DEFINE_STATIC_CALL_NULL(name, _func)   \
DECLARE_STATIC_CALL(name, _func);   \
struct static_call_key STATIC_CALL_KEY(name) = {\
@@ -200,6 +208,14 @@ extern long __static_call_return0(void);
};  \
ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
 
+#define DEFINE_STATIC_CALL_NULL_RO(name, _func)
\
+   DECLARE_STATIC_CALL(name, _func);   \
+   struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\
+   .func = NULL,   \
+   .type = 1,  \
+   };  \
+   ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
+
 #define DEFINE_STATIC_CALL_RET0(name, _func)   \
DECLARE_STATIC_CALL(name, _func);   \
struct static_call_key STATIC_CALL_KEY(name) = {\
-- 
2.43.0




[PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition

2025-01-14 Thread Valentin Schneider
Context
===

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

  64359.052209596NetworkManager   01405 
smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.

Deferral approach
=

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.

What about IPIs whose callback take a parameter, you may ask?

Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.

This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).

Kernel entry vs execution of the deferred operation
===

This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].

There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper), i.e.:

  idtentry_func_foo()<--- we're in the kernel
irqentry_enter()
  enter_from_user_mode()
__ct_user_exit()
ct_kernel_enter_state()
  ct_work_flush()<--- deferred operation is executed here

This means one must take extra care to what can happen in the early entry code,
and that  cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.

Patches
===

o Patches 1-2 are standalone objtool cleanups.

o Patches 3-4 add an RCU testing feature.

o Patches 5-6 add infrastructure for annotating static keys and static calls
  that may be used in noinstr code (courtesy of Josh).
o Patches 7-19 use said annotations on relevant keys / calls.
o Patch 20 enforces proper usage of said annotations (courtesy of Josh).

o Patches 21-23 fiddle with CT_STATE* within context tracking

o Patches 24-29 add the actual IPI deferral faff

o Patch 30 adds a freebie: deferring IPIs for NOHZ_IDLE. Not tested that much!
  if you care about battery-powered devices and energy consumption, go give it
  a try!

Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v4

Stuff I'd like eyes and neurons on
==

Context-tracking vs idle. Patch 22 "improves" the situation by adding an
IDLE->KERNEL transition when getting an IRQ while idle, but it leaves the
following window:

  ~> IRQ
  ct_nmi_enter()
state = state + CT_STATE_KERNEL - CT_STATE_IDLE

  [...]

  ct_nmi_exit()
state = state - CT_STATE_KERNEL + CT_STATE_IDLE

  [...] /!\ CT_STATE_IDLE here while we're really in kernelspace! /!\

  ct_cpuidle_exit()
state = state + CT_STATE_KERNEL - CT_STATE_IDLE

Said window is contained within cpu_idle_poll() and the cpuidle call within
cpuidle_enter_state(), both being noinstr (the former is __cpuidle which is
noinstr itself). Thus objtool will consider it as early entry and will warn
accordingly of any static key / call misuse, so the damage is somewhat
contained, but it's not ideal.

I trie

[PATCH v4 12/30] arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init

2025-01-14 Thread Valentin Schneider
The static call is only ever updated in

  __init pv_time_init()
  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider 
---
 arch/arm64/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c
index aa718d6a9274a..ad28fa23c9228 100644
--- a/arch/arm64/kernel/paravirt.c
+++ b/arch/arm64/kernel/paravirt.c
@@ -32,7 +32,7 @@ static u64 native_steal_clock(int cpu)
return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 
 struct pv_time_stolen_time_region {
struct pvclock_vcpu_stolen_time __rcu *kaddr;
-- 
2.43.0




[PATCH v4 07/30] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init

2025-01-14 Thread Valentin Schneider
Later commits will cause objtool to warn about static calls being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

pv_sched_clock is updated in:
o __init vmware_paravirt_ops_setup()
o __init xen_init_time_common()
o kvm_sched_clock_init() <- __init kvmclock_init()
o hv_setup_sched_clock() <- __init hv_init_tsc_clocksource()

IOW purely init context, and can thus be marked as __ro_after_init.

Reported-by: Josh Poimboeuf 
Signed-off-by: Valentin Schneider 
---
 arch/x86/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec3815335558..ae6675167877b 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -73,7 +73,7 @@ static u64 native_steal_clock(int cpu)
 }
 
 DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
-DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
+DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock);
 
 void paravirt_set_sched_clock(u64 (*func)(void))
 {
-- 
2.43.0




[PATCH v4 01/30] objtool: Make validate_call() recognize indirect calls to pv_ops[]

2025-01-14 Thread Valentin Schneider
call_dest_name() does not get passed the file pointer of validate_call(),
which means its invocation of insn_reloc() will always return NULL. Make it
take a file pointer.

While at it, make sure call_dest_name() uses arch_dest_reloc_offset(),
otherwise it gets the pv_ops[] offset wrong.

Fabricating an intentional warning shows the change; previously:

  vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() 
leaves .noinstr.text section

now:

  vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[1]() 
leaves .noinstr.text section

Signed-off-by: Valentin Schneider 
Acked-by: Josh Poimboeuf 
---
 tools/objtool/check.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 76060da755b5c..b35763f05a548 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3448,7 +3448,7 @@ static inline bool func_uaccess_safe(struct symbol *func)
return false;
 }
 
-static inline const char *call_dest_name(struct instruction *insn)
+static inline const char *call_dest_name(struct objtool_file *file, struct 
instruction *insn)
 {
static char pvname[19];
struct reloc *reloc;
@@ -3457,9 +3457,9 @@ static inline const char *call_dest_name(struct 
instruction *insn)
if (insn_call_dest(insn))
return insn_call_dest(insn)->name;
 
-   reloc = insn_reloc(NULL, insn);
+   reloc = insn_reloc(file, insn);
if (reloc && !strcmp(reloc->sym->name, "pv_ops")) {
-   idx = (reloc_addend(reloc) / sizeof(void *));
+   idx = (arch_dest_reloc_offset(reloc_addend(reloc)) / 
sizeof(void *));
snprintf(pvname, sizeof(pvname), "pv_ops[%d]", idx);
return pvname;
}
@@ -3538,17 +3538,19 @@ static int validate_call(struct objtool_file *file,
 {
if (state->noinstr && state->instr <= 0 &&
!noinstr_call_dest(file, insn, insn_call_dest(insn))) {
-   WARN_INSN(insn, "call to %s() leaves .noinstr.text section", 
call_dest_name(insn));
+   WARN_INSN(insn, "call to %s() leaves .noinstr.text section", 
call_dest_name(file, insn));
return 1;
}
 
if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) {
-   WARN_INSN(insn, "call to %s() with UACCESS enabled", 
call_dest_name(insn));
+   WARN_INSN(insn, "call to %s() with UACCESS enabled",
+ call_dest_name(file, insn));
return 1;
}
 
if (state->df) {
-   WARN_INSN(insn, "call to %s() with DF set", 
call_dest_name(insn));
+   WARN_INSN(insn, "call to %s() with DF set",
+ call_dest_name(file, insn));
return 1;
}
 
-- 
2.43.0




[PATCH v4 03/30] rcu: Add a small-width RCU watching counter debug option

2025-01-14 Thread Valentin Schneider
A later commit will reduce the size of the RCU watching counter to free up
some bits for another purpose. Paul suggested adding a config option to
test the extreme case where the counter is reduced to its minimum usable
width for rcutorture to poke at, so do that.

Make it only configurable under RCU_EXPERT. While at it, add a comment to
explain the layout of context_tracking->state.

Link: 
http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney 
Signed-off-by: Valentin Schneider 
Reviewed-by: Paul E. McKenney 
---
 include/linux/context_tracking_state.h | 44 ++
 kernel/rcu/Kconfig.debug   | 15 +
 2 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 7b8433d5a8efe..0b81248aa03e2 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -18,12 +18,6 @@ enum ctx_state {
CT_STATE_MAX= 4,
 };
 
-/* Odd value for watching, else even. */
-#define CT_RCU_WATCHING CT_STATE_MAX
-
-#define CT_STATE_MASK (CT_STATE_MAX - 1)
-#define CT_RCU_WATCHING_MASK (~CT_STATE_MASK)
-
 struct context_tracking {
 #ifdef CONFIG_CONTEXT_TRACKING_USER
/*
@@ -44,9 +38,45 @@ struct context_tracking {
 #endif
 };
 
+/*
+ * We cram two different things within the same atomic variable:
+ *
+ * CT_RCU_WATCHING_START  CT_STATE_START
+ *||
+ *vv
+ * MSB [ RCU watching counter ][ context_state ] LSB
+ * ^   ^
+ * |   |
+ * CT_RCU_WATCHING_ENDCT_STATE_END
+ *
+ * Bits are used from the LSB upwards, so unused bits (if any) will always be 
in
+ * upper bits of the variable.
+ */
 #ifdef CONFIG_CONTEXT_TRACKING
+#define CT_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
+
+#define CT_STATE_WIDTH bits_per(CT_STATE_MAX - 1)
+#define CT_STATE_START 0
+#define CT_STATE_END   (CT_STATE_START + CT_STATE_WIDTH - 1)
+
+#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH)
+#define CT_RCU_WATCHING_WIDTH (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 
: CT_RCU_WATCHING_MAX_WIDTH)
+#define CT_RCU_WATCHING_START (CT_STATE_END + 1)
+#define CT_RCU_WATCHING_END   (CT_RCU_WATCHING_START + 
CT_RCU_WATCHING_WIDTH - 1)
+#define CT_RCU_WATCHING   BIT(CT_RCU_WATCHING_START)
+
+#define CT_STATE_MASKGENMASK(CT_STATE_END,CT_STATE_START)
+#define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, 
CT_RCU_WATCHING_START)
+
+#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
+
+static_assert(CT_STATE_WIDTH+
+ CT_RCU_WATCHING_WIDTH +
+ CT_UNUSED_WIDTH   ==
+ CT_SIZE);
+
 DECLARE_PER_CPU(struct context_tracking, context_tracking);
-#endif
+#endif /* CONFIG_CONTEXT_TRACKING */
 
 #ifdef CONFIG_CONTEXT_TRACKING_USER
 static __always_inline int __ct_state(void)
diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index 9b0b52e1836fa..ea36953803a1e 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -168,4 +168,19 @@ config RCU_STRICT_GRACE_PERIOD
  when looking for certain types of RCU usage bugs, for example,
  too-short RCU read-side critical sections.
 
+
+config RCU_DYNTICKS_TORTURE
+   bool "Minimize RCU dynticks counter size"
+   depends on RCU_EXPERT && !COMPILE_TEST
+   default n
+   help
+ This option sets the width of the dynticks counter to its
+ minimum usable value.  This minimum width greatly increases
+ the probability of flushing out bugs involving counter wrap,
+ but it also increases the probability of extending grace period
+ durations.  This Kconfig option should therefore be avoided in
+ production due to the consequent increased probability of OOMs.
+
+ This has no value for production and is only for testing.
+
 endmenu # "RCU Debugging"
-- 
2.43.0




[PATCH v4 09/30] x86/paravirt: Mark pv_steal_clock static call as __ro_after_init

2025-01-14 Thread Valentin Schneider
The static call is only ever updated in

  __init pv_time_init()
  __init xen_init_time_common()
  __init vmware_paravirt_ops_setup()
  __init xen_time_setup_guest(

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider 
---
 arch/x86/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ae6675167877b..a19b0002b18d8 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -72,7 +72,7 @@ static u64 native_steal_clock(int cpu)
return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock);
 
 void paravirt_set_sched_clock(u64 (*func)(void))
-- 
2.43.0




[PATCH v4 14/30] perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init

2025-01-14 Thread Valentin Schneider
Later commits will cause objtool to warn about static calls being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

perf_lopwr_cb is used in .noinstr code, but is only ever updated in __init
amd_brs_lopwr_init(), and can thus be marked as __ro_after_init.

Reported-by: Josh Poimboeuf 
Signed-off-by: Valentin Schneider 
---
 arch/x86/events/amd/brs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
index 780acd3dff22a..e2ff03af15d82 100644
--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -422,7 +422,7 @@ void noinstr perf_amd_brs_lopwr_cb(bool lopwr_in)
}
 }
 
-DEFINE_STATIC_CALL_NULL(perf_lopwr_cb, perf_amd_brs_lopwr_cb);
+DEFINE_STATIC_CALL_NULL_RO(perf_lopwr_cb, perf_amd_brs_lopwr_cb);
 EXPORT_STATIC_CALL_TRAMP_GPL(perf_lopwr_cb);
 
 void __init amd_brs_lopwr_init(void)
-- 
2.43.0




[PATCH v4 05/30] jump_label: Add annotations for validating noinstr usage

2025-01-14 Thread Valentin Schneider
From: Josh Poimboeuf 

Deferring a code patching IPI is unsafe if the patched code is in a
noinstr region.  In that case the text poke code must trigger an
immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ
CPU running in userspace.

Some noinstr static branches may really need to be patched at runtime,
despite the resulting disruption.  Add DEFINE_STATIC_KEY_*_NOINSTR()
variants for those.  They don't do anything special yet; that will come
later.

Signed-off-by: Josh Poimboeuf 
---
 include/linux/jump_label.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index f5a2727ca4a9a..88bb6e32fdcbc 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -385,6 +385,23 @@ struct static_key_false {
 #define DEFINE_STATIC_KEY_FALSE_RO(name)   \
struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT
 
+/*
+ * The _NOINSTR variants are used to tell objtool the static key is allowed to
+ * be used in noinstr code.
+ *
+ * They should almost never be used, as they prevent code patching IPIs from
+ * being deferred, which can be problematic for isolated NOHZ_FULL CPUs running
+ * in pure userspace.
+ *
+ * If using one of these _NOINSTR variants, please add a comment above the
+ * definition with the rationale.
+ */
+#define DEFINE_STATIC_KEY_TRUE_NOINSTR(name)   
\
+   DEFINE_STATIC_KEY_TRUE(name)
+
+#define DEFINE_STATIC_KEY_FALSE_NOINSTR(name)  
\
+   DEFINE_STATIC_KEY_FALSE(name)
+
 #define DECLARE_STATIC_KEY_FALSE(name) \
extern struct static_key_false name
 
-- 
2.43.0




[PATCH v4 11/30] loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init

2025-01-14 Thread Valentin Schneider
The static call is only ever updated in

  __init pv_time_init()
  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider 
---
 arch/loongarch/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c
index e5a39bbad0780..b011578d3e931 100644
--- a/arch/loongarch/kernel/paravirt.c
+++ b/arch/loongarch/kernel/paravirt.c
@@ -20,7 +20,7 @@ static u64 native_steal_clock(int cpu)
return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 
 static bool steal_acc = true;
 
-- 
2.43.0




[PATCH v4 02/30] objtool: Flesh out warning related to pv_ops[] calls

2025-01-14 Thread Valentin Schneider
I had to look into objtool itself to understand what this warning was
about; make it more explicit.

Signed-off-by: Valentin Schneider 
Acked-by: Josh Poimboeuf 
---
 tools/objtool/check.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index b35763f05a548..6aa9259fc9940 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3486,7 +3486,7 @@ static bool pv_call_dest(struct objtool_file *file, 
struct instruction *insn)
 
list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) {
if (!target->sec->noinstr) {
-   WARN("pv_ops[%d]: %s", idx, target->name);
+   WARN("pv_ops[%d]: indirect call to %s() leaves 
.noinstr.text section", idx, target->name);
file->pv_ops[idx].clean = false;
}
}
-- 
2.43.0




[PATCH v4 10/30] riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init

2025-01-14 Thread Valentin Schneider
The static call is only ever updated in:

  __init pv_time_init()
  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider 
---
 arch/riscv/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/kernel/paravirt.c b/arch/riscv/kernel/paravirt.c
index fa6b0339a65de..dfe8808016fd8 100644
--- a/arch/riscv/kernel/paravirt.c
+++ b/arch/riscv/kernel/paravirt.c
@@ -30,7 +30,7 @@ static u64 native_steal_clock(int cpu)
return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
 
 static bool steal_acc = true;
 static int __init parse_no_stealacc(char *arg)
-- 
2.43.0




[PATCH v4 08/30] x86/idle: Mark x86_idle static call as __ro_after_init

2025-01-14 Thread Valentin Schneider
Later commits will cause objtool to warn about static calls being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

x86_idle is updated in:
o xen_set_default_idle() <- __init xen_arch_setup()
o __init select_idle_routine()

IOW purely init context, and can thus be marked as __ro_after_init.

Reported-by: Josh Poimboeuf 
Signed-off-by: Valentin Schneider 
---
 arch/x86/kernel/process.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index f63f8fd00a91f..85cd62f61d633 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -746,7 +746,7 @@ void __cpuidle default_idle(void)
 EXPORT_SYMBOL(default_idle);
 #endif
 
-DEFINE_STATIC_CALL_NULL(x86_idle, default_idle);
+DEFINE_STATIC_CALL_NULL_RO(x86_idle, default_idle);
 
 static bool x86_idle_set(void)
 {
-- 
2.43.0




[PATCH v4 04/30] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE

2025-01-14 Thread Valentin Schneider
We now have an RCU_EXPERT config for testing small-sized RCU dynticks
counter:  CONFIG_RCU_DYNTICKS_TORTURE.

Modify scenario TREE04 to exercise to use this config in order to test a
ridiculously small counter (2 bits).

Link: 
http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney 
Signed-off-by: Valentin Schneider 
Reviewed-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/configs/rcu/TREE04 | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 
b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
index dc4985064b3ad..67caf4276bb01 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
@@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
 CONFIG_RCU_EXPERT=y
 CONFIG_RCU_EQS_DEBUG=y
 CONFIG_RCU_LAZY=y
+CONFIG_RCU_DYNTICKS_TORTURE=y
-- 
2.43.0




[PATCH v4 20/30] objtool: Add noinstr validation for static branches/calls

2025-01-14 Thread Valentin Schneider
From: Josh Poimboeuf 

Warn about static branches/calls in noinstr regions, unless the
corresponding key is RO-after-init or has been manually whitelisted with
DEFINE_STATIC_KEY_*_NOINSTR(().

Signed-off-by: Josh Poimboeuf 
[Added NULL check for insn_call_dest() return value]
Signed-off-by: Valentin Schneider 
---
 include/linux/jump_label.h  | 17 +++--
 include/linux/objtool.h |  7 ++
 include/linux/static_call.h |  3 +
 tools/objtool/Documentation/objtool.txt | 34 +
 tools/objtool/check.c   | 92 ++---
 tools/objtool/include/objtool/check.h   |  1 +
 tools/objtool/include/objtool/elf.h |  1 +
 tools/objtool/include/objtool/special.h |  1 +
 tools/objtool/special.c | 18 -
 9 files changed, 156 insertions(+), 18 deletions(-)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 88bb6e32fdcbc..dc8a82a62c109 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -75,6 +75,7 @@
 
 #include 
 #include 
+#include 
 
 extern bool static_key_initialized;
 
@@ -373,8 +374,9 @@ struct static_key_false {
 #define DEFINE_STATIC_KEY_TRUE(name)   \
struct static_key_true name = STATIC_KEY_TRUE_INIT
 
-#define DEFINE_STATIC_KEY_TRUE_RO(name)\
-   struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT
+#define DEFINE_STATIC_KEY_TRUE_RO(name)
\
+   struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT; 
\
+   ANNOTATE_NOINSTR_ALLOWED(name)
 
 #define DECLARE_STATIC_KEY_TRUE(name)  \
extern struct static_key_true name
@@ -382,8 +384,9 @@ struct static_key_false {
 #define DEFINE_STATIC_KEY_FALSE(name)  \
struct static_key_false name = STATIC_KEY_FALSE_INIT
 
-#define DEFINE_STATIC_KEY_FALSE_RO(name)   \
-   struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT
+#define DEFINE_STATIC_KEY_FALSE_RO(name)   
\
+   struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT;   
\
+   ANNOTATE_NOINSTR_ALLOWED(name)
 
 /*
  * The _NOINSTR variants are used to tell objtool the static key is allowed to
@@ -397,10 +400,12 @@ struct static_key_false {
  * definition with the rationale.
  */
 #define DEFINE_STATIC_KEY_TRUE_NOINSTR(name)   
\
-   DEFINE_STATIC_KEY_TRUE(name)
+   DEFINE_STATIC_KEY_TRUE(name);   
\
+   ANNOTATE_NOINSTR_ALLOWED(name)
 
 #define DEFINE_STATIC_KEY_FALSE_NOINSTR(name)  
\
-   DEFINE_STATIC_KEY_FALSE(name)
+   DEFINE_STATIC_KEY_FALSE(name);  
\
+   ANNOTATE_NOINSTR_ALLOWED(name)
 
 #define DECLARE_STATIC_KEY_FALSE(name) \
extern struct static_key_false name
diff --git a/include/linux/objtool.h b/include/linux/objtool.h
index b3b8d3dab52d5..1a7389f273063 100644
--- a/include/linux/objtool.h
+++ b/include/linux/objtool.h
@@ -34,6 +34,12 @@
static void __used __section(".discard.func_stack_frame_non_standard") \
*__func_stack_frame_non_standard_##func = func
 
+#define __ANNOTATE_NOINSTR_ALLOWED(key) \
+   static void __used __section(".discard.noinstr_allowed") \
+   *__annotate_noinstr_allowed_##key = &key
+
+#define ANNOTATE_NOINSTR_ALLOWED(key) __ANNOTATE_NOINSTR_ALLOWED(key)
+
 /*
  * STACK_FRAME_NON_STANDARD_FP() is a frame-pointer-specific function ignore
  * for the case where a function is intentionally missing frame pointer setup,
@@ -157,6 +163,7 @@
 #define STACK_FRAME_NON_STANDARD_FP(func)
 #define ANNOTATE_NOENDBR
 #define ASM_REACHABLE
+#define ANNOTATE_NOINSTR_ALLOWED(key)
 #else
 #define ANNOTATE_INTRA_FUNCTION_CALL
 .macro UNWIND_HINT type:req sp_reg=0 sp_offset=0 signal=0
diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index ea6ca57e2a829..0d4b16d348501 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -133,6 +133,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #ifdef CONFIG_HAVE_STATIC_CALL
@@ -198,6 +199,7 @@ extern long __static_call_return0(void);
.func = _func,  \
.type = 1,  \
};  \
+   ANNOTATE_NOINSTR_ALLOWED(STATIC_CALL_TRAMP(name));  \
ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
 
 #define DEFINE_STATIC_CALL_NULL(name, _func)   \
@@ -214,6 +216,7 @@ extern long __static_call_return0(void);
.func = NULL,   \
.type = 1,  \
};  \
+   ANNOTATE_NOINSTR_ALLOWED(STATIC_CA

[PATCH v4 19/30] stackleack: Mark stack_erasing_bypass key as allowed in .noinstr

2025-01-14 Thread Valentin Schneider
Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

stack_erasing_bypass is used in .noinstr code, and can be modified at runtime
(proc/sys/kernel/stack_erasing write). However it is not expected that it
will be  flipped during latency-sensitive operations, and thus shouldn't be
a source of interference wrt the text patching IPI.

Mark it to let objtool know not to warn about it.

Reported-by: Josh Poimboeuf 
Signed-off-by: Valentin Schneider 
---
 kernel/stackleak.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/stackleak.c b/kernel/stackleak.c
index 39fd620a7db6f..a4f07fbc13f61 100644
--- a/kernel/stackleak.c
+++ b/kernel/stackleak.c
@@ -18,7 +18,11 @@
 #include 
 #include 
 
-static DEFINE_STATIC_KEY_FALSE(stack_erasing_bypass);
+/*
+ * This static key can only be modified via its sysctl interface. It is
+ * expected it will remain stable during latency-senstive operations.
+ */
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(stack_erasing_bypass);
 
 #ifdef CONFIG_SYSCTL
 static int stack_erasing_sysctl(const struct ctl_table *table, int write,
-- 
2.43.0




[PATCH v4 27/30] x86/tlb: Make __flush_tlb_local() noinstr-compliant

2025-01-14 Thread Valentin Schneider
Later patches will require issuing a __flush_tlb_all() from noinstr code.
This requires making both __flush_tlb_local() and __flush_tlb_global()
noinstr-compliant.

For __flush_tlb_local(), xen_flush_tlb() has already been made noinstr, so
it's just native_flush_tlb_global(), and simply __always_inline'ing
invalidate_user_asid() gets us there

Signed-off-by: Valentin Schneider 
---
 arch/x86/include/asm/paravirt.h | 2 +-
 arch/x86/mm/tlb.c   | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index b3daee3d46677..0c0dd186c03e6 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -70,7 +70,7 @@ void native_flush_tlb_one_user(unsigned long addr);
 void native_flush_tlb_multi(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
-static inline void __flush_tlb_local(void)
+static __always_inline void __flush_tlb_local(void)
 {
PVOP_VCALL0(mmu.flush_tlb_user);
 }
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2d2ab3e221f0c..18b40bbc2fa15 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -257,7 +257,7 @@ static void choose_new_asid(struct mm_struct *next, u64 
next_tlb_gen,
  *
  * See SWITCH_TO_USER_CR3.
  */
-static inline void invalidate_user_asid(u16 asid)
+static __always_inline void invalidate_user_asid(u16 asid)
 {
/* There is no user ASID if address space separation is off */
if (!IS_ENABLED(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION))
@@ -1206,7 +1206,7 @@ STATIC_NOPV noinstr void native_flush_tlb_global(void)
 /*
  * Flush the entire current user mapping
  */
-STATIC_NOPV void native_flush_tlb_local(void)
+STATIC_NOPV noinstr void native_flush_tlb_local(void)
 {
/*
 * Preemption or interrupts must be disabled to protect the access
-- 
2.43.0




[PATCH v4 18/30] x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as allowed in .noinstr

2025-01-14 Thread Valentin Schneider
Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

These keys are used in .noinstr code, and can be modified at runtime
(/proc/kernel/vmx* write). However it is not expected that they will be
flipped during latency-sensitive operations, and thus shouldn't be a source
of interference wrt the text patching IPI.

Mark it to let objtool know not to warn about it.

Reported-by: Josh Poimboeuf 
Signed-off-by: Valentin Schneider 
---
 arch/x86/kvm/vmx/vmx.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 893366e537322..a028c38f44e02 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -225,8 +225,15 @@ module_param(pt_mode, int, S_IRUGO);
 
 struct x86_pmu_lbr __ro_after_init vmx_lbr_caps;
 
-static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
-static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);
+/*
+ * Both of these static keys end up being used in .noinstr sections, however
+ * they are only modified:
+ * - at init
+ * - from a /proc/kernel/vmx* write
+ * thus during latency-sensitive operations they should remain stable.
+ */
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_should_flush);
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_flush_cond);
 static DEFINE_MUTEX(vmx_l1d_flush_mutex);
 
 /* Storage for pre module init parameter parsing */
-- 
2.43.0




[PATCH v4 16/30] x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr

2025-01-14 Thread Valentin Schneider
Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

mds_idle_clear is used in .noinstr code, and can be modified at
runtime (SMT hotplug). Suppressing the text_poke_sync() IPI has little
benefits for this key, as hotplug implies eventually going through
takedown_cpu() -> stop_machine_cpuslocked() which is going to cause
interference on all online CPUs anyway.

Mark it to let objtool know not to warn about it.

Signed-off-by: Valentin Schneider 
---
 arch/x86/kernel/cpu/bugs.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 47a01d4028f60..acad84dcfc3cd 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -113,8 +113,13 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
 /* Control unconditional IBPB in switch_mm() */
 DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
 
-/* Control MDS CPU buffer clear before idling (halt, mwait) */
-DEFINE_STATIC_KEY_FALSE(mds_idle_clear);
+/*
+ * Control MDS CPU buffer clear before idling (halt, mwait)
+ *
+ * Allowed in .noinstr as this key is updated during hotplug which comes with
+ * more interference than just the text patching IPI.
+ */
+DEFINE_STATIC_KEY_FALSE_NOINSTR(mds_idle_clear);
 EXPORT_SYMBOL_GPL(mds_idle_clear);
 
 /*
-- 
2.43.0




[PATCH v4 17/30] sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr

2025-01-14 Thread Valentin Schneider
Later commits will cause objtool to warn about static keys being used in
.noinstr sections in order to safely defer instruction patching IPIs
targeted at NOHZ_FULL CPUs.

__sched_clock_stable is used in .noinstr code, and can be modified at
runtime (e.g. time_cpufreq_notifier()). Suppressing the text_poke_sync()
IPI has little benefits for this key, as NOHZ_FULL is incompatible with an
unstable TSC anyway.

Mark it to let objtool know not to warn about it.

Signed-off-by: Valentin Schneider 
---
 kernel/sched/clock.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 200e5568b9894..e59986bc14a43 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -75,8 +75,11 @@ static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running);
  *
  * Similarly we start with __sched_clock_stable_early, thereby assuming we
  * will become stable, such that there's only a single 1 -> 0 transition.
+ *
+ * Allowed in .noinstr as an unstable TLC is incompatible with NOHZ_FULL,
+ * thus the text patching IPI would be the least of our concerns.
  */
-static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
+static DEFINE_STATIC_KEY_FALSE_NOINSTR(__sched_clock_stable);
 static int __sched_clock_stable_early = 1;
 
 /*
-- 
2.43.0




[PATCH v4 21/30] context_tracking: Explicitely use CT_STATE_KERNEL where it is missing

2025-01-14 Thread Valentin Schneider
CT_STATE_KERNEL being zero means it can be (and is) omitted in a handful of
places. A later patch will change CT_STATE_KERNEL into a non-zero value,
prepare that by using it where it should be:

o In the initial CT state
o At kernel entry / exit

No change in functionality intended.

Signed-off-by: Valentin Schneider 
---
 kernel/context_tracking.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 938c48952d265..a61498a8425e2 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -31,7 +31,7 @@ DEFINE_PER_CPU(struct context_tracking, context_tracking) = {
.nesting = 1,
.nmi_nesting = CT_NESTING_IRQ_NONIDLE,
 #endif
-   .state = ATOMIC_INIT(CT_RCU_WATCHING),
+   .state = ATOMIC_INIT(CT_RCU_WATCHING | CT_STATE_KERNEL),
 };
 EXPORT_SYMBOL_GPL(context_tracking);
 
@@ -147,7 +147,7 @@ static void noinstr ct_kernel_exit(bool user, int offset)
instrumentation_end();
WRITE_ONCE(ct->nesting, 0); /* Avoid irq-access tearing. */
// RCU is watching here ...
-   ct_kernel_exit_state(offset);
+   ct_kernel_exit_state(offset - CT_STATE_KERNEL);
// ... but is no longer watching here.
rcu_task_exit();
 }
@@ -175,7 +175,7 @@ static void noinstr ct_kernel_enter(bool user, int offset)
}
rcu_task_enter();
// RCU is not watching here ...
-   ct_kernel_enter_state(offset);
+   ct_kernel_enter_state(offset + CT_STATE_KERNEL);
// ... but is watching here.
instrumentation_begin();
 
@@ -537,7 +537,7 @@ void noinstr __ct_user_enter(enum ctx_state state)
 * RCU only requires CT_RCU_WATCHING increments 
to be fully
 * ordered.
 */
-   raw_atomic_add(state, &ct->state);
+   raw_atomic_add(state - CT_STATE_KERNEL, 
&ct->state);
}
}
}
@@ -647,7 +647,7 @@ void noinstr __ct_user_exit(enum ctx_state state)
 * RCU only requires CT_RCU_WATCHING increments 
to be fully
 * ordered.
 */
-   raw_atomic_sub(state, &ct->state);
+   raw_atomic_sub(state - CT_STATE_KERNEL, 
&ct->state);
}
}
}
-- 
2.43.0




[PATCH v4 15/30] sched/clock: Mark sched_clock_running key as __ro_after_init

2025-01-14 Thread Valentin Schneider
sched_clock_running is only ever enabled in the __init functions
sched_clock_init() and sched_clock_init_late(), and is never disabled. Mark
it __ro_after_init.

Signed-off-by: Valentin Schneider 
---
 kernel/sched/clock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index a09655b481402..200e5568b9894 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -66,7 +66,7 @@ notrace unsigned long long __weak sched_clock(void)
 }
 EXPORT_SYMBOL_GPL(sched_clock);
 
-static DEFINE_STATIC_KEY_FALSE(sched_clock_running);
+static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running);
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 /*
-- 
2.43.0




[PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

2025-01-14 Thread Valentin Schneider
vunmap()'s issued from housekeeping CPUs are a relatively common source of
interference for isolated NOHZ_FULL CPUs, as they are hit by the
flush_tlb_kernel_range() IPIs.

Given that CPUs executing in userspace do not access data in the vmalloc
range, these IPIs could be deferred until their next kernel entry.

Deferral vs early entry danger zone
===

This requires a guarantee that nothing in the vmalloc range can be vunmap'd
and then accessed in early entry code.

Vmalloc uses are, as reported by vmallocinfo:

  $ cat /proc/vmallocinfo | awk '{ print $3 }' | sort | uniq
  __pci_enable_msix_range+0x32b/0x560
  acpi_os_map_iomem+0x22d/0x250
  bpf_prog_alloc_no_stats+0x34/0x140
  fork_idle+0x79/0x120
  gen_pool_add_owner+0x3e/0xb0  ?
  hpet_enable+0xbf/0x470
  irq_init_percpu_irqstack+0x129/0x160
  kernel_clone+0xab/0x3b0
  memremap+0x164/0x330
  n_tty_open+0x18/0xa0
  pcpu_create_chunk+0x4e/0x1b0
  pcpu_create_chunk+0x75/0x1b0
  pcpu_get_vm_areas+0x0/0x1100
  unpurged
  vp_modern_map_capability+0x140/0x270
  zisofs_init+0x16/0x30

I've categorized these as:

a) Device or percpu mappings

   For these to be unmapped, the device (or CPU) has to be removed and an
   eventual IRQ freed. Even if the IRQ isn't freed, context tracking entry
   happens before handling the IRQ itself, per irqentry_enter() usage.

   __pci_enable_msix_range()
   acpi_os_map_iomem()
   irq_init_percpu_irqstack() (not even unmapped when CPU is hot-unplugged!)
   memremap()
   n_tty_open()
   pcpu_create_chunk()
   pcpu_get_vm_areas()
   vp_modern_map_capability()

b) CONFIG_VMAP_STACK

  fork_idle() & kernel_clone()

  vmalloc'd kernel stacks are AFAICT a safe example, as a task running in
  userspace needs to enter kernelspace to execute do_exit() before its
  stack can be vfree'd.

c) Non-issues

  bpf_prog_alloc_no_stats() - early entry is noinstr, no BPF!
  hpet_enable() - hpet_clear_mapping() is only called if __init function
  fails, no runtime worries
  zisofs_init () - used for zisofs block decompression, that's way past
   context tracking entry

d) I'm not sure, have a look?

  gen_pool_add_owner() - AIUI this is mainly for PCI / DMA stuff, which
 again I wouldn't expect to be accessed before
 context tracking entry.

Changes
==

Blindly deferring any and all flush of the kernel mappings is a risky move,
so introduce a variant of flush_tlb_kernel_range() that explicitly allows
deferral. Use it for vunmap flushes.

Note that while flush_tlb_kernel_range() may end up issuing a full
flush (including user mappings), this only happens when reaching a
invalidation range threshold where it is cheaper to do a full flush than to
individually invalidate each page in the range via INVLPG. IOW, it doesn't
*require* invalidating user mappings, and thus remains safe to defer until
a later kernel entry.

Signed-off-by: Valentin Schneider 
---
 arch/x86/include/asm/context_tracking_work.h |  4 +++
 arch/x86/include/asm/tlbflush.h  |  1 +
 arch/x86/mm/tlb.c| 23 +++--
 include/linux/context_tracking_work.h|  4 ++-
 mm/vmalloc.c | 35 +---
 5 files changed, 58 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/context_tracking_work.h 
b/arch/x86/include/asm/context_tracking_work.h
index 485b32881fde5..da3d270289836 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_CONTEXT_TRACKING_WORK_H
 
 #include 
+#include 
 
 static __always_inline void arch_context_tracking_work(enum ct_work work)
 {
@@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(enum 
ct_work work)
case CT_WORK_SYNC:
sync_core();
break;
+   case CT_WORK_TLBI:
+   __flush_tlb_all();
+   break;
case CT_WORK_MAX:
WARN_ON_ONCE(true);
}
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4d11396250999..6e690acc35e53 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -248,6 +248,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, 
unsigned long start,
unsigned long end, unsigned int stride_shift,
bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned 
long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 119765772ab11..47fb437acf52a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1042,6 +1043,11 @@

[PATCH v4 30/30] context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE

2025-01-14 Thread Valentin Schneider
With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these
transitions:

  ct_idle_enter()
ct_kernel_exit()
  ct_state_inc_clear_work()

  ct_idle_exit()
ct_kernel_enter()
  ct_work_flush()

With just CONTEXT_TRACKING_IDLE, ct_state_inc_clear_work() is just
ct_state_inc() and ct_work_flush() is a no-op. However, making them be
functional as if under CONTEXT_TRACKING_WORK would allow NO_HZ_IDLE to
leverage IPI deferral to keep idle CPUs idle longer.

Having this enabled for NO_HZ_IDLE is a different argument than for having
it for NO_HZ_FULL (power savings vs latency/performance), but the backing
mechanism is identical.

Add a default-no option to enable IPI deferral with NO_HZ_IDLE.

Signed-off-by: Valentin Schneider 
---
 kernel/time/Kconfig | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 7e8106a0d981f..c7398fe5382a0 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -183,9 +183,23 @@ config CONTEXT_TRACKING_USER_FORCE
 
 config CONTEXT_TRACKING_WORK
bool
-   depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
+   depends on HAVE_CONTEXT_TRACKING_WORK && (CONTEXT_TRACKING_USER || 
CONTEXT_TRACKING_WORK_IDLE)
default y
 
+config CONTEXT_TRACKING_WORK_IDLE
+   bool
+   depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_IDLE && 
!CONTEXT_TRACKING_USER
+   default n
+   help
+This option enables deferral of some IPIs when they are targeted at 
CPUs
+that are idle. This can help keep CPUs idle longer, but induces some
+extra overhead to idle <-> kernel transitions and to IPI sending.
+
+Say Y if the power improvements are worth more to you than the added
+overheads.
+
+Say N otherwise.
+
 config NO_HZ
bool "Old Idle dynticks config"
help
-- 
2.43.0




[PATCH v4 28/30] x86/tlb: Make __flush_tlb_all() noinstr

2025-01-14 Thread Valentin Schneider
Later patches will require issuing a __flush_tlb_all() from noinstr code.
Both __flush_tlb_local() and __flush_tlb_global() are now
noinstr-compliant, so __flush_tlb_all() can be made noinstr itself.

Signed-off-by: Valentin Schneider 
---
 arch/x86/include/asm/tlbflush.h | 2 +-
 arch/x86/mm/tlb.c   | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 69e79fff41b80..4d11396250999 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -17,7 +17,7 @@
 
 DECLARE_PER_CPU(u64, tlbstate_untag_mask);
 
-void __flush_tlb_all(void);
+noinstr void __flush_tlb_all(void);
 
 #define TLB_FLUSH_ALL  -1UL
 #define TLB_GENERATION_INVALID 0
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 18b40bbc2fa15..119765772ab11 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1229,7 +1229,7 @@ void flush_tlb_local(void)
 /*
  * Flush everything
  */
-void __flush_tlb_all(void)
+noinstr void __flush_tlb_all(void)
 {
/*
 * This is to catch users with enabled preemption and the PGE feature
@@ -1243,7 +1243,7 @@ void __flush_tlb_all(void)
/*
 * !PGE -> !PCID (setup_pcid()), thus every flush is total.
 */
-   flush_tlb_local();
+   __flush_tlb_local();
}
 }
 EXPORT_SYMBOL_GPL(__flush_tlb_all);
-- 
2.43.0




[PATCH v4 13/30] arm/paravirt: Mark pv_steal_clock static call as __ro_after_init

2025-01-14 Thread Valentin Schneider
The static call is only ever updated in

  __init xen_time_setup_guest()

so mark it appropriately as __ro_after_init.

Signed-off-by: Valentin Schneider 
---
 arch/arm/kernel/paravirt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kernel/paravirt.c b/arch/arm/kernel/paravirt.c
index 7dd9806369fb0..632d8d5e06db3 100644
--- a/arch/arm/kernel/paravirt.c
+++ b/arch/arm/kernel/paravirt.c
@@ -20,4 +20,4 @@ static u64 native_steal_clock(int cpu)
return 0;
 }
 
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
+DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
-- 
2.43.0




[PATCH v4 22/30] context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry

2025-01-14 Thread Valentin Schneider
ct_nmi_{enter, exit}() only touches the RCU watching counter and doesn't
modify the actual CT state part context_tracking.state. This means that
upon receiving an IRQ when idle, the CT_STATE_IDLE->CT_STATE_KERNEL
transition only happens in ct_idle_exit().

One can note that ct_nmi_enter() can only ever be entered with the CT state
as either CT_STATE_KERNEL or CT_STATE_IDLE, as an IRQ/NMI happenning in the
CT_STATE_USER or CT_STATE_GUEST states will be routed down to ct_user_exit().

Add/remove CT_STATE_IDLE from the context tracking state as needed in
ct_nmi_{enter, exit}().

Note that this leaves the following window where the CPU is executing code
in kernelspace, but the context tracking state is CT_STATE_IDLE:

  ~> IRQ
  ct_nmi_enter()
state = state + CT_STATE_KERNEL - CT_STATE_IDLE

  [...]

  ct_nmi_exit()
state = state - CT_STATE_KERNEL + CT_STATE_IDLE

  [...] /!\ CT_STATE_IDLE here while we're really in kernelspace! /!\

  ct_cpuidle_exit()
state = state + CT_STATE_KERNEL - CT_STATE_IDLE

Signed-off-by: Valentin Schneider 
---
 kernel/context_tracking.c | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index a61498a8425e2..15f10ddec8cbe 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -236,7 +236,9 @@ void noinstr ct_nmi_exit(void)
instrumentation_end();
 
// RCU is watching here ...
-   ct_kernel_exit_state(CT_RCU_WATCHING);
+   ct_kernel_exit_state(CT_RCU_WATCHING -
+CT_STATE_KERNEL +
+CT_STATE_IDLE);
// ... but is no longer watching here.
 
if (!in_nmi())
@@ -259,6 +261,7 @@ void noinstr ct_nmi_enter(void)
 {
long incby = 2;
struct context_tracking *ct = this_cpu_ptr(&context_tracking);
+   int curr_state;
 
/* Complain about underflow. */
WARN_ON_ONCE(ct_nmi_nesting() < 0);
@@ -271,13 +274,26 @@ void noinstr ct_nmi_enter(void)
 * to be in the outermost NMI handler that interrupted an RCU-idle
 * period (observation due to Andy Lutomirski).
 */
-   if (!rcu_is_watching_curr_cpu()) {
+   curr_state = raw_atomic_read(this_cpu_ptr(&context_tracking.state));
+   if (!(curr_state & CT_RCU_WATCHING)) {
 
if (!in_nmi())
rcu_task_enter();
 
+   /*
+* RCU isn't watching, so we're one of
+* CT_STATE_IDLE
+* CT_STATE_USER
+* CT_STATE_GUEST
+* guest/user entry is handled by ct_user_enter(), so this has
+* to be idle entry.
+*/
+   WARN_ON_ONCE((curr_state & CT_STATE_MASK) != CT_STATE_IDLE);
+
// RCU is not watching here ...
-   ct_kernel_enter_state(CT_RCU_WATCHING);
+   ct_kernel_enter_state(CT_RCU_WATCHING +
+ CT_STATE_KERNEL -
+ CT_STATE_IDLE);
// ... but is watching here.
 
instrumentation_begin();
-- 
2.43.0




[PATCH v4 23/30] context_tracking: Turn CT_STATE_* into bits

2025-01-14 Thread Valentin Schneider
A later patch will require to easily exclude CT_STATE_KERNEL from a genuine
a ct->state read CT_STATE_KERNEL, which requires that value being non-zero
and exclusive with the other CT_STATE_* values.

This increases the size of the CT_STATE region of the ct->state variable by
two bits.

Signed-off-by: Valentin Schneider 
---
 include/linux/context_tracking_state.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 0b81248aa03e2..eb2149b20baef 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -11,11 +11,11 @@
 
 enum ctx_state {
CT_STATE_DISABLED   = -1,   /* returned by ct_state() if unknown */
-   CT_STATE_KERNEL = 0,
-   CT_STATE_IDLE   = 1,
-   CT_STATE_USER   = 2,
-   CT_STATE_GUEST  = 3,
-   CT_STATE_MAX= 4,
+   CT_STATE_KERNEL = 1,
+   CT_STATE_IDLE   = 2,
+   CT_STATE_USER   = 4,
+   CT_STATE_GUEST  = 8,
+   CT_STATE_MAX= 9,
 };
 
 struct context_tracking {
-- 
2.43.0




[PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs

2025-01-14 Thread Valentin Schneider
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
them vs the newly patched instruction. CPUs that are executing in userspace
do not need this synchronization to happen immediately, and this is
actually harmful interference for NOHZ_FULL CPUs.

As the synchronization IPIs are sent using a blocking call, returning from
text_poke_bp_batch() implies all CPUs will observe the patched
instruction(s), and this should be preserved even if the IPI is deferred.
In other words, to safely defer this synchronization, any kernel
instruction leading to the execution of the deferred instruction
sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.

This means we must pay attention to mutable instructions in the early entry
code:
- alternatives
- static keys
- static calls
- all sorts of probes (kprobes/ftrace/bpf/???)

The early entry code leading to ct_work_flush() is noinstr, which gets rid
of the probes.

Alternatives are safe, because it's boot-time patching (before SMP is
even brought up) which is before any IPI deferral can happen.

This leaves us with static keys and static calls.

Any static key used in early entry code should be only forever-enabled at
boot time, IOW __ro_after_init (pretty much like alternatives). Exceptions
are explicitly marked as allowed in .noinstr and will always generate an
IPI when flipped.

The same applies to static calls - they should be only updated at boot
time, or manually marked as an exception.

Objtool is now able to point at static keys/calls that don't respect this,
and all static keys/calls used in early entry code have now been verified
as behaving appropriately.

Leverage the new context_tracking infrastructure to defer sync_core() IPIs
to a target CPU's next kernel entry.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Nicolas Saenz Julienne 
Signed-off-by: Valentin Schneider 
---
 arch/x86/include/asm/context_tracking_work.h |  6 ++--
 arch/x86/include/asm/text-patching.h |  1 +
 arch/x86/kernel/alternative.c| 38 
 arch/x86/kernel/kprobes/core.c   |  4 +--
 arch/x86/kernel/kprobes/opt.c|  4 +--
 arch/x86/kernel/module.c |  2 +-
 include/asm-generic/sections.h   | 15 
 include/linux/context_tracking_work.h|  4 +--
 8 files changed, 59 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/context_tracking_work.h 
b/arch/x86/include/asm/context_tracking_work.h
index 5f3b2d0977235..485b32881fde5 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -2,11 +2,13 @@
 #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
 #define _ASM_X86_CONTEXT_TRACKING_WORK_H
 
+#include 
+
 static __always_inline void arch_context_tracking_work(enum ct_work work)
 {
switch (work) {
-   case CT_WORK_n:
-   // Do work...
+   case CT_WORK_SYNC:
+   sync_core();
break;
case CT_WORK_MAX:
WARN_ON_ONCE(true);
diff --git a/arch/x86/include/asm/text-patching.h 
b/arch/x86/include/asm/text-patching.h
index ab9e143ec9fea..9dfa46f721c1d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -33,6 +33,7 @@ extern void apply_relocation(u8 *buf, const u8 * const instr, 
size_t instrlen, u
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void text_poke_sync(void);
+extern void text_poke_sync_deferrable(void);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
 #define text_poke_copy text_poke_copy
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 243843e44e89d..633deea8a89cb 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2109,9 +2110,24 @@ static void do_sync_core(void *info)
sync_core();
 }
 
+static bool do_sync_core_defer_cond(int cpu, void *info)
+{
+   return !ct_set_cpu_work(cpu, CT_WORK_SYNC);
+}
+
+static void __text_poke_sync(smp_cond_func_t cond_func)
+{
+   on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
+}
+
 void text_poke_sync(void)
 {
-   on_each_cpu(do_sync_core, NULL, 1);
+   __text_poke_sync(NULL);
+}
+
+void text_poke_sync_deferrable(void)
+{
+   __text_poke_sync(do_sync_core_defer_cond);
 }
 
 /*
@@ -2282,6 +2298,7 @@ static int tp_vec_nr;
  */
 static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int 
nr_entries)
 {
+   smp_cond_func_t cond = do_sync_core_defer_cond;
unsigned char int3 = INT3_INSN_OPCODE;
unsigned int i;
int do_sync;
@@ -2317,11 +2334,20 @@ static void text_poke_bp_batch(struct text_poke_loc 
*tp, unsigned int nr_entries
 * First step: add a int3 trap to the address that will

Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

2025-01-14 Thread Jann Horn
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider  wrote:
> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> flush_tlb_kernel_range() IPIs.
>
> Given that CPUs executing in userspace do not access data in the vmalloc
> range, these IPIs could be deferred until their next kernel entry.
>
> Deferral vs early entry danger zone
> ===
>
> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> and then accessed in early entry code.

In other words, it needs a guarantee that no vmalloc allocations that
have been created in the vmalloc region while the CPU was idle can
then be accessed during early entry, right?



Re: [PATCH v4 26/30] x86,tlb: Make __flush_tlb_global() noinstr-compliant

2025-01-14 Thread Dave Hansen
On 1/14/25 09:51, Valentin Schneider wrote:
> + cr4 = this_cpu_read(cpu_tlbstate.cr4);
> + asm volatile("mov %0,%%cr4": : "r" (cr4 ^ X86_CR4_PGE) : "memory");
> + asm volatile("mov %0,%%cr4": : "r" (cr4) : "memory");
> + /*
> +  * In lieu of not having the pinning crap, hard fail if CR4 doesn't
> +  * match the expected value. This ensures that anybody doing dodgy gets
> +  * the fallthrough check.
> +  */
> + BUG_ON(cr4 != this_cpu_read(cpu_tlbstate.cr4));

Let's say someone managed to write to cpu_tlbstate.cr4 where they
cleared one of the pinned bits.

Before this patch, CR4 pinning would WARN_ONCE() about it pretty quickly
and also reset the cleared bits.

After this patch, the first native_flush_tlb_global() can clear pinned
bits, at least until native_write_cr4() gets called the next time. That
seems like it'll undermine CR4 pinning at least somewhat.

What keeps native_write_cr4() from being noinstr-compliant now? Is it
just the WARN_ONCE()?

If so, I'd kinda rather have a native_write_cr4_nowarn() that's
noinstr-compliant but retains all the other CR4 pinning behavior. Would
something like the attached patch be _worse_?diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3e9037690814..2044d516f06f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -423,24 +423,40 @@ void native_write_cr0(unsigned long val)
 }
 EXPORT_SYMBOL(native_write_cr0);
 
-void __no_profile native_write_cr4(unsigned long val)
+void __no_profile __native_write_cr4(unsigned long val, unsigned long *bits_changed)
 {
-	unsigned long bits_changed = 0;
-
 set_register:
 	asm volatile("mov %0,%%cr4": "+r" (val) : : "memory");
 
 	if (static_branch_likely(&cr_pinning)) {
 		if (unlikely((val & cr4_pinned_mask) != cr4_pinned_bits)) {
-			bits_changed = (val & cr4_pinned_mask) ^ cr4_pinned_bits;
+			*bits_changed = (val & cr4_pinned_mask) ^ cr4_pinned_bits;
 			val = (val & ~cr4_pinned_mask) | cr4_pinned_bits;
 			goto set_register;
 		}
-		/* Warn after we've corrected the changed bits. */
-		WARN_ONCE(bits_changed, "pinned CR4 bits changed: 0x%lx!?\n",
-			  bits_changed);
 	}
 }
+
+void __no_profile native_write_cr4(unsigned long val)
+{
+	unsigned long bits_changed = 0;
+
+	__native_write_cr4(val, &bits_changed);
+
+	if (!bits_changed)
+		return
+
+	WARN_ONCE(bits_changed, "pinned CR4 bits changed: 0x%lx!?\n",
+		  bits_changed);
+}
+
+void __no_profile native_write_cr4_nowarn(unsigned long val)
+{
+	unsigned long bits_changed = 0;
+
+	__native_write_cr4(val, &bits_changed);
+}
+
 #if IS_MODULE(CONFIG_LKDTM)
 EXPORT_SYMBOL_GPL(native_write_cr4);
 #endif


[PATCH v4 24/30] context-tracking: Introduce work deferral infrastructure

2025-01-14 Thread Valentin Schneider
smp_call_function() & friends have the unfortunate habit of sending IPIs to
isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online
CPUs.

Some callsites can be bent into doing the right, such as done by commit:

  cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for 
nohz_full CPUs")

Unfortunately, not all SMP callbacks can be omitted in this
fashion. However, some of them only affect execution in kernelspace, which
means they don't have to be executed *immediately* if the target CPU is in
userspace: stashing the callback and executing it upon the next kernel entry
would suffice. x86 kernel instruction patching or kernel TLB invalidation
are prime examples of it.

Reduce the RCU dynticks counter width to free up some bits to be used as a
deferred callback bitmask. Add some build-time checks to validate that
setup.

Presence of CT_STATE_KERNEL in the ct_state prevents queuing deferred work.

Later commits introduce the bit:callback mappings.

Link: https://lore.kernel.org/all/20210929151723.162004...@infradead.org/
Signed-off-by: Nicolas Saenz Julienne 
Signed-off-by: Valentin Schneider 
---
 arch/Kconfig |  9 +++
 arch/x86/Kconfig |  1 +
 arch/x86/include/asm/context_tracking_work.h | 16 +
 include/linux/context_tracking.h | 21 +++
 include/linux/context_tracking_state.h   | 30 ++---
 include/linux/context_tracking_work.h| 26 
 kernel/context_tracking.c| 66 +++-
 kernel/time/Kconfig  |  5 ++
 8 files changed, 162 insertions(+), 12 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 6682b2a53e342..b637f20f0fc68 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -952,6 +952,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK
  - No use of instrumentation, unless instrumentation_begin() got
called.
 
+config HAVE_CONTEXT_TRACKING_WORK
+   bool
+   help
+ Architecture supports deferring work while not in kernel context.
+ This is especially useful on setups with isolated CPUs that might
+ want to avoid being interrupted to perform housekeeping tasks (for
+ ex. TLB invalidation or icache invalidation). The housekeeping
+ operations are performed upon re-entering the kernel.
+
 config HAVE_TIF_NOHZ
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9d7bd0ae48c42..ca5dd4cfc354b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -216,6 +216,7 @@ config X86
select HAVE_CMPXCHG_LOCAL
select HAVE_CONTEXT_TRACKING_USER   if X86_64
select HAVE_CONTEXT_TRACKING_USER_OFFSTACK  if 
HAVE_CONTEXT_TRACKING_USER
+   select HAVE_CONTEXT_TRACKING_WORK   if X86_64
select HAVE_C_RECORDMCOUNT
select HAVE_OBJTOOL_MCOUNT  if HAVE_OBJTOOL
select HAVE_OBJTOOL_NOP_MCOUNT  if HAVE_OBJTOOL_MCOUNT
diff --git a/arch/x86/include/asm/context_tracking_work.h 
b/arch/x86/include/asm/context_tracking_work.h
new file mode 100644
index 0..5f3b2d0977235
--- /dev/null
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
+#define _ASM_X86_CONTEXT_TRACKING_WORK_H
+
+static __always_inline void arch_context_tracking_work(enum ct_work work)
+{
+   switch (work) {
+   case CT_WORK_n:
+   // Do work...
+   break;
+   case CT_WORK_MAX:
+   WARN_ON_ONCE(true);
+   }
+}
+
+#endif
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index af9fe87a09225..0b0faa040e9b5 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -137,6 +138,26 @@ static __always_inline unsigned long ct_state_inc(int 
incby)
return raw_atomic_add_return(incby, 
this_cpu_ptr(&context_tracking.state));
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static __always_inline unsigned long ct_state_inc_clear_work(int incby)
+{
+   struct context_tracking *ct = this_cpu_ptr(&context_tracking);
+   unsigned long new, old, state;
+
+   state = arch_atomic_read(&ct->state);
+   do {
+   old = state;
+   new = old & ~CT_WORK_MASK;
+   new += incby;
+   state = arch_atomic_cmpxchg(&ct->state, old, new);
+   } while (old != state);
+
+   return new;
+}
+#else
+#define ct_state_inc_clear_work(x) ct_state_inc(x)
+#endif
+
 static __always_inline bool warn_rcu_enter(void)
 {
bool ret = false;
diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index eb2

[PATCH v4 26/30] x86,tlb: Make __flush_tlb_global() noinstr-compliant

2025-01-14 Thread Valentin Schneider
From: Peter Zijlstra 

Later patches will require issuing a __flush_tlb_all() from noinstr code.
This requires making both __flush_tlb_local() and __flush_tlb_global()
noinstr-compliant.

For __flush_tlb_global(), both native_flush_tlb_global() and xen_flush_tlb()
need to be made noinstr.

Forgo using __native_flush_tlb_global() / native_write_cr4() and have the
ASM directly inlined in the native function. For the Xen stuff,
__always_inline a handful of helpers.

Not-signed-off-by: Peter Zijlstra 
[Changelog faff]
Signed-off-by: Valentin Schneider 
---
 arch/x86/include/asm/invpcid.h   | 13 ++---
 arch/x86/include/asm/paravirt.h  |  2 +-
 arch/x86/include/asm/xen/hypercall.h | 11 +--
 arch/x86/mm/tlb.c| 15 +++
 arch/x86/xen/mmu_pv.c| 10 +-
 arch/x86/xen/xen-ops.h   | 12 
 6 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/invpcid.h b/arch/x86/include/asm/invpcid.h
index 734482afbf81d..ff26136fcd9c6 100644
--- a/arch/x86/include/asm/invpcid.h
+++ b/arch/x86/include/asm/invpcid.h
@@ -2,7 +2,7 @@
 #ifndef _ASM_X86_INVPCID
 #define _ASM_X86_INVPCID
 
-static inline void __invpcid(unsigned long pcid, unsigned long addr,
+static __always_inline void __invpcid(unsigned long pcid, unsigned long addr,
 unsigned long type)
 {
struct { u64 d[2]; } desc = { { pcid, addr } };
@@ -13,7 +13,7 @@ static inline void __invpcid(unsigned long pcid, unsigned 
long addr,
 * mappings, we don't want the compiler to reorder any subsequent
 * memory accesses before the TLB flush.
 */
-   asm volatile("invpcid %[desc], %[type]"
+   asm_inline volatile("invpcid %[desc], %[type]"
 :: [desc] "m" (desc), [type] "r" (type) : "memory");
 }
 
@@ -23,26 +23,25 @@ static inline void __invpcid(unsigned long pcid, unsigned 
long addr,
 #define INVPCID_TYPE_ALL_NON_GLOBAL3
 
 /* Flush all mappings for a given pcid and addr, not including globals. */
-static inline void invpcid_flush_one(unsigned long pcid,
-unsigned long addr)
+static __always_inline void invpcid_flush_one(unsigned long pcid, unsigned 
long addr)
 {
__invpcid(pcid, addr, INVPCID_TYPE_INDIV_ADDR);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
-static inline void invpcid_flush_single_context(unsigned long pcid)
+static __always_inline void invpcid_flush_single_context(unsigned long pcid)
 {
__invpcid(pcid, 0, INVPCID_TYPE_SINGLE_CTXT);
 }
 
 /* Flush all mappings, including globals, for all PCIDs. */
-static inline void invpcid_flush_all(void)
+static __always_inline void invpcid_flush_all(void)
 {
__invpcid(0, 0, INVPCID_TYPE_ALL_INCL_GLOBAL);
 }
 
 /* Flush all mappings for all PCIDs except globals. */
-static inline void invpcid_flush_all_nonglobals(void)
+static __always_inline void invpcid_flush_all_nonglobals(void)
 {
__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d4eb9e1d61b8e..b3daee3d46677 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -75,7 +75,7 @@ static inline void __flush_tlb_local(void)
PVOP_VCALL0(mmu.flush_tlb_user);
 }
 
-static inline void __flush_tlb_global(void)
+static __always_inline void __flush_tlb_global(void)
 {
PVOP_VCALL0(mmu.flush_tlb_kernel);
 }
diff --git a/arch/x86/include/asm/xen/hypercall.h 
b/arch/x86/include/asm/xen/hypercall.h
index 97771b9d33af3..291e9f8006f62 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -365,8 +365,8 @@ MULTI_mmu_update(struct multicall_entry *mcl, struct 
mmu_update *req,
trace_xen_mc_entry(mcl, 4);
 }
 
-static inline void
-MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count,
+static __always_inline void
+__MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count,
int *success_count, domid_t domid)
 {
mcl->op = __HYPERVISOR_mmuext_op;
@@ -374,6 +374,13 @@ MULTI_mmuext_op(struct multicall_entry *mcl, struct 
mmuext_op *op, int count,
mcl->args[1] = count;
mcl->args[2] = (unsigned long)success_count;
mcl->args[3] = domid;
+}
+
+static inline void
+MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count,
+   int *success_count, domid_t domid)
+{
+   __MULTI_mmuext_op(mcl, op, count, success_count, domid);
 
trace_xen_mc_entry(mcl, 4);
 }
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a2becb85bea79..2d2ab3e221f0c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1169,9 +1169,10 @@ void flush_tlb_one_user(unsigned long addr)
 /*
  * Flush everything
  */
-STATIC_NOPV void native_flush_tlb_global(void)
+STATIC_NOPV noinstr void native_flush_tlb_global(void)

Re: [XEN PATCH] MAINTAINERS: Change reviewer of the ECLAIR integration

2025-01-14 Thread Stefano Stabellini
On Tue, 14 Jan 2025, Simone Ballarin wrote:
> On 2025-01-14 16:00, Roger Pau Monné wrote:
> > On Tue, Jan 14, 2025 at 03:48:54PM +0100, Nicola Vetrini wrote:
> > > Simone Ballarin is no longer actively involved in reviewing
> > > the ECLAIR integration for Xen. I am stepping up as a reviewer.
> > > 
> > > Signed-off-by: Nicola Vetrini 
> > 
> > Acked-by: Roger Pau Monné 
> > 
> 
> Acked-by: Simone Ballarin 

Acked-by: Stefano Stabellini 

Re: [PATCH for-4.20] automation/gitlab: disable coverage from clang randconfig

2025-01-14 Thread Stefano Stabellini
On Tue, 14 Jan 2025, Andrew Cooper wrote:
> On 14/01/2025 5:43 pm, Roger Pau Monne wrote:
> > If randconfig enables coverage support the build times out due to GNU LD
> > taking too long.  For the time being prevent coverage from being enabled in
> > clang randconfig job.
> >
> > Signed-off-by: Roger Pau Monné 
> 
> Acked-by: Andrew Cooper 

Acked-by: Stefano Stabellini 

Re: [PATCH] riscv: Add initial Xen guest support for RISC-V

2025-01-14 Thread Stefano Stabellini
+Oleksii

On Tue, 14 Jan 2025, Milan Djokic wrote:
> From: Slavisa Petrovic 
> 
> This patch introduces initial support for running RISC-V as a Xen guest.
> It provides the necessary infrastructure and stubs for Xen-specific
> operations. Key changes include:
> 
> - Modifications to the RISC-V kernel to integrate Xen-specific hypercalls
>   and interfaces, with function implementations stubbed for future work.
> - Introduction of Xen-specific headers for RISC-V, including event
>   handling, hypervisor interaction, and page management. Functions are
>   defined but not yet implemented.
> - Stub implementations for memory management, grant tables, and context
>   switching in the Xen environment, allowing further development and
>   integration.
> 
> Signed-off-by: Milan Djokic 
> Signed-off-by: Slavisa Petrovic 

Hi Milan, Slavisa,

Thank you very much for your contribution! Which Xen tree are you using
for development? I am asking because RISC-V support in Xen is still in
the process of being upstreamed, and [1] is the tree that consolidates
all patches currently on the to-be-upstreamed list.  

While the specific Xen tree you are using is not particularly important
at this stage, and using [1] is not a requirement, I am asking because
it is essential to align on the hypervisor ABI, especially regarding any
details that have not yet been upstreamed. Specifically, is there
anything in this patch series that relies on features not yet upstream
in Xen?  

[1] https://gitlab.com/xen-project/people/olkur/xen/-/tree/latest?ref_type=heads



RE: [PATCH v1 02/11] xen/x86: introduce new sub-hypercall to get CPPC data

2025-01-14 Thread Penny, Zheng
[AMD Official Use Only - AMD Internal Distribution Only]

Hi,

> -Original Message-
> From: Jan Beulich 
> Sent: Thursday, January 9, 2025 5:46 PM
> To: Penny, Zheng 
> Cc: Stabellini, Stefano ; Huang, Ray
> ; Ragiadakou, Xenia ;
> Andryuk, Jason ; Andrew Cooper
> ; Roger Pau Monné ; Julien
> Grall ; Stefano Stabellini ; xen-
> de...@lists.xenproject.org
> Subject: Re: [PATCH v1 02/11] xen/x86: introduce new sub-hypercall to get CPPC
> data
>
> On 03.12.2024 09:11, Penny Zheng wrote:
> > In order to provide backward compatibility with existing governors
> > that represent performance as frequencies, like ondemand, the _CPC
> > table can optionally provide processor frequency range values, Lowest
> > frequency and Norminal frequency, to let OS use Lowest Frequency/
> > Performance and Nominal Frequency/Performance as anchor points to
> > create linear mapping of CPPC abstract performance to CPU frequency.
> >
> > As Xen is uncapable of parsing the ACPI dynamic table, this commit
> > introduces a new sub-hypercall to get required CPPC data from
> > dom0 kernel.
>
> "get" as used both here and in the title is, to me, something an entity does 
> actively.
> Xen is entirely passive here, though. (Reading the title I was first assuming 
> this is
> about a sub-op to get certain data out of
> Xen.)

How about "a new sub-hypercall to pass/deliver CPPC to Xen"? or any better 
suggestions?

>
> > --- a/xen/arch/x86/platform_hypercall.c
> > +++ b/xen/arch/x86/platform_hypercall.c
> > @@ -572,6 +572,12 @@ ret_t do_platform_op(
> >  break;
> >  }
> >
> > +case XEN_PM_CPPC:
> > +{
> > +ret = set_cppc_pminfo(op->u.set_pminfo.id, &op-
> >u.set_pminfo.u.cppc_data);
> > +}
> > +break;
>
> No such unnecessary figure braces please, which - once dropped - will also 
> call for
> "break" to be indented one level deeper.

Understood.

>
> > --- a/xen/arch/x86/x86_64/cpufreq.c
> > +++ b/xen/arch/x86/x86_64/cpufreq.c
> > @@ -54,3 +54,21 @@ int compat_set_px_pminfo(uint32_t acpi_id,
> >
> >  return set_px_pminfo(acpi_id, xen_perf);  }
> > +
> > +int compat_set_cppc_pminfo(uint32_t acpi_id,
> > +   struct compat_processor_cppc *cppc_data) {
> > +struct xen_processor_cppc *xen_cppc;
> > +unsigned long xlat_page_current;
> > +
> > +xlat_malloc_init(xlat_page_current);
> > +
> > +xen_cppc = xlat_malloc_array(xlat_page_current,
> > +struct xen_processor_cppc, 1);
> > +if ( unlikely(xen_cppc == NULL) )
> > +return -EFAULT;
> > +
> > +XLAT_processor_cppc(xen_cppc, cppc_data);
> > +
> > +return set_cppc_pminfo(acpi_id, xen_cppc); }
>
> Why's this needed? The structure - for now at least - consists of only 
> uint32_t-s,
> and hence has identical layout for compat callers.
>

Understood. Not familiar with the compat framework

> > --- a/xen/drivers/cpufreq/cpufreq.c
> > +++ b/xen/drivers/cpufreq/cpufreq.c
> > @@ -458,6 +458,56 @@ static void print_PPC(unsigned int platform_limit)
> >  printk("\t_PPC: %d\n", platform_limit);  }
> >
> > +static void print_CPPC(struct xen_processor_cppc *cppc_data)
>
> Pointer-to-const?
>

Sure.

> > +{
> > +printk("\t_CPC: highest_perf=%u, lowest_perf=%u, "
> > +   "nominal_perf=%u, lowest_nonlinear_perf=%u, "
> > +   "nominal_freq=%uMhz, lowest_freq=%uMhz\n",
> > +   cppc_data->highest_perf, cppc_data->lowest_perf,
> > +   cppc_data->nominal_perf, cppc_data->lowest_nonlinear_perf,
> > +   cppc_data->nominal_freq, cppc_data->lowest_freq); }
> > +
> > +int set_cppc_pminfo(uint32_t acpi_id, struct xen_processor_cppc
> > +*cppc_data)
>
> Pointer-to-const?
>

Sure.

> > +{
> > +int ret = 0, cpuid;
> > +struct processor_pminfo *pm_info;
> > +
> > +cpuid = get_cpu_id(acpi_id);
> > +if ( cpuid < 0 || !cppc_data )
> > +{
> > +ret = -EINVAL;
> > +goto out;
> > +}
> > +if ( cpufreq_verbose )
> > +printk("Set CPU acpi_id(%d) cpuid(%d) CPPC State info:\n",
> > +   acpi_id, cpuid);
> > +
> > +pm_info = processor_pminfo[cpuid];
> > +if ( !pm_info )
> > +{
> > +pm_info = xzalloc(struct processor_pminfo);
>
> Please be aware that new code is supposed to be using xvmalloc().

Thanks for the update.

>
> > +if ( !pm_info )
> > +{
> > +ret = -ENOMEM;
> > +goto out;
> > +}
> > +processor_pminfo[cpuid] = pm_info;
> > +}
> > +pm_info->acpi_id = acpi_id;
> > +pm_info->id = cpuid;
> > +
> > +memcpy ((void *)&pm_info->cppc_data,
> > +(void *)cppc_data,
> > +sizeof(struct xen_processor_cppc));
>
> What use are these casts? Also please no blank before the opening parenthesis 
> of
> a function call, and please sizeof(*cppc_data). Yet then - why memcpy() in 
> the first
> place? This can be a (type safe) structure assignment, can't it?
>

Yes, 

Re: [PATCH v2 11/25] drm/loongson: Compute dumb-buffer sizes with drm_mode_size_dumb()

2025-01-14 Thread Sui Jingfeng

Hi,

On 2025/1/9 22:57, Thomas Zimmermann wrote:

Call drm_mode_size_dumb() to compute dumb-buffer scanline pitch and
buffer size. Align the pitch according to hardware requirements.

Signed-off-by: Thomas Zimmermann 
Cc: Sui Jingfeng 
Cc: Sui Jingfeng 



Reviewed-by: Sui Jingfeng 



---
  drivers/gpu/drm/loongson/lsdc_gem.c | 29 -
  1 file changed, 8 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/loongson/lsdc_gem.c 
b/drivers/gpu/drm/loongson/lsdc_gem.c
index a720d8f53209..9f982b85301f 100644
--- a/drivers/gpu/drm/loongson/lsdc_gem.c
+++ b/drivers/gpu/drm/loongson/lsdc_gem.c
@@ -6,6 +6,7 @@
  #include 
  
  #include 

+#include 
  #include 
  #include 
  #include 
@@ -204,45 +205,31 @@ int lsdc_dumb_create(struct drm_file *file, struct 
drm_device *ddev,
const struct lsdc_desc *descp = ldev->descp;
u32 domain = LSDC_GEM_DOMAIN_VRAM;
struct drm_gem_object *gobj;
-   size_t size;
-   u32 pitch;
-   u32 handle;
int ret;
  
-	if (!args->width || !args->height)

-   return -EINVAL;
-
-   if (args->bpp != 32 && args->bpp != 16)
-   return -EINVAL;
-
-   pitch = args->width * args->bpp / 8;
-   pitch = ALIGN(pitch, descp->pitch_align);
-   size = pitch * args->height;
-   size = ALIGN(size, PAGE_SIZE);
+   ret = drm_mode_size_dumb(ddev, args, descp->pitch_align, 0);
+   if (ret)
+   return ret;
  
  	/* Maximum single bo size allowed is the half vram size available */

-   if (size > ldev->vram_size / 2) {
-   drm_err(ddev, "Requesting(%zuMiB) failed\n", size >> 20);
+   if (args->size > ldev->vram_size / 2) {
+   drm_err(ddev, "Requesting(%zuMiB) failed\n", (size_t)(args->size 
>> PAGE_SHIFT));
return -ENOMEM;
}
  
-	gobj = lsdc_gem_object_create(ddev, domain, size, false, NULL, NULL);

+   gobj = lsdc_gem_object_create(ddev, domain, args->size, false, NULL, 
NULL);
if (IS_ERR(gobj)) {
drm_err(ddev, "Failed to create gem object\n");
return PTR_ERR(gobj);
}
  
-	ret = drm_gem_handle_create(file, gobj, &handle);

+   ret = drm_gem_handle_create(file, gobj, &args->handle);
  
  	/* drop reference from allocate, handle holds it now */

drm_gem_object_put(gobj);
if (ret)
return ret;
  
-	args->pitch = pitch;

-   args->size = size;
-   args->handle = handle;
-
return 0;
  }
  


--
Best regards,
Sui




Re: [PATCH v4 10/30] riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init

2025-01-14 Thread Andrew Jones
On Tue, Jan 14, 2025 at 06:51:23PM +0100, Valentin Schneider wrote:
> The static call is only ever updated in:
> 
>   __init pv_time_init()
>   __init xen_time_setup_guest()
> 
> so mark it appropriately as __ro_after_init.
> 
> Signed-off-by: Valentin Schneider 
> ---
>  arch/riscv/kernel/paravirt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/riscv/kernel/paravirt.c b/arch/riscv/kernel/paravirt.c
> index fa6b0339a65de..dfe8808016fd8 100644
> --- a/arch/riscv/kernel/paravirt.c
> +++ b/arch/riscv/kernel/paravirt.c
> @@ -30,7 +30,7 @@ static u64 native_steal_clock(int cpu)
>   return 0;
>  }
>  
> -DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
> +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
>  
>  static bool steal_acc = true;
>  static int __init parse_no_stealacc(char *arg)
> -- 
> 2.43.0
>

Reviewed-by: Andrew Jones 



  1   2   >