Re: [PATCH v4 4/6] xen/arm: mpu: Create boot-time MPU protection regions

2024-10-30 Thread Luca Fancellu
Hi Ayan,

While I rebased the branch on top of your patches, I saw you’ve changed the 
number of regions
mapped at boot time, can I ask why? 

Compared to 
https://patchwork.kernel.org/project/xen-devel/patch/20230626033443.2943270-25-penny.zh...@arm.com/:

> +FUNC(enable_boot_cpu_mm)
> +
> +/* Get the number of regions specified in MPUIR_EL2 */
> +mrs   x5, MPUIR_EL2
> +
> +/* x0: region sel */
> +mov   x0, xzr
> +/* Xen text section. */
> +ldr   x1, =_stext
> +ldr   x2, =_etext
> +prepare_xen_region x0, x1, x2, x3, x4, x5, attr_prbar=REGION_TEXT_PRBAR
> +
> +/* Xen read-only data section. */
> +ldr   x1, =_srodata
> +ldr   x2, =_erodata
> +prepare_xen_region x0, x1, x2, x3, x4, x5, attr_prbar=REGION_RO_PRBAR
> +
> +/* Xen read-only after init and data section. (RW data) */
> +ldr   x1, =__ro_after_init_start
> +ldr   x2, =__init_begin
> +prepare_xen_region x0, x1, x2, x3, x4, x5

 ^— this, for example, will block Xen to call init_done(void) later, I 
understand this is earlyboot,
   but I guess we don’t want to make subsequent changes to this 
part when introducing the
   patches to support start_xen()

> +
> +/* Xen code section. */
> +ldr   x1, =__init_begin
> +ldr   x2, =__init_data_begin
> +prepare_xen_region x0, x1, x2, x3, x4, x5, attr_prbar=REGION_TEXT_PRBAR
> +
> +/* Xen data and BSS section. */
> +ldr   x1, =__init_data_begin
> +ldr   x2, =__bss_end
> +prepare_xen_region x0, x1, x2, x3, x4, x5
> +
> +ret
> +
> +END(enable_boot_cpu_mm)

I suggest to keep exactly the regions as are in Penny’s patch.

Cheers,
Luca

Re: [PATCH v1] xen/common: move gic_preinit() to common code

2024-10-30 Thread oleksii . kurochko
On Tue, 2024-10-29 at 17:57 +0100, Jan Beulich wrote:
> On 29.10.2024 17:47, Oleksii Kurochko wrote:
> > --- a/xen/common/device.c
> > +++ b/xen/common/device.c
> > @@ -4,10 +4,14 @@
> >    *   xen/arch/arm/device.c
> >    */
> >   
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> > +#include 
> > +#include 
> 
> I don't think non-ACPI configs should include xen/acpi.h here. Surely
> this
> can be moved down into the ACPI-only code section?
xen/acpi.h was included as acpi_disabled is defined there and it is
needed in commom/device.c for:
```
void __init ic_preinit(void)
{
if ( acpi_disabled )
ic_dt_preinit();
else
ic_acpi_preinit();
}
```
It seems to me that ic_preinit() could be also in generic code and just
ic_acpi_preinit() to be defined in architecture specific code.

~ Oleksii



Re: [RFC PATCH 0/6] xen/abi: On wide bitfields inside primitive types

2024-10-30 Thread Jan Beulich
On 29.10.2024 19:16, Alejandro Vallejo wrote:
> Non-boolean bitfields in the hypercall ABI make it fairly inconvenient to
> create bindings for any language because (a) they are always ad-hoc and are
> subject to restrictions regular fields are not (b) require boilerplate that
> regular fields do not and (c) might not even be part of the core language,
> forcing avoidable external libraries into any sort of generic library.
> 
> This patch (it's a series merely to split roughly by maintainer) is one such
> case that I happened to spot while playing around. It's the grant_version
> field, buried under an otherwise empty grant_opts.
> 
> The invariant I'd like to (slowly) introduce and discuss is that fields may
> have bitflags (e.g: a packed array of booleans indexed by some enumerated
> type), but not be mixed with wider fields in the same primitive type. This
> ensures any field containing an integer of any kind can be referred by pointer
> and treated the same way as any other with regards to sizeof() and the like.

While I don't strictly mind, I'm also not really seeing why taking addresses
or applying sizeof() would be commonly necessary. Can you perhaps provide a
concrete example of where the present way of dealing with grant max version
is getting in the way? After all your use of the term "bitfield" doesn't
really mean C's understanding of it, so especially (c) above escapes me to a
fair degree.

> I'd like to have a certain consensus about this general point before going
> establishing this restriction in the IDL system I'm working on.
> 
> My preference would be to fold everything into a single patch if we decide to
> follow through with this particular case. As I said before, the split is
> artificial for review.

That's not just a preference, but a requirement, or else the build will break
in the middle of the series (harming bisection at the very least).

Jan



Re: [PATCH v3] x86/io-apic: fix directed EOI when using AMD-Vi interrupt remapping

2024-10-30 Thread Jan Beulich
On 29.10.2024 18:48, Roger Pau Monné wrote:
> On Tue, Oct 29, 2024 at 05:43:24PM +0100, Jan Beulich wrote:
>> On 29.10.2024 12:03, Roger Pau Monne wrote:
>> Plus with what you said
>> earlier about vector vs EOI handle, and with the code using "vector" all 
>> over the
>> place, their (non-)relationship could also do with clarifying (perhaps 
>> better in
>> a code comment in __io_apic_eoi()).
> 
> I've attempted to clarify the relation between vector vs EOI handle in
> the first paragraph, and how that applies to AMD-Vi.  I can move
> (part?) of that into the comment in __ioapic_write_entry(), maybe:
> 
> /*
>  * Might be called before io_apic_pin_eoi is allocated.  Entry will be
>  * updated once the array is allocated and there's a write against the
>  * pin.
>  *
>  * Note that the vector field is only cached for raw RTE writes when
>  * using IR.  In that case the vector field might have been repurposed
>  * to store something different than the target vector, and hence need
>  * to be cached for performing EOI.
>  */

Sounds okay to me, yet I'd prefer a comment in __io_apic_eoi(), where it
may want wording a little differently.

>>> @@ -273,6 +293,13 @@ void __ioapic_write_entry(
>>>  {
>>>  __io_apic_write(apic, 0x11 + 2 * pin, eu.w2);
>>>  __io_apic_write(apic, 0x10 + 2 * pin, eu.w1);
>>> +/*
>>> + * Called in clear_IO_APIC_pin() before io_apic_pin_eoi is 
>>> allocated.
>>> + * Entry will be updated once the array is allocated and there's a
>>> + * write against the pin.
>>> + */
>>> +if ( io_apic_pin_eoi )
>>> +io_apic_pin_eoi[apic][pin] = e.vector;
>>
>> The comment here looks a little misleading to me. clear_IO_APIC_pin() calls
>> here to, in particular, set the mask bit. With the mask bit the vector isn't
>> meaningful anyway (and indeed clear_IO_APIC_pin() sets it to zero, at which
>> point recording IRQ_VECTOR_UNASSIGNED might be better than the bogus vector
>> 0x00).
> 
> Note that clear_IO_APIC_pin() performs the call to
> __ioapic_write_entry() with raw == false, at which point
> __ioapic_write_entry() will call iommu_update_ire_from_apic() if IOMMU
> IR is enabled.  The cached 'vector' value will be the IOMMU entry
> offset for the AMD-Vi case, as the IOMMU code will perform the call to
> __ioapic_write_entry() with raw == true.
> 
> What matters is that the cached value matches what's written in the
> IO-APIC RTE, and the current logic ensures this.
> 
> What's the benefit of using IRQ_VECTOR_UNASSIGNED if the result is
> reading the RTE and finding that vector == 0?

It's not specifically the vector == 0 case alone. Shouldn't we leave
the latched vector alone when writing an RTE with the mask bit set?
Any still pending EOI (there should be none aiui) can't possibly
target the meaningless vector / index in such an RTE. Perhaps it was
wrong to suggest to overwrite (with IRQ_VECTOR_UNASSIGNED) what we
have on record.

Yet at the same time there ought to be a case where the recorded
indeed moves back to IRQ_VECTOR_UNASSIGNED.

> Looking at clear_IO_APIC_pin() - I think the function is slightly
> bogus.  If entry.trigger is not set, the logic to switch the entry to
> level triggered  will fetch the entry contents without requesting a
> raw RTE, at which point the entry.vector field can not be used as
> the EOI handle since it will contain the vector, not the IR table
> offset.  I will need to make a further patch to fix this corner
> case.

Is there actually a reason not to pass IRQ_VECTOR_UNASSIGNED there,
to have __io_apic_eoi() determine the vector? (But of course we can
also latch entry.vector from the earlier raw read.)

>>> @@ -298,9 +325,17 @@ static void __io_apic_eoi(unsigned int apic, unsigned 
>>> int vector, unsigned int p
>>>  /* Prefer the use of the EOI register if available */
>>>  if ( ioapic_has_eoi_reg(apic) )
>>>  {
>>> +if ( io_apic_pin_eoi )
>>> +vector = io_apic_pin_eoi[apic][pin];
>>> +
>>>  /* If vector is unknown, read it from the IO-APIC */
>>>  if ( vector == IRQ_VECTOR_UNASSIGNED )
>>> +{
>>>  vector = __ioapic_read_entry(apic, pin, true).vector;
>>
>> Related to my comment higher up regarding vector vs EOI handle: Here we're
>> doing a raw read, i.e. we don't really fetch the vector but the EOI handle
>> in the AMD case. Why is it that this isn't sufficient for directed EOI to
>> work (perhaps with the conditional adjusted)?
> 
> It is enough, but we don't want to be doing such read for each EOI,
> hence why we cache it in io_apic_pin_eoi.

Yet then the patch is to a fair part about improving performance, when the
functionality issue could be addressed with a far less intrusive change.
Which may in particular make a difference with backporting in mind. Plus
that may want at least mentioning in the description.

>> Then again - are we ever taking this path? Certainly not when coming from
>> clear_IO_APIC_pin(), hence

Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-30 Thread Andrew Cooper
On 30/10/2024 6:51 am, Jan Beulich wrote:
> On 29.10.2024 18:55, Andrew Cooper wrote:
>> We already have one migration case opencoded (feat.max_subleaf).  A more
>> recent discovery is that we advertise x2APIC to guests without ensuring that
>> we provide max_leaf >= 0xb.
>>
>> In general, any leaf known to Xen can be safely configured by the toolstack 
>> if
>> it doesn't violate other constraints.
>>
>> Therefore, introduce guest_common_{max,default}_leaves() to generalise the
>> special case we currently have for feat.max_subleaf, in preparation to be 
>> able
>> to provide x2APIC topology in leaf 0xb even on older hardware.
>>
>> Signed-off-by: Andrew Cooper 
> Reviewed-by: Jan Beulich 
>
> I'll have to update the AMX logic accordingly (maybe also the AVX10 one).

Yeah - I need to get back to your shrinking series too.

> I'd like to point out that this highlights a naming anomaly in
> x86_cpu_policies_are_compatible(): update_domain_cpu_policy() passes in
> the respective max policy as first argument. Imo the first parameter of
> the function would better be named "max" there.

That's covered in the documentation.  It made sense when I first planned
things, but that was many many iterations ago.

>
>> --- a/xen/arch/x86/cpu-policy.c
>> +++ b/xen/arch/x86/cpu-policy.c
>> @@ -391,6 +391,27 @@ static void __init calculate_host_policy(void)
>>  p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
>>  }
>>  
>> +/*
>> + * Guest max policies can have any max leaf/subleaf within bounds.
>> + *
>> + * - Some incoming VMs have a larger-than-necessary feat max_subleaf.
>> + * - Some VMs we'd like to synthesise leaves not present on the host.
>> + */
>> +static void __init guest_common_max_leaves(struct cpu_policy *p)
>> +{
>> +p->basic.max_leaf   = ARRAY_SIZE(p->basic.raw) - 1;
>> +p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
>> +p->extd.max_leaf= 0x8000U + ARRAY_SIZE(p->extd.raw) - 1;
>> +}
>> +
>> +/* Guest default policies inherit the host max leaf/subleaf settings. */
>> +static void __init guest_common_default_leaves(struct cpu_policy *p)
>> +{
>> +p->basic.max_leaf   = host_cpu_policy.basic.max_leaf;
>> +p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
>> +p->extd.max_leaf= host_cpu_policy.extd.max_leaf;
>> +}
> Which sadly still leaves open how to suitably shrink the max values,
> when they're larger than necessary (for the guest).

Only the toolstack can do the shrinking, and only as the about the final
step after optional features have been activated.

~Andrew



Re: [PATCH v2 1/3] xen/riscv: introduce setup_mm()

2024-10-30 Thread Jan Beulich
On 23.10.2024 17:50, Oleksii Kurochko wrote:
> Introduce the implementation of setup_mm(), which includes:
> 1. Adding all free regions to the boot allocator, as memory is needed
>to allocate page tables used for frame table mapping.
> 2. Calculating RAM size and the RAM end address.
> 3. Setting up direct map mappings from each RAM bank and initialize
>directmap_virt_start (also introduce XENHEAP_VIRT_START which is
>defined as directmap_virt_start) to be properly aligned with RAM
>start to use more superpages to reduce pressure on the TLB.
> 4. Setting up frame table mappings from physical address 0 to ram_end
>to simplify mfn_to_page() and page_to_mfn() conversions.
> 5. Setting up total_pages and max_page.
> 
> Update virt_to_maddr() to use introduced XENHEAP_VIRT_START.
> 
> Implement maddr_to_virt() function to convert a machine address
> to a virtual address. This function is specifically designed to be used
> only for the DIRECTMAP region, so a check has been added to ensure that
> the address does not exceed DIRECTMAP_SIZE.

I'm unconvinced by this. Conceivably the function could be used on
"imaginary" addresses, just to calculate abstract positions or e.g.
deltas. At the same time I'm also not going to insist on the removal of
that assertion, so long as it doesn't trigger.

> After the introduction of maddr_to_virt() the following linkage error starts
> to occur and to avoid it share_xen_page_with_guest() stub is added:
>   riscv64-linux-gnu-ld: prelink.o: in function `tasklet_kill':
>   /build/xen/common/tasklet.c:176: undefined reference to
>  `share_xen_page_with_guest'
>   riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol 
> `share_xen_page_with_guest'
> isn't defined riscv64-linux-gnu-ld: final link failed: bad value
> 
> Despite the linkger fingering tasklet.c, it's trace.o which has the undefined
> refenrece:
>   $ find . -name \*.o | while read F; do nm $F | grep 
> share_xen_page_with_guest &&
> echo $F; done
>  U share_xen_page_with_guest
> ./xen/common/built_in.o
>  U share_xen_page_with_guest
> ./xen/common/trace.o
>  U share_xen_page_with_guest
> ./xen/prelink.o
> 
> Looking at trace.i, there is call of share_xen_page_with_guest() but in case 
> of
> when maddr_to_virt() is defined as "return NULL" compiler optimizes the part 
> of
> common/trace.c code where share_xen_page_with_priviliged_guest() is called
> ( there is no any code in dissambled common/trace.o ) so there is no real call
> of share_xen_page_with_priviliged_guest().

I don't think it's the "return NULL", but rather BUG_ON()'s (really BUG()'s)
unreachable(). Not the least because the function can't validly return NULL,
and hence callers have no need to check for NULL.

> @@ -25,8 +27,11 @@
>  
>  static inline void *maddr_to_virt(paddr_t ma)
>  {
> -BUG_ON("unimplemented");
> -return NULL;
> +unsigned long va_offset = maddr_to_directmapoff(ma);
> +
> +ASSERT(va_offset < DIRECTMAP_SIZE);
> +
> +return (void *)(XENHEAP_VIRT_START + va_offset);
>  }

I'm afraid I'm not following why this uses XENHEAP_VIRT_START, when
it's all about the directmap. I'm in trouble with XENHEAP_VIRT_START
in the first place: You don't have a separate "heap" virtual address
range, do you?

> @@ -37,9 +42,9 @@ static inline void *maddr_to_virt(paddr_t ma)
>   */
>  static inline unsigned long virt_to_maddr(unsigned long va)
>  {
> -if ((va >= DIRECTMAP_VIRT_START) &&
> +if ((va >= XENHEAP_VIRT_START) &&
>  (va < (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE)))
> -return directmapoff_to_maddr(va - DIRECTMAP_VIRT_START);
> +return directmapoff_to_maddr(va - XENHEAP_VIRT_START);

Same concern here then.

> @@ -423,3 +424,123 @@ void * __init early_fdt_map(paddr_t fdt_paddr)
>  
>  return fdt_virt;
>  }
> +
> +#ifndef CONFIG_RISCV_32

I'd like to ask that you be more selective with this #ifdef (or omit it
altogether here). setup_mm() itself, for example, looks good for any mode.
Like does ...

> +#define ROUNDDOWN(addr, size)  ((addr) & ~((size) - 1))

... this #define. Then again this macro may better be placed in
xen/macros.h anyway, next to ROUNDUP().

> +/* Map a frame table to cover physical addresses ps through pe */
> +static void __init setup_frametable_mappings(paddr_t ps, paddr_t pe)
> +{
> +paddr_t aligned_ps = ROUNDDOWN(ps, PAGE_SIZE);
> +paddr_t aligned_pe = ROUNDUP(pe, PAGE_SIZE);
> +unsigned long nr_mfns = PFN_DOWN(aligned_pe - aligned_ps);
> +unsigned long frametable_size = nr_mfns * sizeof(struct page_info);

Nit: Better sizeof(*frame_table).

> +mfn_t base_mfn;
> +
> +if ( frametable_size > FRAMETABLE_SIZE )
> +panic("The frametable cannot cover the physical region [%#"PRIpaddr" 
> - %#"PRIpaddr")\n",
> +  ps, pe);

As per prior comments of mine: Imo the message is too verbose (and too
long). "frametable cannot cover [%#"PRIpaddr", %#"PR

Re: [PATCH v1] xen/common: move gic_preinit() to common code

2024-10-30 Thread Jan Beulich
On 30.10.2024 10:50, oleksii.kuroc...@gmail.com wrote:
> On Tue, 2024-10-29 at 17:57 +0100, Jan Beulich wrote:
>> On 29.10.2024 17:47, Oleksii Kurochko wrote:
>>> --- a/xen/common/device.c
>>> +++ b/xen/common/device.c
>>> @@ -4,10 +4,14 @@
>>>    *   xen/arch/arm/device.c
>>>    */
>>>   
>>> +#include 
>>>   #include 
>>>   #include 
>>>   #include 
>>>   #include 
>>> +#include 
>>> +#include 
>>> +#include 
>>
>> I don't think non-ACPI configs should include xen/acpi.h here. Surely
>> this
>> can be moved down into the ACPI-only code section?
> xen/acpi.h was included as acpi_disabled is defined there and it is
> needed in commom/device.c for:
> ```
> void __init ic_preinit(void)
> {
> if ( acpi_disabled )
> ic_dt_preinit();
> else
> ic_acpi_preinit();
> }
> ```

Oh, I see.

Jan



Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-30 Thread Andrew Cooper
On 30/10/2024 8:59 am, Roger Pau Monné wrote:
> On Tue, Oct 29, 2024 at 05:55:05PM +, Andrew Cooper wrote:
>> diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
>> index b6d9fad56773..78bc9872b09a 100644
>> --- a/xen/arch/x86/cpu-policy.c
>> +++ b/xen/arch/x86/cpu-policy.c
>> @@ -391,6 +391,27 @@ static void __init calculate_host_policy(void)
>>  p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
>>  }
>>  
>> +/*
>> + * Guest max policies can have any max leaf/subleaf within bounds.
>> + *
>> + * - Some incoming VMs have a larger-than-necessary feat max_subleaf.
>> + * - Some VMs we'd like to synthesise leaves not present on the host.
>> + */
>> +static void __init guest_common_max_leaves(struct cpu_policy *p)
>> +{
>> +p->basic.max_leaf   = ARRAY_SIZE(p->basic.raw) - 1;
>> +p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
>> +p->extd.max_leaf= 0x8000U + ARRAY_SIZE(p->extd.raw) - 1;
>> +}
>> +
>> +/* Guest default policies inherit the host max leaf/subleaf settings. */
>> +static void __init guest_common_default_leaves(struct cpu_policy *p)
>> +{
>> +p->basic.max_leaf   = host_cpu_policy.basic.max_leaf;
>> +p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
>> +p->extd.max_leaf= host_cpu_policy.extd.max_leaf;
>> +}
> I think this what I'm going to ask is future work.  After the
> modifications done to the host policy by max functions
> (calculate_{hvm,pv}_max_policy()) won't the max {sub,}leaf adjustments
> better be done taking into account the contents of the policy, rather
> than capping to the host values?
>
> (note this comment is strictly for guest_common_default_leaves(), the
> max version is fine using ARRAY_SIZE).

I'm afraid I don't follow.

calculate_{pv,hvm}_max_policy() don't modify the host policy.

~Andrew



[PATCH] x86/setup: Make setup.h header self contained

2024-10-30 Thread Frediano Ziglio
The header uses rangeset structure typedef which definition
is not included.

Signed-off-by: Frediano Ziglio 
---
 xen/arch/x86/include/asm/setup.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/xen/arch/x86/include/asm/setup.h b/xen/arch/x86/include/asm/setup.h
index 4874ee8936..e4e96b36bd 100644
--- a/xen/arch/x86/include/asm/setup.h
+++ b/xen/arch/x86/include/asm/setup.h
@@ -2,6 +2,7 @@
 #define __X86_SETUP_H_
 
 #include 
+#include 
 #include 
 
 extern const char __2M_text_start[], __2M_text_end[];
-- 
2.34.1




Re: [RFC PATCH 1/6] xen/domctl: Refine grant_opts into grant_version

2024-10-30 Thread Jan Beulich
On 29.10.2024 19:16, Alejandro Vallejo wrote:
> grant_opts is overoptimizing for space packing in a hypercall that
> doesn't warrant the effort. Tweak the ABI without breaking it in order
> to remove the bitfield by extending it to 8 bits.
> 
> Xen only supports little-endian systems, so the transformation from
> uint32_t to uint8_t followed by 3 octets worth of padding is not an ABI
> breakage.
> 
> No functional change
> 
> Signed-off-by: Alejandro Vallejo 
> ---
>  xen/include/public/domctl.h | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)

This isn't a complete patch, is it? I expect it'll break the build without
users of the field also adjusted.

> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -90,11 +90,18 @@ struct xen_domctl_createdomain {
>  int32_t max_grant_frames;
>  int32_t max_maptrack_frames;
>  
> -/* Grant version, use low 4 bits. */
> -#define XEN_DOMCTL_GRANT_version_mask0xf
> -#define XEN_DOMCTL_GRANT_version(v)  ((v) & 
> XEN_DOMCTL_GRANT_version_mask)
> +/*
> + * Maximum grant table version the domain can be configured with.
> + *
> + * Domains always start with v1 (if CONFIG_GRANT_TABLE) and can be bumped
> + * to use up to `max_grant_version` via GNTTABOP_set_version.
> + *
> + * Must be zero iff !CONFIG_GRANT_TABLE.
> + */
> +uint8_t max_grant_version;
>  
> -uint32_t grant_opts;
> +/* Unused */
> +uint8_t rsvd0[3];
>  
>  /*
>   * Enable altp2m mixed mode.

Just to mention it: I think while binary compatible, this is still on the edge
of needing an interface version bump. We may get away without as users of the
removed identifiers will still notice by way of observing build failures.

Jan



Re: [PATCH v4 4/6] xen/arm: mpu: Create boot-time MPU protection regions

2024-10-30 Thread Julien Grall

On 30/10/2024 10:08, Luca Fancellu wrote:

Hi Julien,


On 30 Oct 2024, at 09:52, Julien Grall  wrote:

On Wed, 30 Oct 2024 at 09:17, Luca Fancellu  wrote:


Hi Ayan,

While I rebased the branch on top of your patches, I saw you’ve changed the 
number of regions
mapped at boot time, can I ask why?


I have asked the change. If you look at the layout...


Apologies, I didn’t see you’ve asked the change


No need to apologies! I think I asked a few revisions ago.







Compared to 
https://patchwork.kernel.org/project/xen-devel/patch/20230626033443.2943270-25-penny.zh...@arm.com/:



... you have two sections with the same permissions:

xen_mpumap[1] : Xen read-only data
xen_mpumap[2] : Xen read-only after init data
xen_mpumap[3] : Xen read-write data

During boot, [2] and [3] will share the same permissions. After boot,
this will be [1] and [2]. Given the number of MPU regions is limited,
this is a bit of a waste.

We also don't want to have a hole in the middle of Xen sections. So
folding seemed to be a good idea.




+FUNC(enable_boot_cpu_mm)
+
+/* Get the number of regions specified in MPUIR_EL2 */
+mrs   x5, MPUIR_EL2
+
+/* x0: region sel */
+mov   x0, xzr
+/* Xen text section. */
+ldr   x1, =_stext
+ldr   x2, =_etext
+prepare_xen_region x0, x1, x2, x3, x4, x5, attr_prbar=REGION_TEXT_PRBAR
+
+/* Xen read-only data section. */
+ldr   x1, =_srodata
+ldr   x2, =_erodata
+prepare_xen_region x0, x1, x2, x3, x4, x5, attr_prbar=REGION_RO_PRBAR
+
+/* Xen read-only after init and data section. (RW data) */
+ldr   x1, =__ro_after_init_start
+ldr   x2, =__init_begin
+prepare_xen_region x0, x1, x2, x3, x4, x5


 ^— this, for example, will block Xen to call init_done(void) later, I 
understand this is earlyboot,
   but I guess we don’t want to make subsequent changes to this 
part when introducing the
   patches to support start_xen()


Can you be a bit more descriptive... What will block?


This call in setup.c:
 rc = modify_xen_mappings((unsigned long)&__ro_after_init_start,
  (unsigned long)&__ro_after_init_end,
  PAGE_HYPERVISOR_RO);

Cannot work anymore because xen_mpumap[2] is wider than only 
(__ro_after_init_start, __ro_after_init_end).


Is this because the implementation of modify_xen_mappings() is only able 
to modify a full region? IOW, it would not be able to split regions 
and/or merge them?




If that is what we want, then we could wrap the above call into something MMU 
specific that will execute the above call and
something MPU specific that will modify xen_mpumap[1] from (_srodata, _erodata) 
to (_srodata, __ro_after_init_end)
and xen_mpumap[2] from (__ro_after_init_start, __init_begin) to 
(__ro_after_init_end, __init_begin).


I think it would make sense to have the call mmu/mpu specific. This 
would allow to limit the number of MPU regions used by Xen itself.


The only problem is IIRC the region is not fixed because we will skip 
empty regions during earlyboot.


Cheers,

--
Julien Grall




[PATCH] x86/mm: Use standard C types for sized integers

2024-10-30 Thread Frediano Ziglio
The header is already using these types.

Signed-off-by: Frediano Ziglio 
---
 xen/arch/x86/include/asm/mm.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 2a837f3d59..71a29b2cb3 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -230,7 +230,7 @@ struct page_info
  * Only valid for: a) free pages, and b) pages with zero type count
  * (except page table pages when the guest is in shadow mode).
  */
-u32 tlbflush_timestamp;
+uint32_t tlbflush_timestamp;
 
 /*
  * When PGT_partial is true then the first two fields are valid and
@@ -284,8 +284,8 @@ struct page_info
  *   in use.
  */
 struct {
-u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
-u16 :16 - PAGETABLE_ORDER - 1 - 1;
+uint16_t nr_validated_ptes:PAGETABLE_ORDER + 1;
+uint16_t :16 - PAGETABLE_ORDER - 1 - 1;
 uint16_t partial_flags:1;
 int16_t linear_pt_count;
 };
-- 
2.34.1




Re: [PATCH v4 4/6] xen/arm: mpu: Create boot-time MPU protection regions

2024-10-30 Thread Luca Fancellu
Hi Julien,

> On 30 Oct 2024, at 10:32, Julien Grall  wrote:
> 
> On 30/10/2024 10:08, Luca Fancellu wrote:
>> Hi Julien,
>>> On 30 Oct 2024, at 09:52, Julien Grall  wrote:
>>> 
>>> On Wed, 30 Oct 2024 at 09:17, Luca Fancellu  wrote:
 
 Hi Ayan,
 
 While I rebased the branch on top of your patches, I saw you’ve changed 
 the number of regions
 mapped at boot time, can I ask why?
>>> 
>>> I have asked the change. If you look at the layout...
>> Apologies, I didn’t see you’ve asked the change
> 
> No need to apologies! I think I asked a few revisions ago.
> 
>>> 
 
 Compared to 
 https://patchwork.kernel.org/project/xen-devel/patch/20230626033443.2943270-25-penny.zh...@arm.com/:
>>> 
>>> 
>>> ... you have two sections with the same permissions:
>>> 
>>> xen_mpumap[1] : Xen read-only data
>>> xen_mpumap[2] : Xen read-only after init data
>>> xen_mpumap[3] : Xen read-write data
>>> 
>>> During boot, [2] and [3] will share the same permissions. After boot,
>>> this will be [1] and [2]. Given the number of MPU regions is limited,
>>> this is a bit of a waste.
>>> 
>>> We also don't want to have a hole in the middle of Xen sections. So
>>> folding seemed to be a good idea.
>>> 
 
> +FUNC(enable_boot_cpu_mm)
> +
> +/* Get the number of regions specified in MPUIR_EL2 */
> +mrs   x5, MPUIR_EL2
> +
> +/* x0: region sel */
> +mov   x0, xzr
> +/* Xen text section. */
> +ldr   x1, =_stext
> +ldr   x2, =_etext
> +prepare_xen_region x0, x1, x2, x3, x4, x5, 
> attr_prbar=REGION_TEXT_PRBAR
> +
> +/* Xen read-only data section. */
> +ldr   x1, =_srodata
> +ldr   x2, =_erodata
> +prepare_xen_region x0, x1, x2, x3, x4, x5, attr_prbar=REGION_RO_PRBAR
> +
> +/* Xen read-only after init and data section. (RW data) */
> +ldr   x1, =__ro_after_init_start
> +ldr   x2, =__init_begin
> +prepare_xen_region x0, x1, x2, x3, x4, x5
 
 ^— this, for example, will block Xen to call init_done(void) 
 later, I understand this is earlyboot,
   but I guess we don’t want to make subsequent changes to this 
 part when introducing the
   patches to support start_xen()
>>> 
>>> Can you be a bit more descriptive... What will block?
>> This call in setup.c:
>> rc = modify_xen_mappings((unsigned long)&__ro_after_init_start,
>>  (unsigned long)&__ro_after_init_end,
>>  PAGE_HYPERVISOR_RO);
>> Cannot work anymore because xen_mpumap[2] is wider than only 
>> (__ro_after_init_start, __ro_after_init_end).
> 
> Is this because the implementation of modify_xen_mappings() is only able to 
> modify a full region? IOW, it would not be able to split regions and/or merge 
> them?

Yes, the code is, at the moment, not smart enough to do that, it will only 
modify a full region.

> 
>> If that is what we want, then we could wrap the above call into something 
>> MMU specific that will execute the above call and
>> something MPU specific that will modify xen_mpumap[1] from (_srodata, 
>> _erodata) to (_srodata, __ro_after_init_end)
>> and xen_mpumap[2] from (__ro_after_init_start, __init_begin) to 
>> (__ro_after_init_end, __init_begin).
> 
> I think it would make sense to have the call mmu/mpu specific. This would 
> allow to limit the number of MPU regions used by Xen itself.
> 
> The only problem is IIRC the region is not fixed because we will skip empty 
> regions during earlyboot.

Yes, but I think we can assume that X(_srodata, _erodata) and 
Y(__ro_after_init_start, __init_begin) won’t never be empty for Xen?

In that case, the call to mpumap_contain_region() should be able to retrieve 
the full region X and the partial region Y and
a specific function could modify the ranges of both given the respective 
indexes.

Code in my branch: 
https://gitlab.com/xen-project/people/lucafancellu/xen/-/blob/r82_rebased/xen/arch/arm/mpu/mm.c#L382




Re: [PATCH v4 4/6] xen/arm: mpu: Create boot-time MPU protection regions

2024-10-30 Thread Luca Fancellu
Hi Julien,

> On 30 Oct 2024, at 09:52, Julien Grall  wrote:
> 
> On Wed, 30 Oct 2024 at 09:17, Luca Fancellu  wrote:
>> 
>> Hi Ayan,
>> 
>> While I rebased the branch on top of your patches, I saw you’ve changed the 
>> number of regions
>> mapped at boot time, can I ask why?
> 
> I have asked the change. If you look at the layout...

Apologies, I didn’t see you’ve asked the change

> 
>> 
>> Compared to 
>> https://patchwork.kernel.org/project/xen-devel/patch/20230626033443.2943270-25-penny.zh...@arm.com/:
> 
> 
> ... you have two sections with the same permissions:
> 
> xen_mpumap[1] : Xen read-only data
> xen_mpumap[2] : Xen read-only after init data
> xen_mpumap[3] : Xen read-write data
> 
> During boot, [2] and [3] will share the same permissions. After boot,
> this will be [1] and [2]. Given the number of MPU regions is limited,
> this is a bit of a waste.
> 
> We also don't want to have a hole in the middle of Xen sections. So
> folding seemed to be a good idea.
> 
>> 
>>> +FUNC(enable_boot_cpu_mm)
>>> +
>>> +/* Get the number of regions specified in MPUIR_EL2 */
>>> +mrs   x5, MPUIR_EL2
>>> +
>>> +/* x0: region sel */
>>> +mov   x0, xzr
>>> +/* Xen text section. */
>>> +ldr   x1, =_stext
>>> +ldr   x2, =_etext
>>> +prepare_xen_region x0, x1, x2, x3, x4, x5, attr_prbar=REGION_TEXT_PRBAR
>>> +
>>> +/* Xen read-only data section. */
>>> +ldr   x1, =_srodata
>>> +ldr   x2, =_erodata
>>> +prepare_xen_region x0, x1, x2, x3, x4, x5, attr_prbar=REGION_RO_PRBAR
>>> +
>>> +/* Xen read-only after init and data section. (RW data) */
>>> +ldr   x1, =__ro_after_init_start
>>> +ldr   x2, =__init_begin
>>> +prepare_xen_region x0, x1, x2, x3, x4, x5
>> 
>> ^— this, for example, will block Xen to call init_done(void) later, 
>> I understand this is earlyboot,
>>   but I guess we don’t want to make subsequent changes to this 
>> part when introducing the
>>   patches to support start_xen()
> 
> Can you be a bit more descriptive... What will block?

This call in setup.c:
rc = modify_xen_mappings((unsigned long)&__ro_after_init_start,
 (unsigned long)&__ro_after_init_end,
 PAGE_HYPERVISOR_RO);

Cannot work anymore because xen_mpumap[2] is wider than only 
(__ro_after_init_start, __ro_after_init_end).

If that is what we want, then we could wrap the above call into something MMU 
specific that will execute the above call and
something MPU specific that will modify xen_mpumap[1] from (_srodata, _erodata) 
to (_srodata, __ro_after_init_end)
and xen_mpumap[2] from (__ro_after_init_start, __init_begin) to 
(__ro_after_init_end, __init_begin).

Please, let me know your thoughts.

Cheers,
Luca



Re: [PATCH v3] x86/io-apic: fix directed EOI when using AMD-Vi interrupt remapping

2024-10-30 Thread Roger Pau Monné
On Wed, Oct 30, 2024 at 10:41:40AM +0100, Jan Beulich wrote:
> On 29.10.2024 18:48, Roger Pau Monné wrote:
> > On Tue, Oct 29, 2024 at 05:43:24PM +0100, Jan Beulich wrote:
> >> On 29.10.2024 12:03, Roger Pau Monne wrote:
> >> Plus with what you said
> >> earlier about vector vs EOI handle, and with the code using "vector" all 
> >> over the
> >> place, their (non-)relationship could also do with clarifying (perhaps 
> >> better in
> >> a code comment in __io_apic_eoi()).
> > 
> > I've attempted to clarify the relation between vector vs EOI handle in
> > the first paragraph, and how that applies to AMD-Vi.  I can move
> > (part?) of that into the comment in __ioapic_write_entry(), maybe:
> > 
> > /*
> >  * Might be called before io_apic_pin_eoi is allocated.  Entry will be
> >  * updated once the array is allocated and there's a write against the
> >  * pin.
> >  *
> >  * Note that the vector field is only cached for raw RTE writes when
> >  * using IR.  In that case the vector field might have been repurposed
> >  * to store something different than the target vector, and hence need
> >  * to be cached for performing EOI.
> >  */
> 
> Sounds okay to me, yet I'd prefer a comment in __io_apic_eoi(), where it
> may want wording a little differently.

OK, let me try to add another comment for __io_apic_eoi() in v4 then.

> >>> @@ -273,6 +293,13 @@ void __ioapic_write_entry(
> >>>  {
> >>>  __io_apic_write(apic, 0x11 + 2 * pin, eu.w2);
> >>>  __io_apic_write(apic, 0x10 + 2 * pin, eu.w1);
> >>> +/*
> >>> + * Called in clear_IO_APIC_pin() before io_apic_pin_eoi is 
> >>> allocated.
> >>> + * Entry will be updated once the array is allocated and there's 
> >>> a
> >>> + * write against the pin.
> >>> + */
> >>> +if ( io_apic_pin_eoi )
> >>> +io_apic_pin_eoi[apic][pin] = e.vector;
> >>
> >> The comment here looks a little misleading to me. clear_IO_APIC_pin() calls
> >> here to, in particular, set the mask bit. With the mask bit the vector 
> >> isn't
> >> meaningful anyway (and indeed clear_IO_APIC_pin() sets it to zero, at which
> >> point recording IRQ_VECTOR_UNASSIGNED might be better than the bogus vector
> >> 0x00).
> > 
> > Note that clear_IO_APIC_pin() performs the call to
> > __ioapic_write_entry() with raw == false, at which point
> > __ioapic_write_entry() will call iommu_update_ire_from_apic() if IOMMU
> > IR is enabled.  The cached 'vector' value will be the IOMMU entry
> > offset for the AMD-Vi case, as the IOMMU code will perform the call to
> > __ioapic_write_entry() with raw == true.
> > 
> > What matters is that the cached value matches what's written in the
> > IO-APIC RTE, and the current logic ensures this.
> > 
> > What's the benefit of using IRQ_VECTOR_UNASSIGNED if the result is
> > reading the RTE and finding that vector == 0?
> 
> It's not specifically the vector == 0 case alone. Shouldn't we leave
> the latched vector alone when writing an RTE with the mask bit set?

I'm not sure what's the benefit of the extra logic to detect such
cases, just to avoid a write to the io_apic_pin_eoi matrix.

> Any still pending EOI (there should be none aiui) can't possibly
> target the meaningless vector / index in such an RTE. Perhaps it was
> wrong to suggest to overwrite (with IRQ_VECTOR_UNASSIGNED) what we
> have on record.
> 
> Yet at the same time there ought to be a case where the recorded
> indeed moves back to IRQ_VECTOR_UNASSIGNED.

The only purpose of the io_apic_pin_eoi matrix is to cache what's
currently in the RTE entry 'vector' field.  I don't think we should
attempt to add extra logic as to whether the entry is valid, or
masked.  Higher level layers should already take care of that.  The
only purpose of the logic added in this patch is to ensure the EOI is
performed using what's in the RTE vector field for the requested pin.
Anything else is out of scope IMO.

Another option, which would allow to make the matrix store uint8_t
elements would be to initialize it at allocation with the RTE vector
fields currently present, IOW: do a raw read of every RTE and set the
fetched vector field in io_apic_pin_eoi.  Would that be better to you,
as also removing the need to ever store IRQ_VECTOR_UNASSIGNED?

> > Looking at clear_IO_APIC_pin() - I think the function is slightly
> > bogus.  If entry.trigger is not set, the logic to switch the entry to
> > level triggered  will fetch the entry contents without requesting a
> > raw RTE, at which point the entry.vector field can not be used as
> > the EOI handle since it will contain the vector, not the IR table
> > offset.  I will need to make a further patch to fix this corner
> > case.
> 
> Is there actually a reason not to pass IRQ_VECTOR_UNASSIGNED there,
> to have __io_apic_eoi() determine the vector? (But of course we can
> also latch entry.vector from the earlier raw read.)

Yes, it should pass IRQ_VECTOR_UNASSIGNED IMO.  The extra cost of
doing the RTE read is not an is

Re: [PATCH v4 4/6] xen/arm: mpu: Create boot-time MPU protection regions

2024-10-30 Thread Julien Grall
On Wed, 30 Oct 2024 at 09:17, Luca Fancellu  wrote:
>
> Hi Ayan,
>
> While I rebased the branch on top of your patches, I saw you’ve changed the 
> number of regions
> mapped at boot time, can I ask why?

I have asked the change. If you look at the layout...

>
> Compared to 
> https://patchwork.kernel.org/project/xen-devel/patch/20230626033443.2943270-25-penny.zh...@arm.com/:


... you have two sections with the same permissions:

xen_mpumap[1] : Xen read-only data
xen_mpumap[2] : Xen read-only after init data
xen_mpumap[3] : Xen read-write data

During boot, [2] and [3] will share the same permissions. After boot,
this will be [1] and [2]. Given the number of MPU regions is limited,
this is a bit of a waste.

We also don't want to have a hole in the middle of Xen sections. So
folding seemed to be a good idea.

>
> > +FUNC(enable_boot_cpu_mm)
> > +
> > +/* Get the number of regions specified in MPUIR_EL2 */
> > +mrs   x5, MPUIR_EL2
> > +
> > +/* x0: region sel */
> > +mov   x0, xzr
> > +/* Xen text section. */
> > +ldr   x1, =_stext
> > +ldr   x2, =_etext
> > +prepare_xen_region x0, x1, x2, x3, x4, x5, attr_prbar=REGION_TEXT_PRBAR
> > +
> > +/* Xen read-only data section. */
> > +ldr   x1, =_srodata
> > +ldr   x2, =_erodata
> > +prepare_xen_region x0, x1, x2, x3, x4, x5, attr_prbar=REGION_RO_PRBAR
> > +
> > +/* Xen read-only after init and data section. (RW data) */
> > +ldr   x1, =__ro_after_init_start
> > +ldr   x2, =__init_begin
> > +prepare_xen_region x0, x1, x2, x3, x4, x5
>
>  ^— this, for example, will block Xen to call init_done(void) later, 
> I understand this is earlyboot,
>but I guess we don’t want to make subsequent changes to this 
> part when introducing the
>patches to support start_xen()

Can you be a bit more descriptive... What will block?

>
> > +
> > +/* Xen code section. */
> > +ldr   x1, =__init_begin
> > +ldr   x2, =__init_data_begin
> > +prepare_xen_region x0, x1, x2, x3, x4, x5, attr_prbar=REGION_TEXT_PRBAR
> > +
> > +/* Xen data and BSS section. */
> > +ldr   x1, =__init_data_begin
> > +ldr   x2, =__bss_end
> > +prepare_xen_region x0, x1, x2, x3, x4, x5
> > +
> > +ret
> > +
> > +END(enable_boot_cpu_mm)
>
> I suggest to keep exactly the regions as are in Penny’s patch.

See above. Without any details on the exact problem, it is difficult
to agree on your suggestion.

Cheers,



Re: [RFC PATCH 0/6] xen/abi: On wide bitfields inside primitive types

2024-10-30 Thread Christian Lindig



> On 29 Oct 2024, at 18:16, Alejandro Vallejo  
> wrote:
> 
> 
> The invariant I'd like to (slowly) introduce and discuss is that fields may
> have bitflags (e.g: a packed array of booleans indexed by some enumerated
> type), but not be mixed with wider fields in the same primitive type. This
> ensures any field containing an integer of any kind can be referred by pointer
> and treated the same way as any other with regards to sizeof() and the like.

Acked-by: Christian Lindig 


Fine with me but the OCaml part is not very exposed to this.

— C


Re: [PATCH v3] x86/io-apic: fix directed EOI when using AMD-Vi interrupt remapping

2024-10-30 Thread Jan Beulich
On 30.10.2024 11:09, Roger Pau Monné wrote:
> On Wed, Oct 30, 2024 at 10:41:40AM +0100, Jan Beulich wrote:
>> On 29.10.2024 18:48, Roger Pau Monné wrote:
>>> On Tue, Oct 29, 2024 at 05:43:24PM +0100, Jan Beulich wrote:
 On 29.10.2024 12:03, Roger Pau Monne wrote:
> @@ -273,6 +293,13 @@ void __ioapic_write_entry(
>  {
>  __io_apic_write(apic, 0x11 + 2 * pin, eu.w2);
>  __io_apic_write(apic, 0x10 + 2 * pin, eu.w1);
> +/*
> + * Called in clear_IO_APIC_pin() before io_apic_pin_eoi is 
> allocated.
> + * Entry will be updated once the array is allocated and there's 
> a
> + * write against the pin.
> + */
> +if ( io_apic_pin_eoi )
> +io_apic_pin_eoi[apic][pin] = e.vector;

 The comment here looks a little misleading to me. clear_IO_APIC_pin() calls
 here to, in particular, set the mask bit. With the mask bit the vector 
 isn't
 meaningful anyway (and indeed clear_IO_APIC_pin() sets it to zero, at which
 point recording IRQ_VECTOR_UNASSIGNED might be better than the bogus vector
 0x00).
>>>
>>> Note that clear_IO_APIC_pin() performs the call to
>>> __ioapic_write_entry() with raw == false, at which point
>>> __ioapic_write_entry() will call iommu_update_ire_from_apic() if IOMMU
>>> IR is enabled.  The cached 'vector' value will be the IOMMU entry
>>> offset for the AMD-Vi case, as the IOMMU code will perform the call to
>>> __ioapic_write_entry() with raw == true.
>>>
>>> What matters is that the cached value matches what's written in the
>>> IO-APIC RTE, and the current logic ensures this.
>>>
>>> What's the benefit of using IRQ_VECTOR_UNASSIGNED if the result is
>>> reading the RTE and finding that vector == 0?
>>
>> It's not specifically the vector == 0 case alone. Shouldn't we leave
>> the latched vector alone when writing an RTE with the mask bit set?
> 
> I'm not sure what's the benefit of the extra logic to detect such
> cases, just to avoid a write to the io_apic_pin_eoi matrix.

Perhaps the largely theoretical concern towards having stale data
somewhere. Yet ...

>> Any still pending EOI (there should be none aiui) can't possibly
>> target the meaningless vector / index in such an RTE. Perhaps it was
>> wrong to suggest to overwrite (with IRQ_VECTOR_UNASSIGNED) what we
>> have on record.
>>
>> Yet at the same time there ought to be a case where the recorded
>> indeed moves back to IRQ_VECTOR_UNASSIGNED.
> 
> The only purpose of the io_apic_pin_eoi matrix is to cache what's
> currently in the RTE entry 'vector' field.  I don't think we should
> attempt to add extra logic as to whether the entry is valid, or
> masked.  Higher level layers should already take care of that.  The
> only purpose of the logic added in this patch is to ensure the EOI is
> performed using what's in the RTE vector field for the requested pin.
> Anything else is out of scope IMO.
> 
> Another option, which would allow to make the matrix store uint8_t
> elements would be to initialize it at allocation with the RTE vector
> fields currently present, IOW: do a raw read of every RTE and set the
> fetched vector field in io_apic_pin_eoi.  Would that be better to you,
> as also removing the need to ever store IRQ_VECTOR_UNASSIGNED?

... yes, that may make sense (and eliminate my concern there).

I wonder whether the allocation of the array then wouldn't better be
moved earlier, to enable_IO_APIC(), such that clear_IO_APIC_pin()
already can suitably update it. In fact, since that function writes
zero[1], no extra reads would then be needed at all, and the array could
simply start out all zeroed.

Jan

[1] With the exception of RTEs saying SMI, where - for having fully
correct data on record - we may then need to update the array slot.



Re: [PATCH] x86/setup: Make setup.h header self contained

2024-10-30 Thread Jan Beulich
On 30.10.2024 11:44, Frediano Ziglio wrote:
> The header uses rangeset structure typedef which definition
> is not included.

And it doesn't need to be. For

int remove_xen_ranges(struct rangeset *r);

we don't need ...

> --- a/xen/arch/x86/include/asm/setup.h
> +++ b/xen/arch/x86/include/asm/setup.h
> @@ -2,6 +2,7 @@
>  #define __X86_SETUP_H_
>  
>  #include 
> +#include 
>  #include 
>  
>  extern const char __2M_text_start[], __2M_text_end[];

... this, a mere

struct rangeset;

forward decl will suffice.

Jan




Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-30 Thread Roger Pau Monné
On Wed, Oct 30, 2024 at 10:39:12AM +, Andrew Cooper wrote:
> On 30/10/2024 8:59 am, Roger Pau Monné wrote:
> > On Tue, Oct 29, 2024 at 05:55:05PM +, Andrew Cooper wrote:
> >> diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
> >> index b6d9fad56773..78bc9872b09a 100644
> >> --- a/xen/arch/x86/cpu-policy.c
> >> +++ b/xen/arch/x86/cpu-policy.c
> >> @@ -391,6 +391,27 @@ static void __init calculate_host_policy(void)
> >>  p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
> >>  }
> >>  
> >> +/*
> >> + * Guest max policies can have any max leaf/subleaf within bounds.
> >> + *
> >> + * - Some incoming VMs have a larger-than-necessary feat max_subleaf.
> >> + * - Some VMs we'd like to synthesise leaves not present on the host.
> >> + */
> >> +static void __init guest_common_max_leaves(struct cpu_policy *p)
> >> +{
> >> +p->basic.max_leaf   = ARRAY_SIZE(p->basic.raw) - 1;
> >> +p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
> >> +p->extd.max_leaf= 0x8000U + ARRAY_SIZE(p->extd.raw) - 1;
> >> +}
> >> +
> >> +/* Guest default policies inherit the host max leaf/subleaf settings. */
> >> +static void __init guest_common_default_leaves(struct cpu_policy *p)
> >> +{
> >> +p->basic.max_leaf   = host_cpu_policy.basic.max_leaf;
> >> +p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
> >> +p->extd.max_leaf= host_cpu_policy.extd.max_leaf;
> >> +}
> > I think this what I'm going to ask is future work.  After the
> > modifications done to the host policy by max functions
> > (calculate_{hvm,pv}_max_policy()) won't the max {sub,}leaf adjustments
> > better be done taking into account the contents of the policy, rather
> > than capping to the host values?
> >
> > (note this comment is strictly for guest_common_default_leaves(), the
> > max version is fine using ARRAY_SIZE).
> 
> I'm afraid I don't follow.
> 
> calculate_{pv,hvm}_max_policy() don't modify the host policy.

Hm, I don't think I've expressed myself clearly, sorry.  Let me try
again.

calculate_{hvm,pv}_max_policy() extends the host policy by possibly
setting new features, and such extended policy is then used as the
base for the PV/HVM default policies.

Won't the resulting policy in calculate_{hvm,pv}_def_policy() risks
having bits set past the max {sub,}leaf in the host policy, as it's
based in {hvm,pv}_def_cpu_policy that might have such bits set?

Thanks, Roger.



Re: [PATCH] x86/mm: Use standard C types for sized integers

2024-10-30 Thread Frediano Ziglio
On Wed, Oct 30, 2024 at 11:02 AM Jan Beulich  wrote:
>
> On 30.10.2024 11:44, Frediano Ziglio wrote:
> > The header is already using these types.
> >
> > Signed-off-by: Frediano Ziglio 
>
> Acked-by: Jan Beulich 
>

Thanks

> Nevertheless I wonder whether ...
>
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -230,7 +230,7 @@ struct page_info
> >   * Only valid for: a) free pages, and b) pages with zero type count
> >   * (except page table pages when the guest is in shadow mode).
> >   */
> > -u32 tlbflush_timestamp;
> > +uint32_t tlbflush_timestamp;
> >
> >  /*
> >   * When PGT_partial is true then the first two fields are valid and
> > @@ -284,8 +284,8 @@ struct page_info
> >   *   in use.
> >   */
> >  struct {
> > -u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
> > -u16 :16 - PAGETABLE_ORDER - 1 - 1;
> > +uint16_t nr_validated_ptes:PAGETABLE_ORDER + 1;
> > +uint16_t :16 - PAGETABLE_ORDER - 1 - 1;
> >  uint16_t partial_flags:1;
>
> ... fixed width types are really needed here; afaict unsigned int ought to do.
>

Not for gcc/clang. Other compilers (like MS one) differs... in our
case this is not a public header and we only support gcc/clang so
unsigned/int would be the same.

> >  int16_t linear_pt_count;
>
> It's only here where the fixed width type is largely needed (or alternatively
>
> signed int linear_pt_count:16;
>

That would be different. Compilers do not allow to take addresses of bit-fields.

> ).
>
> Jan

Frediano



Re: [PATCH] x86/mm: Use standard C types for sized integers

2024-10-30 Thread Jan Beulich
On 30.10.2024 12:10, Frediano Ziglio wrote:
> On Wed, Oct 30, 2024 at 11:02 AM Jan Beulich  wrote:
>> It's only here where the fixed width type is largely needed (or alternatively
>>
>> signed int linear_pt_count:16;
> 
> That would be different. Compilers do not allow to take addresses of 
> bit-fields.

Oh, right, I forgot we take the address of this field in a few places.

Jan



Re: [PATCH] x86/setup: Make setup.h header self contained

2024-10-30 Thread Frediano Ziglio
On Wed, Oct 30, 2024 at 10:59 AM Jan Beulich  wrote:
>
> On 30.10.2024 11:44, Frediano Ziglio wrote:
> > The header uses rangeset structure typedef which definition
> > is not included.
>
> And it doesn't need to be. For
>
> int remove_xen_ranges(struct rangeset *r);
>
> we don't need ...
>
> > --- a/xen/arch/x86/include/asm/setup.h
> > +++ b/xen/arch/x86/include/asm/setup.h
> > @@ -2,6 +2,7 @@
> >  #define __X86_SETUP_H_
> >
> >  #include 
> > +#include 
> >  #include 
> >
> >  extern const char __2M_text_start[], __2M_text_end[];
>
> ... this, a mere
>
> struct rangeset;
>
> forward decl will suffice.
>
> Jan
>

It's true, but for the same reason, we could avoid including
"xen/multiboot.h" and use "struct module" instead of "module_t".

Frediano



Re: [PATCH v7 04/10] tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves

2024-10-30 Thread Andrew Cooper
On 21/10/2024 4:45 pm, Alejandro Vallejo wrote:
> diff --git a/tools/firmware/hvmloader/config.h 
> b/tools/firmware/hvmloader/config.h
> index cd716bf39245..04cab1e59f08 100644
> --- a/tools/firmware/hvmloader/config.h
> +++ b/tools/firmware/hvmloader/config.h
> @@ -4,6 +4,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  enum virtual_vga { VGA_none, VGA_std, VGA_cirrus, VGA_pt };
>  extern enum virtual_vga virtual_vga;
>  
> @@ -48,8 +50,9 @@ extern uint8_t ioapic_version;
>  
>  #define IOAPIC_ID   0x01
>  
> +extern uint32_t cpu_to_x2apicid[HVM_MAX_VCPUS];

Just cpu_to_apic_id[] please.   The distinction between x or x2 isn't
interesting here.

HVM_MAX_VCPUS is a constant that should never have existed in the first
place, *and* its the limit we're looking to finally break when this
series is accepted.

This array needs to be hvm_info->nr_vcpus entries long, and will want to
be more than 128 entries very soon.  Just scratch_alloc() the array. 
Then you can avoid the include.

> diff --git a/tools/firmware/hvmloader/mp_tables.c 
> b/tools/firmware/hvmloader/mp_tables.c
> index 77d3010406d0..539260365e1e 100644
> --- a/tools/firmware/hvmloader/mp_tables.c
> +++ b/tools/firmware/hvmloader/mp_tables.c
> @@ -198,8 +198,10 @@ static void fill_mp_config_table(struct mp_config_table 
> *mpct, int length)
>  /* fills in an MP processor entry for VCPU 'vcpu_id' */
>  static void fill_mp_proc_entry(struct mp_proc_entry *mppe, int vcpu_id)
>  {
> +ASSERT(cpu_to_x2apicid[vcpu_id] < 0xFF );

This is just going to break when we hit 256 vCPUs in a VM.

What do real systems do?

They'll either wrap around 255 like the CPUID xAPIC_ID does, or they'll
not write out MP tables at all.

> diff --git a/tools/firmware/hvmloader/smp.c b/tools/firmware/hvmloader/smp.c
> index 1b940cefd071..d63536f14f00 100644
> --- a/tools/firmware/hvmloader/smp.c
> +++ b/tools/firmware/hvmloader/smp.c
> @@ -90,10 +120,11 @@ static void boot_cpu(unsigned int cpu)
>  BUG();
>  
>  /*
> - * Wait for the secondary processor to complete initialisation.
> + * Wait for the secondary processor to complete initialisation,
> + * which is signaled by its x2APIC ID being written to the LUT.

Technically all arrays are a lookup table, but I'm not sure LUT is a
common enough term to be used unqualified like this.

Just say "... signalled by writing its APIC_ID out."  The where is very
apparent by the code.

> @@ -104,6 +135,12 @@ static void boot_cpu(unsigned int cpu)
>  void smp_initialise(void)
>  {
>  unsigned int i, nr_cpus = hvm_info->nr_vcpus;
> +uint32_t ecx;
> +
> +cpuid(1, NULL, NULL, &ecx, NULL);
> +has_x2apic = (ecx >> 21) & 1;
> +if ( has_x2apic )
> +printf("x2APIC supported\n");

You need to check max_leaf >= 0xb too.  Remember Xen might not give you
leave 0xb yet, and then you'll hit the assert for finding 0.

And has_x2apic wants to be a simple boolean.  Nothing good can come from
confusing -1 with "x2apic available".


I recommend splitting this patch into three.  Several aspects are quite
subtle.

1) Collect the APIC_IDs on APs
2) Change how callin is signalled.
3) Replace LAPIC_ID() with the collected apic_id.

but AFAICT, it can be done as a standalone series, independently of the
other Xen/toolstack work.

~Andrew



Re: Xen 4.20 release schedule

2024-10-30 Thread Andrew Cooper
On 21/10/2024 1:02 pm, oleksii.kuroc...@gmail.com wrote:
> Hello everyone,
>
> As there were no objections to the proposed release schedule
> (https://lore.kernel.org/xen-devel/CAMacjJxEi6PThwH2=nwg3he8eqn39aiaxzcw3bqf7i4ycmj...@mail.gmail.com/
> ), I've updated the wiki with the schedule for Xen 4.20 release
> (https://wiki.xenproject.org/wiki/Xen_Project_X.YY_Release_Notes), and
> it is now accessible from
> https://xenbits.xen.org/docs/unstable-staging/support-matrix.html.

I have a blocker to raise (against myself...) and no good idea of how to
proceed.

The for_each_bit work has a unexpected bug.

    for_each_bit ( ... )
    {
    if ( ... )
            break;
    }

will fall into an infinite loop.  This is caused by for_each_bit()
hiding a double for() loop, in order to declare two scope-local
variables of different types.

The two variables are one copy of the source expression (really quite
important to keep), and one unsigned int iterator (improved optimisation
capability by not using a wider-scope variable).

Options are (off the top of my head)

1) Always take the iterator from outer scope
2) Iterator always the same type as the source expression
3) Figure out some way of expressing "once" in the outer loop

Or anything else that I've missed.

~Andrew



[QUESTION] tools/xenstored: Best way to proceed with the protocol modification

2024-10-30 Thread Andrii Sultanov
Hello,

(CC-ing Jürgen as the original author of the xenstored partial directory
patches:
https://lore.kernel.org/xen-devel/20161205074853.13268-1-jgr...@suse.com/)

I'm investigating implementing XS_DIRECTORY_PART support in Oxenstored, and
have come by a possible issue - the protocol specifies that the 'offset'
parameter for each call is a "byte offset into the list of children", and
so should be calculated on the user side. This makes sense for the C side
as children are stored in a single char array separated by null characters,
but OCaml stores children in a different structure, which can't be indexed
into this way (but is searched more efficiently, etc.)

What's the best way to proceed here?

Could the protocol be redefined to:
1) turn the 'offset' into an opaque id that needs to be re-sent as-is on
the next iteration? (it would keep being interpreted as an index into an
array on the C side, and as something else on the OCaml side)
2) return the opaque 'offset' on each call alongside the string and
generation id so that it wouldn't be calculated from strlen() on the user
side anymore?

Thank you,
Andrii


[PATCH v2] x86/setup: Make setup.h header self contained

2024-10-30 Thread Frediano Ziglio
The header uses rangeset structure pointer.
Forward declare the structure.

Signed-off-by: Frediano Ziglio 
---
Changes since v1:
- use structure forward declaration instead of including header.
---
 xen/arch/x86/include/asm/setup.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/xen/arch/x86/include/asm/setup.h b/xen/arch/x86/include/asm/setup.h
index 4874ee8936..d7ed4f4002 100644
--- a/xen/arch/x86/include/asm/setup.h
+++ b/xen/arch/x86/include/asm/setup.h
@@ -43,6 +43,7 @@ void *bootstrap_map_bm(const struct boot_module *bm);
 void *bootstrap_map(const module_t *mod);
 void bootstrap_unmap(void);
 
+struct rangeset;
 int remove_xen_ranges(struct rangeset *r);
 
 int cf_check stub_selftest(void);
-- 
2.34.1




Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-30 Thread Roger Pau Monné
On Wed, Oct 30, 2024 at 02:45:19PM +, Andrew Cooper wrote:
> On 30/10/2024 11:03 am, Roger Pau Monné wrote:
> > On Wed, Oct 30, 2024 at 10:39:12AM +, Andrew Cooper wrote:
> >> On 30/10/2024 8:59 am, Roger Pau Monné wrote:
> >>> On Tue, Oct 29, 2024 at 05:55:05PM +, Andrew Cooper wrote:
>  diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
>  index b6d9fad56773..78bc9872b09a 100644
>  --- a/xen/arch/x86/cpu-policy.c
>  +++ b/xen/arch/x86/cpu-policy.c
>  @@ -391,6 +391,27 @@ static void __init calculate_host_policy(void)
>   p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
>   }
>   
>  +/*
>  + * Guest max policies can have any max leaf/subleaf within bounds.
>  + *
>  + * - Some incoming VMs have a larger-than-necessary feat max_subleaf.
>  + * - Some VMs we'd like to synthesise leaves not present on the host.
>  + */
>  +static void __init guest_common_max_leaves(struct cpu_policy *p)
>  +{
>  +p->basic.max_leaf   = ARRAY_SIZE(p->basic.raw) - 1;
>  +p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
>  +p->extd.max_leaf= 0x8000U + ARRAY_SIZE(p->extd.raw) - 1;
>  +}
>  +
>  +/* Guest default policies inherit the host max leaf/subleaf settings. */
>  +static void __init guest_common_default_leaves(struct cpu_policy *p)
>  +{
>  +p->basic.max_leaf   = host_cpu_policy.basic.max_leaf;
>  +p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
>  +p->extd.max_leaf= host_cpu_policy.extd.max_leaf;
>  +}
> >>> I think this what I'm going to ask is future work.  After the
> >>> modifications done to the host policy by max functions
> >>> (calculate_{hvm,pv}_max_policy()) won't the max {sub,}leaf adjustments
> >>> better be done taking into account the contents of the policy, rather
> >>> than capping to the host values?
> >>>
> >>> (note this comment is strictly for guest_common_default_leaves(), the
> >>> max version is fine using ARRAY_SIZE).
> >> I'm afraid I don't follow.
> >>
> >> calculate_{pv,hvm}_max_policy() don't modify the host policy.
> > Hm, I don't think I've expressed myself clearly, sorry.  Let me try
> > again.
> >
> > calculate_{hvm,pv}_max_policy() extends the host policy by possibly
> > setting new features, and such extended policy is then used as the
> > base for the PV/HVM default policies.
> >
> > Won't the resulting policy in calculate_{hvm,pv}_def_policy() risks
> > having bits set past the max {sub,}leaf in the host policy, as it's
> > based in {hvm,pv}_def_cpu_policy that might have such bits set?
> 
> Oh, right.
> 
> This patch doesn't change anything WRT that.

Indeed, didn't intend my comment to block it, just that I think at
some point the logic in guest_common_default_leaves() will need to be
expanded.

> But I think you're right that we do risk getting into that case (in
> principle at least) because of how guest_common_*_feature_adjustment() work.
> 
> Furthermore, the bug will typically get hidden because we serialise
> based on the max_leaf/subleaf, and will discard feature words outside of
> the max_leaf/subleaf bounds.

Yes, once we serialize it for toolstack consumption the leafs will be
implicitly zeroed.

> I suppose we probably want a variation of x86_cpu_featureset_to_policy()
> which extends the max_leaf/subleaf based on non-zero values in leaves. 
> (This already feels like it's going to be an ugly algorithm.)

Hm, I was thinking that we would need to adjust
guest_common_default_leaves() to properly shrink the max {sub,}leaf
fields from the max policies.

Thanks, Roger.



Re: [RFC PATCH 0/6] xen/abi: On wide bitfields inside primitive types

2024-10-30 Thread Alejandro Vallejo


In the course of preparing this answer I just noticed that altp2m_opts suffers
from the exact same annoyance, with the exact same fix. I just noticed while
rebasing my Rust branch.

On Wed Oct 30, 2024 at 9:14 AM GMT, Jan Beulich wrote:
> On 29.10.2024 19:16, Alejandro Vallejo wrote:
> > Non-boolean bitfields in the hypercall ABI make it fairly inconvenient to
> > create bindings for any language because (a) they are always ad-hoc and are
> > subject to restrictions regular fields are not (b) require boilerplate that
> > regular fields do not and (c) might not even be part of the core language,
> > forcing avoidable external libraries into any sort of generic library.
> > 
> > This patch (it's a series merely to split roughly by maintainer) is one such
> > case that I happened to spot while playing around. It's the grant_version
> > field, buried under an otherwise empty grant_opts.
> > 
> > The invariant I'd like to (slowly) introduce and discuss is that fields may
> > have bitflags (e.g: a packed array of booleans indexed by some enumerated
> > type), but not be mixed with wider fields in the same primitive type. This
> > ensures any field containing an integer of any kind can be referred by 
> > pointer
> > and treated the same way as any other with regards to sizeof() and the like.
>
> While I don't strictly mind, I'm also not really seeing why taking addresses
> or applying sizeof() would be commonly necessary. Can you perhaps provide a
> concrete example of where the present way of dealing with grant max version
> is getting in the way? After all your use of the term "bitfield" doesn't
> really mean C's understanding of it, so especially (c) above escapes me to a
> fair degree.

Wall of text ahead, but I'll try to stay on point. The rationale should become
a lot clearer after I send an RFC series with initial code to autogenerate some
hypercall payloads from markup. The biggest question is: Can I create a
definition language such that (a) it precisely represents the Xen ABI and (b)
is fully type-safe under modern strongly-typed languages?

I already have a backbone I can define the ABI in, so my options when I hit
some impedance mismatch are:

  1. Change the ABI so it matches better my means of defining it.
  2. Change the means to define so it captures the existing ABI better.

Most of the work I've done has moved in the (2) direction so far, but I found a
number of pain points when mapping the existing ABI to Rust that, while not
impossible to work around, are quite annoying for no clear benefit. If
possible, I'd like to simplify the cognitive load involved in defining, using
and updating hypercalls rather than bending over backwards to support a
construct that provides no real benefit. IOW: If I can define an ABI that is
_simpler_, it follows that it's also easier to not make mistakes and it's
easier to generate code for it.

The use of packed fields is one such case. Even in C, we create extra macros
for creating a field, modifying it, fetching it, etc. Patches 2-6 are strict
code removals. And even in the most extreme cases the space savings are largely
irrelevant because the hypercall has a fixed size. We do want to pack _flags_
as otherwise the payload size would explode pretty quickly on hypercalls with
tons of boolean options, but I'm not aware of that being problematic for wider
subfields (like the grant max version).

Now, being more concrete...

##
# IDL is simpler if the size is a property of the type
##

Consider the definition of the (new) max_grant_version type under the IDL I'm
working on (it's TOML, but I don't particularly care about which markup we end
up using).

  [[enums]]
  name = "xen_domaincreate_max_grant_version"
  description = "Content of the `max_grant_version` field of the domain 
creation hypercall."
  typ = { tag = "u8" }

  [[enums.variants]]
  name = "off"
  description = "Must be used with gnttab support compiled out"
  value = 0

  [[enums.variants]]
  name = "v1"
  description = "Allow the domain to use up to gnttab_v1"
  value = 1

  [[enums.variants]]
  name = "v2"
  description = "Allow the domain to use up to gnttab_v2"
  value = 2

Note that I can define a type being enumerated, can choose its specific
variants and its width is a property of the type itself. With bitfields you're
always in a weird position of the width not being part of the type that goes
into it.

Should I need it as a field somewhere, then...

  [[structs.fields]]
  name = "max_grant_version"
  description = "Maximum grant table version the domain may be bumped to"
  typ = { tag = "enum", args = "xen_domaincreate_max_grant_version" }

... at which point the size of the field is given by an intrinsic property of
the type (the typ property on the enums table) I previously defined. It's
extensible, composable and allows me to generate readable code in both C and

Re: [RFC PATCH 1/6] xen/domctl: Refine grant_opts into grant_version

2024-10-30 Thread Alejandro Vallejo
Hi,

On Wed Oct 30, 2024 at 9:08 AM GMT, Jan Beulich wrote:
> On 29.10.2024 19:16, Alejandro Vallejo wrote:
> > grant_opts is overoptimizing for space packing in a hypercall that
> > doesn't warrant the effort. Tweak the ABI without breaking it in order
> > to remove the bitfield by extending it to 8 bits.
> > 
> > Xen only supports little-endian systems, so the transformation from
> > uint32_t to uint8_t followed by 3 octets worth of padding is not an ABI
> > breakage.
> > 
> > No functional change
> > 
> > Signed-off-by: Alejandro Vallejo 
> > ---
> >  xen/include/public/domctl.h | 15 +++
> >  1 file changed, 11 insertions(+), 4 deletions(-)
>
> This isn't a complete patch, is it? I expect it'll break the build without
> users of the field also adjusted.

Indeed. The non-RFC version would have everything folded in one. I just wanted
to avoid Cc-ing everyone in MAINTAINERS for the same single RFC patch. It's
split by (rough) maintained area.

>
> > --- a/xen/include/public/domctl.h
> > +++ b/xen/include/public/domctl.h
> > @@ -90,11 +90,18 @@ struct xen_domctl_createdomain {
> >  int32_t max_grant_frames;
> >  int32_t max_maptrack_frames;
> >  
> > -/* Grant version, use low 4 bits. */
> > -#define XEN_DOMCTL_GRANT_version_mask0xf
> > -#define XEN_DOMCTL_GRANT_version(v)  ((v) & 
> > XEN_DOMCTL_GRANT_version_mask)
> > +/*
> > + * Maximum grant table version the domain can be configured with.
> > + *
> > + * Domains always start with v1 (if CONFIG_GRANT_TABLE) and can be 
> > bumped
> > + * to use up to `max_grant_version` via GNTTABOP_set_version.
> > + *
> > + * Must be zero iff !CONFIG_GRANT_TABLE.
> > + */
> > +uint8_t max_grant_version;
> >  
> > -uint32_t grant_opts;
> > +/* Unused */
> > +uint8_t rsvd0[3];
> >  
> >  /*
> >   * Enable altp2m mixed mode.
>
> Just to mention it: I think while binary compatible, this is still on the edge
> of needing an interface version bump. We may get away without as users of the
> removed identifiers will still notice by way of observing build failures.
>
> Jan

If users are forced to rebuild either way, might as well prevent existing
binaries from breaking. There ought to be a strict distinction between ABI and
API compatibility because, while they typically move in lockstep, they don't
always (and this is one such an example).

Regardless, this is a discussion for the final patch if we get there.

Cheers,
Alejandro



Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-30 Thread Roger Pau Monné
On Wed, Oct 30, 2024 at 03:59:12PM +, Alejandro Vallejo wrote:
> On Wed Oct 30, 2024 at 3:13 PM GMT, Roger Pau Monné wrote:
> > On Wed, Oct 30, 2024 at 02:45:19PM +, Andrew Cooper wrote:
> > > On 30/10/2024 11:03 am, Roger Pau Monné wrote:
> > > > On Wed, Oct 30, 2024 at 10:39:12AM +, Andrew Cooper wrote:
> > > >> On 30/10/2024 8:59 am, Roger Pau Monné wrote:
> > > >>> On Tue, Oct 29, 2024 at 05:55:05PM +, Andrew Cooper wrote:
> > >  diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
> > >  index b6d9fad56773..78bc9872b09a 100644
> > >  --- a/xen/arch/x86/cpu-policy.c
> > >  +++ b/xen/arch/x86/cpu-policy.c
> > >  @@ -391,6 +391,27 @@ static void __init calculate_host_policy(void)
> > >   p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
> > >   }
> > >   
> > >  +/*
> > >  + * Guest max policies can have any max leaf/subleaf within bounds.
> > >  + *
> > >  + * - Some incoming VMs have a larger-than-necessary feat 
> > >  max_subleaf.
> > >  + * - Some VMs we'd like to synthesise leaves not present on the 
> > >  host.
> > >  + */
> > >  +static void __init guest_common_max_leaves(struct cpu_policy *p)
> > >  +{
> > >  +p->basic.max_leaf   = ARRAY_SIZE(p->basic.raw) - 1;
> > >  +p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
> > >  +p->extd.max_leaf= 0x8000U + ARRAY_SIZE(p->extd.raw) 
> > >  - 1;
> > >  +}
> > >  +
> > >  +/* Guest default policies inherit the host max leaf/subleaf 
> > >  settings. */
> > >  +static void __init guest_common_default_leaves(struct cpu_policy *p)
> > >  +{
> > >  +p->basic.max_leaf   = host_cpu_policy.basic.max_leaf;
> > >  +p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
> > >  +p->extd.max_leaf= host_cpu_policy.extd.max_leaf;
> > >  +}
> > > >>> I think this what I'm going to ask is future work.  After the
> > > >>> modifications done to the host policy by max functions
> > > >>> (calculate_{hvm,pv}_max_policy()) won't the max {sub,}leaf adjustments
> > > >>> better be done taking into account the contents of the policy, rather
> > > >>> than capping to the host values?
> > > >>>
> > > >>> (note this comment is strictly for guest_common_default_leaves(), the
> > > >>> max version is fine using ARRAY_SIZE).
> > > >> I'm afraid I don't follow.
> > > >>
> > > >> calculate_{pv,hvm}_max_policy() don't modify the host policy.
> > > > Hm, I don't think I've expressed myself clearly, sorry.  Let me try
> > > > again.
> > > >
> > > > calculate_{hvm,pv}_max_policy() extends the host policy by possibly
> > > > setting new features, and such extended policy is then used as the
> > > > base for the PV/HVM default policies.
> > > >
> > > > Won't the resulting policy in calculate_{hvm,pv}_def_policy() risks
> > > > having bits set past the max {sub,}leaf in the host policy, as it's
> > > > based in {hvm,pv}_def_cpu_policy that might have such bits set?
> > > 
> > > Oh, right.
> > > 
> > > This patch doesn't change anything WRT that.
> >
> > Indeed, didn't intend my comment to block it, just that I think at
> > some point the logic in guest_common_default_leaves() will need to be
> > expanded.
> >
> > > But I think you're right that we do risk getting into that case (in
> > > principle at least) because of how guest_common_*_feature_adjustment() 
> > > work.
> > > 
> > > Furthermore, the bug will typically get hidden because we serialise
> > > based on the max_leaf/subleaf, and will discard feature words outside of
> > > the max_leaf/subleaf bounds.
> >
> > Yes, once we serialize it for toolstack consumption the leafs will be
> > implicitly zeroed.
> >
> > > I suppose we probably want a variation of x86_cpu_featureset_to_policy()
> > > which extends the max_leaf/subleaf based on non-zero values in leaves. 
> > > (This already feels like it's going to be an ugly algorithm.)
> >
> > Hm, I was thinking that we would need to adjust
> > guest_common_default_leaves() to properly shrink the max {sub,}leaf
> > fields from the max policies.
> 
> That would be tricky in case we end up with subleafs that are strictly
> populated at runtime.

Maybe we need a way to expose {sub,}leaf minimum value requirements in
the gen-cpuid.py logic, so we can tie minimum required {sub,}leaf
values to features?

I would like to think that those run-time populated leafs will be tied
to features, as to have a way to account for them.

> Xen would have no way of knowing whether that's meant to
> be implemented or not. It seems safer to raise the max if we find a non-zero
> leaves higher than the current max.

Raising might be better, TBH I didn't give the exact solution much
though.  But we need to be aware that setting it to the host value is
likely something we should look into fixing, otherwise subtle bugs
might occur.

Maybe add a comment to guest_common_default_leaves() in 

Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-30 Thread Andrew Cooper
On 30/10/2024 3:13 pm, Roger Pau Monné wrote:
> On Wed, Oct 30, 2024 at 02:45:19PM +, Andrew Cooper wrote:
>> On 30/10/2024 11:03 am, Roger Pau Monné wrote:
>>> On Wed, Oct 30, 2024 at 10:39:12AM +, Andrew Cooper wrote:
 On 30/10/2024 8:59 am, Roger Pau Monné wrote:
> On Tue, Oct 29, 2024 at 05:55:05PM +, Andrew Cooper wrote:
>> diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
>> index b6d9fad56773..78bc9872b09a 100644
>> --- a/xen/arch/x86/cpu-policy.c
>> +++ b/xen/arch/x86/cpu-policy.c
>> @@ -391,6 +391,27 @@ static void __init calculate_host_policy(void)
>>  p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
>>  }
>>  
>> +/*
>> + * Guest max policies can have any max leaf/subleaf within bounds.
>> + *
>> + * - Some incoming VMs have a larger-than-necessary feat max_subleaf.
>> + * - Some VMs we'd like to synthesise leaves not present on the host.
>> + */
>> +static void __init guest_common_max_leaves(struct cpu_policy *p)
>> +{
>> +p->basic.max_leaf   = ARRAY_SIZE(p->basic.raw) - 1;
>> +p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
>> +p->extd.max_leaf= 0x8000U + ARRAY_SIZE(p->extd.raw) - 1;
>> +}
>> +
>> +/* Guest default policies inherit the host max leaf/subleaf settings. */
>> +static void __init guest_common_default_leaves(struct cpu_policy *p)
>> +{
>> +p->basic.max_leaf   = host_cpu_policy.basic.max_leaf;
>> +p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
>> +p->extd.max_leaf= host_cpu_policy.extd.max_leaf;
>> +}
> I think this what I'm going to ask is future work.  After the
> modifications done to the host policy by max functions
> (calculate_{hvm,pv}_max_policy()) won't the max {sub,}leaf adjustments
> better be done taking into account the contents of the policy, rather
> than capping to the host values?
>
> (note this comment is strictly for guest_common_default_leaves(), the
> max version is fine using ARRAY_SIZE).
 I'm afraid I don't follow.

 calculate_{pv,hvm}_max_policy() don't modify the host policy.
>>> Hm, I don't think I've expressed myself clearly, sorry.  Let me try
>>> again.
>>>
>>> calculate_{hvm,pv}_max_policy() extends the host policy by possibly
>>> setting new features, and such extended policy is then used as the
>>> base for the PV/HVM default policies.
>>>
>>> Won't the resulting policy in calculate_{hvm,pv}_def_policy() risks
>>> having bits set past the max {sub,}leaf in the host policy, as it's
>>> based in {hvm,pv}_def_cpu_policy that might have such bits set?
>> Oh, right.
>>
>> This patch doesn't change anything WRT that.
> Indeed, didn't intend my comment to block it, just that I think at
> some point the logic in guest_common_default_leaves() will need to be
> expanded.
>
>> But I think you're right that we do risk getting into that case (in
>> principle at least) because of how guest_common_*_feature_adjustment() work.
>>
>> Furthermore, the bug will typically get hidden because we serialise
>> based on the max_leaf/subleaf, and will discard feature words outside of
>> the max_leaf/subleaf bounds.
> Yes, once we serialize it for toolstack consumption the leafs will be
> implicitly zeroed.
>
>> I suppose we probably want a variation of x86_cpu_featureset_to_policy()
>> which extends the max_leaf/subleaf based on non-zero values in leaves. 
>> (This already feels like it's going to be an ugly algorithm.)
> Hm, I was thinking that we would need to adjust
> guest_common_default_leaves() to properly shrink the max {sub,}leaf
> fields from the max policies.

Hmm.  What we'd do is have default inherit max's ARRAY_SIZES(), then do
all the existing logic, then as the final step, shrink the default
policies, vaguely per Jan's plan.

i.e. we'd end up deleting guest_common_default_leaves()

That way we don't need to encode any knowledge of which feature bit
means what WRT max_leaf/subleaf.

~Andrew



[PATCH v2] xen/common: Move gic_preinit() to common code

2024-10-30 Thread Oleksii Kurochko
Introduce ic_preinit() in the common codebase, as it is not
architecture-specific and can be reused by both PPC and RISC-V.
This function identifies the node with the interrupt-controller property
in the device tree and calls device_init() to handle architecture-specific
initialization of the interrupt controller.

Additionally, rename gic_acpi_preinit() to ic_acpi_preinit() as it is used
by ic_preinit(), while keeping it defined in architecture-specific as this
part is architecture-specific. In case if CONFIG_ACPI=n a stub for
ic_acpi_preinit() is provided. To declaration/defintion of ic_acpi_preint()
is added `inline` to deal with the compilation issue:
  error: 'ic_acpi_preinit' defined but not used [-Werror=unused-function]

Make minor adjustments compared to the original ARM implementation of
gic_dt_preinit():
 - Remove the local rc variable in gic_dt_preinit() since it is only used once.
 - Change the prefix from gic to ic to clarify that the function is not
   specific to ARM’s GIC, making it suitable for other architectures as well.

Signed-off-by: Oleksii Kurochko 
---
Changes in v2:
 - Revert changes connected to moving of gic_acpi_preinit() to common code as
   it isn't really architecture indepent part.
 - Update the commit message.
 - Move stub of ic_acpi_preinit() to  for the case when
   CONFIG_ACPI=n.
---
 xen/arch/arm/gic.c   | 45 +--
 xen/arch/arm/setup.c |  3 ++-
 xen/common/device.c  | 46 
 xen/include/asm-generic/device.h | 10 +++
 4 files changed, 59 insertions(+), 45 deletions(-)

diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
index 3eaf670fd7..b4a1e769df 100644
--- a/xen/arch/arm/gic.c
+++ b/xen/arch/arm/gic.c
@@ -214,38 +214,8 @@ int gic_map_hwdom_extra_mappings(struct domain *d)
 return 0;
 }
 
-static void __init gic_dt_preinit(void)
-{
-int rc;
-struct dt_device_node *node;
-uint8_t num_gics = 0;
-
-dt_for_each_device_node( dt_host, node )
-{
-if ( !dt_get_property(node, "interrupt-controller", NULL) )
-continue;
-
-if ( !dt_get_parent(node) )
-continue;
-
-rc = device_init(node, DEVICE_INTERRUPT_CONTROLLER, NULL);
-if ( !rc )
-{
-/* NOTE: Only one GIC is supported */
-num_gics = 1;
-break;
-}
-}
-if ( !num_gics )
-panic("Unable to find compatible GIC in the device tree\n");
-
-/* Set the GIC as the primary interrupt controller */
-dt_interrupt_controller = node;
-dt_device_set_used_by(node, DOMID_XEN);
-}
-
 #ifdef CONFIG_ACPI
-static void __init gic_acpi_preinit(void)
+void __init ic_acpi_preinit(void)
 {
 struct acpi_subtable_header *header;
 struct acpi_madt_generic_distributor *dist;
@@ -259,21 +229,8 @@ static void __init gic_acpi_preinit(void)
 if ( acpi_device_init(DEVICE_INTERRUPT_CONTROLLER, NULL, dist->version) )
 panic("Unable to find compatible GIC in the ACPI table\n");
 }
-#else
-static void __init gic_acpi_preinit(void) { }
 #endif
 
-/* Find the interrupt controller and set up the callback to translate
- * device tree or ACPI IRQ.
- */
-void __init gic_preinit(void)
-{
-if ( acpi_disabled )
-gic_dt_preinit();
-else
-gic_acpi_preinit();
-}
-
 /* Set up the GIC */
 void __init gic_init(void)
 {
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index 71ebaa77ca..1ea7db0bd4 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -359,7 +360,7 @@ void asmlinkage __init start_xen(unsigned long fdt_paddr)
 
 preinit_xen_time();
 
-gic_preinit();
+ic_preinit();
 
 arm_uart_init();
 console_init_preirq();
diff --git a/xen/common/device.c b/xen/common/device.c
index 33e0d58f2f..f440d8be88 100644
--- a/xen/common/device.c
+++ b/xen/common/device.c
@@ -4,10 +4,14 @@
  *   xen/arch/arm/device.c
  */
 
+#include 
 #include 
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include 
 
@@ -56,6 +60,40 @@ enum device_class device_get_class(const struct 
dt_device_node *dev)
 return DEVICE_UNKNOWN;
 }
 
+static void __init ic_dt_preinit(void)
+{
+struct dt_device_node *node;
+uint8_t num_gics = 0;
+
+dt_for_each_device_node( dt_host, node )
+{
+if ( !dt_get_property(node, "interrupt-controller", NULL) )
+continue;
+
+if ( !dt_get_parent(node) )
+continue;
+
+if ( !device_init(node, DEVICE_INTERRUPT_CONTROLLER, NULL) )
+{
+/* NOTE: Only one GIC is supported */
+num_gics = 1;
+break;
+}
+}
+
+if ( !num_gics )
+panic("Unable to find compatible interrupt contoller"
+  "in the device tree\n");
+
+/* Set the interrupt controller as the primary interrupt controller */
+dt_interrupt_

Re: [RFC PATCH 0/6] xen/abi: On wide bitfields inside primitive types

2024-10-30 Thread Alejandro Vallejo
On Wed Oct 30, 2024 at 8:45 AM GMT, Christian Lindig wrote:
>
>
> > On 29 Oct 2024, at 18:16, Alejandro Vallejo  
> > wrote:
> > 
> > 
> > The invariant I'd like to (slowly) introduce and discuss is that fields may
> > have bitflags (e.g: a packed array of booleans indexed by some enumerated
> > type), but not be mixed with wider fields in the same primitive type. This
> > ensures any field containing an integer of any kind can be referred by 
> > pointer
> > and treated the same way as any other with regards to sizeof() and the like.
>
> Acked-by: Christian Lindig 

Thanks.

>
>
> Fine with me but the OCaml part is not very exposed to this.

Yeah, OCaml is pretty far from interacting with these details at all.
>
> — C

Cheers,
Alejandro



[PATCH for-4.19] Config: Update MiniOS revision

2024-10-30 Thread Andrew Cooper
A new branch from xen-RELEASE-4.19.0, for now containing commit
a400dd517068 ("Add missing symbol exports for grub-pv").

Signed-off-by: Andrew Cooper 
---
CC: Juergen Gross 
CC: Jan Beulich 
CC: Stefano Stabellini 
CC: Julien Grall 
---
 Config.mk | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Config.mk b/Config.mk
index 03a89624c77f..aa3d5843f1ed 100644
--- a/Config.mk
+++ b/Config.mk
@@ -224,7 +224,7 @@ QEMU_UPSTREAM_URL ?= 
https://xenbits.xen.org/git-http/qemu-xen.git
 QEMU_UPSTREAM_REVISION ?= qemu-xen-4.19.0
 
 MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git
-MINIOS_UPSTREAM_REVISION ?= xen-RELEASE-4.19.0
+MINIOS_UPSTREAM_REVISION ?= xen-stable-4.19
 
 SEABIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/seabios.git
 SEABIOS_UPSTREAM_REVISION ?= rel-1.16.3

base-commit: fadbc7e32e42f1a4199b854a895744f026803320
-- 
2.39.5




[PATCH] Config: Update MiniOS revision

2024-10-30 Thread Andrew Cooper
Commit 6d5159e8410b ("Add missing symbol exports for grub-pv")

Signed-off-by: Andrew Cooper 
---
CC: Juergen Gross 
CC: Jan Beulich 
CC: Stefano Stabellini 
CC: Julien Grall 
---
 Config.mk | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Config.mk b/Config.mk
index 6dd2f9439cfc..f1eab9a20a66 100644
--- a/Config.mk
+++ b/Config.mk
@@ -224,7 +224,7 @@ QEMU_UPSTREAM_URL ?= 
https://xenbits.xen.org/git-http/qemu-xen.git
 QEMU_UPSTREAM_REVISION ?= master
 
 MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git
-MINIOS_UPSTREAM_REVISION ?= 80ef70f92cb7b95ef48deea1157f2194b10b8c05
+MINIOS_UPSTREAM_REVISION ?= 6d5159e8410be16a47433bac1627e63f8adc7cd9
 
 SEABIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/seabios.git
 SEABIOS_UPSTREAM_REVISION ?= rel-1.16.3

base-commit: bb7296d77f171c7bfbafab30ed51e9be29660e39
-- 
2.39.5




Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-30 Thread Andrew Cooper
On 30/10/2024 5:10 pm, Roger Pau Monné wrote:
> On Wed, Oct 30, 2024 at 04:51:34PM +, Andrew Cooper wrote:
>> On 30/10/2024 3:13 pm, Roger Pau Monné wrote:
>>> On Wed, Oct 30, 2024 at 02:45:19PM +, Andrew Cooper wrote:
 On 30/10/2024 11:03 am, Roger Pau Monné wrote:
> On Wed, Oct 30, 2024 at 10:39:12AM +, Andrew Cooper wrote:
>> On 30/10/2024 8:59 am, Roger Pau Monné wrote:
>>> On Tue, Oct 29, 2024 at 05:55:05PM +, Andrew Cooper wrote:
 diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
 index b6d9fad56773..78bc9872b09a 100644
 --- a/xen/arch/x86/cpu-policy.c
 +++ b/xen/arch/x86/cpu-policy.c
 @@ -391,6 +391,27 @@ static void __init calculate_host_policy(void)
  p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
  }
  
 +/*
 + * Guest max policies can have any max leaf/subleaf within bounds.
 + *
 + * - Some incoming VMs have a larger-than-necessary feat max_subleaf.
 + * - Some VMs we'd like to synthesise leaves not present on the host.
 + */
 +static void __init guest_common_max_leaves(struct cpu_policy *p)
 +{
 +p->basic.max_leaf   = ARRAY_SIZE(p->basic.raw) - 1;
 +p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
 +p->extd.max_leaf= 0x8000U + ARRAY_SIZE(p->extd.raw) - 
 1;
 +}
 +
 +/* Guest default policies inherit the host max leaf/subleaf settings. 
 */
 +static void __init guest_common_default_leaves(struct cpu_policy *p)
 +{
 +p->basic.max_leaf   = host_cpu_policy.basic.max_leaf;
 +p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
 +p->extd.max_leaf= host_cpu_policy.extd.max_leaf;
 +}
>>> I think this what I'm going to ask is future work.  After the
>>> modifications done to the host policy by max functions
>>> (calculate_{hvm,pv}_max_policy()) won't the max {sub,}leaf adjustments
>>> better be done taking into account the contents of the policy, rather
>>> than capping to the host values?
>>>
>>> (note this comment is strictly for guest_common_default_leaves(), the
>>> max version is fine using ARRAY_SIZE).
>> I'm afraid I don't follow.
>>
>> calculate_{pv,hvm}_max_policy() don't modify the host policy.
> Hm, I don't think I've expressed myself clearly, sorry.  Let me try
> again.
>
> calculate_{hvm,pv}_max_policy() extends the host policy by possibly
> setting new features, and such extended policy is then used as the
> base for the PV/HVM default policies.
>
> Won't the resulting policy in calculate_{hvm,pv}_def_policy() risks
> having bits set past the max {sub,}leaf in the host policy, as it's
> based in {hvm,pv}_def_cpu_policy that might have such bits set?
 Oh, right.

 This patch doesn't change anything WRT that.
>>> Indeed, didn't intend my comment to block it, just that I think at
>>> some point the logic in guest_common_default_leaves() will need to be
>>> expanded.
>>>
 But I think you're right that we do risk getting into that case (in
 principle at least) because of how guest_common_*_feature_adjustment() 
 work.

 Furthermore, the bug will typically get hidden because we serialise
 based on the max_leaf/subleaf, and will discard feature words outside of
 the max_leaf/subleaf bounds.
>>> Yes, once we serialize it for toolstack consumption the leafs will be
>>> implicitly zeroed.
>>>
 I suppose we probably want a variation of x86_cpu_featureset_to_policy()
 which extends the max_leaf/subleaf based on non-zero values in leaves. 
 (This already feels like it's going to be an ugly algorithm.)
>>> Hm, I was thinking that we would need to adjust
>>> guest_common_default_leaves() to properly shrink the max {sub,}leaf
>>> fields from the max policies.
>> Hmm.  What we'd do is have default inherit max's ARRAY_SIZES(), then do
>> all the existing logic, then as the final step, shrink the default
>> policies, vaguely per Jan's plan.
>>
>> i.e. we'd end up deleting guest_common_default_leaves()
>>
>> That way we don't need to encode any knowledge of which feature bit
>> means what WRT max_leaf/subleaf.
> What about Alejandro's concern about runtime populated {sub,}leafs,
> won't we risk shrinking too much if the last leaf intended to be kept
> happens to be a fully runtime populated one?
>
> Do we need some kind of special magic for fully run-time populated
> leafs (like the topology ones IIRC?) in order to account for them when
> doing those max calculations?

No.

Xen shrinks the default policies only, as part of calculating them on
boot, in order to make them look more plausible.

The toolstack shrinks the guest policy as part of domain construction.


In both cases, shrinking is probably the final act

[PATCH] scripts: Fix git-checkout.sh to work with branches other than master

2024-10-30 Thread Andrew Cooper
Xen uses master for QEMU_UPSTREAM_REVISION, and has done for other trees too
in the path.  Apparently we've never specified a different branch, because the
git-clone rune only pulls in the master branch; it does not pull in diverging
branches.  Fix this by stating which branch/tag is wanted.

$TAG is really a committish, and git-clone's -b/--branch takes a committish
too.

Signed-off-by: Andrew Cooper 
---
CC: Anthony PERARD 
CC: Juergen Gross 
CC: Jan Beulich 
CC: Stefano Stabellini 
CC: Julien Grall 

Found while trying to test the MiniOS fixes in Gitlab.  This is a example run
with most stubdom builds passing:

  https://gitlab.com/xen-project/people/andyhhp/xen/-/pipelines/1520611847

and all of them successfully cloned.
---
 scripts/git-checkout.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/git-checkout.sh b/scripts/git-checkout.sh
index fd4425ac4ee8..3796cbfe39a7 100755
--- a/scripts/git-checkout.sh
+++ b/scripts/git-checkout.sh
@@ -14,7 +14,7 @@ set -e
 if test \! -d $DIR-remote; then
rm -rf $DIR-remote $DIR-remote.tmp
mkdir -p $DIR-remote.tmp; rmdir $DIR-remote.tmp
-   $GIT clone $TREE $DIR-remote.tmp
+   $GIT clone -b $TAG $TREE $DIR-remote.tmp
if test "$TAG" ; then
cd $DIR-remote.tmp
$GIT branch -D dummy >/dev/null 2>&1 ||:

base-commit: bb7296d77f171c7bfbafab30ed51e9be29660e39
-- 
2.39.5




Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-30 Thread Alejandro Vallejo
On Wed Oct 30, 2024 at 3:13 PM GMT, Roger Pau Monné wrote:
> On Wed, Oct 30, 2024 at 02:45:19PM +, Andrew Cooper wrote:
> > On 30/10/2024 11:03 am, Roger Pau Monné wrote:
> > > On Wed, Oct 30, 2024 at 10:39:12AM +, Andrew Cooper wrote:
> > >> On 30/10/2024 8:59 am, Roger Pau Monné wrote:
> > >>> On Tue, Oct 29, 2024 at 05:55:05PM +, Andrew Cooper wrote:
> >  diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
> >  index b6d9fad56773..78bc9872b09a 100644
> >  --- a/xen/arch/x86/cpu-policy.c
> >  +++ b/xen/arch/x86/cpu-policy.c
> >  @@ -391,6 +391,27 @@ static void __init calculate_host_policy(void)
> >   p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
> >   }
> >   
> >  +/*
> >  + * Guest max policies can have any max leaf/subleaf within bounds.
> >  + *
> >  + * - Some incoming VMs have a larger-than-necessary feat max_subleaf.
> >  + * - Some VMs we'd like to synthesise leaves not present on the host.
> >  + */
> >  +static void __init guest_common_max_leaves(struct cpu_policy *p)
> >  +{
> >  +p->basic.max_leaf   = ARRAY_SIZE(p->basic.raw) - 1;
> >  +p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
> >  +p->extd.max_leaf= 0x8000U + ARRAY_SIZE(p->extd.raw) - 
> >  1;
> >  +}
> >  +
> >  +/* Guest default policies inherit the host max leaf/subleaf settings. 
> >  */
> >  +static void __init guest_common_default_leaves(struct cpu_policy *p)
> >  +{
> >  +p->basic.max_leaf   = host_cpu_policy.basic.max_leaf;
> >  +p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
> >  +p->extd.max_leaf= host_cpu_policy.extd.max_leaf;
> >  +}
> > >>> I think this what I'm going to ask is future work.  After the
> > >>> modifications done to the host policy by max functions
> > >>> (calculate_{hvm,pv}_max_policy()) won't the max {sub,}leaf adjustments
> > >>> better be done taking into account the contents of the policy, rather
> > >>> than capping to the host values?
> > >>>
> > >>> (note this comment is strictly for guest_common_default_leaves(), the
> > >>> max version is fine using ARRAY_SIZE).
> > >> I'm afraid I don't follow.
> > >>
> > >> calculate_{pv,hvm}_max_policy() don't modify the host policy.
> > > Hm, I don't think I've expressed myself clearly, sorry.  Let me try
> > > again.
> > >
> > > calculate_{hvm,pv}_max_policy() extends the host policy by possibly
> > > setting new features, and such extended policy is then used as the
> > > base for the PV/HVM default policies.
> > >
> > > Won't the resulting policy in calculate_{hvm,pv}_def_policy() risks
> > > having bits set past the max {sub,}leaf in the host policy, as it's
> > > based in {hvm,pv}_def_cpu_policy that might have such bits set?
> > 
> > Oh, right.
> > 
> > This patch doesn't change anything WRT that.
>
> Indeed, didn't intend my comment to block it, just that I think at
> some point the logic in guest_common_default_leaves() will need to be
> expanded.
>
> > But I think you're right that we do risk getting into that case (in
> > principle at least) because of how guest_common_*_feature_adjustment() work.
> > 
> > Furthermore, the bug will typically get hidden because we serialise
> > based on the max_leaf/subleaf, and will discard feature words outside of
> > the max_leaf/subleaf bounds.
>
> Yes, once we serialize it for toolstack consumption the leafs will be
> implicitly zeroed.
>
> > I suppose we probably want a variation of x86_cpu_featureset_to_policy()
> > which extends the max_leaf/subleaf based on non-zero values in leaves. 
> > (This already feels like it's going to be an ugly algorithm.)
>
> Hm, I was thinking that we would need to adjust
> guest_common_default_leaves() to properly shrink the max {sub,}leaf
> fields from the max policies.

That would be tricky in case we end up with subleafs that are strictly
populated at runtime. Xen would have no way of knowing whether that's meant to
be implemented or not. It seems safer to raise the max if we find a non-zero
leaves higher than the current max.

The algorithm is probably quite simple for static data, as it's merely
traversing the raw arrays and keeping track of the last non-zero leaf.

>
> Thanks, Roger.

Cheers,
Alejandro



Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-30 Thread Roger Pau Monné
On Wed, Oct 30, 2024 at 04:51:34PM +, Andrew Cooper wrote:
> On 30/10/2024 3:13 pm, Roger Pau Monné wrote:
> > On Wed, Oct 30, 2024 at 02:45:19PM +, Andrew Cooper wrote:
> >> On 30/10/2024 11:03 am, Roger Pau Monné wrote:
> >>> On Wed, Oct 30, 2024 at 10:39:12AM +, Andrew Cooper wrote:
>  On 30/10/2024 8:59 am, Roger Pau Monné wrote:
> > On Tue, Oct 29, 2024 at 05:55:05PM +, Andrew Cooper wrote:
> >> diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
> >> index b6d9fad56773..78bc9872b09a 100644
> >> --- a/xen/arch/x86/cpu-policy.c
> >> +++ b/xen/arch/x86/cpu-policy.c
> >> @@ -391,6 +391,27 @@ static void __init calculate_host_policy(void)
> >>  p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
> >>  }
> >>  
> >> +/*
> >> + * Guest max policies can have any max leaf/subleaf within bounds.
> >> + *
> >> + * - Some incoming VMs have a larger-than-necessary feat max_subleaf.
> >> + * - Some VMs we'd like to synthesise leaves not present on the host.
> >> + */
> >> +static void __init guest_common_max_leaves(struct cpu_policy *p)
> >> +{
> >> +p->basic.max_leaf   = ARRAY_SIZE(p->basic.raw) - 1;
> >> +p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
> >> +p->extd.max_leaf= 0x8000U + ARRAY_SIZE(p->extd.raw) - 
> >> 1;
> >> +}
> >> +
> >> +/* Guest default policies inherit the host max leaf/subleaf settings. 
> >> */
> >> +static void __init guest_common_default_leaves(struct cpu_policy *p)
> >> +{
> >> +p->basic.max_leaf   = host_cpu_policy.basic.max_leaf;
> >> +p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
> >> +p->extd.max_leaf= host_cpu_policy.extd.max_leaf;
> >> +}
> > I think this what I'm going to ask is future work.  After the
> > modifications done to the host policy by max functions
> > (calculate_{hvm,pv}_max_policy()) won't the max {sub,}leaf adjustments
> > better be done taking into account the contents of the policy, rather
> > than capping to the host values?
> >
> > (note this comment is strictly for guest_common_default_leaves(), the
> > max version is fine using ARRAY_SIZE).
>  I'm afraid I don't follow.
> 
>  calculate_{pv,hvm}_max_policy() don't modify the host policy.
> >>> Hm, I don't think I've expressed myself clearly, sorry.  Let me try
> >>> again.
> >>>
> >>> calculate_{hvm,pv}_max_policy() extends the host policy by possibly
> >>> setting new features, and such extended policy is then used as the
> >>> base for the PV/HVM default policies.
> >>>
> >>> Won't the resulting policy in calculate_{hvm,pv}_def_policy() risks
> >>> having bits set past the max {sub,}leaf in the host policy, as it's
> >>> based in {hvm,pv}_def_cpu_policy that might have such bits set?
> >> Oh, right.
> >>
> >> This patch doesn't change anything WRT that.
> > Indeed, didn't intend my comment to block it, just that I think at
> > some point the logic in guest_common_default_leaves() will need to be
> > expanded.
> >
> >> But I think you're right that we do risk getting into that case (in
> >> principle at least) because of how guest_common_*_feature_adjustment() 
> >> work.
> >>
> >> Furthermore, the bug will typically get hidden because we serialise
> >> based on the max_leaf/subleaf, and will discard feature words outside of
> >> the max_leaf/subleaf bounds.
> > Yes, once we serialize it for toolstack consumption the leafs will be
> > implicitly zeroed.
> >
> >> I suppose we probably want a variation of x86_cpu_featureset_to_policy()
> >> which extends the max_leaf/subleaf based on non-zero values in leaves. 
> >> (This already feels like it's going to be an ugly algorithm.)
> > Hm, I was thinking that we would need to adjust
> > guest_common_default_leaves() to properly shrink the max {sub,}leaf
> > fields from the max policies.
> 
> Hmm.  What we'd do is have default inherit max's ARRAY_SIZES(), then do
> all the existing logic, then as the final step, shrink the default
> policies, vaguely per Jan's plan.
> 
> i.e. we'd end up deleting guest_common_default_leaves()
> 
> That way we don't need to encode any knowledge of which feature bit
> means what WRT max_leaf/subleaf.

What about Alejandro's concern about runtime populated {sub,}leafs,
won't we risk shrinking too much if the last leaf intended to be kept
happens to be a fully runtime populated one?

Do we need some kind of special magic for fully run-time populated
leafs (like the topology ones IIRC?) in order to account for them when
doing those max calculations?

Thanks, Roger.



Re: [PATCH v2 1/3] xen/riscv: introduce setup_mm()

2024-10-30 Thread oleksii . kurochko
On Wed, 2024-10-30 at 11:25 +0100, Jan Beulich wrote:
> On 23.10.2024 17:50, Oleksii Kurochko wrote:
> > Introduce the implementation of setup_mm(), which includes:
> > 1. Adding all free regions to the boot allocator, as memory is
> > needed
> >    to allocate page tables used for frame table mapping.
> > 2. Calculating RAM size and the RAM end address.
> > 3. Setting up direct map mappings from each RAM bank and initialize
> >    directmap_virt_start (also introduce XENHEAP_VIRT_START which is
> >    defined as directmap_virt_start) to be properly aligned with RAM
> >    start to use more superpages to reduce pressure on the TLB.
> > 4. Setting up frame table mappings from physical address 0 to
> > ram_end
> >    to simplify mfn_to_page() and page_to_mfn() conversions.
> > 5. Setting up total_pages and max_page.
> > 
> > Update virt_to_maddr() to use introduced XENHEAP_VIRT_START.
> > 
> > Implement maddr_to_virt() function to convert a machine address
> > to a virtual address. This function is specifically designed to be
> > used
> > only for the DIRECTMAP region, so a check has been added to ensure
> > that
> > the address does not exceed DIRECTMAP_SIZE.
> 
> I'm unconvinced by this. Conceivably the function could be used on
> "imaginary" addresses, just to calculate abstract positions or e.g.
> deltas. At the same time I'm also not going to insist on the removal
> of
> that assertion, so long as it doesn't trigger.
> 
> > After the introduction of maddr_to_virt() the following linkage
> > error starts
> > to occur and to avoid it share_xen_page_with_guest() stub is added:
> >   riscv64-linux-gnu-ld: prelink.o: in function `tasklet_kill':
> >   /build/xen/common/tasklet.c:176: undefined reference to
> >  `share_xen_page_with_guest'
> >   riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol
> > `share_xen_page_with_guest'
> >     isn't defined riscv64-linux-gnu-ld: final link failed: bad
> > value
> > 
> > Despite the linkger fingering tasklet.c, it's trace.o which has the
> > undefined
> > refenrece:
> >   $ find . -name \*.o | while read F; do nm $F | grep
> > share_xen_page_with_guest &&
> >     echo $F; done
> >  U share_xen_page_with_guest
> >     ./xen/common/built_in.o
> >  U share_xen_page_with_guest
> >     ./xen/common/trace.o
> >  U share_xen_page_with_guest
> >     ./xen/prelink.o
> > 
> > Looking at trace.i, there is call of share_xen_page_with_guest()
> > but in case of
> > when maddr_to_virt() is defined as "return NULL" compiler optimizes
> > the part of
> > common/trace.c code where share_xen_page_with_priviliged_guest() is
> > called
> > ( there is no any code in dissambled common/trace.o ) so there is
> > no real call
> > of share_xen_page_with_priviliged_guest().
> 
> I don't think it's the "return NULL", but rather BUG_ON()'s (really
> BUG()'s)
> unreachable(). Not the least because the function can't validly
> return NULL,
> and hence callers have no need to check for NULL.
> 
> > @@ -25,8 +27,11 @@
> >  
> >  static inline void *maddr_to_virt(paddr_t ma)
> >  {
> > -    BUG_ON("unimplemented");
> > -    return NULL;
> > +    unsigned long va_offset = maddr_to_directmapoff(ma);
> > +
> > +    ASSERT(va_offset < DIRECTMAP_SIZE);
> > +
> > +    return (void *)(XENHEAP_VIRT_START + va_offset);
> >  }
> 
> I'm afraid I'm not following why this uses XENHEAP_VIRT_START, when
> it's all about the directmap. I'm in trouble with XENHEAP_VIRT_START
> in the first place: You don't have a separate "heap" virtual address
> range, do you?
The name may not be ideal for RISC-V. I borrowed it from Arm, intending
to account for cases where the directmap virtual start might not align
with DIRECTMAP_VIRT_START due to potential adjustments for superpage
mapping.
And my understanding is that XENHEAP == DIRECTMAP in case of Arm64.

Let's discuss below whether XENHEAP_VIRT_START is necessary, as there
are related questions connected to it.

> 
> > @@ -37,9 +42,9 @@ static inline void *maddr_to_virt(paddr_t ma)
> >   */
> >  static inline unsigned long virt_to_maddr(unsigned long va)
> >  {
> > -    if ((va >= DIRECTMAP_VIRT_START) &&
> > +    if ((va >= XENHEAP_VIRT_START) &&
> >  (va < (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE)))
> > -    return directmapoff_to_maddr(va - DIRECTMAP_VIRT_START);
> > +    return directmapoff_to_maddr(va - XENHEAP_VIRT_START);
> 
> Same concern here then.
> 
> > @@ -423,3 +424,123 @@ void * __init early_fdt_map(paddr_t
> > fdt_paddr)
> >  
> >  return fdt_virt;
> >  }
> > +
> > +#ifndef CONFIG_RISCV_32
> 
> I'd like to ask that you be more selective with this #ifdef (or omit
> it
> altogether here). setup_mm() itself, for example, looks good for any
> mode.
Regarding setup_mm() as they have pretty different implementations for
32 and 64 bit versions.

> Like does ...
> 
> > +#define ROUNDDOWN(addr, size)  ((addr) & ~((size) - 1))
> 
> ... this #define. Then again this macro may better be placed i

Re: [PATCH v7 02/10] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-10-30 Thread Alejandro Vallejo
Hi,

On Wed Oct 30, 2024 at 6:37 AM GMT, Jan Beulich wrote:
> On 29.10.2024 21:30, Andrew Cooper wrote:
> > On 21/10/2024 4:45 pm, Alejandro Vallejo wrote:
> >> @@ -310,19 +309,16 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
> >>  break;
> >>  
> >>  case 0xb:
> >> -/*
> >> - * In principle, this leaf is Intel-only.  In practice, it is 
> >> tightly
> >> - * coupled with x2apic, and we offer an x2apic-capable APIC 
> >> emulation
> >> - * to guests on AMD hardware as well.
> >> - *
> >> - * TODO: Rework topology logic.
> >> - */
> >>  if ( p->basic.x2apic )
> >>  {
> >>  *(uint8_t *)&res->c = subleaf;
> >>  
> >> -/* Fix the x2APIC identifier. */
> >> -res->d = v->vcpu_id * 2;
> >> +/*
> >> + * Fix the x2APIC identifier. The PV side is nonsensical, but
> >> + * we've always shown it like this so it's kept for compat.
> >> + */
> > 
> > In hindsight I should changed "Fix the x2APIC identifier." when I
> > reworked this logic, but oh well - better late than never.
> > 
> > /* The x2APIC_ID is per-vCPU, and fixed irrespective of the requested
> > subleaf. */
>
> Can we perhaps avoid "fix" in this comment? "Adjusted", "overwritten", or
> some such ought to do, without carrying a hint towards some bug somewhere.

I understood "fix" there as "pin" rather than "unbreak". Regardless I can also
rewrite it as "The x2APIC ID is per-vCPU and shown on all subleafs"

>
> >> --- a/xen/include/public/arch-x86/hvm/save.h
> >> +++ b/xen/include/public/arch-x86/hvm/save.h
> >> @@ -394,6 +394,8 @@ struct hvm_hw_lapic {
> >>  uint32_t disabled; /* VLAPIC_xx_DISABLED */
> >>  uint32_t timer_divisor;
> >>  uint64_t tdt_msr;
> >> +uint32_t x2apic_id;
> >> +uint32_t rsvd_zero;
> > 
> > ... we do normally spell it _rsvd; to make it extra extra clear that
> > people shouldn't be doing anything with it.
>
> Alternatively, to carry the "zero" in the name, how about _mbz?
>
> Jan

I'd prefer that to _rsvd, if anything to make it patently clear that leaving
rubble is not ok.

Cheers,
Alejandro



Re: [PATCH v7 04/10] tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves

2024-10-30 Thread Jan Beulich
On 30.10.2024 12:31, Andrew Cooper wrote:
> On 21/10/2024 4:45 pm, Alejandro Vallejo wrote:
>> --- a/tools/firmware/hvmloader/mp_tables.c
>> +++ b/tools/firmware/hvmloader/mp_tables.c
>> @@ -198,8 +198,10 @@ static void fill_mp_config_table(struct mp_config_table 
>> *mpct, int length)
>>  /* fills in an MP processor entry for VCPU 'vcpu_id' */
>>  static void fill_mp_proc_entry(struct mp_proc_entry *mppe, int vcpu_id)
>>  {
>> +ASSERT(cpu_to_x2apicid[vcpu_id] < 0xFF );
> 
> This is just going to break when we hit 256 vCPUs in a VM.
> 
> What do real systems do?
> 
> They'll either wrap around 255 like the CPUID xAPIC_ID does, or they'll
> not write out MP tables at all.

"at all" may be going a little far. They may simply not advertise CPUs with
too wide APIC IDs there, while still allowing others to be discovered this
legacy way.

Jan



Re: [PATCH v7 02/10] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-10-30 Thread Alejandro Vallejo
I'm fine with all suggestions, with one exception that needs a bit more
explanation...

On Tue Oct 29, 2024 at 8:30 PM GMT, Andrew Cooper wrote:
> On 21/10/2024 4:45 pm, Alejandro Vallejo wrote:
> > This allows the initial x2APIC ID to be sent on the migration stream.
> > This allows further changes to topology and APIC ID assignment without
> > breaking existing hosts. Given the vlapic data is zero-extended on
> > restore, fix up migrations from hosts without the field by setting it to
> > the old convention if zero.
> >
> > The hardcoded mapping x2apic_id=2*vcpu_id is kept for the time being,
> > but it's meant to be overriden by toolstack on a later patch with
> > appropriate values.
> >
> > Signed-off-by: Alejandro Vallejo 
>
> I'm going to request some changes, but I think they're only comment
> changes. [edit, no sadly, one non-comment change.]
>
> It's unfortunate that Xen uses an instance of hvm_hw_lapic for it's
> internal state, but one swamp at a time.
>
>
> In the subject, there's no such thing as the "initial" x2APIC ID. 
> There's just "the x2APIC ID" and it's not mutable state as far as the
> guest is concerned  (This is different to the xAPIC id, where there is
> an architectural concept of the initial xAPIC ID, from the days when
> OSes were permitted to edit it).  Also, it's x86/hvm, seeing as this is
> an HVM specific change you're making.
>
> Next, while it's true that this allows the value to move in the
> migration stream, the more important point is that this allows the
> toolstack to configure the x2APIC ID for each vCPU.
>
> So, for the commit message, I recommend:
>
> ---%<---
> Today, Xen hard-codes x2APIC_ID = vcpu_id * 2, but this is unwise and
> interferes with providing accurate topology information to the guest.
>
> Introduce a new x2apic_id field into hvm_hw_lapic.  This is immutable
> state from the guest's point of view, but it allows the toolstack to
> configure the value, and for the value to move on migrate.
>
> For backwards compatibility, we treat incoming zeroes as if they were
> the old hardcoded scheme.
> ---%<---
>
> > diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
> > index 2a777436ee27..e2489ff8e346 100644
> > --- a/xen/arch/x86/cpuid.c
> > +++ b/xen/arch/x86/cpuid.c
> > @@ -138,10 +138,9 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
> >  const struct cpu_user_regs *regs;
> >  
> >  case 0x1:
> > -/* TODO: Rework topology logic. */
> >  res->b &= 0x00ffu;
> >  if ( is_hvm_domain(d) )
> > -res->b |= (v->vcpu_id * 2) << 24;
> > +res->b |= vlapic_x2apic_id(vcpu_vlapic(v)) << 24;
>
> There wants to be some kind of note here, especially as you're feeding
> vlapic_x2apic_id() into a field called xAPIC ID.  Perhaps
>
> /* Large systems do wrap around 255 in the xAPIC_ID field. */
>
> ?
>
>
> >  
> >  /* TODO: Rework vPMU control in terms of toolstack choices. */
> >  if ( vpmu_available(v) &&
> > @@ -310,19 +309,16 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
> >  break;
> >  
> >  case 0xb:
> > -/*
> > - * In principle, this leaf is Intel-only.  In practice, it is 
> > tightly
> > - * coupled with x2apic, and we offer an x2apic-capable APIC 
> > emulation
> > - * to guests on AMD hardware as well.
> > - *
> > - * TODO: Rework topology logic.
> > - */
> >  if ( p->basic.x2apic )
> >  {
> >  *(uint8_t *)&res->c = subleaf;
> >  
> > -/* Fix the x2APIC identifier. */
> > -res->d = v->vcpu_id * 2;
> > +/*
> > + * Fix the x2APIC identifier. The PV side is nonsensical, but
> > + * we've always shown it like this so it's kept for compat.
> > + */
>
> In hindsight I should changed "Fix the x2APIC identifier." when I
> reworked this logic, but oh well - better late than never.
>
> /* The x2APIC_ID is per-vCPU, and fixed irrespective of the requested
> subleaf. */
>
> I'd also put a little more context in the PV side:
>
> /* Xen 4.18 and earlier leaked x2APIC into PV guests.  The value shown
> is nonsensical but kept as-was for compatibility. */
>
> > diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
> > index 3363926b487b..33b463925f4e 100644
> > --- a/xen/arch/x86/hvm/vlapic.c
> > +++ b/xen/arch/x86/hvm/vlapic.c
> > @@ -1538,6 +1538,16 @@ static void lapic_load_fixup(struct vlapic *vlapic)
> >  const struct vcpu *v = vlapic_vcpu(vlapic);
> >  uint32_t good_ldr = x2apic_ldr_from_id(vlapic->loaded.id);
> >  
> > +/*
> > + * Loading record without hw.x2apic_id in the save stream, calculate 
> > using
> > + * the traditional "vcpu_id * 2" relation. There's an implicit 
> > assumption
> > + * that vCPU0 always has x2APIC0, which is true for the old relation, 
> > and
> > + * still holds under the new x2APIC generation algorithm. While that 
> > case
> > + *

Re: [PATCH v7 02/10] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-10-30 Thread Jan Beulich
On 30.10.2024 13:03, Alejandro Vallejo wrote:
> On Wed Oct 30, 2024 at 6:37 AM GMT, Jan Beulich wrote:
>> On 29.10.2024 21:30, Andrew Cooper wrote:
>>> On 21/10/2024 4:45 pm, Alejandro Vallejo wrote:
 @@ -310,19 +309,16 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
  break;
  
  case 0xb:
 -/*
 - * In principle, this leaf is Intel-only.  In practice, it is 
 tightly
 - * coupled with x2apic, and we offer an x2apic-capable APIC 
 emulation
 - * to guests on AMD hardware as well.
 - *
 - * TODO: Rework topology logic.
 - */
  if ( p->basic.x2apic )
  {
  *(uint8_t *)&res->c = subleaf;
  
 -/* Fix the x2APIC identifier. */
 -res->d = v->vcpu_id * 2;
 +/*
 + * Fix the x2APIC identifier. The PV side is nonsensical, but
 + * we've always shown it like this so it's kept for compat.
 + */
>>>
>>> In hindsight I should changed "Fix the x2APIC identifier." when I
>>> reworked this logic, but oh well - better late than never.
>>>
>>> /* The x2APIC_ID is per-vCPU, and fixed irrespective of the requested
>>> subleaf. */
>>
>> Can we perhaps avoid "fix" in this comment? "Adjusted", "overwritten", or
>> some such ought to do, without carrying a hint towards some bug somewhere.
> 
> I understood "fix" there as "pin" rather than "unbreak".

Oh, right - that possible meaning escaped me.

Jan



Re: [QUESTION] tools/xenstored: Best way to proceed with the protocol modification

2024-10-30 Thread Jürgen Groß

On 30.10.24 12:58, Andrii Sultanov wrote:

Hello,

(CC-ing Jürgen as the original author of the xenstored partial directory 
patches: https://lore.kernel.org/xen-devel/20161205074853.13268-1- 
jgr...@suse.com/ )


I'm investigating implementing XS_DIRECTORY_PART support in Oxenstored, and have 
come by a possible issue - the protocol specifies that the 'offset' parameter 
for each call is a "byte offset into the list of children", and so should be 
calculated on the user side. This makes sense for the C side as children are 
stored in a single char array separated by null characters, but OCaml stores 
children in a different structure, which can't be indexed into this way (but is 
searched more efficiently, etc.)


What's the best way to proceed here?

Could the protocol be redefined to:
1) turn the 'offset' into an opaque id that needs to be re-sent as-is on the 
next iteration? (it would keep being interpreted as an index into an array on 
the C side, and as something else on the OCaml side)
2) return the opaque 'offset' on each call alongside the string and generation 
id so that it wouldn't be calculated from strlen() on the user side anymore?


As libxenstore is calculating the offset based on the length of the
returned result of the previous call and then using this value for the
next iteration, there is no way "offset" could be redefined to be an
opaque handle with a different value than today.

Modifying the output format isn't possible either, as we have no control
over all clients using the current format.

So you need to comply with the current interface.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: [PATCH v7 02/10] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-10-30 Thread Andrew Cooper
On 30/10/2024 6:37 am, Jan Beulich wrote:
> On 29.10.2024 21:30, Andrew Cooper wrote:
>> On 21/10/2024 4:45 pm, Alejandro Vallejo wrote:
>>> @@ -310,19 +309,16 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
>>>  break;
>>>  
>>>  case 0xb:
>>> -/*
>>> - * In principle, this leaf is Intel-only.  In practice, it is 
>>> tightly
>>> - * coupled with x2apic, and we offer an x2apic-capable APIC 
>>> emulation
>>> - * to guests on AMD hardware as well.
>>> - *
>>> - * TODO: Rework topology logic.
>>> - */
>>>  if ( p->basic.x2apic )
>>>  {
>>>  *(uint8_t *)&res->c = subleaf;
>>>  
>>> -/* Fix the x2APIC identifier. */
>>> -res->d = v->vcpu_id * 2;
>>> +/*
>>> + * Fix the x2APIC identifier. The PV side is nonsensical, but
>>> + * we've always shown it like this so it's kept for compat.
>>> + */
>> In hindsight I should changed "Fix the x2APIC identifier." when I
>> reworked this logic, but oh well - better late than never.
>>
>> /* The x2APIC_ID is per-vCPU, and fixed irrespective of the requested
>> subleaf. */
> Can we perhaps avoid "fix" in this comment? "Adjusted", "overwritten", or
> some such ought to do, without carrying a hint towards some bug somewhere.

Not really.  This is actually a good example of why "fix" as is bugfix
is a weird corner of English, despite it being common in coding circles.


"Fix" means to attach one thing to another, along with a strong
implication that the other thing doesn't move.  This comes from Latin,
and the collective term for nails/screws/bolts/etc is "fixings".

Fix as in bugfix derives from "to repair" or "to mend", which in turn
comes from the fact that even today (and moreso several hundred years
ago), many repairs/mends involve affixing one thing back to something
else that doesn't move.


In this case, it is res->d which which is fixed (as in unmoving) with
respect to the subleaf index.  It is weird even for CPUID; it's the only
example I'm aware of where the content of the world logically the same
piece of information in any subleaf.

~Andrew



Re: [PATCH v3] x86/io-apic: fix directed EOI when using AMD-Vi interrupt remapping

2024-10-30 Thread Roger Pau Monné
On Wed, Oct 30, 2024 at 11:57:39AM +0100, Jan Beulich wrote:
> On 30.10.2024 11:09, Roger Pau Monné wrote:
> > On Wed, Oct 30, 2024 at 10:41:40AM +0100, Jan Beulich wrote:
> >> On 29.10.2024 18:48, Roger Pau Monné wrote:
> >>> On Tue, Oct 29, 2024 at 05:43:24PM +0100, Jan Beulich wrote:
>  On 29.10.2024 12:03, Roger Pau Monne wrote:
> > @@ -273,6 +293,13 @@ void __ioapic_write_entry(
> >  {
> >  __io_apic_write(apic, 0x11 + 2 * pin, eu.w2);
> >  __io_apic_write(apic, 0x10 + 2 * pin, eu.w1);
> > +/*
> > + * Called in clear_IO_APIC_pin() before io_apic_pin_eoi is 
> > allocated.
> > + * Entry will be updated once the array is allocated and 
> > there's a
> > + * write against the pin.
> > + */
> > +if ( io_apic_pin_eoi )
> > +io_apic_pin_eoi[apic][pin] = e.vector;
> 
>  The comment here looks a little misleading to me. clear_IO_APIC_pin() 
>  calls
>  here to, in particular, set the mask bit. With the mask bit the vector 
>  isn't
>  meaningful anyway (and indeed clear_IO_APIC_pin() sets it to zero, at 
>  which
>  point recording IRQ_VECTOR_UNASSIGNED might be better than the bogus 
>  vector
>  0x00).
> >>>
> >>> Note that clear_IO_APIC_pin() performs the call to
> >>> __ioapic_write_entry() with raw == false, at which point
> >>> __ioapic_write_entry() will call iommu_update_ire_from_apic() if IOMMU
> >>> IR is enabled.  The cached 'vector' value will be the IOMMU entry
> >>> offset for the AMD-Vi case, as the IOMMU code will perform the call to
> >>> __ioapic_write_entry() with raw == true.
> >>>
> >>> What matters is that the cached value matches what's written in the
> >>> IO-APIC RTE, and the current logic ensures this.
> >>>
> >>> What's the benefit of using IRQ_VECTOR_UNASSIGNED if the result is
> >>> reading the RTE and finding that vector == 0?
> >>
> >> It's not specifically the vector == 0 case alone. Shouldn't we leave
> >> the latched vector alone when writing an RTE with the mask bit set?
> > 
> > I'm not sure what's the benefit of the extra logic to detect such
> > cases, just to avoid a write to the io_apic_pin_eoi matrix.
> 
> Perhaps the largely theoretical concern towards having stale data
> somewhere. Yet ...
> 
> >> Any still pending EOI (there should be none aiui) can't possibly
> >> target the meaningless vector / index in such an RTE. Perhaps it was
> >> wrong to suggest to overwrite (with IRQ_VECTOR_UNASSIGNED) what we
> >> have on record.
> >>
> >> Yet at the same time there ought to be a case where the recorded
> >> indeed moves back to IRQ_VECTOR_UNASSIGNED.
> > 
> > The only purpose of the io_apic_pin_eoi matrix is to cache what's
> > currently in the RTE entry 'vector' field.  I don't think we should
> > attempt to add extra logic as to whether the entry is valid, or
> > masked.  Higher level layers should already take care of that.  The
> > only purpose of the logic added in this patch is to ensure the EOI is
> > performed using what's in the RTE vector field for the requested pin.
> > Anything else is out of scope IMO.
> > 
> > Another option, which would allow to make the matrix store uint8_t
> > elements would be to initialize it at allocation with the RTE vector
> > fields currently present, IOW: do a raw read of every RTE and set the
> > fetched vector field in io_apic_pin_eoi.  Would that be better to you,
> > as also removing the need to ever store IRQ_VECTOR_UNASSIGNED?
> 
> ... yes, that may make sense (and eliminate my concern there).
> 
> I wonder whether the allocation of the array then wouldn't better be
> moved earlier, to enable_IO_APIC(), such that clear_IO_APIC_pin()
> already can suitably update it. In fact, since that function writes
> zero[1], no extra reads would then be needed at all, and the array could
> simply start out all zeroed.

I agree with the suggestion to allocate and setup the io_apic_pin_eoi
matrix in enable_IO_APIC().  However, I'm not sure I follow your
suggestion about the matrix starting as all zeroes being a sane state.

I think we need to do the raw RTE reads in enable_IO_APIC() before
calling clear_IO_APIC(), otherwise clear_IO_APIC_pin() can call
__io_apic_eoi() before any __ioapic_write_entry() has been performed,
and hence the state of the RTE.vector field could possibly be out of
sync with the initial value in io_apic_pin_eoi, and the EOI not take
effect.

Thanks, Roger.



Re: [PATCH] x86/setup: Make setup.h header self contained

2024-10-30 Thread Andrew Cooper
On 30/10/2024 11:17 am, Jan Beulich wrote:
> On 30.10.2024 12:15, Frediano Ziglio wrote:
>> On Wed, Oct 30, 2024 at 10:59 AM Jan Beulich  wrote:
>>> On 30.10.2024 11:44, Frediano Ziglio wrote:
 The header uses rangeset structure typedef which definition
 is not included.
>>> And it doesn't need to be. For
>>>
>>> int remove_xen_ranges(struct rangeset *r);
>>>
>>> we don't need ...
>>>
 --- a/xen/arch/x86/include/asm/setup.h
 +++ b/xen/arch/x86/include/asm/setup.h
 @@ -2,6 +2,7 @@
  #define __X86_SETUP_H_

  #include 
 +#include 
  #include 

  extern const char __2M_text_start[], __2M_text_end[];
>>> ... this, a mere
>>>
>>> struct rangeset;
>>>
>>> forward decl will suffice.
>>>
>>> Jan
>>>
>> It's true, but for the same reason, we could avoid including
>> "xen/multiboot.h" and use "struct module" instead of "module_t".
> Indeed. I'd even question the need for that typedef.

Please don't got playing with includes of multiboot.h.  All you'll do is
interfere with Daniel's in-progress series.

Most includes are getting removed.

~Andrew



Re: [PATCH] x86/mm: Use standard C types for sized integers

2024-10-30 Thread Jan Beulich
On 30.10.2024 11:44, Frediano Ziglio wrote:
> The header is already using these types.
> 
> Signed-off-by: Frediano Ziglio 

Acked-by: Jan Beulich 

Nevertheless I wonder whether ...

> --- a/xen/arch/x86/include/asm/mm.h
> +++ b/xen/arch/x86/include/asm/mm.h
> @@ -230,7 +230,7 @@ struct page_info
>   * Only valid for: a) free pages, and b) pages with zero type count
>   * (except page table pages when the guest is in shadow mode).
>   */
> -u32 tlbflush_timestamp;
> +uint32_t tlbflush_timestamp;
>  
>  /*
>   * When PGT_partial is true then the first two fields are valid and
> @@ -284,8 +284,8 @@ struct page_info
>   *   in use.
>   */
>  struct {
> -u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
> -u16 :16 - PAGETABLE_ORDER - 1 - 1;
> +uint16_t nr_validated_ptes:PAGETABLE_ORDER + 1;
> +uint16_t :16 - PAGETABLE_ORDER - 1 - 1;
>  uint16_t partial_flags:1;

... fixed width types are really needed here; afaict unsigned int ought to do.

>  int16_t linear_pt_count;

It's only here where the fixed width type is largely needed (or alternatively

signed int linear_pt_count:16;

).

Jan



Re: [PATCH] x86/setup: Make setup.h header self contained

2024-10-30 Thread Jan Beulich
On 30.10.2024 12:15, Frediano Ziglio wrote:
> On Wed, Oct 30, 2024 at 10:59 AM Jan Beulich  wrote:
>>
>> On 30.10.2024 11:44, Frediano Ziglio wrote:
>>> The header uses rangeset structure typedef which definition
>>> is not included.
>>
>> And it doesn't need to be. For
>>
>> int remove_xen_ranges(struct rangeset *r);
>>
>> we don't need ...
>>
>>> --- a/xen/arch/x86/include/asm/setup.h
>>> +++ b/xen/arch/x86/include/asm/setup.h
>>> @@ -2,6 +2,7 @@
>>>  #define __X86_SETUP_H_
>>>
>>>  #include 
>>> +#include 
>>>  #include 
>>>
>>>  extern const char __2M_text_start[], __2M_text_end[];
>>
>> ... this, a mere
>>
>> struct rangeset;
>>
>> forward decl will suffice.
>>
>> Jan
>>
> 
> It's true, but for the same reason, we could avoid including
> "xen/multiboot.h" and use "struct module" instead of "module_t".

Indeed. I'd even question the need for that typedef.

Jan



[PATCH] x86/mm: ensure L2 is always freed if empty

2024-10-30 Thread Roger Pau Monne
The current logic in modify_xen_mappings() allows for fully empty L2 tables to
not be freed and unhooked from the parent L3 if the last L2 slot is not
populated.

Ensure that even when an L2 slot is empty the logic to check whether the whole
L2 can be removed is not skipped.

Signed-off-by: Roger Pau Monné 
---
I've attempted to find a Fixes tag for this one, but I'm afraid there have been
many changes in the function, and it's possibly the code that introduced the L2
freeing (4376c05c31132) the one that failed to originally adjust this case.
---
 xen/arch/x86/mm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index d537a799bced..0f53dcebad95 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5717,7 +5717,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, 
unsigned int nf)
 
 v += 1UL << L2_PAGETABLE_SHIFT;
 v &= ~((1UL << L2_PAGETABLE_SHIFT) - 1);
-continue;
+goto check_l3;
 }
 
 if ( l2e_get_flags(*pl2e) & _PAGE_PSE )
-- 
2.46.0




Re: [PATCH v3 0/5] x86/xen: Drop absolute references from startup code

2024-10-30 Thread Ard Biesheuvel
On Tue, 29 Oct 2024 at 13:54, Jürgen Groß  wrote:
>
> On 29.10.24 13:50, Ard Biesheuvel wrote:
> > On Wed, 9 Oct 2024 at 18:09, Ard Biesheuvel  wrote:
> >>
..
> >
> > Ping?
>
> I have queued this series for 6.13.
>

Thanks!



Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-30 Thread Andrew Cooper
On 30/10/2024 11:03 am, Roger Pau Monné wrote:
> On Wed, Oct 30, 2024 at 10:39:12AM +, Andrew Cooper wrote:
>> On 30/10/2024 8:59 am, Roger Pau Monné wrote:
>>> On Tue, Oct 29, 2024 at 05:55:05PM +, Andrew Cooper wrote:
 diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
 index b6d9fad56773..78bc9872b09a 100644
 --- a/xen/arch/x86/cpu-policy.c
 +++ b/xen/arch/x86/cpu-policy.c
 @@ -391,6 +391,27 @@ static void __init calculate_host_policy(void)
  p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
  }
  
 +/*
 + * Guest max policies can have any max leaf/subleaf within bounds.
 + *
 + * - Some incoming VMs have a larger-than-necessary feat max_subleaf.
 + * - Some VMs we'd like to synthesise leaves not present on the host.
 + */
 +static void __init guest_common_max_leaves(struct cpu_policy *p)
 +{
 +p->basic.max_leaf   = ARRAY_SIZE(p->basic.raw) - 1;
 +p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
 +p->extd.max_leaf= 0x8000U + ARRAY_SIZE(p->extd.raw) - 1;
 +}
 +
 +/* Guest default policies inherit the host max leaf/subleaf settings. */
 +static void __init guest_common_default_leaves(struct cpu_policy *p)
 +{
 +p->basic.max_leaf   = host_cpu_policy.basic.max_leaf;
 +p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
 +p->extd.max_leaf= host_cpu_policy.extd.max_leaf;
 +}
>>> I think this what I'm going to ask is future work.  After the
>>> modifications done to the host policy by max functions
>>> (calculate_{hvm,pv}_max_policy()) won't the max {sub,}leaf adjustments
>>> better be done taking into account the contents of the policy, rather
>>> than capping to the host values?
>>>
>>> (note this comment is strictly for guest_common_default_leaves(), the
>>> max version is fine using ARRAY_SIZE).
>> I'm afraid I don't follow.
>>
>> calculate_{pv,hvm}_max_policy() don't modify the host policy.
> Hm, I don't think I've expressed myself clearly, sorry.  Let me try
> again.
>
> calculate_{hvm,pv}_max_policy() extends the host policy by possibly
> setting new features, and such extended policy is then used as the
> base for the PV/HVM default policies.
>
> Won't the resulting policy in calculate_{hvm,pv}_def_policy() risks
> having bits set past the max {sub,}leaf in the host policy, as it's
> based in {hvm,pv}_def_cpu_policy that might have such bits set?

Oh, right.

This patch doesn't change anything WRT that.

But I think you're right that we do risk getting into that case (in
principle at least) because of how guest_common_*_feature_adjustment() work.

Furthermore, the bug will typically get hidden because we serialise
based on the max_leaf/subleaf, and will discard feature words outside of
the max_leaf/subleaf bounds.

I suppose we probably want a variation of x86_cpu_featureset_to_policy()
which extends the max_leaf/subleaf based on non-zero values in leaves. 
(This already feels like it's going to be an ugly algorithm.)

~Andrew



Re: [PATCH v7 05/10] tools/libacpi: Use LUT of APIC IDs rather than function pointer

2024-10-30 Thread Andrew Cooper
On 21/10/2024 4:45 pm, Alejandro Vallejo wrote:
> diff --git a/tools/include/xenguest.h b/tools/include/xenguest.h
> index e01f494b772a..aa50b78dfb89 100644
> --- a/tools/include/xenguest.h
> +++ b/tools/include/xenguest.h
> @@ -22,6 +22,8 @@
>  #ifndef XENGUEST_H
>  #define XENGUEST_H
>  
> +#include "xen/hvm/hvm_info_table.h"
> +
>  #define XC_NUMA_NO_NODE   (~0U)
>  
>  #define XCFLAGS_LIVE  (1 << 0)
> @@ -236,6 +238,9 @@ struct xc_dom_image {
>  #if defined(__i386__) || defined(__x86_64__)
>  struct e820entry *e820;
>  unsigned int e820_entries;
> +
> +/* LUT mapping cpu id to (x2)APIC ID */
> +uint32_t cpu_to_apicid[HVM_MAX_VCPUS];

Same note as the previous patch.

This needs to be a plain dynamically allocated array, because it mustn't
use HVM_MAX_VCPUS.

~Andrew



Re: [PATCH v2] x86/setup: Make setup.h header self contained

2024-10-30 Thread Andrew Cooper
On 30/10/2024 2:13 pm, Frediano Ziglio wrote:
> The header uses rangeset structure pointer.
> Forward declare the structure.
>
> Signed-off-by: Frediano Ziglio 

Acked-by: Andrew Cooper 



Re: [PATCH] scripts: Fix git-checkout.sh to work with branches other than master

2024-10-30 Thread Stefano Stabellini
On Wed, 30 Oct 2024, Andrew Cooper wrote:
> Xen uses master for QEMU_UPSTREAM_REVISION, and has done for other trees too
> in the path.  Apparently we've never specified a different branch, because the
> git-clone rune only pulls in the master branch; it does not pull in diverging
> branches.  Fix this by stating which branch/tag is wanted.
> 
> $TAG is really a committish, and git-clone's -b/--branch takes a committish
> too.
> 
> Signed-off-by: Andrew Cooper 

Well spotted!

Reviewed-by: Stefano Stabellini 


> ---
> CC: Anthony PERARD 
> CC: Juergen Gross 
> CC: Jan Beulich 
> CC: Stefano Stabellini 
> CC: Julien Grall 
> 
> Found while trying to test the MiniOS fixes in Gitlab.  This is a example run
> with most stubdom builds passing:
> 
>   https://gitlab.com/xen-project/people/andyhhp/xen/-/pipelines/1520611847
> 
> and all of them successfully cloned.
> ---
>  scripts/git-checkout.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/scripts/git-checkout.sh b/scripts/git-checkout.sh
> index fd4425ac4ee8..3796cbfe39a7 100755
> --- a/scripts/git-checkout.sh
> +++ b/scripts/git-checkout.sh
> @@ -14,7 +14,7 @@ set -e
>  if test \! -d $DIR-remote; then
>   rm -rf $DIR-remote $DIR-remote.tmp
>   mkdir -p $DIR-remote.tmp; rmdir $DIR-remote.tmp
> - $GIT clone $TREE $DIR-remote.tmp
> + $GIT clone -b $TAG $TREE $DIR-remote.tmp
>   if test "$TAG" ; then
>   cd $DIR-remote.tmp
>   $GIT branch -D dummy >/dev/null 2>&1 ||:
> 
> base-commit: bb7296d77f171c7bfbafab30ed51e9be29660e39
> -- 
> 2.39.5
> 



Re: [PATCH -next v4 06/19] arm64: entry: Move arm64_preempt_schedule_irq() into exit_to_kernel_mode()

2024-10-30 Thread Jinjie Ruan



On 2024/10/29 22:52, Mark Rutland wrote:
> On Fri, Oct 25, 2024 at 06:06:47PM +0800, Jinjie Ruan wrote:
>> Move arm64_preempt_schedule_irq() into exit_to_kernel_mode(), so not
>> only __el1_irq() but also every time when kernel mode irq return,
>> there is a chance to reschedule.
> 
> We use exit_to_kernel_mode() for every non-NMI exception return to the
> kernel, not just IRQ returns.

Yes, it it not only irq but other non-NMI exception, will update it.

> 
>> As Mark pointed out, this change will have the following key impact:
>>
>> "We'll preempt even without taking a "real" interrupt. That
>> shouldn't result in preemption that wasn't possible before,
>> but it does change the probability of preempting at certain points,
>> and might have a performance impact, so probably warrants a
>> benchmark."
> 
> For anyone following along at home, I said that at:
> 
>   https://lore.kernel.org/linux-arm-kernel/ZxejvAmccYMTa4P1@J2N7QTR9R3/
> 
> ... and there I specifically said:

Thank you!

This one and the next patch will be merged as you suggested.

I would have thought it would have been clearer to put it in
__exit_to_kernel_mode() and move it to interrupt enabled block in two steps.

> 
>> I's suggest you first write a patch to align arm64's entry code with the
>> generic code, by removing the call to arm64_preempt_schedule_irq() from
>> __el1_irq(), and adding a call to arm64_preempt_schedule_irq() in
>> __exit_to_kernel_mode(), e.g.
>>
>> | static __always_inline void __exit_to_kernel_mode(struct pt_regs *regs)
>> | {
>> |...
>> |if (interrupts_enabled(regs)) {
>> |...
>> |if (regs->exit_rcu) {
>> |...
>> |}
>> |...
>> |arm64_preempt_schedule_irq();
>> |...
>> |} else {
>> |...
>> |}
>> | }
> 
> [...]
> 
>> +#ifdef CONFIG_PREEMPT_DYNAMIC
>> +DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
>> +#define need_irq_preemption() \
>> +(static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
>> +#else
>> +#define need_irq_preemption()   (IS_ENABLED(CONFIG_PREEMPTION))
>> +#endif
>> +
>> +static void __sched arm64_preempt_schedule_irq(void)
>> +{
>> +if (!need_irq_preemption())
>> +return;
>> +
>> +/*
>> + * Note: thread_info::preempt_count includes both thread_info::count
>> + * and thread_info::need_resched, and is not equivalent to
>> + * preempt_count().
>> + */
>> +if (READ_ONCE(current_thread_info()->preempt_count) != 0)
>> +return;
>> +
>> +/*
>> + * DAIF.DA are cleared at the start of IRQ/FIQ handling, and when GIC
>> + * priority masking is used the GIC irqchip driver will clear DAIF.IF
>> + * using gic_arch_enable_irqs() for normal IRQs. If anything is set in
>> + * DAIF we must have handled an NMI, so skip preemption.
>> + */
>> +if (system_uses_irq_prio_masking() && read_sysreg(daif))
>> +return;
>> +
>> +/*
>> + * Preempting a task from an IRQ means we leave copies of PSTATE
>> + * on the stack. cpufeature's enable calls may modify PSTATE, but
>> + * resuming one of these preempted tasks would undo those changes.
>> + *
>> + * Only allow a task to be preempted once cpufeatures have been
>> + * enabled.
>> + */
>> +if (system_capabilities_finalized())
>> +preempt_schedule_irq();
>> +}
>> +
>>  /*
>>   * Handle IRQ/context state management when exiting to kernel mode.
>>   * After this function returns it is not safe to call regular kernel code,
>> @@ -72,6 +114,8 @@ static noinstr irqentry_state_t 
>> enter_from_kernel_mode(struct pt_regs *regs)
>>  static void noinstr exit_to_kernel_mode(struct pt_regs *regs,
>>  irqentry_state_t state)
>>  {
>> +arm64_preempt_schedule_irq();
> 
> This is broken; exit_to_kernel_mode() is called for any non-NMI return
> excpetion return to the kernel, and this doesn't check that interrupts
> were enabled in the context the exception was taken from.
> 
> This will preempt in cases where we should not, e.g. if we WARN() in a 
> section with
> IRQs disabled.
> 
> Mark.
> 



Re: [PATCH 00/13] Remove implicit devres from pci_intx()

2024-10-30 Thread Bjorn Helgaas
On Tue, Oct 15, 2024 at 08:51:10PM +0200, Philipp Stanner wrote:
> @Driver-Maintainers: Your driver might be touched by patch "Remove
> devres from pci_intx()". You might want to take a look.
> 
> Changes since the RFC [1]:
>   - Add a patch deprecating pci{m}_intx(). (Heiner, Andy, Me)
>   - Add Acked-by's already given.
>   - Export pcim_intx() as a GPL function. (Alex)
>   - Drop patch for rts5280, since this driver will be removed quite
> soon. (Philipp Hortmann, Greg)
>   - Use early-return in pci_intx_unmanaged() and pci_intx(). (Andy)
> 
> Hi all,
> 
> this series removes a problematic feature from pci_intx(). That function
> sometimes implicitly uses devres for automatic cleanup. We should get
> rid of this implicit behavior.
> 
> To do so, a pci_intx() version that is always-managed, and one that is
> never-managed are provided. Then, all pci_intx() users are ported to the
> version they need. Afterwards, pci_intx() can be cleaned up and the
> users of the never-managed version be ported back to pci_intx().
> 
> This way we'd get this PCI API consistent again.
> 
> Patch "Remove devres from pci_intx()" obviously reverts the previous
> patches that made drivers use pci_intx_unmanaged(). But this way it's
> easier to review and approve. It also makes sure that each checked out
> commit should provide correct behavior, not just the entire series as a
> whole.
> 
> Merge plan for this is to enter through the PCI tree.
> 
> [1] https://lore.kernel.org/all/20241009083519.10088-1-pstan...@redhat.com/

I *think* this series depends on resolution of Takashi's "Restore the
original INTX_DISABLE bit by pcim_intx()" patch [2], right?

For now I'm postponing this series, but let me know if that's not the
right thing.

[2] https://lore.kernel.org/r/20241024155539.19416-1-ti...@suse.de

> Philipp Stanner (13):
>   PCI: Prepare removing devres from pci_intx()
>   ALSA: hda_intel: Use always-managed version of pcim_intx()
>   drivers/xen: Use never-managed version of pci_intx()
>   net/ethernet: Use never-managed version of pci_intx()
>   net/ntb: Use never-managed version of pci_intx()
>   misc: Use never-managed version of pci_intx()
>   vfio/pci: Use never-managed version of pci_intx()
>   PCI: MSI: Use never-managed version of pci_intx()
>   ata: Use always-managed version of pci_intx()
>   wifi: qtnfmac: use always-managed version of pcim_intx()
>   HID: amd_sfh: Use always-managed version of pcim_intx()
>   Remove devres from pci_intx()
>   PCI: Deprecate pci_intx(), pcim_intx()
> 
>  drivers/ata/ahci.c|  2 +-
>  drivers/ata/ata_piix.c|  2 +-
>  drivers/ata/pata_rdc.c|  2 +-
>  drivers/ata/sata_sil24.c  |  2 +-
>  drivers/ata/sata_sis.c|  2 +-
>  drivers/ata/sata_uli.c|  2 +-
>  drivers/ata/sata_vsc.c|  2 +-
>  drivers/hid/amd-sfh-hid/amd_sfh_pcie.c|  4 +--
>  drivers/hid/amd-sfh-hid/sfh1_1/amd_sfh_init.c |  2 +-
>  .../wireless/quantenna/qtnfmac/pcie/pcie.c|  2 +-
>  drivers/pci/devres.c  | 29 +--
>  drivers/pci/pci.c | 19 
>  include/linux/pci.h   |  1 +
>  sound/pci/hda/hda_intel.c |  2 +-
>  14 files changed, 26 insertions(+), 47 deletions(-)
> 
> -- 
> 2.47.0
> 



Re: [PATCH -next v4 01/19] arm64: ptrace: Replace interrupts_enabled() with regs_irqs_disabled()

2024-10-30 Thread Jinjie Ruan



On 2024/10/29 22:19, Mark Rutland wrote:
> On Fri, Oct 25, 2024 at 06:06:42PM +0800, Jinjie Ruan wrote:
>> Implement regs_irqs_disabled(), and replace interrupts_enabled() macro
>> with regs_irqs_disabled() all over the place.
>>
>> No functional changes.
>>
> 
> Please say why, e.g.
> 
> | The generic entry code expects architecture code to provide
> | regs_irqs_disabled(regs), but arm64 does not have this and provides
> | interrupts_enabled(regs), which has the opposite polarity.
> | 
> | In preparation for moving arm64 over to the generic entry code,
> | replace arm64's interrupts_enabled() with regs_irqs_disabled() and
> | update its callers under arch/arm64.
> |
> | For the moment, a definition of interrupts_enabled() is provided for
> | the GICv3 driver. Once arch/arm implement regs_irqs_disabled(), this
> | can be removed.
> 

Thank you! Will expand the commit message and describe the cause of the
patch also for other patches.

>> Suggested-by: Mark Rutland 
>> Signed-off-by: Jinjie Ruan 
>> ---
> 
> [...]
> 
>>  arch/arm/include/asm/ptrace.h   | 4 ++--
>>  arch/arm/kernel/hw_breakpoint.c | 2 +-
>>  arch/arm/kernel/process.c   | 2 +-
>>  arch/arm/mm/alignment.c | 2 +-
>>  arch/arm/mm/fault.c | 2 +-
> 
>>  drivers/irqchip/irq-gic-v3.c| 2 +-
> 
> I hadn't realised that the GICv3 driver was using this and hence we'd
> need to update a few places in arch/arm at the same time. Please update
> just the arch/arm64 bits, and add:
> 
> | /* 
> |  * Used by the GICv3 driver, can be removed once arch/arm implements
> |  * regs_irqs_disabled() directly.
> |  */
> | #define interrupts_enabled(regs)(!regs_irqs_disabled(regs))
> 
> ... and then once 32-bit arm implements this we can update the GIC
> driver and remove the architecture definitions.
> 
> That way we avoid the risk of conflicts with 32-bit arm.
> 
> Mark.
> 



Re: [PATCH -next v4 02/19] arm64: entry: Refactor the entry and exit for exceptions from EL1

2024-10-30 Thread Jinjie Ruan



On 2024/10/29 22:33, Mark Rutland wrote:
> On Fri, Oct 25, 2024 at 06:06:43PM +0800, Jinjie Ruan wrote:
>> These changes refactor the entry and exit routines for the exceptions
>> from EL1. They store the RCU and lockdep state in a struct
>> irqentry_state variable on the stack, rather than recording them
>> in the fields of pt_regs, since it is safe enough for these context.
> 
> In general, please descirbe *why* we want to make the change first, e.g.
> 
> | The generic entry code uses irqentry_state_t to track lockdep and RCU
> | state across exception entry and return. For historical reasons, arm64
> | embeds similar fields within its pt_regs structure.
> |
> | In preparation for moving arm64 over to the generic entry code, pull
> | these fields out of arm64's pt_regs, and use a seperate structure,
> | matching the style of the generic entry code.
> 
>> Before:
>>  struct pt_regs {
>>  ...
>>  u64 lockdep_hardirqs;
>>  u64 exit_rcu;
>>  }
>>
>>  enter_from_kernel_mode(regs);
>>  ...
>>  exit_to_kernel_mode(regs);
>>
>> After:
>>  typedef struct irqentry_state {
>>  union {
>>  boolexit_rcu;
>>  boollockdep;
>>  };
>>  } irqentry_state_t;
>>
>>  irqentry_state_t state = enter_from_kernel_mode(regs);
>>  ...
>>  exit_to_kernel_mode(regs, state);
> 
> I don't think this part is necessary.

Thank you, will remove it and explain why.

> 
>>
>> No functional changes.
>>
>> Suggested-by: Mark Rutland 
>> Signed-off-by: Jinjie Ruan 
>> ---
>>  arch/arm64/include/asm/ptrace.h  |  11 ++-
>>  arch/arm64/kernel/entry-common.c | 129 +++
>>  2 files changed, 85 insertions(+), 55 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/ptrace.h 
>> b/arch/arm64/include/asm/ptrace.h
>> index 3e5372a98da4..5156c0d5fa20 100644
>> --- a/arch/arm64/include/asm/ptrace.h
>> +++ b/arch/arm64/include/asm/ptrace.h
>> @@ -149,6 +149,13 @@ static inline unsigned long pstate_to_compat_psr(const 
>> unsigned long pstate)
>>  return psr;
>>  }
>>  
>> +typedef struct irqentry_state {
>> +union {
>> +boolexit_rcu;
>> +boollockdep;
>> +};
>> +} irqentry_state_t;
> 
> AFAICT this can be moved directly into arch/arm64/kernel/entry-common.c.
> 
>> +
>>  /*
>>   * This struct defines the way the registers are stored on the stack during 
>> an
>>   * exception. struct user_pt_regs must form a prefix of struct pt_regs.
>> @@ -169,10 +176,6 @@ struct pt_regs {
>>  
>>  u64 sdei_ttbr1;
>>  struct frame_record_meta stackframe;
>> -
>> -/* Only valid for some EL1 exceptions. */
>> -u64 lockdep_hardirqs;
>> -u64 exit_rcu;
>>  };
>>  
>>  /* For correct stack alignment, pt_regs has to be a multiple of 16 bytes. */
>> diff --git a/arch/arm64/kernel/entry-common.c 
>> b/arch/arm64/kernel/entry-common.c
>> index c547e70428d3..68a9aecacdb9 100644
>> --- a/arch/arm64/kernel/entry-common.c
>> +++ b/arch/arm64/kernel/entry-common.c
>> @@ -36,29 +36,36 @@
>>   * This is intended to match the logic in irqentry_enter(), handling the 
>> kernel
>>   * mode transitions only.
>>   */
>> -static __always_inline void __enter_from_kernel_mode(struct pt_regs *regs)
>> +static __always_inline irqentry_state_t __enter_from_kernel_mode(struct 
>> pt_regs *regs)
>>  {
>> -regs->exit_rcu = false;
>> +irqentry_state_t ret = {
>> +.exit_rcu = false,
>> +};
> 
> I realise that the generic entry code calls this 'ret' in
> irqentry_enter() and similar, but could we please use 'state'
> consistently in the arm64 code?
> 
> [...]
> 
>>  /*
>> @@ -190,9 +199,11 @@ asmlinkage void noinstr asm_exit_to_user_mode(struct 
>> pt_regs *regs)
>>   * mode. Before this function is called it is not safe to call regular 
>> kernel
>>   * code, instrumentable code, or any code which may trigger an exception.
>>   */
>> -static void noinstr arm64_enter_nmi(struct pt_regs *regs)
>> +static noinstr irqentry_state_t arm64_enter_nmi(struct pt_regs *regs)
>>  {
>> -regs->lockdep_hardirqs = lockdep_hardirqs_enabled();
>> +irqentry_state_t irq_state;
> 
> Likewise, please use 'state' rather than 'irq_state'.
> 
> In future we should probably have a separate structure for the NMI
> paths, and get rid of the union, which would avoid the possiblity of
> using mismatched helpers.
> 
> Mark.
> 
> 



Re: [PATCH -next v4 03/19] arm64: entry: Remove __enter_from_user_mode()

2024-10-30 Thread Jinjie Ruan



On 2024/10/29 22:42, Mark Rutland wrote:
> On Fri, Oct 25, 2024 at 06:06:44PM +0800, Jinjie Ruan wrote:
>> The __enter_from_user_mode() is only called by enter_from_user_mode(),
>> so replaced it with enter_from_user_mode().
> 
> As with the next two patches, all the __enter_from_*() and __exit_to_*()
> are supposed to handle the raw entry, closely matching the generic code,
> and the non-underscored enter_from_*() and exit_to_*() functions are
> supposed to be wrappers that handle (possibly instrumentable)

Sure, the __enter_from_*() and __exit_to_*() is all about the generic
code, and the enter_from_*() and exit_to_*() includes arm64-specific MTE
  check.

> arm64-specific post-entry and pre-exit logic.
> 
> I would prefer to keep that split, even though enter_from_user_mode() is
> a trivial wrapper.
> 
> Am I missing some reason we must remove the wrappers?

It is not necessary to remove these functions, just found it by chance
and cleanup them by the way, originally I thought that removing the
underline function might make the relative order of the MTE functions
look clearer.

> 
> Mark.
> 
>>
>> No functional changes.
>>
>> Signed-off-by: Jinjie Ruan 
>> ---
>>  arch/arm64/kernel/entry-common.c | 7 +--
>>  1 file changed, 1 insertion(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/entry-common.c 
>> b/arch/arm64/kernel/entry-common.c
>> index 68a9aecacdb9..ccf59b44464d 100644
>> --- a/arch/arm64/kernel/entry-common.c
>> +++ b/arch/arm64/kernel/entry-common.c
>> @@ -109,7 +109,7 @@ static void noinstr exit_to_kernel_mode(struct pt_regs 
>> *regs,
>>   * Before this function is called it is not safe to call regular kernel 
>> code,
>>   * instrumentable code, or any code which may trigger an exception.
>>   */
>> -static __always_inline void __enter_from_user_mode(void)
>> +static __always_inline void enter_from_user_mode(struct pt_regs *regs)
>>  {
>>  lockdep_hardirqs_off(CALLER_ADDR0);
>>  CT_WARN_ON(ct_state() != CT_STATE_USER);
>> @@ -118,11 +118,6 @@ static __always_inline void __enter_from_user_mode(void)
>>  mte_disable_tco_entry(current);
>>  }
>>  
>> -static __always_inline void enter_from_user_mode(struct pt_regs *regs)
>> -{
>> -__enter_from_user_mode();
>> -}
>> -
>>  /*
>>   * Handle IRQ/context state management when exiting to user mode.
>>   * After this function returns it is not safe to call regular kernel code,
>> -- 
>> 2.34.1
>>
> 



Re: [PATCH -next v4 04/19] arm64: entry: Remove __enter_from_kernel_mode()

2024-10-30 Thread Jinjie Ruan



On 2024/10/29 22:37, Mark Rutland wrote:
> On Fri, Oct 25, 2024 at 06:06:45PM +0800, Jinjie Ruan wrote:
>> The __enter_from_kernel_mode() is only called by enter_from_kernel_mode(),
>> remove it.
> 
> The point of this split is to cleanly separate the raw entry logic (in
> __enter_from_kernel_mode() from pieces that run later and can safely be
> instrumented (later in enter_from_kernel_mode()).

Hi, Mark,

I reviewed your commit bc29b71f53b1 ("arm64: entry: clarify entry/exit
helpers"), and keep these functions is to make instrumentation
boundaries more clear, and will not change them.

> 
> I had expected that a later patch would replace
> __enter_from_kernel_mode() with the generic equivalent, leaving
> enter_from_kernel_mode() unchanged. It looks like patch 16 could do that
> without this patch being necessary -- am I missing something?

Yes, you are right! these useless cleanup patches will be removed.

And when switched to generic syscall, I found that proper refactoring
would also facilitate clear code switching.

Thank you.

> 
> Mark.
> 
>>
>> No functional changes.
>>
>> Signed-off-by: Jinjie Ruan 
>> ---
>>  arch/arm64/kernel/entry-common.c | 9 +
>>  1 file changed, 1 insertion(+), 8 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/entry-common.c 
>> b/arch/arm64/kernel/entry-common.c
>> index ccf59b44464d..a7fd4d6c7650 100644
>> --- a/arch/arm64/kernel/entry-common.c
>> +++ b/arch/arm64/kernel/entry-common.c
>> @@ -36,7 +36,7 @@
>>   * This is intended to match the logic in irqentry_enter(), handling the 
>> kernel
>>   * mode transitions only.
>>   */
>> -static __always_inline irqentry_state_t __enter_from_kernel_mode(struct 
>> pt_regs *regs)
>> +static noinstr irqentry_state_t enter_from_kernel_mode(struct pt_regs *regs)
>>  {
>>  irqentry_state_t ret = {
>>  .exit_rcu = false,
>> @@ -55,13 +55,6 @@ static __always_inline irqentry_state_t 
>> __enter_from_kernel_mode(struct pt_regs
>>  rcu_irq_enter_check_tick();
>>  trace_hardirqs_off_finish();
>>  
>> -return ret;
>> -}
>> -
>> -static noinstr irqentry_state_t enter_from_kernel_mode(struct pt_regs *regs)
>> -{
>> -irqentry_state_t ret = __enter_from_kernel_mode(regs);
>> -
>>  mte_check_tfsr_entry();
>>  mte_disable_tco_entry(current);
>>  
>> -- 
>> 2.34.1
>>
> 



Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-30 Thread Jan Beulich
On 29.10.2024 18:55, Andrew Cooper wrote:
> We already have one migration case opencoded (feat.max_subleaf).  A more
> recent discovery is that we advertise x2APIC to guests without ensuring that
> we provide max_leaf >= 0xb.
> 
> In general, any leaf known to Xen can be safely configured by the toolstack if
> it doesn't violate other constraints.
> 
> Therefore, introduce guest_common_{max,default}_leaves() to generalise the
> special case we currently have for feat.max_subleaf, in preparation to be able
> to provide x2APIC topology in leaf 0xb even on older hardware.
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 

I'll have to update the AMX logic accordingly (maybe also the AVX10 one).

I'd like to point out that this highlights a naming anomaly in
x86_cpu_policies_are_compatible(): update_domain_cpu_policy() passes in
the respective max policy as first argument. Imo the first parameter of
the function would better be named "max" there.

> --- a/xen/arch/x86/cpu-policy.c
> +++ b/xen/arch/x86/cpu-policy.c
> @@ -391,6 +391,27 @@ static void __init calculate_host_policy(void)
>  p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
>  }
>  
> +/*
> + * Guest max policies can have any max leaf/subleaf within bounds.
> + *
> + * - Some incoming VMs have a larger-than-necessary feat max_subleaf.
> + * - Some VMs we'd like to synthesise leaves not present on the host.
> + */
> +static void __init guest_common_max_leaves(struct cpu_policy *p)
> +{
> +p->basic.max_leaf   = ARRAY_SIZE(p->basic.raw) - 1;
> +p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
> +p->extd.max_leaf= 0x8000U + ARRAY_SIZE(p->extd.raw) - 1;
> +}
> +
> +/* Guest default policies inherit the host max leaf/subleaf settings. */
> +static void __init guest_common_default_leaves(struct cpu_policy *p)
> +{
> +p->basic.max_leaf   = host_cpu_policy.basic.max_leaf;
> +p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
> +p->extd.max_leaf= host_cpu_policy.extd.max_leaf;
> +}

Which sadly still leaves open how to suitably shrink the max values,
when they're larger than necessary (for the guest).

Jan



Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-30 Thread Roger Pau Monné
On Tue, Oct 29, 2024 at 05:55:05PM +, Andrew Cooper wrote:
> We already have one migration case opencoded (feat.max_subleaf).  A more
> recent discovery is that we advertise x2APIC to guests without ensuring that
> we provide max_leaf >= 0xb.
> 
> In general, any leaf known to Xen can be safely configured by the toolstack if
> it doesn't violate other constraints.
> 
> Therefore, introduce guest_common_{max,default}_leaves() to generalise the
> special case we currently have for feat.max_subleaf, in preparation to be able
> to provide x2APIC topology in leaf 0xb even on older hardware.
> 
> Signed-off-by: Andrew Cooper 
> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Alejandro Vallejo 
> 
> On a KabyLake I have to hand, here's the delta in what xen-cpuid -p reports:
> 
>   git diff --no-index xen-cpuid-p-{before,after}.log
>   diff --git a/xen-cpuid-p-before.log b/xen-cpuid-p-after.log
>   index 5a76d05..24e22be 100644
>   --- a/xen-cpuid-p-before.log
>   +++ b/xen-cpuid-p-after.log
>   @@ -61,7 +61,7 @@ Host policy: 33 leaves, 2 MSRs
>  index-> value
>  00ce -> 8000
>  010a -> 0e000c04
>   -PV Max policy: 33 leaves, 2 MSRs
>   +PV Max policy: 58 leaves, 2 MSRs
> CPUID:
>  leaf subleaf  -> eax  ebx  ecx  edx
>  : -> 000d:756e6547:6c65746e:49656e69
>   @@ -75,7 +75,7 @@ PV Max policy: 33 leaves, 2 MSRs
>  000d: -> 0007::0340:
>  000d:0001 -> 0007:::
>  000d:0002 -> 0100:0240::
>   -  8000: -> 8008:::
>   +  8000: -> 8021:::
>  8001: -> ::0123:28100800
>  8002: -> 65746e49:2952286c:6f655820:2952286e
>  8003: -> 55504320:2d334520:30333231:20367620
>   @@ -87,7 +87,7 @@ PV Max policy: 33 leaves, 2 MSRs
>  index-> value
>  00ce -> 8000
>  010a -> 1c020004
>   -HVM Max policy: 35 leaves, 2 MSRs
>   +HVM Max policy: 60 leaves, 2 MSRs
> CPUID:
>  leaf subleaf  -> eax  ebx  ecx  edx
>  : -> 000d:756e6547:6c65746e:49656e69
>   @@ -103,7 +103,7 @@ HVM Max policy: 35 leaves, 2 MSRs
>  000d:0002 -> 0100:0240::
>  000d:0003 -> 0040:03c0::
>  000d:0004 -> 0040:0400::
>   -  8000: -> 8008:::
>   +  8000: -> 8021:::
>  8001: -> ::0123:2c100800
>  8002: -> 65746e49:2952286c:6f655820:2952286e
>  8003: -> 55504320:2d334520:30333231:20367620
> ---
>  xen/arch/x86/cpu-policy.c | 39 +--
>  1 file changed, 25 insertions(+), 14 deletions(-)
> 
> diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
> index b6d9fad56773..78bc9872b09a 100644
> --- a/xen/arch/x86/cpu-policy.c
> +++ b/xen/arch/x86/cpu-policy.c
> @@ -391,6 +391,27 @@ static void __init calculate_host_policy(void)
>  p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
>  }
>  
> +/*
> + * Guest max policies can have any max leaf/subleaf within bounds.
> + *
> + * - Some incoming VMs have a larger-than-necessary feat max_subleaf.
> + * - Some VMs we'd like to synthesise leaves not present on the host.
> + */
> +static void __init guest_common_max_leaves(struct cpu_policy *p)
> +{
> +p->basic.max_leaf   = ARRAY_SIZE(p->basic.raw) - 1;
> +p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
> +p->extd.max_leaf= 0x8000U + ARRAY_SIZE(p->extd.raw) - 1;
> +}
> +
> +/* Guest default policies inherit the host max leaf/subleaf settings. */
> +static void __init guest_common_default_leaves(struct cpu_policy *p)
> +{
> +p->basic.max_leaf   = host_cpu_policy.basic.max_leaf;
> +p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
> +p->extd.max_leaf= host_cpu_policy.extd.max_leaf;
> +}

I think this what I'm going to ask is future work.  After the
modifications done to the host policy by max functions
(calculate_{hvm,pv}_max_policy()) won't the max {sub,}leaf adjustments
better be done taking into account the contents of the policy, rather
than capping to the host values?

(note this comment is strictly for guest_common_default_leaves(), the
max version is fine using ARRAY_SIZE).

Thanks, Roger.