from:"Alejandro Vallejo"

Re: [PATCH v6 08/11] xen/lib: Add topology generator for x86

2024-10-15 Thread Alejandro Vallejo

On Thu Oct 10, 2024 at 8:54 AM BST, Jan Beulich wrote:
> On 09.10.2024 19:57, Alejandro Vallejo wrote:
> > On Wed Oct 9, 2024 at 3:45 PM BST, Jan Beulich wrote:
> >> On 01.10.2024 14:38, Alejandro Vallejo wrote:
> >>> --- a/xen/lib/x86/policy.c
> >>> +++ b/xen/lib/x86/policy.c
> >>> @@ -2,6 +2,94 @@
> >>>  
> >>>  #include 
> >>>  
> >>> +static unsigned int order(unsigned int n)
> >>> +{
> >>> +ASSERT(n); /* clz(0) is UB */
> >>> +
> >>> +return 8 * sizeof(n) - __builtin_clz(n);
> >>> +}
> >>> +
> >>> +int x86_topo_from_parts(struct cpu_policy *p,
> >>> +unsigned int threads_per_core,
> >>> +unsigned int cores_per_pkg)
> >>> +{
> >>> +unsigned int threads_per_pkg = threads_per_core * cores_per_pkg;
> >>
> >> What about the (admittedly absurd) case of this overflowing?
> > 
> > Each of them individually could overflow the fields in which they are used.
> > 
> > Does returning EINVAL if either threads_per_core or cores_per_pkg overflow 
> > the
> > INTEL structure j
>
> The sentence looks unfinished, so I can only vaguely say that my answer to
> the question would likely be "yes".

It was indeed. Regardless, the number of bits available in Intel's cache
subleaves is rather limited, so I'll be clipping those to the maximum on
overflow and...

>
> >>> +switch ( p->x86_vendor )
> >>> +{
> >>> +case X86_VENDOR_INTEL: {
> >>> +struct cpuid_cache_leaf *sl = p->cache.subleaf;
> >>> +
> >>> +for ( size_t i = 0; sl->type &&
> >>> +i < ARRAY_SIZE(p->cache.raw); i++, sl++ )
> >>> +{
> >>> +sl->cores_per_package = cores_per_pkg - 1;
> >>> +sl->threads_per_cache = threads_per_core - 1;
> >>> +if ( sl->type == 3 /* unified cache */ )
> >>> +sl->threads_per_cache = threads_per_pkg - 1;
> >>
> >> I wasn't able to find documentation for this, well, anomaly. Can you please
> >> point me at where this is spelled out?
> > 
> > That's showing all unified caches as caches covering the whole package. We
> > could do it the other way around (but I don't want to reverse engineer what 
> > the
> > host policy says because that's irrelevant). There's nothing in the SDM 
> > (AFAIK)
> > forcing L2 or L3 to behave one way or another, so we get to choose. I 
> > thought
> > it more helpful to make all unified caches unified across the package. to 
> > give
> > more information in the leaf.
> > 
> > My own system exposes 2 unified caches (data trimmed for space):
> > 
> > ``` cpuid
> > 
> >deterministic cache parameters (4):
> >   --- cache 0 ---
> >   cache type = data cache (1)
> >   cache level= 0x1 (1)
> >   maximum IDs for CPUs sharing cache = 0x1 (1)
> >   maximum IDs for cores in pkg   = 0xf (15)
> >   --- cache 1 ---
> >   cache type = instruction cache (2)
> >   cache level= 0x1 (1)
> >   maximum IDs for CPUs sharing cache = 0x1 (1)
> >   maximum IDs for cores in pkg   = 0xf (15)
> >   --- cache 2 ---
> >   cache type = unified cache (3)
> >   cache level= 0x2 (2)
> >   maximum IDs for CPUs sharing cache = 0x1 (1)
>
> Note how this is different ...
>
> >   maximum IDs for cores in pkg   = 0xf (15)
> >   --- cache 3 ---
> >   cache type = unified cache (3)
> >   cache level= 0x3 (3)
> >   maximum IDs for CPUs sharing cache = 0x1f (31)
>
> ... from this, whereas your code would make it the same.
>
> Especially if this is something you do beyond / outside the spec, it imo
> needs reasoning about in fair detail in the description.

... given the risk of clipping, I'll get rid of that conditional too to make it
easier for a non-clipped number to be reported.

I'll write in the commit message the behaviour on overflow for these leaves.

>
> Jan

Cheers,
Alejandro

Re: [PATCH v6 09/11] xen/x86: Derive topologically correct x2APIC IDs from the policy

2024-10-09 Thread Alejandro Vallejo

On Wed Oct 9, 2024 at 3:53 PM BST, Jan Beulich wrote:
> On 01.10.2024 14:38, Alejandro Vallejo wrote:
> > Implements the helper for mapping vcpu_id to x2apic_id given a valid
> > topology in a policy. The algo is written with the intention of
> > extending it to leaves 0x1f and extended 0x26 in the future.
> > 
> > Toolstack doesn't set leaf 0xb and the HVM default policy has it
> > cleared, so the leaf is not implemented. In that case, the new helper
> > just returns the legacy mapping.
>
> Is the first sentence of this latter paragraph missing an "If" or "When"
> at the beginning? As written I'm afraid I can't really make sense of it.
>
> Jan

It's a statement of current affairs. Could be rewritten as...

   The helper returns the legacy mapping when leaf 0xb is not implemented (as
   is the case at the moment).

Does that look better?

Cheers,
Alejandro

Re: [PATCH v6 01/11] lib/x86: Relax checks about policy compatibility

2024-10-09 Thread Alejandro Vallejo

Hi,

On Wed Oct 9, 2024 at 10:40 AM BST, Jan Beulich wrote:
> On 01.10.2024 14:37, Alejandro Vallejo wrote:
> > --- a/xen/lib/x86/policy.c
> > +++ b/xen/lib/x86/policy.c
> > @@ -15,7 +15,16 @@ int x86_cpu_policies_are_compatible(const struct 
> > cpu_policy *host,
> >  #define FAIL_MSR(m) \
> >  do { e.msr = (m); goto out; } while ( 0 )
> >  
> > -if ( guest->basic.max_leaf > host->basic.max_leaf )
> > +/*
> > + * Old AMD hardware doesn't expose topology information in leaf 0xb. We
> > + * want to emulate that leaf with credible information because it must 
> > be
> > + * present on systems in which we emulate the x2APIC.
> > + *
> > + * For that reason, allow the max basic guest leaf to be larger than 
> > the
> > + * hosts' up until 0xb.
> > + */
> > +if ( guest->basic.max_leaf > 0xb &&
> > + guest->basic.max_leaf > host->basic.max_leaf )
> >  FAIL_CPUID(0, NA);
> >  
> >  if ( guest->feat.max_subleaf > host->feat.max_subleaf )
>
> I'm concerned by this in multiple ways:
>
> 1) It's pretty ad hoc, and hence doesn't make clear how to deal with similar
> situations in the future.

I agree. I don't have a principled suggestion for how to deal with other cases
where we might have to bump the max leaf. It may be safe (as is here becasue
everything below it is either used or unimplemnted), but AFAIU some leaves
might be problematic to expose, even as zeroes. I suspect that's the problem
you hint at later on about AMX and AVX10?

>
> 2) Why would we permit going up to leaf 0xb when x2APIC is off in the 
> respective
> leaf?

I assume you mean when the x2APIC is not emulated? One reason is to avoid a
migration barrier, as otherwise we can't migrate VMs created in "leaf
0xb"-capable hardware to non-"leaf 0xb"-capable even though the migration is
perfectly safe.

Also, it's benign and simplifies everything. Otherwise we have to find out
during early creation not only whether the host has leaf 0xb, but also whether
we're emulating an x2APIC or not.

Furthermore, not doing this would actively prevent emulating an x2APIC on AMD
Lisbon-like silicon even though it's fine to do so. Note that we have a broken
invariant in existing code where the x2APIC is emulated and leaf 0xb is not
exposed at all; not even to show the x2APIC IDs.

>
> 3) We similarly force a higher extended leaf in order to accommodate the 
> LFENCE-
> is-dispatch-serializing bit. Yet there's no similar extra logic there in the
> function here.

That's done on the host policy though, so there's no clash.

In calculate_host_policy()...

```
  /*
   * For AMD/Hygon hardware before Zen3, we unilaterally modify LFENCE to be
   * dispatch serialising for Spectre mitigations.  Extend max_extd_leaf
   * beyond what hardware supports, to include the feature leaf containing
   * this information.
   */
  if ( cpu_has_lfence_dispatch )
  max_extd_leaf = max(max_extd_leaf, 0x8021U);
```

One could imagine doing the same for leaf 0xb and dropping this patch, but then
we'd have to synthesise something on that leaf for hardware that doesn't have
it, which is a lot more annoying.

>
> 4) While there the guest vs host check won't matter, the situation with AMX 
> and
> AVX10 leaves imo still wants considering here right away. IOW (taken together
> with at least 3) above) I think we need to first settle on a model for
> collectively all max (sub)leaf handling. That in particular needs to properly
> spell out who's responsible for what (tool stack vs Xen).

I'm not sure I follow. What's the situation with AMX and AVX10 that you refer
to? I'd assume that making ad-hoc decisions on this is pretty much unavoidable,
but maybe the solution to the problem you mention would highlight a more
general approach.

>
> Jan

Cheers,
Alejandro

Re: [PATCH v6 03/11] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-10-09 Thread Alejandro Vallejo

Hi,

On Wed Oct 9, 2024 at 2:12 PM BST, Jan Beulich wrote:
> On 01.10.2024 14:37, Alejandro Vallejo wrote:
> > @@ -311,18 +310,15 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
> >  
> >  case 0xb:
> >  /*
> > - * In principle, this leaf is Intel-only.  In practice, it is 
> > tightly
> > - * coupled with x2apic, and we offer an x2apic-capable APIC 
> > emulation
> > - * to guests on AMD hardware as well.
> > - *
> > - * TODO: Rework topology logic.
> > + * Don't expose topology information to PV guests. Exposed on HVM
> > + * along with x2APIC because they are tightly coupled.
> >   */
> > -if ( p->basic.x2apic )
> > +if ( is_hvm_domain(d) && p->basic.x2apic )
>
> This change isn't mentioned at all in the description, despite it having the
> potential of introducing a (perceived) regression. See the comments near the
> top of calculate_pv_max_policy() and near the top of
> domain_cpu_policy_changed(). What's wrong with ...
>
> >  {
> >  *(uint8_t *)&res->c = subleaf;
> >  
> >  /* Fix the x2APIC identifier. */
> > -res->d = v->vcpu_id * 2;
> > +res->d = vlapic_x2apic_id(vcpu_vlapic(v));
>
> ...
>
> res->d = is_hvm_domain(d) ? vlapic_x2apic_id(vcpu_vlapic(v))
>   : v->vcpu_id * 2;
>
> ?

Hmmm. I haven't seem problems with PV guests, but that's a good point. While I
suspect no PV guest would use this value for anything relevant (seeing how
there's no actual APIC), handing out zeroes might still have bad consequences.

Sure, I'll amend it.

>
> > --- a/xen/arch/x86/hvm/vlapic.c
> > +++ b/xen/arch/x86/hvm/vlapic.c
> > @@ -1090,7 +1090,7 @@ static uint32_t x2apic_ldr_from_id(uint32_t id)
> >  static void set_x2apic_id(struct vlapic *vlapic)
> >  {
> >  const struct vcpu *v = vlapic_vcpu(vlapic);
> > -uint32_t apic_id = v->vcpu_id * 2;
> > +uint32_t apic_id = vlapic->hw.x2apic_id;
>
> Any reason you're open-coding vlapic_x2apic_id() here and ...
>
> > @@ -1470,7 +1470,7 @@ void vlapic_reset(struct vlapic *vlapic)
> >  if ( v->vcpu_id == 0 )
> >  vlapic->hw.apic_base_msr |= APIC_BASE_BSP;
> >  
> > -vlapic_set_reg(vlapic, APIC_ID, (v->vcpu_id * 2) << 24);
> > +vlapic_set_reg(vlapic, APIC_ID, SET_xAPIC_ID(vlapic->hw.x2apic_id));
>
> ... here?

Not a good one. vlapic_x2apic_id() exists mostly to allow self-contained
accesses from outside this translation unit. It makes no harm using the
accessor even inside, sure.

>
> Jan

Cheers,
Alejandro

Re: [PATCH v6 04/11] xen/x86: Add supporting code for uploading LAPIC contexts during domain create

2024-10-09 Thread Alejandro Vallejo

On Wed Oct 9, 2024 at 2:28 PM BST, Jan Beulich wrote:
> On 01.10.2024 14:38, Alejandro Vallejo wrote:
> > If toolstack were to upload LAPIC contexts as part of domain creation it
>
> If it were to - yes. But it doesn't, an peeking ahead in the series I also
> couldn't spot this changing. Hence I don#t think I see why this change
> would be needed, and why ...

Patch 10 does. It's the means by which (in a rather roundabout way)
toolstack overrides vlapic->hw.x2apic_id.

>
> > would encounter a problem were the architectural state does not reflect
> > the APIC ID in the hidden state. This patch ensures updates to the
> > hidden state trigger an update in the architectural registers so the
> > APIC ID in both is consistent.
> > 
> > Signed-off-by: Alejandro Vallejo 
> > ---
> >  xen/arch/x86/hvm/vlapic.c | 20 
> >  1 file changed, 20 insertions(+)
> > 
> > diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
> > index 02570f9dd63a..a8183c3023da 100644
> > --- a/xen/arch/x86/hvm/vlapic.c
> > +++ b/xen/arch/x86/hvm/vlapic.c
> > @@ -1640,7 +1640,27 @@ static int cf_check lapic_load_hidden(struct domain 
> > *d, hvm_domain_context_t *h)
> >  
> >  s->loaded.hw = 1;
> >  if ( s->loaded.regs )
> > +{
> > +/*
> > + * We already processed architectural regs in lapic_load_regs(), so
> > + * this must be a migration. Fix up inconsistencies from any older 
> > Xen.
> > + */
> >  lapic_load_fixup(s);
> > +}
> > +else
> > +{
> > +/*
> > + * We haven't seen architectural regs so this could be a migration 
> > or a
> > + * plain domain create. In the domain create case it's fine to 
> > modify
> > + * the architectural state to align it to the APIC ID that was just
> > + * uploaded and in the migrate case it doesn't matter because the
> > + * architectural state will be replaced by the LAPIC_REGS ctx 
> > later on.
> > + */
>
> ... a comment would need to mention a case that never really happens, thus
> only risking to cause confusion.
>
> Jan

I assume the "never really happens" is about the same as the previous
paragraph? If so, the same answer applies.

About the lack of ordering in the migrate stream the code already makes no
assumptions as to which HVM context blob might appear first in the vLAPIC area.

I'm not sure why, but I assumed it may be different on older Xen.

>
> > +if ( vlapic_x2apic_mode(s) )
> > +set_x2apic_id(s);
> > +else
> > +vlapic_set_reg(s, APIC_ID, SET_xAPIC_ID(s->hw.x2apic_id));
> > +}
> >  
> >  hvm_update_vlapic_mode(v);
> >  

Cheers,
Alejandro

Re: [PATCH v6 02/11] x86/vlapic: Move lapic migration checks to the check hooks

2024-10-09 Thread Alejandro Vallejo

On Tue Oct 8, 2024 at 4:41 PM BST, Jan Beulich wrote:
> On 01.10.2024 14:37, Alejandro Vallejo wrote:
> > While doing this, factor out checks common to architectural and hidden
> > state.
> > 
> > Signed-off-by: Alejandro Vallejo 
> > Reviewed-by: Roger Pau Monné 
> > --
> > Last reviewed in the topology series v3. Fell under the cracks.
> > 
> >   https://lore.kernel.org/xen-devel/ZlhP11Vvk6c1Ix36@macbook/
>
> It's not the 1st patch in the series, and I can't spot anywhere that it is
> made clear that this one can go in ahead of patch 1. I may have overlooked
> something in the long-ish cover letter.
>
> Jan

Patch 1 is independent of almost everything. It merely needs to go in before
the final patch to avoid turning it into a bugfix. I put it on front under the
expectation that it wouldn't be very contentious. In retrospect, this one ought
to have taken its place, indeed.

Cheers,
Alejandro

Re: [PATCH v6 08/11] xen/lib: Add topology generator for x86

2024-10-09 Thread Alejandro Vallejo

On Wed Oct 9, 2024 at 3:45 PM BST, Jan Beulich wrote:
> On 01.10.2024 14:38, Alejandro Vallejo wrote:
> > --- a/xen/include/xen/lib/x86/cpu-policy.h
> > +++ b/xen/include/xen/lib/x86/cpu-policy.h
> > @@ -542,6 +542,22 @@ int x86_cpu_policies_are_compatible(const struct 
> > cpu_policy *host,
> >  const struct cpu_policy *guest,
> >  struct cpu_policy_errors *err);
> >  
> > +/**
> > + * Synthesise topology information in `p` given high-level constraints
> > + *
> > + * Topology is given in various fields accross several leaves, some of
> > + * which are vendor-specific. This function uses the policy itself to
> > + * derive such leaves from threads/core and cores/package.
>
> Isn't it more like s/uses/fills/ (and the rest of the sentence then
> possibly adjust some to match)? The policy looks to be purely an output
> here (except for the vendor field).

Sure.

>
> > --- a/xen/lib/x86/policy.c
> > +++ b/xen/lib/x86/policy.c
> > @@ -2,6 +2,94 @@
> >  
> >  #include 
> >  
> > +static unsigned int order(unsigned int n)
> > +{
> > +ASSERT(n); /* clz(0) is UB */
> > +
> > +return 8 * sizeof(n) - __builtin_clz(n);
> > +}
> > +
> > +int x86_topo_from_parts(struct cpu_policy *p,
> > +unsigned int threads_per_core,
> > +unsigned int cores_per_pkg)
> > +{
> > +unsigned int threads_per_pkg = threads_per_core * cores_per_pkg;
>
> What about the (admittedly absurd) case of this overflowing?

Each of them individually could overflow the fields in which they are used.

Does returning EINVAL if either threads_per_core or cores_per_pkg overflow the
INTEL structure j

>
> > +unsigned int apic_id_size;
> > +
> > +if ( !p || !threads_per_core || !cores_per_pkg )
> > +return -EINVAL;
> > +
> > +p->basic.max_leaf = MAX(0xb, p->basic.max_leaf);
>
> Better use the type-safe max() (and min() further down)?

Sure

>
> > +memset(p->topo.raw, 0, sizeof(p->topo.raw));
> > +
> > +/* thread level */
> > +p->topo.subleaf[0].nr_logical = threads_per_core;
> > +p->topo.subleaf[0].id_shift = 0;
> > +p->topo.subleaf[0].level = 0;
> > +p->topo.subleaf[0].type = 1;
> > +if ( threads_per_core > 1 )
> > +p->topo.subleaf[0].id_shift = order(threads_per_core - 1);
> > +
> > +/* core level */
> > +p->topo.subleaf[1].nr_logical = cores_per_pkg;
> > +if ( p->x86_vendor == X86_VENDOR_INTEL )
> > +p->topo.subleaf[1].nr_logical = threads_per_pkg;
> > +p->topo.subleaf[1].id_shift = p->topo.subleaf[0].id_shift;
> > +p->topo.subleaf[1].level = 1;
> > +p->topo.subleaf[1].type = 2;
> > +if ( cores_per_pkg > 1 )
> > +p->topo.subleaf[1].id_shift += order(cores_per_pkg - 1);
> > +
> > +apic_id_size = p->topo.subleaf[1].id_shift;
> > +
> > +/*
> > + * Contrary to what the name might seem to imply. HTT is an enabler for
> > + * SMP and there's no harm in setting it even with a single vCPU.
> > + */
> > +p->basic.htt = true;
> > +p->basic.lppp = MIN(0xff, threads_per_pkg);
> > +
> > +switch ( p->x86_vendor )
> > +{
> > +case X86_VENDOR_INTEL: {
> > +struct cpuid_cache_leaf *sl = p->cache.subleaf;
> > +
> > +for ( size_t i = 0; sl->type &&
> > +i < ARRAY_SIZE(p->cache.raw); i++, sl++ )
> > +{
> > +sl->cores_per_package = cores_per_pkg - 1;
> > +sl->threads_per_cache = threads_per_core - 1;
> > +if ( sl->type == 3 /* unified cache */ )
> > +sl->threads_per_cache = threads_per_pkg - 1;
>
> I wasn't able to find documentation for this, well, anomaly. Can you please
> point me at where this is spelled out?

That's showing all unified caches as caches covering the whole package. We
could do it the other way around (but I don't want to reverse engineer what the
host policy says because that's irrelevant). There's nothing in the SDM (AFAIK)
forcing L2 or L3 to behave one way or another, so we get to choose. I thought
it more helpful to make all unified caches unified across the package. to give
more information in the leaf.

My own system exposes 2 unified caches (data trimmed for space):

``` cpuid

   deterministic cache parameters (4):
  --- cache 0 ---

Re: [PATCH v6 05/11] tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves

2024-10-09 Thread Alejandro Vallejo

Hi,

On Wed Oct 9, 2024 at 3:03 PM BST, Jan Beulich wrote:
> On 01.10.2024 14:38, Alejandro Vallejo wrote:
> > Make it so the APs expose their own APIC IDs in a LUT. We can use that
> > LUT to populate the MADT, decoupling the algorithm that relates CPU IDs
> > and APIC IDs from hvmloader.
> > 
> > While at this also remove ap_callin, as writing the APIC ID may serve
> > the same purpose.
>
> ... on the assumption that no AP will have an APIC ID of zero.
>
> > @@ -341,11 +341,11 @@ int main(void)
> >  
> >  printf("CPU speed is %u MHz\n", get_cpu_mhz());
> >  
> > +smp_initialise();
> > +
> >  apic_setup();
> >  pci_setup();
> >  
> > -smp_initialise();
>
> I can see that you may want cpu_setup(0) to run ahead of apic_setup().

Not only that. This hunk ensures CPU_TO_X2APICID is populated ASAP for every
CPU. Reading zeroes where a non-zero APIC ID should be is fatal and tricky to
debug later. I tripped on enough "used the LUT before being set up" bugs to
really prefer initialising it before anyone has a chance to misuse it.

> Yet is it really appropriate to run boot_cpu() ahead of apic_setup() as well?

I would've agreed before the patches that went in to replace INIT-SIPI-SIPI
with hypercalls, but now that hvmloader is enlightened it has no real need for
the APIC to be configured. If feels weird because you wouldn't use this order
on bare metal. But it's fine under virt.

> At the very least it feels logically wrong, even if at the moment there
> may not be any direct dependency (which might change, however, down the
> road).

I suspect it feels wrong because you can't boot CPUs ahead of configuring your
APIC in real hardware. But hvmloader is always virtualized so that point is
moot. If anything, I'd be scared of adding code ahead of smp_initialise() that
relies on CPU_TO_X2APICID being set when it hasn't yet.

If you have a strong view on the matter I can remove this hunk and call
read_apic_id() from apic_setup(). But it wouldn't be my preference to do so.

>
> > --- a/tools/firmware/hvmloader/mp_tables.c
> > +++ b/tools/firmware/hvmloader/mp_tables.c
> > @@ -198,8 +198,10 @@ static void fill_mp_config_table(struct 
> > mp_config_table *mpct, int length)
> >  /* fills in an MP processor entry for VCPU 'vcpu_id' */
> >  static void fill_mp_proc_entry(struct mp_proc_entry *mppe, int vcpu_id)
> >  {
> > +ASSERT(CPU_TO_X2APICID[vcpu_id] < 0xFF );
>
> Nit: Excess blank before closing paren.

Oops, right.

>
> And of course this will need doing differently anyway once we get to
> support for more than 128 vCPU-s.

This is just a paranoia-driven assert to give quick feedback on the overflow of
the APIC ID later on. The entry in the MP-Table is a single octet long, so in
those cases we'd want to skip the table to begin with.

>
> > --- a/tools/firmware/hvmloader/smp.c
> > +++ b/tools/firmware/hvmloader/smp.c
> > @@ -29,7 +29,34 @@
> >  
> >  #include 
> >  
> > -static int ap_callin;
> > +/**
> > + * Lookup table of x2APIC IDs.
> > + *
> > + * Each entry is populated its respective CPU as they come online. This is 
> > required
> > + * for generating the MADT with minimal assumptions about ID relationships.
> > + */
> > +uint32_t CPU_TO_X2APICID[HVM_MAX_VCPUS];
>
> I can kind of accept keeping it simple in the name (albeit - why all caps?),
> but at least the comment should mention that it may also be an xAPIC ID
> that's stored here.

I'll add that in the comment. I do want it to be x2apic in name though, so as
to make it obvious that it's LUT of 32bit items.

As for the caps, bad reasons. It used to be a macro and over time I kept
interpreting it as an indexed constant. Should be lowercase.

>
> > @@ -104,6 +132,12 @@ static void boot_cpu(unsigned int cpu)
> >  void smp_initialise(void)
> >  {
> >  unsigned int i, nr_cpus = hvm_info->nr_vcpus;
> > +uint32_t ecx;
> > +
> > +cpuid(1, NULL, NULL, &ecx, NULL);
> > +has_x2apic = (ecx >> 21) & 1;
>
> Would be really nice to avoid the literal 21 here by including
> xen/arch-x86/cpufeatureset.h. Can this be arranged for?

I'll give that a go. hvmloader has given no shortage of headaches with its
quirky environment, so we'll see...

>
> Jan

Cheers,
Alejandro

Re: [PATCH v6 06/11] tools/libacpi: Use LUT of APIC IDs rather than function pointer

2024-10-09 Thread Alejandro Vallejo

On Wed Oct 9, 2024 at 3:25 PM BST, Jan Beulich wrote:
> On 01.10.2024 14:38, Alejandro Vallejo wrote:
> > @@ -148,7 +148,7 @@ static struct acpi_20_madt *construct_madt(struct 
> > acpi_ctxt *ctxt,
> >  lapic->length  = sizeof(*lapic);
> >  /* Processor ID must match processor-object IDs in the DSDT. */
> >  lapic->acpi_processor_id = i;
> > -lapic->apic_id = config->lapic_id(i);
> > +lapic->apic_id = config->cpu_to_apicid[i];
>
> Perhaps assert (like you do in an earlier patch) that the ID is small
> enough?
>
> > --- a/tools/libacpi/libacpi.h
> > +++ b/tools/libacpi/libacpi.h
> > @@ -84,7 +84,7 @@ struct acpi_config {
> >  unsigned long rsdp;
> >  
> >  /* x86-specific parameters */
> > -uint32_t (*lapic_id)(unsigned cpu);
> > +uint32_t *cpu_to_apicid; /* LUT mapping cpu id to (x2)APIC ID */
>
> const uint32_t *?
>
> > --- a/tools/libs/light/libxl_dom.c
> > +++ b/tools/libs/light/libxl_dom.c
> > @@ -1082,6 +1082,11 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
> >  
> >  dom->container_type = XC_DOM_HVM_CONTAINER;
> >  
> > +#if defined(__i386__) || defined(__x86_64__)
> > +for ( uint32_t i = 0; i < info->max_vcpus; i++ )
>
> Plain unsigned int?
>
> Jan

Sure to all three.

Cheers,
Alejandro

Re: [PATCH v4 1/2] x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu

2024-10-08 Thread Alejandro Vallejo

On Tue Oct 8, 2024 at 8:47 AM BST, Frediano Ziglio wrote:
> On Mon, Oct 7, 2024 at 4:52 PM Alejandro Vallejo
>  wrote:
> >
> > fpu_ctxt is either a pointer to the legacy x87/SSE save area (used by 
> > FXSAVE) or
> > a pointer aliased with xsave_area that points to its fpu_sse subfield. Such
> > subfield is at the base and is identical in size and layout to the legacy
> > buffer.
> >
> > This patch merges the 2 pointers in the arch_vcpu into a single XSAVE area. 
> > In
> > the very rare case in which the host doesn't support XSAVE all we're doing 
> > is
> > wasting a tiny amount of memory and trading those for a lot more simplicity 
> > in
> > the code.
> >
> > While at it, dedup the setup logic in vcpu_init_fpu() and integrate it
> > into xstate_alloc_save_area().
> >
> > Signed-off-by: Alejandro Vallejo 
> > --
> > v4:
> >   * Amend commit message with extra note about deduping vcpu_init_fpu()
> >   * Remove comment on top of cpu_user_regs (though I really think there
> > ought to be a credible one, in one form or another).
> >   * Remove cast from blk.c so FXSAVE_AREA is "void *"
> >   * Simplify comment in xstate_alloc_save_area() for the "host has no
> > XSAVE" case.
> > ---
> >  xen/arch/x86/domctl.c |  6 -
> >  xen/arch/x86/hvm/emulate.c|  4 +--
> >  xen/arch/x86/hvm/hvm.c|  6 -
> >  xen/arch/x86/i387.c   | 45 +--
> >  xen/arch/x86/include/asm/domain.h |  6 -
> >  xen/arch/x86/x86_emulate/blk.c|  2 +-
> >  xen/arch/x86/xstate.c | 12 ++---
> >  7 files changed, 28 insertions(+), 53 deletions(-)
> >
> > diff --git a/xen/arch/x86/x86_emulate/blk.c b/xen/arch/x86/x86_emulate/blk.c
> > index e790f4f90056..08a05f8453f7 100644
> > --- a/xen/arch/x86/x86_emulate/blk.c
> > +++ b/xen/arch/x86/x86_emulate/blk.c
> > @@ -11,7 +11,7 @@
> >  !defined(X86EMUL_NO_SIMD)
> >  # ifdef __XEN__
> >  #  include 
> > -#  define FXSAVE_AREA current->arch.fpu_ctxt
> > +#  define FXSAVE_AREA ((void *)¤t->arch.xsave_area->fpu_sse)
>
> Could you use "struct x86_fxsr *" instead of "void*" ?
> Maybe adding another "struct x86_fxsr fxsr" inside the anonymous
> fpu_sse union would help here.
>

I did in v3, and Andrew suggested to keep the (void *). See:

  
https://lore.kernel.org/xen-devel/2b42323a-961a-4dd8-8cde-f4b19eac0...@citrix.com/

Cheers,
Alejandro

Re: [PATCH v4 2/2] x86/fpu: Rework fpu_setup_fpu() uses to split it in two

2024-10-08 Thread Alejandro Vallejo

On Tue Oct 8, 2024 at 7:37 AM BST, Jan Beulich wrote:
> On 07.10.2024 17:52, Alejandro Vallejo wrote:
> > It was trying to do too many things at once and there was no clear way of
> > defining what it was meant to do. This commit splits the function in two.
> > 
> >   1. A function to return the FPU to power-on reset values.
> >   2. A x87/SSE state loader (equivalent to the old function when it took
> >  a data pointer).
> > 
> > The old function also had a concept of "default" values that the FPU
> > would be configured for in some cases but not others. This patch removes
> > that 3rd vague initial state and replaces it with power-on reset.
> > 
> > While doing this make sure the abridged control tag is consistent with the
> > manuals and starts as 0xFF
> > 
> > Signed-off-by: Alejandro Vallejo 
> > Reviewed-by: Jan Beulich 
> > --
> > @Jan: The patch changed substantially. Are you still ok with this R-by?
>
> I am. However in such a situation imo you'd better drop the tag, for it to
> be re-offered (if desired). It can very well happen that the person simply
> doesn't notice the question pointed at them.
>
> Jan

Noted for next time. Thanks for the promptness!

Cheers,
Alejandro

Re: [PATCH v3 5/5] x86/boot: Clarify comment

2024-10-11 Thread Alejandro Vallejo

On Fri, Oct 11, 2024 at 09:52:44AM +0100, Frediano Ziglio wrote:
> Signed-off-by: Frediano Ziglio 
> ---
>  xen/arch/x86/boot/reloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/boot/reloc.c b/xen/arch/x86/boot/reloc.c
> index e50e161b27..e725cfb6eb 100644
> --- a/xen/arch/x86/boot/reloc.c
> +++ b/xen/arch/x86/boot/reloc.c
> @@ -65,7 +65,7 @@ typedef struct memctx {
>  /*
>   * Simple bump allocator.
>   *
> - * It starts from the base of the trampoline and allocates downwards.
> + * It starts on top of space reserved for the trampoline and allocates 
> downwards.

nit: Not sure this is much clearer. The trampoline is not a stack (and even if
it was, I personally find "top" and "bottom" quite ambiguous when it grows
backwards), so calling top to its lowest address seems more confusing than not.

If anything clarification ought to go in the which direction it takes. Leaving
"base" instead of "top" and replacing "downwards" by "backwards" to make it
crystal clear that it's a pointer that starts where the trampoline starts, but
moves in the opposite direction.

My .02 anyway.

>   */
>  uint32_t ptr;
>  } memctx;
> -- 
> 2.34.1
> 
> 

Cheers,
Alejandro

Re: [PATCH v6 06/11] tools/libacpi: Use LUT of APIC IDs rather than function pointer

2024-10-11 Thread Alejandro Vallejo

On Wed Oct 9, 2024 at 3:25 PM BST, Jan Beulich wrote:
> On 01.10.2024 14:38, Alejandro Vallejo wrote:
> > @@ -148,7 +148,7 @@ static struct acpi_20_madt *construct_madt(struct 
> > acpi_ctxt *ctxt,
> >  lapic->length  = sizeof(*lapic);
> >  /* Processor ID must match processor-object IDs in the DSDT. */
> >  lapic->acpi_processor_id = i;
> > -lapic->apic_id = config->lapic_id(i);
> > +lapic->apic_id = config->cpu_to_apicid[i];
>
> Perhaps assert (like you do in an earlier patch) that the ID is small
> enough?

Actually, I just remembered why I didn't. libacpi is pulled into libxl, which
is integrated into libvirt. A failed assert here would kill the application,
which is not very nice.

HVM is already protected by the mp tables assert, so I'm not terribly worried
about it and, while PVH is not, it would crash pretty quickly due to the
corruption.

I'd rather have the domain crashing rather than virt-manager.

>
> > --- a/tools/libacpi/libacpi.h
> > +++ b/tools/libacpi/libacpi.h
> > @@ -84,7 +84,7 @@ struct acpi_config {
> >  unsigned long rsdp;
> >  
> >  /* x86-specific parameters */
> > -uint32_t (*lapic_id)(unsigned cpu);
> > +uint32_t *cpu_to_apicid; /* LUT mapping cpu id to (x2)APIC ID */
>
> const uint32_t *?
>
> > --- a/tools/libs/light/libxl_dom.c
> > +++ b/tools/libs/light/libxl_dom.c
> > @@ -1082,6 +1082,11 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
> >  
> >  dom->container_type = XC_DOM_HVM_CONTAINER;
> >  
> > +#if defined(__i386__) || defined(__x86_64__)
> > +for ( uint32_t i = 0; i < info->max_vcpus; i++ )
>
> Plain unsigned int?
>
> Jan

Sigh... and this didn't have the libxl style either.

I really hate this style mix we have :/

Cheers,
Alejandro

Re: [PATCH v3 1/2] x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu

2024-10-04 Thread Alejandro Vallejo

Hi,

On Thu Oct 3, 2024 at 8:38 PM BST, Andrew Cooper wrote:
> On 13/08/2024 3:21 pm, Alejandro Vallejo wrote:
> > @@ -299,44 +299,14 @@ void save_fpu_enable(void)
> >  /* Initialize FPU's context save area */
> >  int vcpu_init_fpu(struct vcpu *v)
> >  {
> > -int rc;
> > -
> >  v->arch.fully_eager_fpu = opt_eager_fpu;
> > -
> > -if ( (rc = xstate_alloc_save_area(v)) != 0 )
> > -return rc;
> > -
> > -if ( v->arch.xsave_area )
> > -v->arch.fpu_ctxt = &v->arch.xsave_area->fpu_sse;
> > -else
> > -{
> > -BUILD_BUG_ON(__alignof(v->arch.xsave_area->fpu_sse) < 16);
> > -v->arch.fpu_ctxt = _xzalloc(sizeof(v->arch.xsave_area->fpu_sse),
> > -
> > __alignof(v->arch.xsave_area->fpu_sse));
> > -if ( v->arch.fpu_ctxt )
> > -{
> > -fpusse_t *fpu_sse = v->arch.fpu_ctxt;
> > -
> > -fpu_sse->fcw = FCW_DEFAULT;
> > -fpu_sse->mxcsr = MXCSR_DEFAULT;
> > -}
> > -else
> > -rc = -ENOMEM;
>
> This looks wonky.  It's not, because xstate_alloc_save_area() contains
> the same logic for setting up FCW/MXCSR.
>
> It would be helpful to note this in the commit message.  Something about
> deduplicating the setup alongside deduplicating the pointer.
>

Sure

> > diff --git a/xen/arch/x86/include/asm/domain.h 
> > b/xen/arch/x86/include/asm/domain.h
> > index bca3258d69ac..3da60af2a44a 100644
> > --- a/xen/arch/x86/include/asm/domain.h
> > +++ b/xen/arch/x86/include/asm/domain.h
> > @@ -592,11 +592,11 @@ struct pv_vcpu
> >  struct arch_vcpu
> >  {
> >  /*
> > - * guest context (mirroring struct vcpu_guest_context) common
> > - * between pv and hvm guests
> > + * Guest context common between PV and HVM guests. Includes general 
> > purpose
> > + * registers, segment registers and other parts of the exception frame.
> > + *
> > + * It doesn't contain FPU state, as that lives in xsave_area instead.
> >   */
>
> This new comment isn't really correct either.  arch_vcpu contains the
> PV/HVM union, so it not only things which are common between the two.

It's about cpu_user_regs though, not arch_vcpu?

>
> I'd either leave it alone, or delete it entirely.  It doesn't serve much
> purpose IMO, and it is going to bitrot very quickly (FRED alone will
> change two of the state groups you mention).
>

I'm happy getting rid of it because it's actively confusing in its current
form. That said, I can't possibly believe there's not a single simple
description of cpu_user_regs that everyone can agree on.

> > -
> > -void  *fpu_ctxt;
> >  struct cpu_user_regs user_regs;
> >  
> >  /* Debug registers. */
> > diff --git a/xen/arch/x86/x86_emulate/blk.c b/xen/arch/x86/x86_emulate/blk.c
> > index e790f4f90056..28b54f26fe29 100644
> > --- a/xen/arch/x86/x86_emulate/blk.c
> > +++ b/xen/arch/x86/x86_emulate/blk.c
> > @@ -11,7 +11,8 @@
> >  !defined(X86EMUL_NO_SIMD)
> >  # ifdef __XEN__
> >  #  include 
> > -#  define FXSAVE_AREA current->arch.fpu_ctxt
> > +#  define FXSAVE_AREA ((struct x86_fxsr *) \
> > +   (void *)¤t->arch.xsave_area->fpu_sse)
>
> This isn't a like-for-like replacement.
>
> Previously FXSAVE_AREA's type was void *.  I'd leave the expression as just
>
>     (void *)¤t->arch.xsave_area->fpu_sse
>
> because struct x86_fxsr is not the only type needing to be used here in
> due course.   (There are 8 variations of data layout for older
> instructions.)
>

Sure

> >  # else
> >  #  define FXSAVE_AREA get_fpu_save_area()
> >  # endif
> > diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
> > index 5c4144d55e89..850ee31bd18c 100644
> > --- a/xen/arch/x86/xstate.c
> > +++ b/xen/arch/x86/xstate.c
> > @@ -507,9 +507,16 @@ int xstate_alloc_save_area(struct vcpu *v)
> >  unsigned int size;
> >  
> >  if ( !cpu_has_xsave )
> > -return 0;
> > -
> > -if ( !is_idle_vcpu(v) || !cpu_has_xsavec )
> > +{
> > +/*
> > + * This is bigger than FXSAVE_SIZE by 64 bytes, but it helps 
> > treating
> > + * the FPU state uniformly as an XSAVE buffer even if XSAVE is not
> > + * available in the host. Note the alignment restriction of the 
> > XSAVE
> > + * area are stricter than those of the FXSAVE area.
> > + */
>
> Can I suggest the following?
>
> "On non-XSAVE systems, we allocate an XSTATE buffer for simplicity. 
> XSTATE is backwards compatible to FXSAVE, and only one cacheline larger."
>
> It's rather more concise.
>
> ~Andrew

Sure.

Cheers,
Alejandro

[PATCH v4 1/2] x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu

2024-10-07 Thread Alejandro Vallejo

fpu_ctxt is either a pointer to the legacy x87/SSE save area (used by FXSAVE) or
a pointer aliased with xsave_area that points to its fpu_sse subfield. Such
subfield is at the base and is identical in size and layout to the legacy
buffer.

This patch merges the 2 pointers in the arch_vcpu into a single XSAVE area. In
the very rare case in which the host doesn't support XSAVE all we're doing is
wasting a tiny amount of memory and trading those for a lot more simplicity in
the code.

While at it, dedup the setup logic in vcpu_init_fpu() and integrate it
into xstate_alloc_save_area().

Signed-off-by: Alejandro Vallejo 
--
v4:
  * Amend commit message with extra note about deduping vcpu_init_fpu()
  * Remove comment on top of cpu_user_regs (though I really think there
ought to be a credible one, in one form or another).
  * Remove cast from blk.c so FXSAVE_AREA is "void *"
  * Simplify comment in xstate_alloc_save_area() for the "host has no
XSAVE" case.
---
 xen/arch/x86/domctl.c |  6 -
 xen/arch/x86/hvm/emulate.c|  4 +--
 xen/arch/x86/hvm/hvm.c|  6 -
 xen/arch/x86/i387.c   | 45 +--
 xen/arch/x86/include/asm/domain.h |  6 -
 xen/arch/x86/x86_emulate/blk.c|  2 +-
 xen/arch/x86/xstate.c | 12 ++---
 7 files changed, 28 insertions(+), 53 deletions(-)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 96d816cf1a7d..2d115395da90 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1379,7 +1379,11 @@ void arch_get_info_guest(struct vcpu *v, 
vcpu_guest_context_u c)
 #define c(fld) (c.nat->fld)
 #endif
 
-memcpy(&c.nat->fpu_ctxt, v->arch.fpu_ctxt, sizeof(c.nat->fpu_ctxt));
+BUILD_BUG_ON(sizeof(c.nat->fpu_ctxt) !=
+ sizeof(v->arch.xsave_area->fpu_sse));
+memcpy(&c.nat->fpu_ctxt, &v->arch.xsave_area->fpu_sse,
+   sizeof(c.nat->fpu_ctxt));
+
 if ( is_pv_domain(d) )
 c(flags = v->arch.pv.vgc_flags & ~(VGCF_i387_valid|VGCF_in_kernel));
 else
diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index aa97ca1cbffd..f2bc6967dfcb 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2371,7 +2371,7 @@ static int cf_check hvmemul_get_fpu(
 alternative_vcall(hvm_funcs.fpu_dirty_intercept);
 else if ( type == X86EMUL_FPU_fpu )
 {
-const fpusse_t *fpu_ctxt = curr->arch.fpu_ctxt;
+const fpusse_t *fpu_ctxt = &curr->arch.xsave_area->fpu_sse;
 
 /*
  * Latch current register state so that we can back out changes
@@ -2411,7 +2411,7 @@ static void cf_check hvmemul_put_fpu(
 
 if ( aux )
 {
-fpusse_t *fpu_ctxt = curr->arch.fpu_ctxt;
+fpusse_t *fpu_ctxt = &curr->arch.xsave_area->fpu_sse;
 bool dval = aux->dval;
 int mode = hvm_guest_x86_mode(curr);
 
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 7b2e1c9813d6..77fe282118f7 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -914,7 +914,11 @@ static int cf_check hvm_save_cpu_ctxt(struct vcpu *v, 
hvm_domain_context_t *h)
 
 if ( v->fpu_initialised )
 {
-memcpy(ctxt.fpu_regs, v->arch.fpu_ctxt, sizeof(ctxt.fpu_regs));
+BUILD_BUG_ON(sizeof(ctxt.fpu_regs) !=
+ sizeof(v->arch.xsave_area->fpu_sse));
+memcpy(ctxt.fpu_regs, &v->arch.xsave_area->fpu_sse,
+   sizeof(ctxt.fpu_regs));
+
 ctxt.flags = XEN_X86_FPU_INITIALISED;
 }
 
diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index 134e0bece519..fbb9d3584a3d 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -39,7 +39,7 @@ static inline void fpu_xrstor(struct vcpu *v, uint64_t mask)
 /* Restore x87 FPU, MMX, SSE and SSE2 state */
 static inline void fpu_fxrstor(struct vcpu *v)
 {
-const fpusse_t *fpu_ctxt = v->arch.fpu_ctxt;
+const fpusse_t *fpu_ctxt = &v->arch.xsave_area->fpu_sse;
 
 /*
  * Some CPUs don't save/restore FDP/FIP/FOP unless an exception
@@ -151,7 +151,7 @@ static inline void fpu_xsave(struct vcpu *v)
 /* Save x87 FPU, MMX, SSE and SSE2 state */
 static inline void fpu_fxsave(struct vcpu *v)
 {
-fpusse_t *fpu_ctxt = v->arch.fpu_ctxt;
+fpusse_t *fpu_ctxt = &v->arch.xsave_area->fpu_sse;
 unsigned int fip_width = v->domain->arch.x87_fip_width;
 
 if ( fip_width != 4 )
@@ -212,7 +212,7 @@ void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool 
need_stts)
  * above) we also need to restore full state, to prevent subsequently
  * saving state belonging to another vCPU.
  */
-if ( v->arch.fully_eager_fpu || (v->arch.xsave_area && xstate_all(v)) )
+if ( v->arch.fully_eager_fpu || xstate_all(v) )
 {
 if ( cpu_has_xsave )
 fpu_xrstor(v

[PATCH v4 2/2] x86/fpu: Rework fpu_setup_fpu() uses to split it in two

2024-10-07 Thread Alejandro Vallejo

It was trying to do too many things at once and there was no clear way of
defining what it was meant to do. This commit splits the function in two.

  1. A function to return the FPU to power-on reset values.
  2. A x87/SSE state loader (equivalent to the old function when it took
 a data pointer).

The old function also had a concept of "default" values that the FPU
would be configured for in some cases but not others. This patch removes
that 3rd vague initial state and replaces it with power-on reset.

While doing this make sure the abridged control tag is consistent with the
manuals and starts as 0xFF

Signed-off-by: Alejandro Vallejo 
Reviewed-by: Jan Beulich 
--
@Jan: The patch changed substantially. Are you still ok with this R-by?

v4:
  * Reworded commit message and title
  * Remove vcpu_default_fpu() and replaced its uses with vcpu_reset_fpu()
  * s/FTW_RESET/FXSAVE_FTW_RESET/ (plus comment)
  * Remove FCW_DEFAULT, as it's the leftover reset value from the 80287
(which we largely don't care about anymore).
---
 xen/arch/x86/domain.c |  7 +++--
 xen/arch/x86/hvm/hvm.c| 12 +++-
 xen/arch/x86/i387.c   | 51 +++
 xen/arch/x86/include/asm/i387.h   | 21 ++---
 xen/arch/x86/include/asm/xstate.h |  1 +
 5 files changed, 45 insertions(+), 47 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 89aad7e8978f..78a13e6812c9 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1186,9 +1186,10 @@ int arch_set_info_guest(
  is_pv_64bit_domain(d) )
 v->arch.flags &= ~TF_kernel_mode;
 
-vcpu_setup_fpu(v, v->arch.xsave_area,
-   flags & VGCF_I387_VALID ? &c.nat->fpu_ctxt : NULL,
-   FCW_DEFAULT);
+if ( flags & VGCF_I387_VALID )
+vcpu_setup_fpu(v, &c.nat->fpu_ctxt);
+else
+vcpu_reset_fpu(v);
 
 if ( !compat )
 {
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 77fe282118f7..44f4964aa036 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1163,10 +1163,10 @@ static int cf_check hvm_load_cpu_ctxt(struct domain *d, 
hvm_domain_context_t *h)
 seg.attr = ctxt.ldtr_arbytes;
 hvm_set_segment_register(v, x86_seg_ldtr, &seg);
 
-/* Cover xsave-absent save file restoration on xsave-capable host. */
-vcpu_setup_fpu(v, xsave_enabled(v) ? NULL : v->arch.xsave_area,
-   ctxt.flags & XEN_X86_FPU_INITIALISED ? ctxt.fpu_regs : NULL,
-   FCW_RESET);
+if ( ctxt.flags & XEN_X86_FPU_INITIALISED )
+vcpu_setup_fpu(v, &ctxt.fpu_regs);
+else
+vcpu_reset_fpu(v);
 
 v->arch.user_regs.rax = ctxt.rax;
 v->arch.user_regs.rbx = ctxt.rbx;
@@ -4006,9 +4006,7 @@ void hvm_vcpu_reset_state(struct vcpu *v, uint16_t cs, 
uint16_t ip)
 v->arch.guest_table = pagetable_null();
 }
 
-if ( v->arch.xsave_area )
-v->arch.xsave_area->xsave_hdr.xstate_bv = 0;
-vcpu_setup_fpu(v, v->arch.xsave_area, NULL, FCW_RESET);
+vcpu_reset_fpu(v);
 
 arch_vcpu_regs_init(v);
 v->arch.user_regs.rip = ip;
diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index fbb9d3584a3d..916d9b572598 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -303,41 +303,26 @@ int vcpu_init_fpu(struct vcpu *v)
 return xstate_alloc_save_area(v);
 }
 
-void vcpu_setup_fpu(struct vcpu *v, struct xsave_struct *xsave_area,
-const void *data, unsigned int fcw_default)
+void vcpu_reset_fpu(struct vcpu *v)
 {
-fpusse_t *fpu_sse = &v->arch.xsave_area->fpu_sse;
-
-ASSERT(!xsave_area || xsave_area == v->arch.xsave_area);
-
-v->fpu_initialised = !!data;
-
-if ( data )
-{
-memcpy(fpu_sse, data, sizeof(*fpu_sse));
-if ( xsave_area )
-xsave_area->xsave_hdr.xstate_bv = XSTATE_FP_SSE;
-}
-else if ( xsave_area && fcw_default == FCW_DEFAULT )
-{
-xsave_area->xsave_hdr.xstate_bv = 0;
-fpu_sse->mxcsr = MXCSR_DEFAULT;
-}
-else
-{
-memset(fpu_sse, 0, sizeof(*fpu_sse));
-fpu_sse->fcw = fcw_default;
-fpu_sse->mxcsr = MXCSR_DEFAULT;
-if ( v->arch.xsave_area )
-{
-v->arch.xsave_area->xsave_hdr.xstate_bv &= ~XSTATE_FP_SSE;
-if ( fcw_default != FCW_DEFAULT )
-v->arch.xsave_area->xsave_hdr.xstate_bv |= X86_XCR0_X87;
-}
-}
+v->fpu_initialised = false;
+*v->arch.xsave_area = (struct xsave_struct) {
+.fpu_sse = {
+.mxcsr = MXCSR_DEFAULT,
+.fcw = FCW_RESET,
+.ftw = FXSAVE_FTW_RESET,
+},
+.xsave_hdr.xstate_bv = X86_XCR0_X87,
+};
+}
 
-if ( xsave_area )
-xsave_area->xsave_hdr.xcomp_bv = 0;
+void vcpu_setup_fpu(

[PATCH v4 0/2] x86: FPU handling cleanup

2024-10-07 Thread Alejandro Vallejo

v3: 
https://lore.kernel.org/xen-devel/20240813142119.29012-1-alejandro.vall...@cloud.com/
v3 -> v4: Removal vcpu_default_fpu() + style changes

v2: 
https://lore.kernel.org/xen-devel/20240808134150.29927-1-alejandro.vall...@cloud.com/
v2 -> v3: Cosmetic changes and wiped big comment about missing data in the
  migration stream. Details in each patch.

v1: 
https://lore.kernel.org/xen-devel/cover.1720538832.git.alejandro.vall...@cloud.com/
v1 -> v2: v1/patch1 and v1/patch2 are already in staging.

=== Original cover letter =
I want to eventually reach a position in which the FPU state can be allocated
from the domheap and hidden via the same core mechanism proposed in Elias'
directmap removal series. Doing so is complicated by the presence of 2 aliased
pointers (v->arch.fpu_ctxt and v->arch.xsave_area) and the rather complicated
semantics of vcpu_setup_fpu(). This series tries to simplify the code so moving
to a "map/modify/unmap" model is more tractable.

Patches 1 and 2 are trivial refactors.

Patch 3 unifies FPU state so an XSAVE area is allocated per vCPU regardless of
the host supporting it or not. The rationale is that the memory savings are
negligible and not worth the extra complexity.

Patch 4 is a non-trivial split of the vcpu_setup_fpu() into 2 separate
functions. One to override x87/SSE state, and another to set a reset state.
===========

Alejandro Vallejo (3):
  x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu
  x86/fpu: Rework fpu_setup_fpu() uses to split it in two
  x86/fpu: Remove remaining uses of FCW_DEFAULT

 xen/arch/x86/domain.c |  7 ++-
 xen/arch/x86/domctl.c |  6 +-
 xen/arch/x86/hvm/emulate.c|  4 +-
 xen/arch/x86/hvm/hvm.c| 18 +++---
 xen/arch/x86/i387.c   | 94 ---
 xen/arch/x86/include/asm/domain.h |  6 --
 xen/arch/x86/include/asm/i387.h   | 21 +--
 xen/arch/x86/include/asm/xstate.h |  2 +-
 xen/arch/x86/x86_emulate/blk.c|  2 +-
 xen/arch/x86/xstate.c | 14 +++--
 xen/common/efi/runtime.c  |  2 +-
 11 files changed, 74 insertions(+), 102 deletions(-)

-- 
2.46.0

Re: [PATCH v3 2/2] x86/fpu: Split fpu_setup_fpu() in three

2024-10-07 Thread Alejandro Vallejo

Hi,

On Fri Oct 4, 2024 at 7:08 AM BST, Jan Beulich wrote:
> On 03.10.2024 15:54, Alejandro Vallejo wrote:
> > On Tue Aug 13, 2024 at 5:33 PM BST, Alejandro Vallejo wrote:
> >> On Tue Aug 13, 2024 at 3:32 PM BST, Jan Beulich wrote:
> >>> On 13.08.2024 16:21, Alejandro Vallejo wrote:
> >>>> It was trying to do too many things at once and there was no clear way of
> >>>> defining what it was meant to do. This commit splits the function in 
> >>>> three.
> >>>>
> >>>>   1. A function to return the FPU to power-on reset values.
> >>>>   2. A function to return the FPU to default values.
> >>>>   3. A x87/SSE state loader (equivalent to the old function when it took 
> >>>> a data
> >>>>  pointer).
> >>>>
> >>>> While at it, make sure the abridged tag is consistent with the manuals 
> >>>> and
> >>>> start as 0xFF.
> >>>>
> >>>> Signed-off-by: Alejandro Vallejo 
> >>>
> >>> Reviewed-by: Jan Beulich 
> >>>
> >>>> ---
> >>>> v3:
> >>>>   * Adjust commit message, as the split is now in 3.
> >>>>   * Remove bulky comment, as the rationale for it turned out to be
> >>>> unsubstantiated. I can't find proof in xen-devel of the stream
> >>>> operating the way I claimed, and at that point having the comment
> >>>> at all is pointless
> >>>
> >>> So you deliberately removed the comment altogether, not just point 3 of 
> >>> it?
> >>>
> >>> Jan
> >>
> >> Yes. The other two cases can be deduced pretty trivially from the 
> >> conditional,
> >> I reckon. I commented them more heavily in order to properly introduce 
> >> (3), but
> >> seeing how it was all a midsummer dream might as well reduce clutter.
> >>
> >> I got as far as the original implementation of XSAVE in Xen and it seems to
> >> have been tested against many combinations of src and dst, none of which 
> >> was
> >> that ficticious "xsave enabled + xsave context missing". I suspect the
> >> xsave_enabled(v) was merely avoiding writing to the XSAVE buffer just for
> >> efficiency (however minor effect it might have had). I just reverse 
> >> engineering
> >> it wrong.
> >>
> >> Which reminds me. Thanks for mentioning that, because it was really just
> >> guesswork on my part.
> >>
> >> Cheers,
> >> Alejandro
> > 
> > Playing around with the FPU I noticed this patch wasn't committed, did it 
> > fall
> > under the cracks or is there a specific reason?
>
> Well, it's patch 2 in a series with no statement that it's independent of 
> patch

I meant the series as a whole, rather than this specific patch. They are indeed
not independent.

> 1, and patch 1 continues to lack an ack (based on earlier comments of mine you
> probably have inferred that I'm not intending to ack it in this shape, while 
> at
> the same time - considering the arguments you gave - I also don't mean to 
> stand
> in the way of it going in with someone else's ack).

I didn't infer that at all, I'm afraid. I merely thought you had been busy and
forgot about it. Is the "in this shape" about the overallocation that you
mentioned in v1?

>
> Jan

Cheers,
Alejandro

Re: [PATCH] x86emul/test: drop Xeon Phi S/G prefetch special case

2024-10-16 Thread Alejandro Vallejo

On Wed Oct 16, 2024 at 8:46 AM BST, Jan Beulich wrote:
> Another leftover from the dropping of Xeon Phi support.
>
> Signed-off-by: Jan Beulich 
> ---
> Note: I'm deliberately not switching to use of the conditional operator,
> as the form as is resulting now is what we'll want for APX (which is
> where I noticed this small piece of dead logic).
>
> --- a/tools/tests/x86_emulator/evex-disp8.c
> +++ b/tools/tests/x86_emulator/evex-disp8.c
> @@ -911,10 +911,8 @@ static void test_one(const struct test *
>  n = test->scale == SC_vl ? vsz : esz;
>  if ( !sg )
>  n += vsz;
> -else if ( !strstr(test->mnemonic, "pf") )
> -n += esz;
>  else
> -++n;
> +n += esz;

Just making sure. This is leftover from 85191cf32180("x86: drop Xeon Phi
support"), right? Dead code after the removal of the avx512pf group.

If so, that sounds good. But (not having looking at the general logic), how
come we go from ++n to "n += esz". It's all quite cryptic.

>  
>  for ( ; i < n; ++i )
>   if ( accessed[i] != (sg ? (vsz / esz) >> (test->opc & 1 & !evex.w)

Cheers,
Alejandro

Re: [PATCH v1 2/5] xen/riscv: implement maddr_to_virt()

2024-10-16 Thread Alejandro Vallejo

On Wed Oct 16, 2024 at 10:15 AM BST, Oleksii Kurochko wrote:
> Implement the `maddr_to_virt()` function to convert a machine address
> to a virtual address. This function is specifically designed to be used
> only for the DIRECTMAP region, so a check has been added to ensure that
> the address does not exceed `DIRECTMAP_SIZE`.
>

nit: Worth mentioning this comes from the x86 side of things.

> Signed-off-by: Oleksii Kurochko 
> ---
>  xen/arch/riscv/include/asm/mm.h | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
> index ebb142502e..0396e66f47 100644
> --- a/xen/arch/riscv/include/asm/mm.h
> +++ b/xen/arch/riscv/include/asm/mm.h
> @@ -25,8 +25,12 @@
>  
>  static inline void *maddr_to_virt(paddr_t ma)
>  {
> -BUG_ON("unimplemented");
> -return NULL;
> +/* Offset in the direct map, accounting for pdx compression */
> +unsigned long va_offset = maddr_to_directmapoff(ma);
> +
> +ASSERT(va_offset < DIRECTMAP_SIZE);
> +
> +return (void *)(DIRECTMAP_VIRT_START + va_offset);
>  }
>  
>  /*

Re: [PATCH] x86emul/test: correct loop body indentation in evex-disp8.c:test_one()

2024-10-16 Thread Alejandro Vallejo

On Wed Oct 16, 2024 at 8:45 AM BST, Jan Beulich wrote:
> For some reason I entirely consistently screwed these up.
>
> Signed-off-by: Jan Beulich 

  Reviewed-by: Alejandro Vallejo 

We should really give another push to the clang-format effort. This whole class
of mistakes would be a thing of the past.

Cheers,
Alejandro

Re: [PATCH] x86emul/test: drop Xeon Phi S/G prefetch special case

2024-10-16 Thread Alejandro Vallejo

On Wed Oct 16, 2024 at 11:54 AM BST, Jan Beulich wrote:
> On 16.10.2024 12:34, Alejandro Vallejo wrote:
> > On Wed Oct 16, 2024 at 8:46 AM BST, Jan Beulich wrote:
> >> --- a/tools/tests/x86_emulator/evex-disp8.c
> >> +++ b/tools/tests/x86_emulator/evex-disp8.c
> >> @@ -911,10 +911,8 @@ static void test_one(const struct test *
> >>  n = test->scale == SC_vl ? vsz : esz;
> >>  if ( !sg )
> >>  n += vsz;
> >> -else if ( !strstr(test->mnemonic, "pf") )
> >> -n += esz;
> >>  else
> >> -++n;
> >> +n += esz;
> > 
> > Just making sure. This is leftover from 85191cf32180("x86: drop Xeon Phi
> > support"), right? Dead code after the removal of the avx512pf group.
>
> Yes.
>
> > If so, that sounds good. But (not having looking at the general logic), how
> > come we go from ++n to "n += esz". It's all quite cryptic.
>
> It's the (prior) if() portion we're keeping, and the "else" we're dropping.
> The if() checks for _no_ "pf" in the mnemonic. "Going from ++n to n+= esz"
> is merely an effect of how the change is being expressed as diff.
>
> Jan

Bah, misremembered strstr() being used like strcmp() on match, but of course
that makes no sense with the substring being returned. Thanks for spelling it
out :)

Cheers,
Alejandro

Re: [PATCH] x86emul/test: correct loop body indentation in evex-disp8.c:test_one()

2024-10-16 Thread Alejandro Vallejo

On Wed Oct 16, 2024 at 11:15 AM BST, Jan Beulich wrote:
> On 16.10.2024 12:06, Alejandro Vallejo wrote:
> > On Wed Oct 16, 2024 at 8:45 AM BST, Jan Beulich wrote:
> >> For some reason I entirely consistently screwed these up.
> >>
> >> Signed-off-by: Jan Beulich 
> > 
> >   Reviewed-by: Alejandro Vallejo 
>
> Thanks.
>
> > We should really give another push to the clang-format effort. This whole 
> > class
> > of mistakes would be a thing of the past.
>
> For issues like the one here it would depend on whether that would also be
> applied to (parts of) tool stack code. The plans, iirc, were mainly to cover
> the xen/ subtree.
>
> Jan

True, but AIUI that was merely an act of scope reduction for the sake of
getting something merged in a finite time frame. In an ideal world the whole
codebase would be covered, and I think this was a shared sentiment among those
in favour.

Cheers,
Alejandro

Re: [PATCH] x86/io-apic: fix directed EOI when using AMd-Vi interrupt remapping

2024-10-21 Thread Alejandro Vallejo

On Mon Oct 21, 2024 at 3:51 PM BST, Andrew Cooper wrote:
> On 21/10/2024 3:06 pm, Roger Pau Monné wrote:
> > On Mon, Oct 21, 2024 at 12:34:37PM +0100, David Woodhouse wrote:
> >> On Fri, 2024-10-18 at 10:08 +0200, Roger Pau Monne wrote:
> >>> When using AMD-VI interrupt remapping the vector field in the IO-APIC RTE 
> >>> is
> >>> repurposed to contain part of the offset into the remapping table.  
> >>> Previous to
> >>> 2ca9fbd739b8 Xen had logic so that the offset into the interrupt remapping
> >>> table would match the vector.  Such logic was mandatory for end of 
> >>> interrupt to
> >>> work, since the vector field (even when not containing a vector) is used 
> >>> by the
> >>> IO-APIC to find for which pin the EOI must be performed.
> >>>
> >>> Introduce a table to store the EOI handlers when using interrupt 
> >>> remapping, so
> >>> that the IO-APIC driver can translate pins into EOI handlers without 
> >>> having to
> >>> read the IO-APIC RTE entry.  Note that to simplify the logic such table 
> >>> is used
> >>> unconditionally when interrupt remapping is enabled, even if strictly it 
> >>> would
> >>> only be required for AMD-Vi.
> >>>
> >>> Reported-by: Willi Junga 
> >>> Suggested-by: David Woodhouse 
> >>> Fixes: 2ca9fbd739b8 ('AMD IOMMU: allocate IRTE entries instead of using a 
> >>> static mapping')
> >>> Signed-off-by: Roger Pau Monné 
> >> Hm, couldn't we just have used the pin#?
> > Yes, but that would require a much bigger change that what's currently
> > presented here, and for backport purposes I think it's better done
> > this way for fixing this specific bug.
> >
> > Changing to use pin# as the IR offset is worthwhile, but IMO needs to
> > be done separated from the bugfix here.
> >
> >> The AMD IOMMU has per-device IRTE, so you *know* you can just use IRTE
> >> indices 0-23 for the I/O APIC pins.
> > Aren't there IO-APICs with more than 24 pins?
>
> Recent Intel SoCs have a single IO-APIC with 120 pins.
>
> ~Andrew

I can't say I understand why though.

In practice you have the legacy ISA IRQs and the 4 legacy PCI INTx. If you have
a weird enough system you might have more than one PCIe bus, but even that fits
more than nicely in 24 "pins". Does ACPI give more than 4 IRQs these days after
an adequate blood sacrifice to the gods of AML?

Cheers,
Alejandro

Re: [PATCH v1 2/5] xen/riscv: implement maddr_to_virt()

2024-10-21 Thread Alejandro Vallejo

On Fri Oct 18, 2024 at 2:17 PM BST, oleksii.kurochko wrote:
> On Thu, 2024-10-17 at 16:55 +0200, Jan Beulich wrote:
> > On 16.10.2024 11:15, Oleksii Kurochko wrote:
> > > --- a/xen/arch/riscv/include/asm/mm.h
> > > +++ b/xen/arch/riscv/include/asm/mm.h
> > > @@ -25,8 +25,12 @@
> > >  
> > >  static inline void *maddr_to_virt(paddr_t ma)
> > >  {
> > > -    BUG_ON("unimplemented");
> > > -    return NULL;
> > > +    /* Offset in the direct map, accounting for pdx compression */
> > > +    unsigned long va_offset = maddr_to_directmapoff(ma);
> > 
> > Why the mentioning of PDX compression?
> It was mentioned because if PDX will be enabled maddr_to_directmapoff()
> will take into account PDX stuff.
>
> >  At least right now it's unavailable
> > for RISC-V afaics. Are there plans to change that any time soon?
> At the moment, I don't have such plans, looking at available platform
> there are no a lot of benefits of having PDX compression now.
>
> Perhaps it would be good to add
> BUILD_BUG_ON(IS_ENABLED(PDX_COMPRESSION)) for the places which should
> be updated when CONFIG_PDX will be enabled.
>
> ~ Oleksii

I'd just forget about it unless you ever notice you're wasting a lot of entries
in the frame table due to empty space in the memory map. Julien measured the
effect on Amazon's Live Migration as a 10% improvement in downtime with PDX
off.

PDX compression shines when you have separate RAM banks at very, very
disparately far addresses (specifics in pdx.h). Unfortunately the flip side of
this compression is that you get several memory accesses for each single
pdx-(to/from)-mfn conversion. And we do a lot of those. One possible solution
would be to alt-patch the values in the code-stream and avoid the perf-hit, but
that's not merged. Jan had some patches but that didn't make it to staging,
IIRC.

Cheers,
Alejandro

Re: [PATCH] x86/io-apic: fix directed EOI when using AMd-Vi interrupt remapping

2024-10-21 Thread Alejandro Vallejo

On Fri Oct 18, 2024 at 9:08 AM BST, Roger Pau Monne wrote:
> When using AMD-VI interrupt remapping the vector field in the IO-APIC RTE is
> repurposed to contain part of the offset into the remapping table.  Previous 
> to

For my own education. Is that really a repurpose? Isn't the RTE vector field
itself simply remapped, just like any MSI?

> 2ca9fbd739b8 Xen had logic so that the offset into the interrupt remapping
> table would match the vector.  Such logic was mandatory for end of interrupt 
> to
> work, since the vector field (even when not containing a vector) is used by 
> the
> IO-APIC to find for which pin the EOI must be performed.
>
> Introduce a table to store the EOI handlers when using interrupt remapping, so

The table seems to store the pre-IR vectors. Is this a matter of nomenclature
or leftover from a previous implementation?

> that the IO-APIC driver can translate pins into EOI handlers without having to
> read the IO-APIC RTE entry.  Note that to simplify the logic such table is 
> used
> unconditionally when interrupt remapping is enabled, even if strictly it would
> only be required for AMD-Vi.

Given that last statement it might be worth mentioning that the table is
bypassed when IR is off as well.

>
> Reported-by: Willi Junga 
> Suggested-by: David Woodhouse 
> Fixes: 2ca9fbd739b8 ('AMD IOMMU: allocate IRTE entries instead of using a 
> static mapping')
> Signed-off-by: Roger Pau Monné 
> ---
>  xen/arch/x86/io_apic.c | 47 ++
>  1 file changed, 47 insertions(+)
>
> diff --git a/xen/arch/x86/io_apic.c b/xen/arch/x86/io_apic.c
> index e40d2f7dbd75..8856eb29d275 100644
> --- a/xen/arch/x86/io_apic.c
> +++ b/xen/arch/x86/io_apic.c
> @@ -71,6 +71,22 @@ static int apic_pin_2_gsi_irq(int apic, int pin);
>  
>  static vmask_t *__read_mostly vector_map[MAX_IO_APICS];
>  
> +/*
> + * Store the EOI handle when using interrupt remapping.

That explains the when, but not the what. This is "a LUT from IOAPIC pin to its
vector field", as far as I can see. 

The order in which it's meant to be indexed would be a good addition here as
well. I had to scroll down to see how it was used to really see what this was.

> + *
> + * If using AMD-Vi interrupt remapping the IO-APIC redirection entry remapped
> + * format repurposes the vector field to store the offset into the Interrupt
> + * Remap table.  This causes directed EOI to longer work, as the CPU vector 
> no
> + * longer matches the contents of the RTE vector field.  Add a translation
> + * table so that directed EOI uses the value in the RTE vector field when

nit: Might be worth mentioning that it's a merely cache and is populated
on-demand from authoritative state in the IOAPIC.

> + * interrupt remapping is enabled.
> + *
> + * Note Intel VT-d Xen code still stores the CPU vector in the RTE vector 
> field
> + * when using the remapped format, but use the translation table uniformly in
> + * order to avoid extra logic to differentiate between VT-d and AMD-Vi.
> + */
> +static unsigned int **apic_pin_eoi;

This should be signed to allow IRQ_VECTOR_UNASSIGNED, I think. Possibly
int16_t, matching arch_irq_desc->vector. This raises doubts about the existing
vectors here typed as unsigned too.

On naming, I'd rather see ioapic rather than apic, but that's a an existing sin
in the whole file. Otherwise, while it's used for EOI ATM, isn't it really just
an ioapic_pin_vector?

> +
>  static void share_vector_maps(unsigned int src, unsigned int dst)
>  {
>  unsigned int pin;
> @@ -273,6 +289,13 @@ void __ioapic_write_entry(
>  {
>  __io_apic_write(apic, 0x11 + 2 * pin, eu.w2);
>  __io_apic_write(apic, 0x10 + 2 * pin, eu.w1);
> +/*
> + * Might be called before apic_pin_eoi is allocated.  Entry will be
> + * updated once the array is allocated and there's an EOI or write
> + * against the pin.
> + */
> +if ( apic_pin_eoi )
> +apic_pin_eoi[apic][pin] = e.vector;
>  }
>  else
>  iommu_update_ire_from_apic(apic, pin, e.raw);
> @@ -298,9 +321,17 @@ static void __io_apic_eoi(unsigned int apic, unsigned 
> int vector, unsigned int p

Out of curiosity, how could this vector come to be unassigned as a parameter?
The existing code seems to assume that may happen.

>  /* Prefer the use of the EOI register if available */
>  if ( ioapic_has_eoi_reg(apic) )
>  {
> +if ( apic_pin_eoi )
> +vector = apic_pin_eoi[apic][pin];
> +
>  /* If vector is unknown, read it from the IO-APIC */
>  if ( vector == IRQ_VECTOR_UNASSIGNED )
> +{
>  vector = __ioapic_read_entry(apic, pin, true).vector;
> +if ( apic_pin_eoi )
> +/* Update cached value so further EOI don't need to fetch 
> it. */
> +apic_pin_eoi[apic][pin] = vector;
> +}
>  
>  *(IO_APIC_BASE(apic)+16) = vector;
>  }
> @@ -1022,7 +1053,23 @@ static voi

Re: [PATCH] x86/io-apic: fix directed EOI when using AMd-Vi interrupt remapping

2024-10-21 Thread Alejandro Vallejo

Re: [PATCH] x86/io-apic: fix directed EOI when using AMd-Vi interrupt remapping

2024-10-21 Thread Alejandro Vallejo

On Mon Oct 21, 2024 at 12:32 PM BST, David Woodhouse wrote:
> On Mon, 2024-10-21 at 10:55 +0100, Alejandro Vallejo wrote:
> > On Fri Oct 18, 2024 at 9:08 AM BST, Roger Pau Monne wrote:
> > > When using AMD-VI interrupt remapping the vector field in the IO-APIC RTE 
> > > is
> > > repurposed to contain part of the offset into the remapping table.  
> > > Previous to
> > 
> > For my own education
>
> Careful what you wish for.
>
> http://david.woodhou.se/more-than-you-ever-wanted-to-know-about-x86-msis.txt

I had seen it before, but then neglected to give it the attentive read it very
much deserves. Let me correct that, thanks.

Cheers,
Alejandro

Re: [PATCH v1 2/5] xen/riscv: implement maddr_to_virt()

2024-10-21 Thread Alejandro Vallejo

On Mon Oct 21, 2024 at 10:17 AM BST, oleksii.kurochko wrote:
> On Mon, 2024-10-21 at 08:56 +0100, Alejandro Vallejo wrote:
> > On Fri Oct 18, 2024 at 2:17 PM BST, oleksii.kurochko wrote:
> > > On Thu, 2024-10-17 at 16:55 +0200, Jan Beulich wrote:
> > > > On 16.10.2024 11:15, Oleksii Kurochko wrote:
> > > > > --- a/xen/arch/riscv/include/asm/mm.h
> > > > > +++ b/xen/arch/riscv/include/asm/mm.h
> > > > > @@ -25,8 +25,12 @@
> > > > >  
> > > > >  static inline void *maddr_to_virt(paddr_t ma)
> > > > >  {
> > > > > -    BUG_ON("unimplemented");
> > > > > -    return NULL;
> > > > > +    /* Offset in the direct map, accounting for pdx
> > > > > compression */
> > > > > +    unsigned long va_offset = maddr_to_directmapoff(ma);
> > > > 
> > > > Why the mentioning of PDX compression?
> > > It was mentioned because if PDX will be enabled
> > > maddr_to_directmapoff()
> > > will take into account PDX stuff.
> > > 
> > > >  At least right now it's unavailable
> > > > for RISC-V afaics. Are there plans to change that any time soon?
> > > At the moment, I don't have such plans, looking at available
> > > platform
> > > there are no a lot of benefits of having PDX compression now.
> > > 
> > > Perhaps it would be good to add
> > > BUILD_BUG_ON(IS_ENABLED(PDX_COMPRESSION)) for the places which
> > > should
> > > be updated when CONFIG_PDX will be enabled.
> > > 
> > > ~ Oleksii
> > 
> > I'd just forget about it unless you ever notice you're wasting a lot
> > of entries
> > in the frame table due to empty space in the memory map. Julien
> > measured the
> > effect on Amazon's Live Migration as a 10% improvement in downtime
> > with PDX
> > off.
> > 
> > PDX compression shines when you have separate RAM banks at very, very
> > disparately far addresses (specifics in pdx.h). Unfortunately the
> > flip side of
> > this compression is that you get several memory accesses for each
> > single
> > pdx-(to/from)-mfn conversion. And we do a lot of those. One possible
> > solution
> > would be to alt-patch the values in the code-stream and avoid the
> > perf-hit, but
> > that's not merged. Jan had some patches but that didn't make it to
> > staging,
> > IIRC.
> Could you please give me some links in the mailing list with mentioned
> patches?
>
> ~ Oleksii

Sure.

Much of this was discussed in the "Make PDX compression optional" series. This
link is v1, but there were 3 in total and a pre-patch documenting pdx.h
explaining what the technique actually does to make sure we were all on the
same page (pun intended) and the pdx-off case wouldn't break the world.

  
https://lore.kernel.org/xen-devel/20230717160318.2113-1-alejandro.vall...@cloud.com/

This was Jan's 2018 take to turn PDX into alternatives. He mentioned it
somewhere in those threads, but I can't find that message anymore.

  
https://lore.kernel.org/xen-devel/5b7674080278001df...@prv1-mh.provo.novell.com/

Cheers,
Alejandro

[PATCH v7 02/10] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-10-21 Thread Alejandro Vallejo

This allows the initial x2APIC ID to be sent on the migration stream.
This allows further changes to topology and APIC ID assignment without
breaking existing hosts. Given the vlapic data is zero-extended on
restore, fix up migrations from hosts without the field by setting it to
the old convention if zero.

The hardcoded mapping x2apic_id=2*vcpu_id is kept for the time being,
but it's meant to be overriden by toolstack on a later patch with
appropriate values.

Signed-off-by: Alejandro Vallejo 
---
v7:
 * Preserve output for CPUID[0xb].edx on PV rather than nullify it.
 * s/vlapic->hw.x2apic_id/vlapic_x2apic_id(vlapic)/ in vlapic.c
---
 xen/arch/x86/cpuid.c   | 18 +++---
 xen/arch/x86/hvm/vlapic.c  | 22 --
 xen/arch/x86/include/asm/hvm/vlapic.h  |  1 +
 xen/include/public/arch-x86/hvm/save.h |  2 ++
 4 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
index 2a777436ee27..e2489ff8e346 100644
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -138,10 +138,9 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
 const struct cpu_user_regs *regs;
 
 case 0x1:
-/* TODO: Rework topology logic. */
 res->b &= 0x00ffu;
 if ( is_hvm_domain(d) )
-res->b |= (v->vcpu_id * 2) << 24;
+res->b |= vlapic_x2apic_id(vcpu_vlapic(v)) << 24;
 
 /* TODO: Rework vPMU control in terms of toolstack choices. */
 if ( vpmu_available(v) &&
@@ -310,19 +309,16 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
 break;
 
 case 0xb:
-/*
- * In principle, this leaf is Intel-only.  In practice, it is tightly
- * coupled with x2apic, and we offer an x2apic-capable APIC emulation
- * to guests on AMD hardware as well.
- *
- * TODO: Rework topology logic.
- */
 if ( p->basic.x2apic )
 {
 *(uint8_t *)&res->c = subleaf;
 
-/* Fix the x2APIC identifier. */
-res->d = v->vcpu_id * 2;
+/*
+ * Fix the x2APIC identifier. The PV side is nonsensical, but
+ * we've always shown it like this so it's kept for compat.
+ */
+res->d = is_hvm_domain(d) ? vlapic_x2apic_id(vcpu_vlapic(v))
+  : 2 * v->vcpu_id;
 }
 break;
 
diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 3363926b487b..33b463925f4e 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1090,7 +1090,7 @@ static uint32_t x2apic_ldr_from_id(uint32_t id)
 static void set_x2apic_id(struct vlapic *vlapic)
 {
 const struct vcpu *v = vlapic_vcpu(vlapic);
-uint32_t apic_id = v->vcpu_id * 2;
+uint32_t apic_id = vlapic_x2apic_id(vlapic);
 uint32_t apic_ldr = x2apic_ldr_from_id(apic_id);
 
 /*
@@ -1470,7 +1470,7 @@ void vlapic_reset(struct vlapic *vlapic)
 if ( v->vcpu_id == 0 )
 vlapic->hw.apic_base_msr |= APIC_BASE_BSP;
 
-vlapic_set_reg(vlapic, APIC_ID, (v->vcpu_id * 2) << 24);
+vlapic_set_reg(vlapic, APIC_ID, SET_xAPIC_ID(vlapic_x2apic_id(vlapic)));
 vlapic_do_init(vlapic);
 }
 
@@ -1538,6 +1538,16 @@ static void lapic_load_fixup(struct vlapic *vlapic)
 const struct vcpu *v = vlapic_vcpu(vlapic);
 uint32_t good_ldr = x2apic_ldr_from_id(vlapic->loaded.id);
 
+/*
+ * Loading record without hw.x2apic_id in the save stream, calculate using
+ * the traditional "vcpu_id * 2" relation. There's an implicit assumption
+ * that vCPU0 always has x2APIC0, which is true for the old relation, and
+ * still holds under the new x2APIC generation algorithm. While that case
+ * goes through the conditional it's benign because it still maps to zero.
+ */
+if ( !vlapic->hw.x2apic_id )
+vlapic->hw.x2apic_id = v->vcpu_id * 2;
+
 /* Skip fixups on xAPIC mode, or if the x2APIC LDR is already correct */
 if ( !vlapic_x2apic_mode(vlapic) ||
  (vlapic->loaded.ldr == good_ldr) )
@@ -1606,6 +1616,13 @@ static int cf_check lapic_check_hidden(const struct 
domain *d,
  APIC_BASE_EXTD )
 return -EINVAL;
 
+/*
+ * Fail migrations from newer versions of Xen where
+ * rsvd_zero is interpreted as something else.
+ */
+if ( s.rsvd_zero )
+return -EINVAL;
+
 return 0;
 }
 
@@ -1687,6 +1704,7 @@ int vlapic_init(struct vcpu *v)
 }
 
 vlapic->pt.source = PTSRC_lapic;
+vlapic->hw.x2apic_id = 2 * v->vcpu_id;
 
 vlapic->regs_page = alloc_domheap_page(v->domain, MEMF_no_owner);
 if ( !vlapic->regs_page )
diff --git a/xen/arch/x86/include/asm/hvm/vlapic.h 
b/xen/arch/x86/include/asm/hvm/vlapic.h
index 2c4ff94ae7a8..85c4a236b9f6

[PATCH v7 04/10] tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves

2024-10-21 Thread Alejandro Vallejo

Make it so the APs expose their own APIC IDs in a LUT. We can use that
LUT to populate the MADT, decoupling the algorithm that relates CPU IDs
and APIC IDs from hvmloader.

Moved smp_initialise() ahead of apic_setup() in order to initialise
cpu_to_x2apicid ASAP and avoid using it uninitialised. Note that
bringing up the APs doesn't need the APIC in hvmloader becasue it always
runs virtualized and uses the PV interface.

While at this, exploit the assumption that CPU0 always has APICID0 to
remove ap_callin, as writing the APIC ID may serve the same purpose.

Signed-off-by: Alejandro Vallejo 
---
v7:
  * CPU_TO_X2APICID to lowercase
  * Spell out the CPU0<-->APICID0 relationship in the commit message as
the rationale to remove ap_callin.
  * Explain the motion of smp_initialise() ahead of apic_setup() in the
commit message.
---
 tools/firmware/hvmloader/config.h   |  5 ++-
 tools/firmware/hvmloader/hvmloader.c|  6 +--
 tools/firmware/hvmloader/mp_tables.c|  4 +-
 tools/firmware/hvmloader/smp.c  | 57 -
 tools/firmware/hvmloader/util.c |  2 +-
 tools/include/xen-tools/common-macros.h |  5 +++
 6 files changed, 63 insertions(+), 16 deletions(-)

diff --git a/tools/firmware/hvmloader/config.h 
b/tools/firmware/hvmloader/config.h
index cd716bf39245..04cab1e59f08 100644
--- a/tools/firmware/hvmloader/config.h
+++ b/tools/firmware/hvmloader/config.h
@@ -4,6 +4,8 @@
 #include 
 #include 
 
+#include 
+
 enum virtual_vga { VGA_none, VGA_std, VGA_cirrus, VGA_pt };
 extern enum virtual_vga virtual_vga;
 
@@ -48,8 +50,9 @@ extern uint8_t ioapic_version;
 
 #define IOAPIC_ID   0x01
 
+extern uint32_t cpu_to_x2apicid[HVM_MAX_VCPUS];
+
 #define LAPIC_BASE_ADDRESS  0xfee0
-#define LAPIC_ID(vcpu_id)   ((vcpu_id) * 2)
 
 #define PCI_ISA_DEVFN   0x08/* dev 1, fn 0 */
 #define PCI_ISA_IRQ_MASK0x0c20U /* ISA IRQs 5,10,11 are PCI connected */
diff --git a/tools/firmware/hvmloader/hvmloader.c 
b/tools/firmware/hvmloader/hvmloader.c
index f8af88fabf24..bebdfa923880 100644
--- a/tools/firmware/hvmloader/hvmloader.c
+++ b/tools/firmware/hvmloader/hvmloader.c
@@ -224,7 +224,7 @@ static void apic_setup(void)
 
 /* 8259A ExtInts are delivered through IOAPIC pin 0 (Virtual Wire Mode). */
 ioapic_write(0x10, APIC_DM_EXTINT);
-ioapic_write(0x11, SET_APIC_ID(LAPIC_ID(0)));
+ioapic_write(0x11, SET_APIC_ID(cpu_to_x2apicid[0]));
 }
 
 struct bios_info {
@@ -341,11 +341,11 @@ int main(void)
 
 printf("CPU speed is %u MHz\n", get_cpu_mhz());
 
+smp_initialise();
+
 apic_setup();
 pci_setup();
 
-smp_initialise();
-
 perform_tests();
 
 if ( bios->bios_info_setup )
diff --git a/tools/firmware/hvmloader/mp_tables.c 
b/tools/firmware/hvmloader/mp_tables.c
index 77d3010406d0..539260365e1e 100644
--- a/tools/firmware/hvmloader/mp_tables.c
+++ b/tools/firmware/hvmloader/mp_tables.c
@@ -198,8 +198,10 @@ static void fill_mp_config_table(struct mp_config_table 
*mpct, int length)
 /* fills in an MP processor entry for VCPU 'vcpu_id' */
 static void fill_mp_proc_entry(struct mp_proc_entry *mppe, int vcpu_id)
 {
+ASSERT(cpu_to_x2apicid[vcpu_id] < 0xFF );
+
 mppe->type = ENTRY_TYPE_PROCESSOR;
-mppe->lapic_id = LAPIC_ID(vcpu_id);
+mppe->lapic_id = cpu_to_x2apicid[vcpu_id];
 mppe->lapic_version = 0x11;
 mppe->cpu_flags = CPU_FLAG_ENABLED;
 if ( vcpu_id == 0 )
diff --git a/tools/firmware/hvmloader/smp.c b/tools/firmware/hvmloader/smp.c
index 1b940cefd071..d63536f14f00 100644
--- a/tools/firmware/hvmloader/smp.c
+++ b/tools/firmware/hvmloader/smp.c
@@ -29,7 +29,37 @@
 
 #include 
 
-static int ap_callin;
+/**
+ * Lookup table of (x2)APIC IDs.
+ *
+ * Each entry is populated its respective CPU as they come online. This is 
required
+ * for generating the MADT with minimal assumptions about ID relationships.
+ *
+ * While the name makes "x2" explicit, these may actually be xAPIC IDs if no
+ * x2APIC is present. "x2" merely highlights that each entry is 32 bits wide.
+ */
+uint32_t cpu_to_x2apicid[HVM_MAX_VCPUS];
+
+/** Tristate about x2apic being supported. -1=unknown */
+static int has_x2apic = -1;
+
+static uint32_t read_apic_id(void)
+{
+uint32_t apic_id;
+
+if ( has_x2apic )
+cpuid(0xb, NULL, NULL, NULL, &apic_id);
+else
+{
+cpuid(1, NULL, &apic_id, NULL, NULL);
+apic_id >>= 24;
+}
+
+/* Never called by cpu0, so should never return 0 */
+ASSERT(apic_id);
+
+return apic_id;
+}
 
 static void cpu_setup(unsigned int cpu)
 {
@@ -37,13 +67,17 @@ static void cpu_setup(unsigned int cpu)
 cacheattr_init();
 printf("done.\n");
 
-if ( !cpu ) /* Used on the BSP too */
+/* The BSP exits early because its APIC ID is known to be zero */
+if ( !cpu )
 return;
 
 wmb();
-ap_callin = 1;
+ACCESS_ONCE(cpu_to_x2apicid[cpu]) = read

[PATCH v7 08/10] xen/x86: Derive topologically correct x2APIC IDs from the policy

2024-10-21 Thread Alejandro Vallejo

Implements the helper for mapping vcpu_id to x2apic_id given a valid
topology in a policy. The algo is written with the intention of
extending it to leaves 0x1f and extended 0x26 in the future.

The helper returns the legacy mapping when leaf 0xb is not implemented
(as is the case at the moment).

Signed-off-by: Alejandro Vallejo 
---
v7:
  * Changes to commit message
---
 tools/tests/cpu-policy/test-cpu-policy.c | 68 +
 xen/include/xen/lib/x86/cpu-policy.h | 11 
 xen/lib/x86/policy.c | 76 
 3 files changed, 155 insertions(+)

diff --git a/tools/tests/cpu-policy/test-cpu-policy.c 
b/tools/tests/cpu-policy/test-cpu-policy.c
index 849d7cebaa7c..e5f9b8f7ee39 100644
--- a/tools/tests/cpu-policy/test-cpu-policy.c
+++ b/tools/tests/cpu-policy/test-cpu-policy.c
@@ -781,6 +781,73 @@ static void test_topo_from_parts(void)
 }
 }
 
+static void test_x2apic_id_from_vcpu_id_success(void)
+{
+static const struct test {
+unsigned int vcpu_id;
+unsigned int threads_per_core;
+unsigned int cores_per_pkg;
+uint32_t x2apic_id;
+uint8_t x86_vendor;
+} tests[] = {
+{
+.vcpu_id = 3, .threads_per_core = 3, .cores_per_pkg = 8,
+.x2apic_id = 1 << 2,
+},
+{
+.vcpu_id = 6, .threads_per_core = 3, .cores_per_pkg = 8,
+.x2apic_id = 2 << 2,
+},
+{
+.vcpu_id = 24, .threads_per_core = 3, .cores_per_pkg = 8,
+.x2apic_id = 1 << 5,
+},
+{
+.vcpu_id = 35, .threads_per_core = 3, .cores_per_pkg = 8,
+.x2apic_id = (35 % 3) | (((35 / 3) % 8) << 2) | ((35 / 24) << 5),
+},
+{
+.vcpu_id = 96, .threads_per_core = 7, .cores_per_pkg = 3,
+.x2apic_id = (96 % 7) | (((96 / 7) % 3) << 3) | ((96 / 21) << 5),
+},
+};
+
+const uint8_t vendors[] = {
+X86_VENDOR_INTEL,
+X86_VENDOR_AMD,
+X86_VENDOR_CENTAUR,
+X86_VENDOR_SHANGHAI,
+X86_VENDOR_HYGON,
+};
+
+printf("Testing x2apic id from vcpu id success:\n");
+
+/* Perform the test run on every vendor we know about */
+for ( size_t i = 0; i < ARRAY_SIZE(vendors); ++i )
+{
+for ( size_t j = 0; j < ARRAY_SIZE(tests); ++j )
+{
+struct cpu_policy policy = { .x86_vendor = vendors[i] };
+const struct test *t = &tests[j];
+uint32_t x2apic_id;
+int rc = x86_topo_from_parts(&policy, t->threads_per_core,
+ t->cores_per_pkg);
+
+if ( rc ) {
+fail("FAIL[%d] - 'x86_topo_from_parts() failed", rc);
+continue;
+}
+
+x2apic_id = x86_x2apic_id_from_vcpu_id(&policy, t->vcpu_id);
+if ( x2apic_id != t->x2apic_id )
+fail("FAIL - '%s cpu%u %u t/c %u c/p'. bad x2apic_id: 
expected=%u actual=%u\n",
+ x86_cpuid_vendor_to_str(policy.x86_vendor),
+ t->vcpu_id, t->threads_per_core, t->cores_per_pkg,
+ t->x2apic_id, x2apic_id);
+}
+}
+}
+
 int main(int argc, char **argv)
 {
 printf("CPU Policy unit tests\n");
@@ -799,6 +866,7 @@ int main(int argc, char **argv)
 test_is_compatible_failure();
 
 test_topo_from_parts();
+test_x2apic_id_from_vcpu_id_success();
 
 if ( nr_failures )
 printf("Done: %u failures\n", nr_failures);
diff --git a/xen/include/xen/lib/x86/cpu-policy.h 
b/xen/include/xen/lib/x86/cpu-policy.h
index 67d16fda933d..61d5cf3c7f12 100644
--- a/xen/include/xen/lib/x86/cpu-policy.h
+++ b/xen/include/xen/lib/x86/cpu-policy.h
@@ -542,6 +542,17 @@ int x86_cpu_policies_are_compatible(const struct 
cpu_policy *host,
 const struct cpu_policy *guest,
 struct cpu_policy_errors *err);
 
+/**
+ * Calculates the x2APIC ID of a vCPU given a CPU policy
+ *
+ * If the policy lacks leaf 0xb falls back to legacy mapping of apic_id=cpu*2
+ *
+ * @param p  CPU policy of the domain.
+ * @param id vCPU ID of the vCPU.
+ * @returns x2APIC ID of the vCPU.
+ */
+uint32_t x86_x2apic_id_from_vcpu_id(const struct cpu_policy *p, uint32_t id);
+
 /**
  * Synthesise topology information in `p` given high-level constraints
  *
diff --git a/xen/lib/x86/policy.c b/xen/lib/x86/policy.c
index 5ff89022e901..427a90f907a2 100644
--- a/xen/lib/x86/policy.c
+++ b/xen/lib/x86/policy.c
@@ -2,6 +2,82 @@
 
 #include 
 
+static uint32_t parts_per_higher_scoped_level(const struct cpu_policy *p,
+  size_t lvl)
+{
+/*
+ * `nr_logical` reported by Intel is the number of THREADS contained in
+ * the next topological

[PATCH v7 03/10] xen/x86: Add supporting code for uploading LAPIC contexts during domain create

2024-10-21 Thread Alejandro Vallejo

A later patch will upload LAPIC contexts as part of domain creation. In
order for it not to encounter a problem where the architectural state
does not reflect the APIC ID in the hidden state, this patch ensures
updates to the hidden state trigger an update in the architectural
registers so the APIC ID in both is consistent.

Signed-off-by: Alejandro Vallejo 
---
v7:
 * Rework the commit message so it explains a follow-up patch rather
   than hypothetical behaviour.
---
 xen/arch/x86/hvm/vlapic.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 33b463925f4e..03581eb33812 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1640,7 +1640,27 @@ static int cf_check lapic_load_hidden(struct domain *d, 
hvm_domain_context_t *h)
 
 s->loaded.hw = 1;
 if ( s->loaded.regs )
+{
+/*
+ * We already processed architectural regs in lapic_load_regs(), so
+ * this must be a migration. Fix up inconsistencies from any older Xen.
+ */
 lapic_load_fixup(s);
+}
+else
+{
+/*
+ * We haven't seen architectural regs so this could be a migration or a
+ * plain domain create. In the domain create case it's fine to modify
+ * the architectural state to align it to the APIC ID that was just
+ * uploaded and in the migrate case it doesn't matter because the
+ * architectural state will be replaced by the LAPIC_REGS ctx later on.
+ */
+if ( vlapic_x2apic_mode(s) )
+set_x2apic_id(s);
+else
+vlapic_set_reg(s, APIC_ID, SET_xAPIC_ID(s->hw.x2apic_id));
+}
 
 hvm_update_vlapic_mode(v);
 
-- 
2.47.0

[PATCH v7 09/10] tools/libguest: Set distinct x2APIC IDs for each vCPU

2024-10-21 Thread Alejandro Vallejo

Have toolstack populate the new x2APIC ID in the LAPIC save record with
the proper IDs intended for each vCPU.

Signed-off-by: Alejandro Vallejo 
---
v7:
  * Unchanged
---
 tools/libs/guest/xg_dom_x86.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/tools/libs/guest/xg_dom_x86.c b/tools/libs/guest/xg_dom_x86.c
index c98229317db7..38486140ed15 100644
--- a/tools/libs/guest/xg_dom_x86.c
+++ b/tools/libs/guest/xg_dom_x86.c
@@ -1004,11 +1004,14 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 HVM_SAVE_TYPE(HEADER) header;
 struct hvm_save_descriptor mtrr_d;
 HVM_SAVE_TYPE(MTRR) mtrr;
+struct hvm_save_descriptor lapic_d;
+HVM_SAVE_TYPE(LAPIC) lapic;
 struct hvm_save_descriptor end_d;
 HVM_SAVE_TYPE(END) end;
 } vcpu_ctx;
-/* Context from full_ctx */
+/* Contexts from full_ctx */
 const HVM_SAVE_TYPE(MTRR) *mtrr_record;
+const HVM_SAVE_TYPE(LAPIC) *lapic_record;
 /* Raw context as taken from Xen */
 uint8_t *full_ctx = NULL;
 int rc;
@@ -,6 +1114,8 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 vcpu_ctx.mtrr_d.typecode = HVM_SAVE_CODE(MTRR);
 vcpu_ctx.mtrr_d.length = HVM_SAVE_LENGTH(MTRR);
 vcpu_ctx.mtrr = *mtrr_record;
+vcpu_ctx.lapic_d.typecode = HVM_SAVE_CODE(LAPIC);
+vcpu_ctx.lapic_d.length = HVM_SAVE_LENGTH(LAPIC);
 vcpu_ctx.end_d = bsp_ctx.end_d;
 vcpu_ctx.end = bsp_ctx.end;
 
@@ -1125,6 +1130,18 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 {
 vcpu_ctx.mtrr_d.instance = i;
 
+lapic_record = hvm_get_save_record(full_ctx, HVM_SAVE_CODE(LAPIC), i);
+if ( !lapic_record )
+{
+xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+ "%s: unable to get LAPIC[%d] save record", __func__, 
i);
+goto out;
+}
+
+vcpu_ctx.lapic = *lapic_record;
+vcpu_ctx.lapic.x2apic_id = dom->cpu_to_apicid[i];
+vcpu_ctx.lapic_d.instance = i;
+
 rc = xc_domain_hvm_setcontext(dom->xch, dom->guest_domid,
   (uint8_t *)&vcpu_ctx, sizeof(vcpu_ctx));
 if ( rc != 0 )
-- 
2.47.0

[PATCH v7 05/10] tools/libacpi: Use LUT of APIC IDs rather than function pointer

2024-10-21 Thread Alejandro Vallejo

Refactors libacpi so that a single LUT is the authoritative source of
truth for the CPU to APIC ID mappings. This has a know-on effect in
reducing complexity on future patches, as the same LUT can be used for
configuring the APICs and configuring the ACPI tables for PVH.

Not functional change intended, because the same mappings are preserved.

Signed-off-by: Alejandro Vallejo 
---
v7:
  * NOTE: didn't add assert to libacpi as initially accepted in order to
protect libvirt from an assert failure.
  * s/uint32_t/unsigned int/ in for loop of libxl.
  * turned Xen-style loop in libxl to libxl-style.
---
 tools/firmware/hvmloader/util.c   | 7 +--
 tools/include/xenguest.h  | 5 +
 tools/libacpi/build.c | 6 +++---
 tools/libacpi/libacpi.h   | 2 +-
 tools/libs/light/libxl_dom.c  | 5 +
 tools/libs/light/libxl_x86_acpi.c | 7 +--
 6 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index 821b3086a87d..afa3eb9d5775 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -825,11 +825,6 @@ static void acpi_mem_free(struct acpi_ctxt *ctxt,
 /* ACPI builder currently doesn't free memory so this is just a stub */
 }
 
-static uint32_t acpi_lapic_id(unsigned cpu)
-{
-return cpu_to_x2apic_id[cpu];
-}
-
 void hvmloader_acpi_build_tables(struct acpi_config *config,
  unsigned int physical)
 {
@@ -859,7 +854,7 @@ void hvmloader_acpi_build_tables(struct acpi_config *config,
 }
 
 config->lapic_base_address = LAPIC_BASE_ADDRESS;
-config->lapic_id = acpi_lapic_id;
+config->cpu_to_apicid = cpu_to_x2apicid;
 config->ioapic_base_address = IOAPIC_BASE_ADDRESS;
 config->ioapic_id = IOAPIC_ID;
 config->pci_isa_irq_mask = PCI_ISA_IRQ_MASK; 
diff --git a/tools/include/xenguest.h b/tools/include/xenguest.h
index e01f494b772a..aa50b78dfb89 100644
--- a/tools/include/xenguest.h
+++ b/tools/include/xenguest.h
@@ -22,6 +22,8 @@
 #ifndef XENGUEST_H
 #define XENGUEST_H
 
+#include "xen/hvm/hvm_info_table.h"
+
 #define XC_NUMA_NO_NODE   (~0U)
 
 #define XCFLAGS_LIVE  (1 << 0)
@@ -236,6 +238,9 @@ struct xc_dom_image {
 #if defined(__i386__) || defined(__x86_64__)
 struct e820entry *e820;
 unsigned int e820_entries;
+
+/* LUT mapping cpu id to (x2)APIC ID */
+uint32_t cpu_to_apicid[HVM_MAX_VCPUS];
 #endif
 
 xen_pfn_t vuart_gfn;
diff --git a/tools/libacpi/build.c b/tools/libacpi/build.c
index 2f29863db154..2ad1d461a2ec 100644
--- a/tools/libacpi/build.c
+++ b/tools/libacpi/build.c
@@ -74,7 +74,7 @@ static struct acpi_20_madt *construct_madt(struct acpi_ctxt 
*ctxt,
 const struct hvm_info_table   *hvminfo = config->hvminfo;
 int i, sz;
 
-if ( config->lapic_id == NULL )
+if ( config->cpu_to_apicid == NULL )
 return NULL;
 
 sz  = sizeof(struct acpi_20_madt);
@@ -148,7 +148,7 @@ static struct acpi_20_madt *construct_madt(struct acpi_ctxt 
*ctxt,
 lapic->length  = sizeof(*lapic);
 /* Processor ID must match processor-object IDs in the DSDT. */
 lapic->acpi_processor_id = i;
-lapic->apic_id = config->lapic_id(i);
+lapic->apic_id = config->cpu_to_apicid[i];
 lapic->flags = (test_bit(i, hvminfo->vcpu_online)
 ? ACPI_LOCAL_APIC_ENABLED : 0);
 lapic++;
@@ -236,7 +236,7 @@ static struct acpi_20_srat *construct_srat(struct acpi_ctxt 
*ctxt,
 processor->type = ACPI_PROCESSOR_AFFINITY;
 processor->length   = sizeof(*processor);
 processor->domain   = config->numa.vcpu_to_vnode[i];
-processor->apic_id  = config->lapic_id(i);
+processor->apic_id  = config->cpu_to_apicid[i];
 processor->flags= ACPI_LOCAL_APIC_AFFIN_ENABLED;
 processor++;
 }
diff --git a/tools/libacpi/libacpi.h b/tools/libacpi/libacpi.h
index deda39e5dbc4..e8f603ee18ee 100644
--- a/tools/libacpi/libacpi.h
+++ b/tools/libacpi/libacpi.h
@@ -84,7 +84,7 @@ struct acpi_config {
 unsigned long rsdp;
 
 /* x86-specific parameters */
-uint32_t (*lapic_id)(unsigned cpu);
+const uint32_t *cpu_to_apicid; /* LUT mapping cpu id to (x2)APIC ID */
 uint32_t lapic_base_address;
 uint32_t ioapic_base_address;
 uint16_t pci_isa_irq_mask;
diff --git a/tools/libs/light/libxl_dom.c b/tools/libs/light/libxl_dom.c
index 94fef374014e..5f4f6830e850 100644
--- a/tools/libs/light/libxl_dom.c
+++ b/tools/libs/light/libxl_dom.c
@@ -1082,6 +1082,11 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
 
 dom->container_type = XC_DOM_HVM_CONTAINER;
 
+#if defined(__i386__) || defined(__x86_64__)
+for (unsigned int i = 0; i < info->max_vcpus; i++)
+dom->cpu_to_apicid[i] = 2 * i; /* TODO: Replace by topo calculation */
+#endif
+
 /* The par

[PATCH v7 10/10] tools/x86: Synthesise domain topologies

2024-10-21 Thread Alejandro Vallejo

Expose sensible topologies in leaf 0xb. At the moment it synthesises
non-HT systems, in line with the previous code intent.

Leaf 0xb in the host policy is no longer zapped and the guest {max,def}
policies have their topology leaves zapped instead. The intent is for
toolstack to populate them. There's no current use for the topology
information in the host policy, but it makes no harm.

Signed-off-by: Alejandro Vallejo 
---
v7:
  * No changes
---
 tools/include/xenguest.h|  3 +++
 tools/libs/guest/xg_cpuid_x86.c | 29 -
 tools/libs/light/libxl_dom.c| 22 +-
 xen/arch/x86/cpu-policy.c   |  9 ++---
 4 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/tools/include/xenguest.h b/tools/include/xenguest.h
index aa50b78dfb89..dcabf219b9cb 100644
--- a/tools/include/xenguest.h
+++ b/tools/include/xenguest.h
@@ -831,6 +831,9 @@ int xc_set_domain_cpu_policy(xc_interface *xch, uint32_t 
domid,
 
 uint32_t xc_get_cpu_featureset_size(void);
 
+/* Returns the APIC ID of the `cpu`-th CPU according to `policy` */
+uint32_t xc_cpu_to_apicid(const xc_cpu_policy_t *policy, unsigned int cpu);
+
 enum xc_static_cpu_featuremask {
 XC_FEATUREMASK_KNOWN,
 XC_FEATUREMASK_SPECIAL,
diff --git a/tools/libs/guest/xg_cpuid_x86.c b/tools/libs/guest/xg_cpuid_x86.c
index 4453178100ad..c591f8732a1a 100644
--- a/tools/libs/guest/xg_cpuid_x86.c
+++ b/tools/libs/guest/xg_cpuid_x86.c
@@ -725,8 +725,16 @@ int xc_cpuid_apply_policy(xc_interface *xch, uint32_t 
domid, bool restore,
 p->policy.basic.htt   = test_bit(X86_FEATURE_HTT, host_featureset);
 p->policy.extd.cmp_legacy = test_bit(X86_FEATURE_CMP_LEGACY, 
host_featureset);
 }
-else
+else if ( restore )
 {
+/*
+ * Reconstruct the topology exposed on Xen <= 4.13. It makes very 
little
+ * sense, but it's what those guests saw so it's set in stone now.
+ *
+ * Guests from Xen 4.14 onwards carry their own CPUID leaves in the
+ * migration stream so they don't need special treatment.
+ */
+
 /*
  * Topology for HVM guests is entirely controlled by Xen.  For now, we
  * hardcode APIC_ID = vcpu_id * 2 to give the illusion of no SMT.
@@ -782,6 +790,20 @@ int xc_cpuid_apply_policy(xc_interface *xch, uint32_t 
domid, bool restore,
 break;
 }
 }
+else
+{
+/* TODO: Expose the ability to choose a custom topology for HVM/PVH */
+unsigned int threads_per_core = 1;
+unsigned int cores_per_pkg = di.max_vcpu_id + 1;
+
+rc = x86_topo_from_parts(&p->policy, threads_per_core, cores_per_pkg);
+if ( rc )
+{
+ERROR("Failed to generate topology: rc=%d t/c=%u c/p=%u",
+  rc, threads_per_core, cores_per_pkg);
+goto out;
+}
+}
 
 nr_leaves = ARRAY_SIZE(p->leaves);
 rc = x86_cpuid_copy_to_buffer(&p->policy, p->leaves, &nr_leaves);
@@ -1028,3 +1050,8 @@ bool xc_cpu_policy_is_compatible(xc_interface *xch, 
xc_cpu_policy_t *host,
 
 return false;
 }
+
+uint32_t xc_cpu_to_apicid(const xc_cpu_policy_t *policy, unsigned int cpu)
+{
+return x86_x2apic_id_from_vcpu_id(&policy->policy, cpu);
+}
diff --git a/tools/libs/light/libxl_dom.c b/tools/libs/light/libxl_dom.c
index 5f4f6830e850..1d7c34820d8f 100644
--- a/tools/libs/light/libxl_dom.c
+++ b/tools/libs/light/libxl_dom.c
@@ -1063,6 +1063,9 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
 libxl_domain_build_info *const info = &d_config->b_info;
 struct xc_dom_image *dom = NULL;
 bool device_model = info->type == LIBXL_DOMAIN_TYPE_HVM ? true : false;
+#if defined(__i386__) || defined(__x86_64__)
+struct xc_cpu_policy *policy = NULL;
+#endif
 
 xc_dom_loginit(ctx->xch);
 
@@ -1083,8 +1086,22 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
 dom->container_type = XC_DOM_HVM_CONTAINER;
 
 #if defined(__i386__) || defined(__x86_64__)
+policy = xc_cpu_policy_init();
+if (!policy) {
+LOGE(ERROR, "xc_cpu_policy_get_domain failed d%u", domid);
+rc = ERROR_NOMEM;
+goto out;
+}
+
+rc = xc_cpu_policy_get_domain(ctx->xch, domid, policy);
+if (rc != 0) {
+LOGE(ERROR, "xc_cpu_policy_get_domain failed d%u", domid);
+rc = ERROR_FAIL;
+goto out;
+}
+
 for (unsigned int i = 0; i < info->max_vcpus; i++)
-dom->cpu_to_apicid[i] = 2 * i; /* TODO: Replace by topo calculation */
+dom->cpu_to_apicid[i] = xc_cpu_to_apicid(policy, i);
 #endif
 
 /* The params from the configuration file are in Mb, which are then
@@ -1214,6 +1231,9 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
 out:
 assert(rc != 0);
 if (dom != NULL) xc_dom_release(dom);
+#if defined(__i386__) || defined(__x86_64__)
+xc_cp

[PATCH v7 07/10] xen/lib: Add topology generator for x86

2024-10-21 Thread Alejandro Vallejo

Add a helper to populate topology leaves in the cpu policy from
threads/core and cores/package counts. It's unit-tested in
test-cpu-policy.c, but it's not connected to the rest of the code yet.

Intel's cache leaves (CPUID[4]) have limited width for core counts, so
(in the absence of real world data for how it might behave) this
implementation takes the view that those counts should clip to their
maximum values on overflow. Just like lppp and NC.

Adds the ASSERT() macro to xen/lib/x86/private.h, as it was missing.

Signed-off-by: Alejandro Vallejo 
---
v7:
  * MAX/MIN -> max/min; adding U suffixes to literals for type-matching
and uppercases for MISRA compliance.
  * Clip core counts in cache leaves to their maximum values
  * Remove unified cache conditional. Less code, and less likely for the
threads_per_cache field to clip.
  * Add extra check to ensure threads_per_pkg fit in 16 bits (which is
the space they have in leaf 0xb.
  * Add extra check to detect overflow in threads_per_pkg calculation.
  * Reworked the comment for the topo generator, expressing more clearly
what are inputs and what are outputs.
---
 tools/tests/cpu-policy/test-cpu-policy.c | 133 +++
 xen/include/xen/lib/x86/cpu-policy.h |  16 +++
 xen/lib/x86/policy.c |  93 
 xen/lib/x86/private.h|   4 +
 4 files changed, 246 insertions(+)

diff --git a/tools/tests/cpu-policy/test-cpu-policy.c 
b/tools/tests/cpu-policy/test-cpu-policy.c
index 301df2c00285..849d7cebaa7c 100644
--- a/tools/tests/cpu-policy/test-cpu-policy.c
+++ b/tools/tests/cpu-policy/test-cpu-policy.c
@@ -650,6 +650,137 @@ static void test_is_compatible_failure(void)
 }
 }
 
+static void test_topo_from_parts(void)
+{
+static const struct test {
+unsigned int threads_per_core;
+unsigned int cores_per_pkg;
+struct cpu_policy policy;
+} tests[] = {
+{
+.threads_per_core = 3, .cores_per_pkg = 1,
+.policy = {
+.x86_vendor = X86_VENDOR_AMD,
+.topo.subleaf = {
+{ .nr_logical = 3, .level = 0, .type = 1, .id_shift = 2, },
+{ .nr_logical = 1, .level = 1, .type = 2, .id_shift = 2, },
+},
+},
+},
+{
+.threads_per_core = 1, .cores_per_pkg = 3,
+.policy = {
+.x86_vendor = X86_VENDOR_AMD,
+.topo.subleaf = {
+{ .nr_logical = 1, .level = 0, .type = 1, .id_shift = 0, },
+{ .nr_logical = 3, .level = 1, .type = 2, .id_shift = 2, },
+},
+},
+},
+{
+.threads_per_core = 7, .cores_per_pkg = 5,
+.policy = {
+.x86_vendor = X86_VENDOR_AMD,
+.topo.subleaf = {
+{ .nr_logical = 7, .level = 0, .type = 1, .id_shift = 3, },
+{ .nr_logical = 5, .level = 1, .type = 2, .id_shift = 6, },
+},
+},
+},
+{
+.threads_per_core = 2, .cores_per_pkg = 128,
+.policy = {
+.x86_vendor = X86_VENDOR_AMD,
+.topo.subleaf = {
+{ .nr_logical = 2, .level = 0, .type = 1, .id_shift = 1, },
+{ .nr_logical = 128, .level = 1, .type = 2,
+  .id_shift = 8, },
+},
+},
+},
+{
+.threads_per_core = 3, .cores_per_pkg = 1,
+.policy = {
+.x86_vendor = X86_VENDOR_INTEL,
+.topo.subleaf = {
+{ .nr_logical = 3, .level = 0, .type = 1, .id_shift = 2, },
+{ .nr_logical = 3, .level = 1, .type = 2, .id_shift = 2, },
+},
+},
+},
+{
+.threads_per_core = 1, .cores_per_pkg = 3,
+.policy = {
+.x86_vendor = X86_VENDOR_INTEL,
+.topo.subleaf = {
+{ .nr_logical = 1, .level = 0, .type = 1, .id_shift = 0, },
+{ .nr_logical = 3, .level = 1, .type = 2, .id_shift = 2, },
+},
+},
+},
+{
+.threads_per_core = 7, .cores_per_pkg = 5,
+.policy = {
+.x86_vendor = X86_VENDOR_INTEL,
+.topo.subleaf = {
+{ .nr_logical = 7, .level = 0, .type = 1, .id_shift = 3, },
+{ .nr_logical = 35, .level = 1, .type = 2, .id_shift = 6, 
},
+},
+},
+},
+{
+.threads_per_core = 2, .cores_per_pkg = 128,
+.policy = {
+.x86_vendor = X86_VENDOR_INTEL,
+.topo.subleaf = {
+{ .nr_logical = 2, .level = 0, .type = 1, .id_shift = 1, },
+{ .nr_logical = 256, .le

[PATCH v7 00/10] x86: Expose consistent topology to guests

2024-10-21 Thread Alejandro Vallejo

Current topology handling is close to non-existent. As things stand, APIC IDs
are allocated through the apic_id=vcpu_id*2 relation without giving any hints
to the OS on how to parse the x2APIC ID of a given CPU and assuming the guest
will assume 2 threads per core.

This series involves bringing x2APIC IDs into the migration stream, so older
guests keep operating as they used to and enhancing Xen+toolstack so new guests
get topology information consistent with their x2APIC IDs. As a side effect of
this, x2APIC IDs are now packed and don't have (unless under a pathological
case) gaps.

Further work ought to allow combining this topology configurations with
gang-scheduling of guest hyperthreads into affine physical hyperthreads. For
the time being it purposefully keeps the configuration of "1 socket" + "1
thread per core" + "1 core per vCPU".

===

Other minor changes highlighted in each individual patch.

Hypervisor prerequisites:

  patch  1: lib/x86: Bump max basic leaf in {pv,hvm}_max_policy
* Conceptually similar to v6/patch1 ("Relax checks about policy
  compatibility"), but operates on the max policies instead.
  patch  2: xen/x86: Add initial x2APIC ID to the per-vLAPIC save area
  patch  3: xen/x86: Add supporting code for uploading LAPIC contexts during
   domain create

hvmloader prerequisites

  patch  4: tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves

Toolstack prerequisites:

  patch  5: tools/libacpi: Use LUT of APIC IDs rather than function pointer
  patch  6: tools/libguest: Always set vCPU context in vcpu_hvm()

No functional changes:

  patch  7: xen/lib: Add topology generator for x86
* Tweaked the behaviour of the cache leaves on overflow and added stronger
  checks.
  patch  8: xen/x86: Derive topologically correct x2APIC IDs from the policy

Final toolstack/xen stitching:

  patch  9: tools/libguest: Set distinct x2APIC IDs for each vCPU
  patch 10: xen/x86: Synthesise domain topologies

v6: 
https://lore.kernel.org/xen-devel/20241001123807.605-1-alejandro.vall...@cloud.com
v5: 
https://lore.kernel.org/xen-devel/20240808134251.29995-1-alejandro.vall...@cloud.com/
v4: 
https://lore.kernel.org/xen-devel/cover.1719416329.git.alejandro.vall...@cloud.com/
v3: 
https://lore.kernel.org/xen-devel/cover.1716976271.git.alejandro.vall...@cloud.com/
v2: 
https://lore.kernel.org/xen-devel/cover.1715102098.git.alejandro.vall...@cloud.com/
v1: 
https://lore.kernel.org/xen-devel/20240109153834.4192-1-alejandro.vall...@cloud.com/


Alejandro Vallejo (10):
  lib/x86: Bump max basic leaf in {pv,hvm}_max_policy
  xen/x86: Add initial x2APIC ID to the per-vLAPIC save area
  xen/x86: Add supporting code for uploading LAPIC contexts during
domain create
  tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves
  tools/libacpi: Use LUT of APIC IDs rather than function pointer
  tools/libguest: Always set vCPU context in vcpu_hvm()
  xen/lib: Add topology generator for x86
  xen/x86: Derive topologically correct x2APIC IDs from the policy
  tools/libguest: Set distinct x2APIC IDs for each vCPU
  tools/x86: Synthesise domain topologies

 tools/firmware/hvmloader/config.h|   5 +-
 tools/firmware/hvmloader/hvmloader.c |   6 +-
 tools/firmware/hvmloader/mp_tables.c |   4 +-
 tools/firmware/hvmloader/smp.c   |  57 +--
 tools/firmware/hvmloader/util.c  |   7 +-
 tools/include/xen-tools/common-macros.h  |   5 +
 tools/include/xenguest.h |   8 +
 tools/libacpi/build.c|   6 +-
 tools/libacpi/libacpi.h  |   2 +-
 tools/libs/guest/xg_cpuid_x86.c  |  29 +++-
 tools/libs/guest/xg_dom_x86.c|  93 +++
 tools/libs/light/libxl_dom.c |  25 +++
 tools/libs/light/libxl_x86_acpi.c|   7 +-
 tools/tests/cpu-policy/test-cpu-policy.c | 201 +++
 xen/arch/x86/cpu-policy.c|  15 +-
 xen/arch/x86/cpuid.c |  18 +-
 xen/arch/x86/hvm/vlapic.c|  42 -
 xen/arch/x86/include/asm/hvm/vlapic.h|   1 +
 xen/include/public/arch-x86/hvm/save.h   |   2 +
 xen/include/xen/lib/x86/cpu-policy.h |  27 +++
 xen/lib/x86/policy.c | 169 +++
 xen/lib/x86/private.h|   4 +
 22 files changed, 649 insertions(+), 84 deletions(-)


base-commit: 081683ea578da56dd20b9dc22a64d03c893b47ba
-- 
2.47.0

[PATCH v7 06/10] tools/libguest: Always set vCPU context in vcpu_hvm()

2024-10-21 Thread Alejandro Vallejo

Currently used by PVH to set MTRR, will be used by a later patch to set
APIC state. Unconditionally send the hypercall, and gate overriding the
MTRR so it remains functionally equivalent.

While at it, add a missing "goto out" to what was the error condition
in the loop.

In principle this patch shouldn't affect functionality. An extra record
(the MTRR) is sent to the hypervisor per vCPU on HVM, but these records
are identical to those retrieved in the first place so there's no
expected functional change.

Signed-off-by: Alejandro Vallejo 
---
v7:
  * Unchanged
---
 tools/libs/guest/xg_dom_x86.c | 84 ++-
 1 file changed, 44 insertions(+), 40 deletions(-)

diff --git a/tools/libs/guest/xg_dom_x86.c b/tools/libs/guest/xg_dom_x86.c
index cba01384ae75..c98229317db7 100644
--- a/tools/libs/guest/xg_dom_x86.c
+++ b/tools/libs/guest/xg_dom_x86.c
@@ -989,6 +989,7 @@ const static void *hvm_get_save_record(const void *ctx, 
unsigned int type,
 
 static int vcpu_hvm(struct xc_dom_image *dom)
 {
+/* Initialises the BSP */
 struct {
 struct hvm_save_descriptor header_d;
 HVM_SAVE_TYPE(HEADER) header;
@@ -997,6 +998,18 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 struct hvm_save_descriptor end_d;
 HVM_SAVE_TYPE(END) end;
 } bsp_ctx;
+/* Initialises APICs and MTRRs of every vCPU */
+struct {
+struct hvm_save_descriptor header_d;
+HVM_SAVE_TYPE(HEADER) header;
+struct hvm_save_descriptor mtrr_d;
+HVM_SAVE_TYPE(MTRR) mtrr;
+struct hvm_save_descriptor end_d;
+HVM_SAVE_TYPE(END) end;
+} vcpu_ctx;
+/* Context from full_ctx */
+const HVM_SAVE_TYPE(MTRR) *mtrr_record;
+/* Raw context as taken from Xen */
 uint8_t *full_ctx = NULL;
 int rc;
 
@@ -1083,51 +1096,42 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 bsp_ctx.end_d.instance = 0;
 bsp_ctx.end_d.length = HVM_SAVE_LENGTH(END);
 
-/* TODO: maybe this should be a firmware option instead? */
-if ( !dom->device_model )
+/* TODO: maybe setting MTRRs should be a firmware option instead? */
+mtrr_record = hvm_get_save_record(full_ctx, HVM_SAVE_CODE(MTRR), 0);
+
+if ( !mtrr_record)
 {
-struct {
-struct hvm_save_descriptor header_d;
-HVM_SAVE_TYPE(HEADER) header;
-struct hvm_save_descriptor mtrr_d;
-HVM_SAVE_TYPE(MTRR) mtrr;
-struct hvm_save_descriptor end_d;
-HVM_SAVE_TYPE(END) end;
-} mtrr = {
-.header_d = bsp_ctx.header_d,
-.header = bsp_ctx.header,
-.mtrr_d.typecode = HVM_SAVE_CODE(MTRR),
-.mtrr_d.length = HVM_SAVE_LENGTH(MTRR),
-.end_d = bsp_ctx.end_d,
-.end = bsp_ctx.end,
-};
-const HVM_SAVE_TYPE(MTRR) *mtrr_record =
-hvm_get_save_record(full_ctx, HVM_SAVE_CODE(MTRR), 0);
-unsigned int i;
-
-if ( !mtrr_record )
-{
-xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
- "%s: unable to get MTRR save record", __func__);
-goto out;
-}
+xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+ "%s: unable to get MTRR save record", __func__);
+goto out;
+}
 
-memcpy(&mtrr.mtrr, mtrr_record, sizeof(mtrr.mtrr));
+vcpu_ctx.header_d = bsp_ctx.header_d;
+vcpu_ctx.header = bsp_ctx.header;
+vcpu_ctx.mtrr_d.typecode = HVM_SAVE_CODE(MTRR);
+vcpu_ctx.mtrr_d.length = HVM_SAVE_LENGTH(MTRR);
+vcpu_ctx.mtrr = *mtrr_record;
+vcpu_ctx.end_d = bsp_ctx.end_d;
+vcpu_ctx.end = bsp_ctx.end;
 
-/*
- * Enable MTRR, set default type to WB.
- * TODO: add MMIO areas as UC when passthrough is supported.
- */
-mtrr.mtrr.msr_mtrr_def_type = MTRR_TYPE_WRBACK | MTRR_DEF_TYPE_ENABLE;
+/*
+ * Enable MTRR, set default type to WB.
+ * TODO: add MMIO areas as UC when passthrough is supported in PVH
+ */
+if ( !dom->device_model )
+vcpu_ctx.mtrr.msr_mtrr_def_type = MTRR_TYPE_WRBACK | 
MTRR_DEF_TYPE_ENABLE;
+
+for ( unsigned int i = 0; i < dom->max_vcpus; i++ )
+{
+vcpu_ctx.mtrr_d.instance = i;
 
-for ( i = 0; i < dom->max_vcpus; i++ )
+rc = xc_domain_hvm_setcontext(dom->xch, dom->guest_domid,
+  (uint8_t *)&vcpu_ctx, sizeof(vcpu_ctx));
+if ( rc != 0 )
 {
-mtrr.mtrr_d.instance = i;
-rc = xc_domain_hvm_setcontext(dom->xch, dom->guest_domid,
-  (uint8_t *)&mtrr, sizeof(mtrr));
-if ( rc != 0 )
-xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
- "%s: SETHVMCONTEXT failed (rc=%d)", __func__, rc);
+xc_dom_panic(dom->xch, XC_INTERNAL_E

[PATCH v7 01/10] lib/x86: Bump max basic leaf in {pv,hvm}_max_policy

2024-10-21 Thread Alejandro Vallejo

Bump it to ARRAY_SIZE() so toolstack is able to extend a policy past
host limits (i.e: to emulate a feature not present in the host)

Signed-off-by: Alejandro Vallejo 
---
v7:
  * Replaces v6/patch1("Relax checks about policy compatibility")
  * Bumps basic.max_leaf to ARRAY_SIZE(basic.raw) to pass the
compatibility checks rather than tweaking the checker.
---
 xen/arch/x86/cpu-policy.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index b6d9fad56773..715a66d2a978 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -585,6 +585,9 @@ static void __init calculate_pv_max_policy(void)
  */
 p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
 
+/* Toolstack may populate leaves not present in the basic host leaves */
+p->basic.max_leaf = ARRAY_SIZE(p->basic.raw) - 1;
+
 x86_cpu_policy_to_featureset(p, fs);
 
 for ( i = 0; i < ARRAY_SIZE(fs); ++i )
@@ -672,6 +675,9 @@ static void __init calculate_hvm_max_policy(void)
  */
 p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
 
+/* Toolstack may populate leaves not present in the basic host leaves */
+p->basic.max_leaf = ARRAY_SIZE(p->basic.raw) - 1;
+
 x86_cpu_policy_to_featureset(p, fs);
 
 mask = hvm_hap_supported() ?
-- 
2.47.0

Re: [PATCH 1/1] NUMA: Introduce NODE_DATA->node_present_pages(RAM pages)

2024-10-22 Thread Alejandro Vallejo

Hi,

The subject was probably meant to have a v3?

On Tue Oct 22, 2024 at 11:10 AM BST, Bernhard Kaindl wrote:
> From: Bernhard Kaindl 
>
> Some admin tools like 'xl info -n' like to display the total memory
> for each NUMA node. The Xen backend[1] of hwloc comes to mind too.
>
> The total amount of RAM on a NUMA node is not needed by Xen internally:
> Xen only uses NODE_DATA->node_spanned_pages, but that can be confusing
> for users as it includes memory holes (can be as large as 2GB on x86).
>
> Calculate the RAM per NUMA node by iterating over arch_get_ram_range()
> which returns the e820 RAM entries on x86 and update it on memory_add().
>
> Use NODE_DATA->node_present_pages (like in the Linux kernel) to hold
> this info and in a later commit, find a way for tools to read it.

Part of this information would be more helpful in a comment in the definition
of node_data, I think.

>
> [1] hwloc with Xen backend: https://github.com/xenserver-next/hwloc/
>
> Signed-off-by: Bernhard Kaindl 
> ---
> Changes in v2:
> - Remove update of numainfo call, only calculate RAM for each node.
> - Calculate RAM based on page boundaries, coding style fixes
> Changes in v3:
> - Use PFN_UP/DOWN, refactored further to simplify the code, while leaving
>   compiler-level optimisations to the compiler's optimisation passes.
> ---
>  xen/arch/x86/x86_64/mm.c |  3 +++
>  xen/common/numa.c| 31 ---
>  xen/include/xen/numa.h   |  3 +++
>  3 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
> index b2a280fba3..66b9bed057 100644
> --- a/xen/arch/x86/x86_64/mm.c
> +++ b/xen/arch/x86/x86_64/mm.c
> @@ -1334,6 +1334,9 @@ int memory_add(unsigned long spfn, unsigned long epfn, 
> unsigned int pxm)
>  share_hotadd_m2p_table(&info);
>  transfer_pages_to_heap(&info);
>  
> +/* Update the node's present pages (like the total_pages of the system) 
> */
> +NODE_DATA(node)->node_present_pages += epfn - spfn;
> +
>  return 0;
>  
>  destroy_m2p:
> diff --git a/xen/common/numa.c b/xen/common/numa.c
> index 28a09766fa..374132df08 100644
> --- a/xen/common/numa.c
> +++ b/xen/common/numa.c
> @@ -4,6 +4,7 @@
>   * Adapted for Xen: Ryan Harper 
>   */
>  
> +#include "xen/pfn.h"
>  #include 
>  #include 
>  #include 
> @@ -499,15 +500,39 @@ int __init compute_hash_shift(const struct node *nodes,
>  return shift;
>  }
>  
> -/* Initialize NODE_DATA given nodeid and start/end */
> +/**
> + * @brief Initialize a NUMA node's NODE_DATA given nodeid and start/end 
> addrs.
> + *
> + * This function sets up the boot memory for a given NUMA node by calculating
> + * the node's start and end page frame numbers (PFNs) and determining
> + * the number of present RAM pages within the node's memory range.
> + *
> + * @param nodeid The identifier of the node to initialize.
> + * @param start The starting physical address of the node's memory range.
> + * @param end The ending physical address of the node's memory range.

I'd add that end is "exclusive". To make it unambiguous.

> + */
>  void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
>  {
>  unsigned long start_pfn = paddr_to_pfn(start);
>  unsigned long end_pfn = paddr_to_pfn(end);
> +struct node_data *numa_node = NODE_DATA(nodeid);
> +paddr_t start_ram, end_ram;
> +unsigned long pages = 0;
> +unsigned int idx = 0;
> +int err;
>  
> -NODE_DATA(nodeid)->node_start_pfn = start_pfn;
> -NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
> +numa_node->node_start_pfn = start_pfn;
> +numa_node->node_spanned_pages = end_pfn - start_pfn;
>  
> +/* Calculate the number of present RAM pages within the node: */
> +while ( (err = arch_get_ram_range(idx++, &start_ram, &end_ram)) != 
> -ENOENT )

nit: This line seems quite overloaded. Might be easier for the eye as a
do-while, with "int err" being defined inside the loop itself.

> +{
> +if ( err || start_ram >= end || end_ram <= start )
> +continue;  /* Not RAM (err != 0) or range is outside the node */
> +
> +pages += PFN_DOWN(min(end_ram, end)) - PFN_UP(max(start_ram, start));
> +}
> +numa_node->node_present_pages = pages;
>  node_set_online(nodeid);
>  }
>  
> diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> index fd1511a6fb..c860f3ad1c 100644
> --- a/xen/include/xen/numa.h
> +++ b/xen/include/xen/numa.h
> @@ -71,6 +71,7 @@ extern nodeid_t *memnodemap;
>  struct node_data {
>  unsigned long node_start_pfn;
>  unsigned long node_spanned_pages;
> +unsigned long node_present_pages;
>  };
>  
>  extern struct node_data node_data[];
> @@ -91,6 +92,7 @@ static inline nodeid_t mfn_to_nid(mfn_t mfn)
>  
>  #define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn)
>  #define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages)
> +#define node_present_pages(nid) (NODE_DATA(nid)->node_present

Re: [PATCH v3 5/5] x86/boot: Clarify comment

2024-10-11 Thread Alejandro Vallejo

On Fri, Oct 11, 2024 at 02:08:37PM +0100, Frediano Ziglio wrote:
> On Fri, Oct 11, 2024 at 1:56 PM Alejandro Vallejo
>  wrote:
> >
> > On Fri, Oct 11, 2024 at 09:52:44AM +0100, Frediano Ziglio wrote:
> > > Signed-off-by: Frediano Ziglio 
> > > ---
> > >  xen/arch/x86/boot/reloc.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/xen/arch/x86/boot/reloc.c b/xen/arch/x86/boot/reloc.c
> > > index e50e161b27..e725cfb6eb 100644
> > > --- a/xen/arch/x86/boot/reloc.c
> > > +++ b/xen/arch/x86/boot/reloc.c
> > > @@ -65,7 +65,7 @@ typedef struct memctx {
> > >  /*
> > >   * Simple bump allocator.
> > >   *
> > > - * It starts from the base of the trampoline and allocates downwards.
> > > + * It starts on top of space reserved for the trampoline and 
> > > allocates downwards.
> >
> > nit: Not sure this is much clearer. The trampoline is not a stack (and even 
> > if
> > it was, I personally find "top" and "bottom" quite ambiguous when it grows
> > backwards), so calling top to its lowest address seems more confusing than 
> > not.
> >
> > If anything clarification ought to go in the which direction it takes. 
> > Leaving
> > "base" instead of "top" and replacing "downwards" by "backwards" to make it
> > crystal clear that it's a pointer that starts where the trampoline starts, 
> > but
> > moves in the opposite direction.
> >
> 
> Base looks confusing to me, but surely that comment could be confusing.
> For the trampoline 64 KB are reserved. Last 4 KB are used as a normal
> stack (push/pop/call/whatever), first part gets a copy of the
> trampoline code/data (about 6 Kb) the rest (so 64 - 4 - ~6 = ~54 kb)
> is used for the copy of MBI information. That "rest" is what we are
> talking about here.

Last? From what I looked at it seems to be the first 12K.

   #define TRAMPOLINE_STACK_SPACE  PAGE_SIZE
   #define TRAMPOLINE_SPACE(KB(64) - TRAMPOLINE_STACK_SPACE)

To put it another way, with left=lo-addr and right=hi-addr. The code seems to
do this...

 |<--64K-->|
 |<-12K--->|   |
 +-+-+-+
 | stack-space | mbi | trampoline  |
 +-+-+-+
   ^  ^
   |  |
   |  +-- copied Multiboot info + modules
   +- initial memctx.ptr

... with the stack growing backwards to avoid overflowing onto mbi.

Or am I missing something?

Cheers,
Alejandro

Re: [PATCH v3 5/5] x86/boot: Clarify comment

2024-10-11 Thread Alejandro Vallejo

On Fri Oct 11, 2024 at 2:58 PM BST, Frediano Ziglio wrote:
> On Fri, Oct 11, 2024 at 2:38 PM Andrew Cooper  
> wrote:
> >
> > On 11/10/2024 2:28 pm, Alejandro Vallejo wrote:
> > > On Fri, Oct 11, 2024 at 02:08:37PM +0100, Frediano Ziglio wrote:
> > >> On Fri, Oct 11, 2024 at 1:56 PM Alejandro Vallejo
> > >>  wrote:
> > >>> On Fri, Oct 11, 2024 at 09:52:44AM +0100, Frediano Ziglio wrote:
> > >>>> Signed-off-by: Frediano Ziglio 
> > >>>> ---
> > >>>>  xen/arch/x86/boot/reloc.c | 2 +-
> > >>>>  1 file changed, 1 insertion(+), 1 deletion(-)
> > >>>>
> > >>>> diff --git a/xen/arch/x86/boot/reloc.c b/xen/arch/x86/boot/reloc.c
> > >>>> index e50e161b27..e725cfb6eb 100644
> > >>>> --- a/xen/arch/x86/boot/reloc.c
> > >>>> +++ b/xen/arch/x86/boot/reloc.c
> > >>>> @@ -65,7 +65,7 @@ typedef struct memctx {
> > >>>>  /*
> > >>>>   * Simple bump allocator.
> > >>>>   *
> > >>>> - * It starts from the base of the trampoline and allocates 
> > >>>> downwards.
> > >>>> + * It starts on top of space reserved for the trampoline and 
> > >>>> allocates downwards.
> > >>> nit: Not sure this is much clearer. The trampoline is not a stack (and 
> > >>> even if
> > >>> it was, I personally find "top" and "bottom" quite ambiguous when it 
> > >>> grows
> > >>> backwards), so calling top to its lowest address seems more confusing 
> > >>> than not.
> > >>>
> > >>> If anything clarification ought to go in the which direction it takes. 
> > >>> Leaving
> > >>> "base" instead of "top" and replacing "downwards" by "backwards" to 
> > >>> make it
> > >>> crystal clear that it's a pointer that starts where the trampoline 
> > >>> starts, but
> > >>> moves in the opposite direction.
> > >>>
> > >> Base looks confusing to me, but surely that comment could be confusing.
> > >> For the trampoline 64 KB are reserved. Last 4 KB are used as a normal
> > >> stack (push/pop/call/whatever), first part gets a copy of the
> > >> trampoline code/data (about 6 Kb) the rest (so 64 - 4 - ~6 = ~54 kb)
> > >> is used for the copy of MBI information. That "rest" is what we are
> > >> talking about here.
> > > Last? From what I looked at it seems to be the first 12K.
> > >
> > >#define TRAMPOLINE_STACK_SPACE  PAGE_SIZE
> > >#define TRAMPOLINE_SPACE(KB(64) - TRAMPOLINE_STACK_SPACE)
> > >
> > > To put it another way, with left=lo-addr and right=hi-addr. The code 
> > > seems to
> > > do this...
> > >
> > >  |<--64K-->|
> > >  |<-12K--->|   |

s/12K/4K/

My brain merged the 12bits in the wrong place. Too much bit twiddling.

> > >  +-+-+-+
> > >  | stack-space | mbi | trampoline  |
> > >  +-+-+-+
> > >^  ^
> > >|  |
> > >|  +-- copied Multiboot info + modules
> > >+- initial memctx.ptr
> > >
> > > ... with the stack growing backwards to avoid overflowing onto mbi.
> > >
> > > Or am I missing something?
> >
> > So I was hoping for some kind of diagram like this, to live in
> > arch/x86/include/asm/trampoline.h with the other notes about the trampoline.
> >
> > But, is that diagram accurate?  Looking at
>
>/* Switch to low-memory stack which lives at the end of
> trampoline region. */
>mov sym_esi(trampoline_phys), %edi
>lea TRAMPOLINE_SPACE+TRAMPOLINE_STACK_SPACE(%edi),%esp
>lea trampoline_boot_cpu_entry-trampoline_start(%edi),%eax
>pushl   $BOOT_CS32
>push%eax
>
>/* Copy bootstrap trampoline to low memory, below 1MB. */
>lea sym_esi(trampoline_start), %esi
>mov $((trampoline_end - trampoline_start) / 4),%ecx
>rep movsl
>
> So, from low to high
> - trampoline code/data (%edi at beginning of copy is trampoline_phys,
> %esi is trampoline_start)
> - space (used for MBI copy)
> - stack (%esp is set to trampoline_phys + TRAMPOLINE_SPACE +
> TRAMPOLINE_STACK_SPACE)
>
> Frediano

So it's reversed from what I thought

 |<--64K-->|
 |   |<-4K>|
 +-+-+-+
 |  text-(ish) | mbi | stack-space |
 +-+-+-+
  ^^
  ||
  |+-- initial memctx.ptr
  +--- copied Multiboot info + modules


Your version of the comment is a definite improvement over the nonsense that
was there before. Sorry for the noise :)

Cheers,
Alejandro

Re: [PATCH 12/14] x86/fpu: Pass explicit xsave areas to fpu_(f)xsave()

2024-10-29 Thread Alejandro Vallejo

On Tue Oct 29, 2024 at 8:37 AM GMT, Jan Beulich wrote:
> On 28.10.2024 16:49, Alejandro Vallejo wrote:
> > --- a/xen/arch/x86/xstate.c
> > +++ b/xen/arch/x86/xstate.c
> > @@ -300,9 +300,8 @@ void compress_xsave_states(struct vcpu *v, const void 
> > *src, unsigned int size)
> >  vcpu_unmap_xsave_area(v, xstate);
> >  }
> >  
> > -void xsave(struct vcpu *v, uint64_t mask)
> > +void xsave(struct vcpu *v, struct xsave_struct *ptr, uint64_t mask)
> >  {
> > -struct xsave_struct *ptr = v->arch.xsave_area;
> >  uint32_t hmask = mask >> 32;
> >  uint32_t lmask = mask;
> >  unsigned int fip_width = v->domain->arch.x87_fip_width;
>
> Imo this change wants to constify v at the same time, to demonstrate that
> nothing is changed through v anymore. The comment may extend to other 
> functions
> as well that are being altered here; I only closely looks at this one.
>
> Jan

I didn't think of that angle... I'll have a look and take it into account for
v2.

Cheers,
Alejandro

Re: [PATCH 01/14] x86/xstate: Update stale assertions in fpu_x{rstor,save}()

2024-10-29 Thread Alejandro Vallejo

On Tue Oct 29, 2024 at 8:13 AM GMT, Jan Beulich wrote:
> On 28.10.2024 18:16, Andrew Cooper wrote:
> > On 28/10/2024 3:49 pm, Alejandro Vallejo wrote:
> >> The asserts' intent was to establish whether the xsave instruction was
> >> usable or not, which at the time was strictly given by the presence of
> >> the xsave area. After edb48e76458b("x86/fpu: Combine fpu_ctxt and
> >> xsave_area in arch_vcpu"), that area is always present a more relevant
> >> assert is that the host supports XSAVE.
> >>
> >> Fixes: edb48e76458b("x86/fpu: Combine fpu_ctxt and xsave_area in 
> >> arch_vcpu")
> >> Signed-off-by: Alejandro Vallejo 
> >> ---
> >> I'd also be ok with removing the assertions altogether. They serve very
> >> little purpose there after the merge of xsave and fpu_ctxt.
> > 
> > I'd be fine with dropping them.
>
> +1
>
> Jan
>
> >  If they're violated, the use of
> > XSAVE/XRSTOR immediately afterwards will be fatal too.
> > 
> > ~Andrew

Ok then, I'll re-send this one as a removal.

Cheers,
Alejandro

Re: [PATCH 05/14] x86/xstate: Map/unmap xsave area in xstate_set_init() and handle_setbv()

2024-10-29 Thread Alejandro Vallejo

On Tue Oct 29, 2024 at 8:26 AM GMT, Jan Beulich wrote:
> On 28.10.2024 16:49, Alejandro Vallejo wrote:
> > --- a/xen/arch/x86/xstate.c
> > +++ b/xen/arch/x86/xstate.c
> > @@ -993,7 +993,12 @@ int handle_xsetbv(u32 index, u64 new_bv)
> >  
> >  clts();
> >  if ( curr->fpu_dirtied )
> > -asm ( "stmxcsr %0" : "=m" 
> > (curr->arch.xsave_area->fpu_sse.mxcsr) );
> > +{
> > +struct xsave_struct *xsave_area = vcpu_map_xsave_area(curr);
> > +
> > +asm ( "stmxcsr %0" : "=m" (xsave_area->fpu_sse.mxcsr) );
> > +vcpu_unmap_xsave_area(curr, xsave_area);
> > +}
>
> Since it's curr that we're dealing with, is this largely a cosmetic change? 
> I.e.
> there's no going to be any actual map/unmap operation in that case? Otherwise
> I'd be inclined to say that an actual map/unmap is pretty high overhead for a
> mere store of a 32-bit value.
>
> Jan

Somewhat.

See the follow-up reply to patch2 with something resembling what I expect the
wrappers to have. In short, yes, I expect "current" to not require
mapping/unmapping; but I still would rather see those sites using the same
wrappers for auditability. After we settle on a particular interface, we can
let the implementation details creep out if that happens to be clearer, but
it's IMO easier to work this way for the time being until those details
crystalise.

Cheers,
Alejandro

Re: [PATCH 02/14] x86/xstate: Create map/unmap primitives for xsave areas

2024-10-29 Thread Alejandro Vallejo

On Tue Oct 29, 2024 at 8:19 AM GMT, Jan Beulich wrote:
> On 28.10.2024 16:49, Alejandro Vallejo wrote:
> > --- a/xen/arch/x86/include/asm/xstate.h
> > +++ b/xen/arch/x86/include/asm/xstate.h
> > @@ -143,4 +143,24 @@ static inline bool xstate_all(const struct vcpu *v)
> > (v->arch.xcr0_accum & XSTATE_LAZY & ~XSTATE_FP_SSE);
> >  }
> >  
> > +/*
> > + * Fetch a pointer to the XSAVE area of a vCPU
> > + *
> > + * If ASI is enabled for the domain, this mapping is pCPU-local.
>
> Taking the umap commentary into account, I think this needs to expand
> some, to also symmetrically cover what the unmap comment says regarding
> "v is [not] the currently scheduled vCPU".

Yes, that's fair.

> This may then also help
> better see the further outlook, as Andrew was asking for.

Sure, I'll answer his comment in a jiffy with a rough approximation of what I
expect them to contain.

>
> > + * @param v Owner of the XSAVE area
> > + */
> > +#define vcpu_map_xsave_area(v) ((v)->arch.xsave_area)
> > +
> > +/*
> > + * Drops the XSAVE area of a vCPU and nullifies its pointer on exit.
>
> Nit: I expect it drops the mapping, not the area.

Yes, although even the mapping might not be dropped if we can credibly avoid
it. Regardless, yes this needs rewriting.

The particulars are murky and should become easier to see with the pseudo-code
I'm about to answer Andrew with

>
> > + * If ASI is enabled and v is not the currently scheduled vCPU then the
> > + * per-pCPU mapping is removed from the address space.
> > + *
> > + * @param v   vCPU logically owning xsave_area
> > + * @param xsave_area  XSAVE blob of v
> > + */
> > +#define vcpu_unmap_xsave_area(v, x) ({ (x) = NULL; })
> > +
> >  #endif /* __ASM_XSTATE_H */

Cheers,
Alejandro

Re: [PATCH 02/14] x86/xstate: Create map/unmap primitives for xsave areas

2024-10-29 Thread Alejandro Vallejo

Hi,

On Mon Oct 28, 2024 at 5:20 PM GMT, Andrew Cooper wrote:
> On 28/10/2024 3:49 pm, Alejandro Vallejo wrote:
> > diff --git a/xen/arch/x86/include/asm/xstate.h 
> > b/xen/arch/x86/include/asm/xstate.h
> > index 07017cc4edfd..36260459667c 100644
> > --- a/xen/arch/x86/include/asm/xstate.h
> > +++ b/xen/arch/x86/include/asm/xstate.h
> > @@ -143,4 +143,24 @@ static inline bool xstate_all(const struct vcpu *v)
> > (v->arch.xcr0_accum & XSTATE_LAZY & ~XSTATE_FP_SSE);
> >  }
> >  
> > +/*
> > + * Fetch a pointer to the XSAVE area of a vCPU
> > + *
> > + * If ASI is enabled for the domain, this mapping is pCPU-local.
> > + *
> > + * @param v Owner of the XSAVE area
> > + */
> > +#define vcpu_map_xsave_area(v) ((v)->arch.xsave_area)
> > +
> > +/*
> > + * Drops the XSAVE area of a vCPU and nullifies its pointer on exit.
> > + *
> > + * If ASI is enabled and v is not the currently scheduled vCPU then the
> > + * per-pCPU mapping is removed from the address space.
> > + *
> > + * @param v   vCPU logically owning xsave_area
> > + * @param xsave_area  XSAVE blob of v
> > + */
> > +#define vcpu_unmap_xsave_area(v, x) ({ (x) = NULL; })
> > +
>
> Is there a preview of how these will end up looking with the real ASI
> bits in place?

I expect the contents to be something along these lines (in function form for
clarity):

  struct xsave_struct *vcpu_map_xsave_area(struct vcpu *v)
  {
  if ( !v->domain->asi )
  return v->arch.xsave_area;

  if ( likely(v == current) )
  return percpu_fixmap(v, PCPU_FIX_XSAVE_AREA);

  /* Likely some new vmap-like abstraction after AMX */
  return map_domain_page(v->arch.xsave_area_pg);
  }

Where:
  1. v->arch.xsave_area is a pointer to the XSAVE area on non-ASI domains.
  2. v->arch.xsave_area_pg an mfn (or a pointer to a page_info, converted)
  3. percpu_fixmap(v, PCPU_FIX_XSAVE_AREA) is a slot in a per-vCPU fixmap, that
 changes as we context switch from vCPU to vCPU.

  /*
   * NOTE: Being a function this doesn't nullify the xsave_area pointer, but
   * it would in a macro. It's unimportant for the overall logic though.
   */
  void vcpu_unmap_xsave_area(struct vcpu *v, struct xsave_struct *xsave_area)
  {
  /* Catch mismatched areas when ASI is disabled */
  ASSERT(v->domain->asi || xsave_area == v->arch.xsave_area);

  /* Likely some new vunmap-like abstraction after AMX */
  if ( v->domain->asi && v != current )
  unmap_domain_page(xsave_area);
  }

Of course, many of these details hang in the balance of what happens to the ASI
series from Roger. In any case, the takeaway is that map/unmap must have
fastpaths for "current" that don't involve mapping. The assumption is that
non-current vCPUs are cold paths. In particular, context switches will undergo
some refactoring in order to make save/restore not require additional
map/unmaps besides the page table switch and yet another change to further
align "current" with the currently running page tables. Paths like the
instruction emulator go through these wrappers later on for ease of
auditability, but are early-returns that cause no major overhead.

My expectation is that these macros are general enough to be tweakable in
whatever way is most suitable, thus allowing the refactor of the codebase at
large to make it ASI-friendly before the details of the ASI infra are merged,
or even finalised.

>
> Having a macro-that-reads-like-a-function mutating x by name, rather
> than by pointer, is somewhat rude.  This is why we capitalise
> XFREE()/etc which have a similar pattern; to make it clear it's a macro
> and potentially doing weird things with scopes.
>
> ~Andrew

That magic trick on unmap warrants uppercase, agreed. Initially it was all
function calls and after macrofying them I was lazy to change their users.

Cheers,
Alejandro

Re: [PATCH 02/14] x86/xstate: Create map/unmap primitives for xsave areas

2024-10-29 Thread Alejandro Vallejo

On Tue Oct 29, 2024 at 1:28 PM GMT, Jan Beulich wrote:
> On 29.10.2024 12:57, Alejandro Vallejo wrote:
> > On Mon Oct 28, 2024 at 5:20 PM GMT, Andrew Cooper wrote:
> >> On 28/10/2024 3:49 pm, Alejandro Vallejo wrote:
> >>> diff --git a/xen/arch/x86/include/asm/xstate.h 
> >>> b/xen/arch/x86/include/asm/xstate.h
> >>> index 07017cc4edfd..36260459667c 100644
> >>> --- a/xen/arch/x86/include/asm/xstate.h
> >>> +++ b/xen/arch/x86/include/asm/xstate.h
> >>> @@ -143,4 +143,24 @@ static inline bool xstate_all(const struct vcpu *v)
> >>> (v->arch.xcr0_accum & XSTATE_LAZY & ~XSTATE_FP_SSE);
> >>>  }
> >>>  
> >>> +/*
> >>> + * Fetch a pointer to the XSAVE area of a vCPU
> >>> + *
> >>> + * If ASI is enabled for the domain, this mapping is pCPU-local.
> >>> + *
> >>> + * @param v Owner of the XSAVE area
> >>> + */
> >>> +#define vcpu_map_xsave_area(v) ((v)->arch.xsave_area)
> >>> +
> >>> +/*
> >>> + * Drops the XSAVE area of a vCPU and nullifies its pointer on exit.
> >>> + *
> >>> + * If ASI is enabled and v is not the currently scheduled vCPU then the
> >>> + * per-pCPU mapping is removed from the address space.
> >>> + *
> >>> + * @param v   vCPU logically owning xsave_area
> >>> + * @param xsave_area  XSAVE blob of v
> >>> + */
> >>> +#define vcpu_unmap_xsave_area(v, x) ({ (x) = NULL; })
> >>> +
> >>
> >> Is there a preview of how these will end up looking with the real ASI
> >> bits in place?
> > 
> > I expect the contents to be something along these lines (in function form 
> > for
> > clarity):
> > 
> >   struct xsave_struct *vcpu_map_xsave_area(struct vcpu *v)
> >   {
> >   if ( !v->domain->asi )
> >   return v->arch.xsave_area;
> > 
> >   if ( likely(v == current) )
> >   return percpu_fixmap(v, PCPU_FIX_XSAVE_AREA);
> > 
> >   /* Likely some new vmap-like abstraction after AMX */
> >   return map_domain_page(v->arch.xsave_area_pg);
> >   }
>
> I'd like to ask that map_domain_page() be avoided here from the beginning, to
> take AMX into account right away. I've been sitting on the AMX series for
> years, and I'd consider it pretty unfair if it was me to take care of such an
> aspect, when instead the series should (imo) long have landed.
>
> Jan

Of course. This is just pseudo-code for explanation purposes, but I didn't want
to introduce imaginary functions. In the final thing we'll want to map an array
of MFNs if the XSAVE area is large enough.

I am already accounting for the XSAVE area to possibly exceed a single page (3
after AMX, I think?). Part of this abstraction stems from that want, in fact,
as otherwise I could simply stash it all away under map_domain_page() and let
that take care of everything. We'll want map_domain_pages_contig() or something
along those lines that takes an array of mfns we've previously stored in
arch_vcpu. But that's a tomorrow problem for when we do have a secret area to
create those mappings on.

For today, I'd be happy with most code to stop assuming there will be a pointer
in the vcpu.

Cheers,

Re: [PATCH 02/14] x86/xstate: Create map/unmap primitives for xsave areas

2024-10-29 Thread Alejandro Vallejo

On Tue Oct 29, 2024 at 1:24 PM GMT, Frediano Ziglio wrote:
> On Tue, Oct 29, 2024 at 11:58 AM Alejandro Vallejo
>  wrote:
> >
> > Hi,
> >
> > On Mon Oct 28, 2024 at 5:20 PM GMT, Andrew Cooper wrote:
> > > On 28/10/2024 3:49 pm, Alejandro Vallejo wrote:
> > > > diff --git a/xen/arch/x86/include/asm/xstate.h 
> > > > b/xen/arch/x86/include/asm/xstate.h
> > > > index 07017cc4edfd..36260459667c 100644
> > > > --- a/xen/arch/x86/include/asm/xstate.h
> > > > +++ b/xen/arch/x86/include/asm/xstate.h
> > > > @@ -143,4 +143,24 @@ static inline bool xstate_all(const struct vcpu *v)
> > > > (v->arch.xcr0_accum & XSTATE_LAZY & ~XSTATE_FP_SSE);
> > > >  }
> > > >
> > > > +/*
> > > > + * Fetch a pointer to the XSAVE area of a vCPU
> > > > + *
> > > > + * If ASI is enabled for the domain, this mapping is pCPU-local.
> > > > + *
> > > > + * @param v Owner of the XSAVE area
> > > > + */
> > > > +#define vcpu_map_xsave_area(v) ((v)->arch.xsave_area)
> > > > +
> > > > +/*
> > > > + * Drops the XSAVE area of a vCPU and nullifies its pointer on exit.
> > > > + *
> > > > + * If ASI is enabled and v is not the currently scheduled vCPU then the
> > > > + * per-pCPU mapping is removed from the address space.
> > > > + *
> > > > + * @param v   vCPU logically owning xsave_area
> > > > + * @param xsave_area  XSAVE blob of v
> > > > + */
> > > > +#define vcpu_unmap_xsave_area(v, x) ({ (x) = NULL; })
> > > > +
> > >
> > > Is there a preview of how these will end up looking with the real ASI
> > > bits in place?
> >
> > I expect the contents to be something along these lines (in function form 
> > for
> > clarity):
> >
> >   struct xsave_struct *vcpu_map_xsave_area(struct vcpu *v)
> >   {
> >   if ( !v->domain->asi )
> >   return v->arch.xsave_area;
> >
> >   if ( likely(v == current) )
> >   return percpu_fixmap(v, PCPU_FIX_XSAVE_AREA);
> >
> >   /* Likely some new vmap-like abstraction after AMX */
> >   return map_domain_page(v->arch.xsave_area_pg);
> >   }
> >
> > Where:
> >   1. v->arch.xsave_area is a pointer to the XSAVE area on non-ASI domains.
> >   2. v->arch.xsave_area_pg an mfn (or a pointer to a page_info, converted)
> >   3. percpu_fixmap(v, PCPU_FIX_XSAVE_AREA) is a slot in a per-vCPU fixmap, 
> > that
> >  changes as we context switch from vCPU to vCPU.
> >
> >   /*
> >* NOTE: Being a function this doesn't nullify the xsave_area pointer, but
> >* it would in a macro. It's unimportant for the overall logic though.
> >*/
> >   void vcpu_unmap_xsave_area(struct vcpu *v, struct xsave_struct 
> > *xsave_area)
> >   {
> >   /* Catch mismatched areas when ASI is disabled */
> >   ASSERT(v->domain->asi || xsave_area == v->arch.xsave_area);
> >
> >   /* Likely some new vunmap-like abstraction after AMX */
> >   if ( v->domain->asi && v != current )
> >   unmap_domain_page(xsave_area);
> >   }
> >
> > Of course, many of these details hang in the balance of what happens to the 
> > ASI
> > series from Roger. In any case, the takeaway is that map/unmap must have
> > fastpaths for "current" that don't involve mapping. The assumption is that
> > non-current vCPUs are cold paths. In particular, context switches will 
> > undergo
> > some refactoring in order to make save/restore not require additional
> > map/unmaps besides the page table switch and yet another change to further
> > align "current" with the currently running page tables. Paths like the
> > instruction emulator go through these wrappers later on for ease of
> > auditability, but are early-returns that cause no major overhead.
> >
> > My expectation is that these macros are general enough to be tweakable in
> > whatever way is most suitable, thus allowing the refactor of the codebase at
> > large to make it ASI-friendly before the details of the ASI infra are 
> > merged,
> > or even finalised.
> >
> > >
> > > Having a macro-that-reads-like-a-function mutating x by name, rather
> > > than by pointer, is somewhat rude.  This is why we capitalise
> > > XFREE()/etc which have a similar pattern; to make it clear it&#

[RFC PATCH 6/6] xen/common: Rename grant_opts to grant_version

2024-10-29 Thread Alejandro Vallejo

... and remove the macros that no longer exist.

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/common/domain.c   | 6 +++---
 xen/common/grant_table.c  | 3 +--
 xen/include/xen/grant_table.h | 4 ++--
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 92263a4fbdc5..86f0e99e0d4a 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -579,9 +579,9 @@ static int sanitise_domain_config(struct 
xen_domctl_createdomain *config)
 return -EINVAL;
 }
 
-if ( config->grant_opts & ~XEN_DOMCTL_GRANT_version_mask )
+if ( config->rsvd0[0] | config->rsvd0[1] | config->rsvd0[2] )
 {
-dprintk(XENLOG_INFO, "Unknown grant options %#x\n", 
config->grant_opts);
+dprintk(XENLOG_INFO, "Rubble in rsvd0 padding\n");
 return -EINVAL;
 }
 
@@ -788,7 +788,7 @@ struct domain *domain_create(domid_t domid,
 
 if ( (err = grant_table_init(d, config->max_grant_frames,
  config->max_maptrack_frames,
- config->grant_opts)) != 0 )
+ config->max_grant_version)) != 0 )
 goto fail;
 init_status |= INIT_gnttab;
 
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 6c77867f8cdd..51a3f72a9601 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1963,10 +1963,9 @@ active_alloc_failed:
 }
 
 int grant_table_init(struct domain *d, int max_grant_frames,
- int max_maptrack_frames, unsigned int options)
+ int max_maptrack_frames, uint8_t max_grant_version)
 {
 struct grant_table *gt;
-unsigned int max_grant_version = options & XEN_DOMCTL_GRANT_version_mask;
 int ret = -ENOMEM;
 
 if ( !max_grant_version )
diff --git a/xen/include/xen/grant_table.h b/xen/include/xen/grant_table.h
index 50edfecfb62f..f3edbae3c974 100644
--- a/xen/include/xen/grant_table.h
+++ b/xen/include/xen/grant_table.h
@@ -73,9 +73,9 @@ int gnttab_acquire_resource(
 static inline int grant_table_init(struct domain *d,
int max_grant_frames,
int max_maptrack_frames,
-   unsigned int options)
+   uint8_t max_grant_version)
 {
-if ( options )
+if ( max_grant_version )
 return -EINVAL;
 
 return 0;
-- 
2.47.0

Re: [PATCH 05/14] x86/xstate: Map/unmap xsave area in xstate_set_init() and handle_setbv()

2024-10-29 Thread Alejandro Vallejo

On Tue Oct 29, 2024 at 1:31 PM GMT, Jan Beulich wrote:
> On 29.10.2024 14:00, Alejandro Vallejo wrote:
> > On Tue Oct 29, 2024 at 8:26 AM GMT, Jan Beulich wrote:
> >> On 28.10.2024 16:49, Alejandro Vallejo wrote:
> >>> --- a/xen/arch/x86/xstate.c
> >>> +++ b/xen/arch/x86/xstate.c
> >>> @@ -993,7 +993,12 @@ int handle_xsetbv(u32 index, u64 new_bv)
> >>>  
> >>>  clts();
> >>>  if ( curr->fpu_dirtied )
> >>> -asm ( "stmxcsr %0" : "=m" 
> >>> (curr->arch.xsave_area->fpu_sse.mxcsr) );
> >>> +{
> >>> +struct xsave_struct *xsave_area = vcpu_map_xsave_area(curr);
> >>> +
> >>> +asm ( "stmxcsr %0" : "=m" (xsave_area->fpu_sse.mxcsr) );
> >>> +vcpu_unmap_xsave_area(curr, xsave_area);
> >>> +}
> >>
> >> Since it's curr that we're dealing with, is this largely a cosmetic 
> >> change? I.e.
> >> there's no going to be any actual map/unmap operation in that case? 
> >> Otherwise
> >> I'd be inclined to say that an actual map/unmap is pretty high overhead 
> >> for a
> >> mere store of a 32-bit value.
> > 
> > Somewhat.
> > 
> > See the follow-up reply to patch2 with something resembling what I expect 
> > the
> > wrappers to have. In short, yes, I expect "current" to not require
> > mapping/unmapping; but I still would rather see those sites using the same
> > wrappers for auditability. After we settle on a particular interface, we can
> > let the implementation details creep out if that happens to be clearer, but
> > it's IMO easier to work this way for the time being until those details
> > crystalise.
>
> Sure. As expressed in a later reply on the same topic, what I'm after are 
> brief
> comments indicating that despite the function names involved, no actual 
> mapping
> operations will be carried out in these cases, thus addressing concerns 
> towards
> the overhead involved.
>
> Jan

Right, I can add those to the sites using exclusively "current". That's no
problem.

Cheers,
Alejandro

Re: [PATCH v7 5/5] x86/boot: Clarify comment

2024-10-29 Thread Alejandro Vallejo

On Tue Oct 29, 2024 at 4:40 PM GMT, Frediano Ziglio wrote:
> On Tue, Oct 29, 2024 at 3:07 PM Andrew Cooper  
> wrote:
> >
> > On 29/10/2024 2:53 pm, Roger Pau Monné wrote:
> > > On Tue, Oct 29, 2024 at 10:29:42AM +, Frediano Ziglio wrote:
> > >> Signed-off-by: Frediano Ziglio 
> > >> ---
> > >>  xen/arch/x86/boot/reloc.c | 2 +-
> > >>  1 file changed, 1 insertion(+), 1 deletion(-)
> > >>
> > >> diff --git a/xen/arch/x86/boot/reloc.c b/xen/arch/x86/boot/reloc.c
> > >> index e50e161b27..e725cfb6eb 100644
> > >> --- a/xen/arch/x86/boot/reloc.c
> > >> +++ b/xen/arch/x86/boot/reloc.c
> > >> @@ -65,7 +65,7 @@ typedef struct memctx {
> > >>  /*
> > >>   * Simple bump allocator.
> > >>   *
> > >> - * It starts from the base of the trampoline and allocates 
> > >> downwards.
> > >> + * It starts on top of space reserved for the trampoline and 
> > >> allocates downwards.
> > > I'm afraid this line is over 80 characters long, will need to be
> > > adjusted.  Maybe:
> > >
> > > * Starts at top of the relocated trampoline space and allocates 
> > > downwards.
> >
> > This patch miss misses 2 of the 3 incorrect statements about how the
> > trampoline works, and Alejandro had some better suggestions in the
> > thread on the matter.
> >
> > ~Andrew
>
> Hi,
>   changed to "Starts at the end of the relocated trampoline space and
> allocates backwards".
>
> See 
> https://gitlab.com/xen-project/people/fziglio/xen/-/commit/21be0b9d2813db9c578e8a6ace76eee2445908f5.
>
> Frediano

with that:

  Reviewed-by: Alejandro Vallejo 

Cheers,
Alejandro

[RFC PATCH 3/6] tools/ocaml: Rename grant_opts to grant_version

2024-10-29 Thread Alejandro Vallejo

... and remove the macros that no longer exist.

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 tools/ocaml/libs/xc/xenctrl_stubs.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/ocaml/libs/xc/xenctrl_stubs.c 
b/tools/ocaml/libs/xc/xenctrl_stubs.c
index c78191f95abc..c4d34ca48753 100644
--- a/tools/ocaml/libs/xc/xenctrl_stubs.c
+++ b/tools/ocaml/libs/xc/xenctrl_stubs.c
@@ -223,8 +223,7 @@ CAMLprim value stub_xc_domain_create(value xch_val, value 
wanted_domid, value co
.max_evtchn_port = Int_val(VAL_MAX_EVTCHN_PORT),
.max_grant_frames = Int_val(VAL_MAX_GRANT_FRAMES),
.max_maptrack_frames = Int_val(VAL_MAX_MAPTRACK_FRAMES),
-   .grant_opts =
-   XEN_DOMCTL_GRANT_version(Int_val(VAL_MAX_GRANT_VERSION)),
+   .grant_version = Int_val(VAL_MAX_GRANT_VERSION),
.altp2m_opts = Int32_val(VAL_ALTP2M_OPTS),
.vmtrace_size = vmtrace_size,
.cpupool_id = Int32_val(VAL_CPUPOOL_ID),
-- 
2.47.0

[RFC PATCH 5/6] xen/x86: Rename grant_opts to grant_version

2024-10-29 Thread Alejandro Vallejo

... and remove the macros that no longer exist.

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/setup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 177f4024abca..a9130161969b 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -956,7 +956,7 @@ static struct domain *__init create_dom0(const module_t 
*image,
 .max_evtchn_port = -1,
 .max_grant_frames = -1,
 .max_maptrack_frames = -1,
-.grant_opts = XEN_DOMCTL_GRANT_version(opt_gnttab_max_version),
+.max_grant_version = opt_gnttab_max_version,
 .max_vcpus = dom0_max_vcpus(),
 .arch = {
 .misc_flags = opt_dom0_msr_relaxed ? XEN_X86_MSR_RELAXED : 0,
-- 
2.47.0

[RFC PATCH 4/6] xen/arm: Rename grant_opts to grant_version

2024-10-29 Thread Alejandro Vallejo

... and remove the macros that no longer exist.

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/arm/dom0less-build.c | 4 ++--
 xen/arch/arm/domain_build.c   | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/arch/arm/dom0less-build.c b/xen/arch/arm/dom0less-build.c
index f328a044e9d3..1c6219c7cc82 100644
--- a/xen/arch/arm/dom0less-build.c
+++ b/xen/arch/arm/dom0less-build.c
@@ -877,7 +877,7 @@ void __init create_domUs(void)
 .max_evtchn_port = 1023,
 .max_grant_frames = -1,
 .max_maptrack_frames = -1,
-.grant_opts = XEN_DOMCTL_GRANT_version(opt_gnttab_max_version),
+.max_grant_version = opt_gnttab_max_version,
 };
 unsigned int flags = 0U;
 uint32_t val;
@@ -959,7 +959,7 @@ void __init create_domUs(void)
 }
 
 if ( dt_property_read_u32(node, "max_grant_version", &val) )
-d_cfg.grant_opts = XEN_DOMCTL_GRANT_version(val);
+d_cfg.max_grant_version = val;
 
 if ( dt_property_read_u32(node, "max_grant_frames", &val) )
 {
diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index 2c30792de88b..773412ba2acb 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -2194,7 +2194,7 @@ void __init create_dom0(void)
 .max_evtchn_port = -1,
 .max_grant_frames = gnttab_dom0_frames(),
 .max_maptrack_frames = -1,
-.grant_opts = XEN_DOMCTL_GRANT_version(opt_gnttab_max_version),
+.max_grant_version = opt_gnttab_max_version,
 };
 int rc;
 
-- 
2.47.0

[RFC PATCH 2/6] tools: Rename grant_opts to grant_version

2024-10-29 Thread Alejandro Vallejo

... and remove the macros that no longer exist.

No functional change

Signed-off-by: Alejandro Vallejo 
---
 tools/helpers/init-xenstore-domain.c | 2 +-
 tools/libs/light/libxl_create.c  | 2 +-
 tools/python/xen/lowlevel/xc/xc.c| 2 +-
 tools/tests/paging-mempool/test-paging-mempool.c | 2 +-
 tools/tests/resource/test-resource.c | 6 +++---
 tools/tests/tsx/test-tsx.c   | 4 ++--
 6 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/tools/helpers/init-xenstore-domain.c 
b/tools/helpers/init-xenstore-domain.c
index 01ca667d25d1..25e41cf5175f 100644
--- a/tools/helpers/init-xenstore-domain.c
+++ b/tools/helpers/init-xenstore-domain.c
@@ -96,7 +96,7 @@ static int build(xc_interface *xch)
  */
 .max_grant_frames = 4,
 .max_maptrack_frames = 128,
-.grant_opts = XEN_DOMCTL_GRANT_version(1),
+.grant_version = 1,
 };
 
 xs_fd = open("/dev/xen/xenbus_backend", O_RDWR);
diff --git a/tools/libs/light/libxl_create.c b/tools/libs/light/libxl_create.c
index edeadd57ef5a..f952614b1f8d 100644
--- a/tools/libs/light/libxl_create.c
+++ b/tools/libs/light/libxl_create.c
@@ -646,7 +646,7 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_config 
*d_config,
 .max_evtchn_port = b_info->event_channels,
 .max_grant_frames = b_info->max_grant_frames,
 .max_maptrack_frames = b_info->max_maptrack_frames,
-.grant_opts = XEN_DOMCTL_GRANT_version(b_info->max_grant_version),
+.grant_version = b_info->max_grant_version,
 .vmtrace_size = ROUNDUP(b_info->vmtrace_buf_kb << 10, 
XC_PAGE_SHIFT),
 .cpupool_id = info->poolid,
 };
diff --git a/tools/python/xen/lowlevel/xc/xc.c 
b/tools/python/xen/lowlevel/xc/xc.c
index 9feb12ae2b16..b3bbda6d955d 100644
--- a/tools/python/xen/lowlevel/xc/xc.c
+++ b/tools/python/xen/lowlevel/xc/xc.c
@@ -167,7 +167,7 @@ static PyObject *pyxc_domain_create(XcObject *self,
 #else
 #error Architecture not supported
 #endif
-config.grant_opts = XEN_DOMCTL_GRANT_version(max_grant_version);
+config.grant_version = max_grant_version;
 
 if ( (ret = xc_domain_create(self->xc_handle, &dom, &config)) < 0 )
 return pyxc_error_to_exception(self->xc_handle);
diff --git a/tools/tests/paging-mempool/test-paging-mempool.c 
b/tools/tests/paging-mempool/test-paging-mempool.c
index 1ebc13455ac2..dc90b3b41793 100644
--- a/tools/tests/paging-mempool/test-paging-mempool.c
+++ b/tools/tests/paging-mempool/test-paging-mempool.c
@@ -24,7 +24,7 @@ static struct xen_domctl_createdomain create = {
 .flags = XEN_DOMCTL_CDF_hvm | XEN_DOMCTL_CDF_hap,
 .max_vcpus = 1,
 .max_grant_frames = 1,
-.grant_opts = XEN_DOMCTL_GRANT_version(1),
+.grant_version = 1,
 
 .arch = {
 #if defined(__x86_64__) || defined(__i386__)
diff --git a/tools/tests/resource/test-resource.c 
b/tools/tests/resource/test-resource.c
index 1b10be16a6b4..33bdb3113d85 100644
--- a/tools/tests/resource/test-resource.c
+++ b/tools/tests/resource/test-resource.c
@@ -137,7 +137,7 @@ static void test_domain_configurations(void)
 .create = {
 .max_vcpus = 2,
 .max_grant_frames = 40,
-.grant_opts = XEN_DOMCTL_GRANT_version(1),
+.grant_version = 1,
 },
 },
 {
@@ -146,7 +146,7 @@ static void test_domain_configurations(void)
 .flags = XEN_DOMCTL_CDF_hvm,
 .max_vcpus = 2,
 .max_grant_frames = 40,
-.grant_opts = XEN_DOMCTL_GRANT_version(1),
+.grant_version = 1,
 .arch = {
 .emulation_flags = XEN_X86_EMU_LAPIC,
 },
@@ -159,7 +159,7 @@ static void test_domain_configurations(void)
 .flags = XEN_DOMCTL_CDF_hvm | XEN_DOMCTL_CDF_hap,
 .max_vcpus = 2,
 .max_grant_frames = 40,
-.grant_opts = XEN_DOMCTL_GRANT_version(1),
+.grant_version = 1,
 },
 },
 #endif
diff --git a/tools/tests/tsx/test-tsx.c b/tools/tests/tsx/test-tsx.c
index 5af04953f340..86608c95d627 100644
--- a/tools/tests/tsx/test-tsx.c
+++ b/tools/tests/tsx/test-tsx.c
@@ -457,7 +457,7 @@ static void test_guests(void)
 struct xen_domctl_createdomain c = {
 .max_vcpus = 1,
 .max_grant_frames = 1,
-.grant_opts = XEN_DOMCTL_GRANT_version(1),
+.grant_version = 1,
 };
 
 printf("Testing PV guest\n");
@@ -470,7 +470,7 @@ static void test_guests(void)
 .flags = XEN_DOMCTL_CDF_hvm,
 .max_vcpus = 1,
 .max_grant_frames = 1,
-.grant_opts = XEN_DOMCTL_GRANT_version(1),
+.grant_version = 1,
 .arch = {
 .emulation_flags = XEN_X86_EMU_LAPIC,
 },
-- 
2.47.0

[RFC PATCH 0/6] xen/abi: On wide bitfields inside primitive types

2024-10-29 Thread Alejandro Vallejo

Non-boolean bitfields in the hypercall ABI make it fairly inconvenient to
create bindings for any language because (a) they are always ad-hoc and are
subject to restrictions regular fields are not (b) require boilerplate that
regular fields do not and (c) might not even be part of the core language,
forcing avoidable external libraries into any sort of generic library.

This patch (it's a series merely to split roughly by maintainer) is one such
case that I happened to spot while playing around. It's the grant_version
field, buried under an otherwise empty grant_opts.

The invariant I'd like to (slowly) introduce and discuss is that fields may
have bitflags (e.g: a packed array of booleans indexed by some enumerated
type), but not be mixed with wider fields in the same primitive type. This
ensures any field containing an integer of any kind can be referred by pointer
and treated the same way as any other with regards to sizeof() and the like.

I'd like to have a certain consensus about this general point before going
establishing this restriction in the IDL system I'm working on.

My preference would be to fold everything into a single patch if we decide to
follow through with this particular case. As I said before, the split is
artificial for review.

Alejandro Vallejo (6):
  xen/domctl: Refine grant_opts into grant_version
  tools: Rename grant_opts to grant_version
  tools/ocaml: Rename grant_opts to grant_version
  xen/arm: Rename grant_opts to grant_version
  xen/x86: Rename grant_opts to grant_version
  xen/common: Rename grant_opts to grant_version

 tools/helpers/init-xenstore-domain.c |  2 +-
 tools/libs/light/libxl_create.c  |  2 +-
 tools/ocaml/libs/xc/xenctrl_stubs.c  |  3 +--
 tools/python/xen/lowlevel/xc/xc.c|  2 +-
 tools/tests/paging-mempool/test-paging-mempool.c |  2 +-
 tools/tests/resource/test-resource.c |  6 +++---
 tools/tests/tsx/test-tsx.c   |  4 ++--
 xen/arch/arm/dom0less-build.c|  4 ++--
 xen/arch/arm/domain_build.c  |  2 +-
 xen/arch/x86/setup.c |  2 +-
 xen/common/domain.c  |  6 +++---
 xen/common/grant_table.c |  3 +--
 xen/include/public/domctl.h  | 15 +++
 xen/include/xen/grant_table.h|  4 ++--
 14 files changed, 31 insertions(+), 26 deletions(-)

-- 
2.47.0

[RFC PATCH 1/6] xen/domctl: Refine grant_opts into grant_version

2024-10-29 Thread Alejandro Vallejo

grant_opts is overoptimizing for space packing in a hypercall that
doesn't warrant the effort. Tweak the ABI without breaking it in order
to remove the bitfield by extending it to 8 bits.

Xen only supports little-endian systems, so the transformation from
uint32_t to uint8_t followed by 3 octets worth of padding is not an ABI
breakage.

No functional change

Signed-off-by: Alejandro Vallejo 
---
 xen/include/public/domctl.h | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 353f831e402e..b3c8271e66ba 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -90,11 +90,18 @@ struct xen_domctl_createdomain {
 int32_t max_grant_frames;
 int32_t max_maptrack_frames;
 
-/* Grant version, use low 4 bits. */
-#define XEN_DOMCTL_GRANT_version_mask0xf
-#define XEN_DOMCTL_GRANT_version(v)  ((v) & XEN_DOMCTL_GRANT_version_mask)
+/*
+ * Maximum grant table version the domain can be configured with.
+ *
+ * Domains always start with v1 (if CONFIG_GRANT_TABLE) and can be bumped
+ * to use up to `max_grant_version` via GNTTABOP_set_version.
+ *
+ * Must be zero iff !CONFIG_GRANT_TABLE.
+ */
+uint8_t max_grant_version;
 
-uint32_t grant_opts;
+/* Unused */
+uint8_t rsvd0[3];
 
 /*
  * Enable altp2m mixed mode.
-- 
2.47.0

Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-29 Thread Alejandro Vallejo

On Tue Oct 29, 2024 at 5:55 PM GMT, Andrew Cooper wrote:
> We already have one migration case opencoded (feat.max_subleaf).  A more
> recent discovery is that we advertise x2APIC to guests without ensuring that
> we provide max_leaf >= 0xb.
>
> In general, any leaf known to Xen can be safely configured by the toolstack if
> it doesn't violate other constraints.
>
> Therefore, introduce guest_common_{max,default}_leaves() to generalise the
> special case we currently have for feat.max_subleaf, in preparation to be able
> to provide x2APIC topology in leaf 0xb even on older hardware.
>
> Signed-off-by: Andrew Cooper 

Reviewed-by: Alejandro Vallejo 

Cheers,
Alejandro

Re: [RFC PATCH 0/6] xen/abi: On wide bitfields inside primitive types

2024-10-29 Thread Alejandro Vallejo

On Tue Oct 29, 2024 at 6:16 PM GMT, Alejandro Vallejo wrote:
> Non-boolean bitfields in the hypercall ABI make it fairly inconvenient to
> create bindings for any language because (a) they are always ad-hoc and are
> subject to restrictions regular fields are not (b) require boilerplate that
> regular fields do not and (c) might not even be part of the core language,
> forcing avoidable external libraries into any sort of generic library.
>
> This patch (it's a series merely to split roughly by maintainer) is one such
> case that I happened to spot while playing around. It's the grant_version
> field, buried under an otherwise empty grant_opts.
>
> The invariant I'd like to (slowly) introduce and discuss is that fields may
> have bitflags (e.g: a packed array of booleans indexed by some enumerated
> type), but not be mixed with wider fields in the same primitive type. This
> ensures any field containing an integer of any kind can be referred by pointer
> and treated the same way as any other with regards to sizeof() and the like.
>
> I'd like to have a certain consensus about this general point before going
> establishing this restriction in the IDL system I'm working on.
>
> My preference would be to fold everything into a single patch if we decide to
> follow through with this particular case. As I said before, the split is
> artificial for review.
>
> Alejandro Vallejo (6):
>   xen/domctl: Refine grant_opts into grant_version
>   tools: Rename grant_opts to grant_version
>   tools/ocaml: Rename grant_opts to grant_version
>   xen/arm: Rename grant_opts to grant_version
>   xen/x86: Rename grant_opts to grant_version
>   xen/common: Rename grant_opts to grant_version
>
>  tools/helpers/init-xenstore-domain.c |  2 +-
>  tools/libs/light/libxl_create.c  |  2 +-
>  tools/ocaml/libs/xc/xenctrl_stubs.c  |  3 +--
>  tools/python/xen/lowlevel/xc/xc.c|  2 +-
>  tools/tests/paging-mempool/test-paging-mempool.c |  2 +-
>  tools/tests/resource/test-resource.c |  6 +++---
>  tools/tests/tsx/test-tsx.c   |  4 ++--
>  xen/arch/arm/dom0less-build.c|  4 ++--
>  xen/arch/arm/domain_build.c  |  2 +-
>  xen/arch/x86/setup.c |  2 +-
>  xen/common/domain.c  |  6 +++---
>  xen/common/grant_table.c |  3 +--
>  xen/include/public/domctl.h  | 15 +++
>  xen/include/xen/grant_table.h|  4 ++--
>  14 files changed, 31 insertions(+), 26 deletions(-)

Bah. I sent it too early. The new field in patches 2-6 ought to be
max_grant_version. Regardless, the general point still holds, I hope.

Cheers,
Alejandro

[PATCH v2 02/13] x86/xstate: Create map/unmap primitives for xsave areas

2024-11-05 Thread Alejandro Vallejo

Add infrastructure to simplify ASI handling. With ASI in the picture
we'll have several different means of accessing the XSAVE area of a
given vCPU, depending on whether a domain is covered by ASI or not and
whether the vCPU is question is scheduled on the current pCPU or not.

Having these complexities exposed at the call sites becomes unwieldy
very fast. These wrappers are intended to be used in a similar way to
map_domain_page() and unmap_domain_page(); The map operation will
dispatch the appropriate pointer for each case in a future patch, while
unmap will remain a no-op where no unmap is required (e.g: when there's
no ASI) and remove the transient maping if one was required.

Follow-up patches replace all uses of raw v->arch.xsave_area by this
mechanism in preparation to add the beforementioned dispatch logic to be
added at a later time.

Signed-off-by: Alejandro Vallejo 
---
v2:
  * Comment macros more heavily to show their performance characteristics.
  * Addressed various nits in the macro comments.
  * Macro names to uppercase.
---
 xen/arch/x86/include/asm/xstate.h | 42 +++
 1 file changed, 42 insertions(+)

diff --git a/xen/arch/x86/include/asm/xstate.h 
b/xen/arch/x86/include/asm/xstate.h
index 07017cc4edfd..6b0daff0aeec 100644
--- a/xen/arch/x86/include/asm/xstate.h
+++ b/xen/arch/x86/include/asm/xstate.h
@@ -143,4 +143,46 @@ static inline bool xstate_all(const struct vcpu *v)
(v->arch.xcr0_accum & XSTATE_LAZY & ~XSTATE_FP_SSE);
 }
 
+/*
+ * Fetch a pointer to a vCPU's XSAVE area
+ *
+ * TL;DR: If v == current, the mapping is guaranteed to already exist.
+ *
+ * Despite the name, this macro might not actually map anything. The only case
+ * in which a mutation of page tables is strictly required is when ASI==on &&
+ * v!=current. For everything else the mapping already exists and needs not
+ * be created nor destroyed.
+ *
+ * +-+--+
+ * |   v == current  | v != current |
+ *  +--+-+--+
+ *  | ASI  enabled | per-vCPU fixmap |  actual map  |
+ *  +--+-+--+
+ *  | ASI disabled | directmap  |
+ *  +--++
+ *
+ * There MUST NOT be outstanding maps of XSAVE areas of the non-current vCPU
+ * at the point of context switch. Otherwise, the unmap operation will
+ * misbehave.
+ *
+ * TODO: Expand the macro to the ASI cases after infra to do so is in place.
+ *
+ * @param v Owner of the XSAVE area
+ */
+#define VCPU_MAP_XSAVE_AREA(v) ((v)->arch.xsave_area)
+
+/*
+ * Drops the mapping of a vCPU's XSAVE area and nullifies its pointer on exit
+ *
+ * See VCPU_MAP_XSAVE_AREA() for additional information on the persistence of
+ * these mappings. This macro only tears down the mappings in the ASI=on &&
+ * v!=current case.
+ *
+ * TODO: Expand the macro to the ASI cases after infra to do so is in place.
+ *
+ * @param v Owner of the XSAVE area
+ * @param x XSAVE blob of v
+ */
+#define VCPU_UNMAP_XSAVE_AREA(v, x) ({ (x) = NULL; })
+
 #endif /* __ASM_XSTATE_H */
-- 
2.47.0

[PATCH v2 01/13] x86/xstate: Remove stale assertions in fpu_x{rstor,save}()

2024-11-05 Thread Alejandro Vallejo

After edb48e76458b("x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu"),
v->arch.xsave_area is always present and we can just remove these asserts.

Fixes: edb48e76458b("x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu")
Signed-off-by: Alejandro Vallejo 
---
v2:
  * Remove asserts rather than refactor them.
  * Trimmed and adjusted commit message
---
 xen/arch/x86/i387.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index 83f9b2502bff..3add0025e495 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -24,7 +24,6 @@ static inline void fpu_xrstor(struct vcpu *v, uint64_t mask)
 {
 bool ok;
 
-ASSERT(v->arch.xsave_area);
 /*
  * XCR0 normally represents what guest OS set. In case of Xen itself,
  * we set the accumulated feature mask before doing save/restore.
@@ -136,7 +135,6 @@ static inline void fpu_xsave(struct vcpu *v)
 uint64_t mask = vcpu_xsave_mask(v);
 
 ASSERT(mask);
-ASSERT(v->arch.xsave_area);
 /*
  * XCR0 normally represents what guest OS set. In case of Xen itself,
  * we set the accumulated feature mask before doing save/restore.
-- 
2.47.0

[PATCH v2 04/13] x86/fpu: Map/umap xsave area in vcpu_{reset,setup}_fpu()

2024-11-05 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
v2:
  * No change
---
 xen/arch/x86/i387.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index 3add0025e495..a6ae323fa95f 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -304,8 +304,10 @@ int vcpu_init_fpu(struct vcpu *v)
 
 void vcpu_reset_fpu(struct vcpu *v)
 {
+struct xsave_struct *xsave_area = VCPU_MAP_XSAVE_AREA(v);
+
 v->fpu_initialised = false;
-*v->arch.xsave_area = (struct xsave_struct) {
+*xsave_area = (struct xsave_struct) {
 .fpu_sse = {
 .mxcsr = MXCSR_DEFAULT,
 .fcw = FCW_RESET,
@@ -313,15 +315,21 @@ void vcpu_reset_fpu(struct vcpu *v)
 },
 .xsave_hdr.xstate_bv = X86_XCR0_X87,
 };
+
+VCPU_UNMAP_XSAVE_AREA(v, xsave_area);
 }
 
 void vcpu_setup_fpu(struct vcpu *v, const void *data)
 {
+struct xsave_struct *xsave_area = VCPU_MAP_XSAVE_AREA(v);
+
 v->fpu_initialised = true;
-*v->arch.xsave_area = (struct xsave_struct) {
+*xsave_area = (struct xsave_struct) {
 .fpu_sse = *(const fpusse_t*)data,
 .xsave_hdr.xstate_bv = XSTATE_FP_SSE,
 };
+
+VCPU_UNMAP_XSAVE_AREA(v, xsave_area);
 }
 
 /* Free FPU's context save area */
-- 
2.47.0

[PATCH v2 07/13] x86/domctl: Map/unmap xsave area in arch_get_info_guest()

2024-11-05 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
v2:
  * No change
---
 xen/arch/x86/domctl.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 5f0619da..3044f706de1c 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1377,16 +1377,17 @@ void arch_get_info_guest(struct vcpu *v, 
vcpu_guest_context_u c)
 unsigned int i;
 const struct domain *d = v->domain;
 bool compat = is_pv_32bit_domain(d);
+const struct xsave_struct *xsave_area;
 #ifdef CONFIG_COMPAT
 #define c(fld) (!compat ? (c.nat->fld) : (c.cmp->fld))
 #else
 #define c(fld) (c.nat->fld)
 #endif
 
-BUILD_BUG_ON(sizeof(c.nat->fpu_ctxt) !=
- sizeof(v->arch.xsave_area->fpu_sse));
-memcpy(&c.nat->fpu_ctxt, &v->arch.xsave_area->fpu_sse,
-   sizeof(c.nat->fpu_ctxt));
+xsave_area = VCPU_MAP_XSAVE_AREA(v);
+BUILD_BUG_ON(sizeof(c.nat->fpu_ctxt) != sizeof(xsave_area->fpu_sse));
+memcpy(&c.nat->fpu_ctxt, &xsave_area->fpu_sse, sizeof(c.nat->fpu_ctxt));
+VCPU_UNMAP_XSAVE_AREA(v, xsave_area);
 
 if ( is_pv_domain(d) )
 c(flags = v->arch.pv.vgc_flags & ~(VGCF_i387_valid|VGCF_in_kernel));
-- 
2.47.0

[PATCH v2 11/13] x86/fpu: Pass explicit xsave areas to fpu_(f)xsave()

2024-11-05 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
v2:
  * const-ified v
---
 xen/arch/x86/i387.c   | 16 ++--
 xen/arch/x86/include/asm/xstate.h |  2 +-
 xen/arch/x86/xstate.c |  3 +--
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index a6ae323fa95f..73c52ce2f577 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -129,7 +129,7 @@ static inline uint64_t vcpu_xsave_mask(const struct vcpu *v)
 }
 
 /* Save x87 extended state */
-static inline void fpu_xsave(struct vcpu *v)
+static inline void fpu_xsave(const struct vcpu *v, struct xsave_struct 
*xsave_area)
 {
 bool ok;
 uint64_t mask = vcpu_xsave_mask(v);
@@ -141,15 +141,14 @@ static inline void fpu_xsave(struct vcpu *v)
  */
 ok = set_xcr0(v->arch.xcr0_accum | XSTATE_FP_SSE);
 ASSERT(ok);
-xsave(v, mask);
+xsave(v, xsave_area, mask);
 ok = set_xcr0(v->arch.xcr0 ?: XSTATE_FP_SSE);
 ASSERT(ok);
 }
 
 /* Save x87 FPU, MMX, SSE and SSE2 state */
-static inline void fpu_fxsave(struct vcpu *v)
+static inline void fpu_fxsave(struct vcpu *v, fpusse_t *fpu_ctxt)
 {
-fpusse_t *fpu_ctxt = &v->arch.xsave_area->fpu_sse;
 unsigned int fip_width = v->domain->arch.x87_fip_width;
 
 if ( fip_width != 4 )
@@ -264,6 +263,8 @@ void vcpu_restore_fpu_lazy(struct vcpu *v)
  */
 static bool _vcpu_save_fpu(struct vcpu *v)
 {
+struct xsave_struct *xsave_area;
+
 if ( !v->fpu_dirtied && !v->arch.nonlazy_xstate_used )
 return false;
 
@@ -272,11 +273,14 @@ static bool _vcpu_save_fpu(struct vcpu *v)
 /* This can happen, if a paravirtualised guest OS has set its CR0.TS. */
 clts();
 
+xsave_area = VCPU_MAP_XSAVE_AREA(v);
+
 if ( cpu_has_xsave )
-fpu_xsave(v);
+fpu_xsave(v, xsave_area);
 else
-fpu_fxsave(v);
+fpu_fxsave(v, &xsave_area->fpu_sse);
 
+VCPU_UNMAP_XSAVE_AREA(v, xsave_area);
 v->fpu_dirtied = 0;
 
 return true;
diff --git a/xen/arch/x86/include/asm/xstate.h 
b/xen/arch/x86/include/asm/xstate.h
index 6b0daff0aeec..bd286123c735 100644
--- a/xen/arch/x86/include/asm/xstate.h
+++ b/xen/arch/x86/include/asm/xstate.h
@@ -97,7 +97,7 @@ uint64_t get_xcr0(void);
 void set_msr_xss(u64 xss);
 uint64_t get_msr_xss(void);
 uint64_t read_bndcfgu(void);
-void xsave(struct vcpu *v, uint64_t mask);
+void xsave(const struct vcpu *v, struct xsave_struct *ptr, uint64_t mask);
 void xrstor(struct vcpu *v, uint64_t mask);
 void xstate_set_init(uint64_t mask);
 bool xsave_enabled(const struct vcpu *v);
diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index 9ecbef760277..f3e41f742c3c 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -300,9 +300,8 @@ void compress_xsave_states(struct vcpu *v, const void *src, 
unsigned int size)
 VCPU_UNMAP_XSAVE_AREA(v, xstate);
 }
 
-void xsave(struct vcpu *v, uint64_t mask)
+void xsave(const struct vcpu *v, struct xsave_struct *ptr, uint64_t mask)
 {
-struct xsave_struct *ptr = v->arch.xsave_area;
 uint32_t hmask = mask >> 32;
 uint32_t lmask = mask;
 unsigned int fip_width = v->domain->arch.x87_fip_width;
-- 
2.47.0

[PATCH v2 03/13] x86/hvm: Map/unmap xsave area in hvm_save_cpu_ctxt()

2024-11-05 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
v2:
  * No change
---
 xen/arch/x86/hvm/hvm.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 018d44a08b6b..c90654697cb1 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -914,11 +914,11 @@ static int cf_check hvm_save_cpu_ctxt(struct vcpu *v, 
hvm_domain_context_t *h)
 
 if ( v->fpu_initialised )
 {
-BUILD_BUG_ON(sizeof(ctxt.fpu_regs) !=
- sizeof(v->arch.xsave_area->fpu_sse));
-memcpy(ctxt.fpu_regs, &v->arch.xsave_area->fpu_sse,
-   sizeof(ctxt.fpu_regs));
+const struct xsave_struct *xsave_area = VCPU_MAP_XSAVE_AREA(v);
 
+BUILD_BUG_ON(sizeof(ctxt.fpu_regs) != sizeof(xsave_area->fpu_sse));
+memcpy(ctxt.fpu_regs, &xsave_area->fpu_sse, sizeof(ctxt.fpu_regs));
+VCPU_UNMAP_XSAVE_AREA(v, xsave_area);
 ctxt.flags = XEN_X86_FPU_INITIALISED;
 }
 
-- 
2.47.0

[PATCH v2 12/13] x86/fpu: Pass explicit xsave areas to fpu_(f)xrstor()

2024-11-05 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
v2:
  * const-ified v in fpu_xrstor()
---
 xen/arch/x86/i387.c   | 26 --
 xen/arch/x86/include/asm/xstate.h |  2 +-
 xen/arch/x86/xstate.c | 10 ++
 3 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index 73c52ce2f577..c794367a3cc7 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -20,7 +20,8 @@
 /* FPU Restore Functions   */
 /***/
 /* Restore x87 extended state */
-static inline void fpu_xrstor(struct vcpu *v, uint64_t mask)
+static inline void fpu_xrstor(struct vcpu *v, struct xsave_struct *xsave_area,
+  uint64_t mask)
 {
 bool ok;
 
@@ -30,16 +31,14 @@ static inline void fpu_xrstor(struct vcpu *v, uint64_t mask)
  */
 ok = set_xcr0(v->arch.xcr0_accum | XSTATE_FP_SSE);
 ASSERT(ok);
-xrstor(v, mask);
+xrstor(v, xsave_area, mask);
 ok = set_xcr0(v->arch.xcr0 ?: XSTATE_FP_SSE);
 ASSERT(ok);
 }
 
 /* Restore x87 FPU, MMX, SSE and SSE2 state */
-static inline void fpu_fxrstor(struct vcpu *v)
+static inline void fpu_fxrstor(struct vcpu *v, const fpusse_t *fpu_ctxt)
 {
-const fpusse_t *fpu_ctxt = &v->arch.xsave_area->fpu_sse;
-
 /*
  * Some CPUs don't save/restore FDP/FIP/FOP unless an exception
  * is pending. Clear the x87 state here by setting it to fixed
@@ -195,6 +194,8 @@ static inline void fpu_fxsave(struct vcpu *v, fpusse_t 
*fpu_ctxt)
 /* Restore FPU state whenever VCPU is schduled in. */
 void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool need_stts)
 {
+struct xsave_struct *xsave_area;
+
 /* Restore nonlazy extended state (i.e. parts not tracked by CR0.TS). */
 if ( !v->arch.fully_eager_fpu && !v->arch.nonlazy_xstate_used )
 goto maybe_stts;
@@ -209,12 +210,13 @@ void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool 
need_stts)
  * above) we also need to restore full state, to prevent subsequently
  * saving state belonging to another vCPU.
  */
+xsave_area = VCPU_MAP_XSAVE_AREA(v);
 if ( v->arch.fully_eager_fpu || xstate_all(v) )
 {
 if ( cpu_has_xsave )
-fpu_xrstor(v, XSTATE_ALL);
+fpu_xrstor(v, xsave_area, XSTATE_ALL);
 else
-fpu_fxrstor(v);
+fpu_fxrstor(v, &xsave_area->fpu_sse);
 
 v->fpu_initialised = 1;
 v->fpu_dirtied = 1;
@@ -224,9 +226,10 @@ void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool 
need_stts)
 }
 else
 {
-fpu_xrstor(v, XSTATE_NONLAZY);
+fpu_xrstor(v, xsave_area, XSTATE_NONLAZY);
 need_stts = true;
 }
+VCPU_UNMAP_XSAVE_AREA(v, xsave_area);
 
  maybe_stts:
 if ( need_stts )
@@ -238,6 +241,7 @@ void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool 
need_stts)
  */
 void vcpu_restore_fpu_lazy(struct vcpu *v)
 {
+struct xsave_struct *xsave_area;
 ASSERT(!is_idle_vcpu(v));
 
 /* Avoid recursion. */
@@ -248,10 +252,12 @@ void vcpu_restore_fpu_lazy(struct vcpu *v)
 
 ASSERT(!v->arch.fully_eager_fpu);
 
+xsave_area = VCPU_MAP_XSAVE_AREA(v);
 if ( cpu_has_xsave )
-fpu_xrstor(v, XSTATE_LAZY);
+fpu_xrstor(v, xsave_area, XSTATE_LAZY);
 else
-fpu_fxrstor(v);
+fpu_fxrstor(v, &xsave_area->fpu_sse);
+VCPU_UNMAP_XSAVE_AREA(v, xsave_area);
 
 v->fpu_initialised = 1;
 v->fpu_dirtied = 1;
diff --git a/xen/arch/x86/include/asm/xstate.h 
b/xen/arch/x86/include/asm/xstate.h
index bd286123c735..d2ef4c0b25f0 100644
--- a/xen/arch/x86/include/asm/xstate.h
+++ b/xen/arch/x86/include/asm/xstate.h
@@ -98,7 +98,7 @@ void set_msr_xss(u64 xss);
 uint64_t get_msr_xss(void);
 uint64_t read_bndcfgu(void);
 void xsave(const struct vcpu *v, struct xsave_struct *ptr, uint64_t mask);
-void xrstor(struct vcpu *v, uint64_t mask);
+void xrstor(const struct vcpu *v, struct xsave_struct *ptr, uint64_t mask);
 void xstate_set_init(uint64_t mask);
 bool xsave_enabled(const struct vcpu *v);
 int __must_check validate_xstate(const struct domain *d,
diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index f3e41f742c3c..b5e8d90ef600 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -374,11 +374,10 @@ void xsave(const struct vcpu *v, struct xsave_struct 
*ptr, uint64_t mask)
 ptr->fpu_sse.x[FPU_WORD_SIZE_OFFSET] = fip_width;
 }
 
-void xrstor(struct vcpu *v, uint64_t mask)
+void xrstor(const struct vcpu *v, struct xsave_struct *ptr, uint64_t mask)
 {
 uint32_t hmask = mask >> 32;
 uint32_t lmask = mask;
-struct xsave_struct *ptr = v->arch.xsave_area;
 unsigned int faults, prev_faults;
 
 /*
@@ -992,6 +991,7 @@ int handle_xsetbv(u32 index, u64 new_bv)
 mask &= curr->fpu_dirtied ? ~XSTATE_FP_SSE : XSTATE_NONLAZY;
 if ( mask )
 {
+struct xsave

[PATCH v2 13/13] x86/xstate: Make xstate_all() and vcpu_xsave_mask() take explicit xstate

2024-11-05 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/i387.c   | 9 +
 xen/arch/x86/include/asm/xstate.h | 5 +++--
 xen/arch/x86/xstate.c | 2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index c794367a3cc7..36a6c8918162 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -107,7 +107,8 @@ static inline void fpu_fxrstor(struct vcpu *v, const 
fpusse_t *fpu_ctxt)
 /*  FPU Save Functions */
 /***/
 
-static inline uint64_t vcpu_xsave_mask(const struct vcpu *v)
+static inline uint64_t vcpu_xsave_mask(const struct vcpu *v,
+   const struct xsave_struct *xsave_area)
 {
 if ( v->fpu_dirtied )
 return v->arch.nonlazy_xstate_used ? XSTATE_ALL : XSTATE_LAZY;
@@ -124,14 +125,14 @@ static inline uint64_t vcpu_xsave_mask(const struct vcpu 
*v)
  * XSTATE_FP_SSE), vcpu_xsave_mask will return XSTATE_ALL. Otherwise
  * return XSTATE_NONLAZY.
  */
-return xstate_all(v) ? XSTATE_ALL : XSTATE_NONLAZY;
+return xstate_all(v, xsave_area) ? XSTATE_ALL : XSTATE_NONLAZY;
 }
 
 /* Save x87 extended state */
 static inline void fpu_xsave(const struct vcpu *v, struct xsave_struct 
*xsave_area)
 {
 bool ok;
-uint64_t mask = vcpu_xsave_mask(v);
+uint64_t mask = vcpu_xsave_mask(v, xsave_area);
 
 ASSERT(mask);
 /*
@@ -211,7 +212,7 @@ void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool 
need_stts)
  * saving state belonging to another vCPU.
  */
 xsave_area = VCPU_MAP_XSAVE_AREA(v);
-if ( v->arch.fully_eager_fpu || xstate_all(v) )
+if ( v->arch.fully_eager_fpu || xstate_all(v, xsave_area) )
 {
 if ( cpu_has_xsave )
 fpu_xrstor(v, xsave_area, XSTATE_ALL);
diff --git a/xen/arch/x86/include/asm/xstate.h 
b/xen/arch/x86/include/asm/xstate.h
index d2ef4c0b25f0..e3e9c18239ed 100644
--- a/xen/arch/x86/include/asm/xstate.h
+++ b/xen/arch/x86/include/asm/xstate.h
@@ -132,14 +132,15 @@ xsave_area_compressed(const struct xsave_struct 
*xsave_area)
 return xsave_area->xsave_hdr.xcomp_bv & XSTATE_COMPACTION_ENABLED;
 }
 
-static inline bool xstate_all(const struct vcpu *v)
+static inline bool xstate_all(const struct vcpu *v,
+  const struct xsave_struct *xsave_area)
 {
 /*
  * XSTATE_FP_SSE may be excluded, because the offsets of XSTATE_FP_SSE
  * (in the legacy region of xsave area) are fixed, so saving
  * XSTATE_FP_SSE will not cause overwriting problem with XSAVES/XSAVEC.
  */
-return xsave_area_compressed(v->arch.xsave_area) &&
+return xsave_area_compressed(xsave_area) &&
(v->arch.xcr0_accum & XSTATE_LAZY & ~XSTATE_FP_SSE);
 }
 
diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index b5e8d90ef600..26e460adfd79 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -1003,7 +1003,7 @@ int handle_xsetbv(u32 index, u64 new_bv)
 asm ( "stmxcsr %0" : "=m" (xsave_area->fpu_sse.mxcsr) );
 VCPU_UNMAP_XSAVE_AREA(curr, xsave_area);
 }
-else if ( xstate_all(curr) )
+else if ( xstate_all(curr, xsave_area) )
 {
 /* See the comment in i387.c:vcpu_restore_fpu_eager(). */
 mask |= XSTATE_LAZY;
-- 
2.47.0

[PATCH v2 06/13] x86/hvm: Map/unmap xsave area in hvmemul_{get,put}_fpu()

2024-11-05 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
v2:
  * Added comments highlighting fastpath for current
---
 xen/arch/x86/hvm/emulate.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index f2bc6967dfcb..04a3df420a59 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2371,7 +2371,9 @@ static int cf_check hvmemul_get_fpu(
 alternative_vcall(hvm_funcs.fpu_dirty_intercept);
 else if ( type == X86EMUL_FPU_fpu )
 {
-const fpusse_t *fpu_ctxt = &curr->arch.xsave_area->fpu_sse;
+/* has a fastpath for `current`, so there's no actual map */
+const struct xsave_struct *xsave_area = VCPU_MAP_XSAVE_AREA(curr);
+const fpusse_t *fpu_ctxt = &xsave_area->fpu_sse;
 
 /*
  * Latch current register state so that we can back out changes
@@ -2397,6 +2399,8 @@ static int cf_check hvmemul_get_fpu(
 else
 ASSERT(fcw == fpu_ctxt->fcw);
 }
+
+VCPU_UNMAP_XSAVE_AREA(curr, xsave_area);
 }
 
 return X86EMUL_OKAY;
@@ -2411,7 +2415,9 @@ static void cf_check hvmemul_put_fpu(
 
 if ( aux )
 {
-fpusse_t *fpu_ctxt = &curr->arch.xsave_area->fpu_sse;
+/* has a fastpath for `current`, so there's no actual map */
+struct xsave_struct *xsave_area = VCPU_MAP_XSAVE_AREA(curr);
+fpusse_t *fpu_ctxt = &xsave_area->fpu_sse;
 bool dval = aux->dval;
 int mode = hvm_guest_x86_mode(curr);
 
@@ -2465,6 +2471,8 @@ static void cf_check hvmemul_put_fpu(
 
 fpu_ctxt->fop = aux->op;
 
+VCPU_UNMAP_XSAVE_AREA(curr, xsave_area);
+
 /* Re-use backout code below. */
 backout = X86EMUL_FPU_fpu;
 }
-- 
2.47.0

[PATCH v2 08/13] x86/xstate: Map/unmap xsave area in {compress,expand}_xsave_states()

2024-11-05 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
v2:
  * No change
---
 xen/arch/x86/xstate.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index 401bdad2eb0d..6db7ec2ea6a9 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -177,7 +177,7 @@ static void setup_xstate_comp(uint16_t *comp_offsets,
  */
 void expand_xsave_states(const struct vcpu *v, void *dest, unsigned int size)
 {
-const struct xsave_struct *xstate = v->arch.xsave_area;
+const struct xsave_struct *xstate = VCPU_MAP_XSAVE_AREA(v);
 const void *src;
 uint16_t comp_offsets[sizeof(xfeature_mask)*8];
 u64 xstate_bv = xstate->xsave_hdr.xstate_bv;
@@ -228,6 +228,8 @@ void expand_xsave_states(const struct vcpu *v, void *dest, 
unsigned int size)
 
 valid &= ~feature;
 }
+
+VCPU_UNMAP_XSAVE_AREA(v, xstate);
 }
 
 /*
@@ -242,7 +244,7 @@ void expand_xsave_states(const struct vcpu *v, void *dest, 
unsigned int size)
  */
 void compress_xsave_states(struct vcpu *v, const void *src, unsigned int size)
 {
-struct xsave_struct *xstate = v->arch.xsave_area;
+struct xsave_struct *xstate = VCPU_MAP_XSAVE_AREA(v);
 void *dest;
 uint16_t comp_offsets[sizeof(xfeature_mask)*8];
 u64 xstate_bv, valid;
@@ -294,6 +296,8 @@ void compress_xsave_states(struct vcpu *v, const void *src, 
unsigned int size)
 
 valid &= ~feature;
 }
+
+VCPU_UNMAP_XSAVE_AREA(v, xstate);
 }
 
 void xsave(struct vcpu *v, uint64_t mask)
-- 
2.47.0

[PATCH v2 00/13] x86: Address Space Isolation FPU preparations

2024-11-05 Thread Alejandro Vallejo

See original cover letter in v1

v1: 
https://lore.kernel.org/xen-devel/20241028154932.6797-1-alejandro.vall...@cloud.com/
v1->v2:
  * Turned v1/patch1 into an assert removal
  * Dropped v1/patch11: "x86/mpx: Adjust read_bndcfgu() to clean after itself"
  * Other minor changes out of feedback. Explained in each patch.

Alejandro Vallejo (13):
  x86/xstate: Remove stale assertions in fpu_x{rstor,save}()
  x86/xstate: Create map/unmap primitives for xsave areas
  x86/hvm: Map/unmap xsave area in hvm_save_cpu_ctxt()
  x86/fpu: Map/umap xsave area in vcpu_{reset,setup}_fpu()
  x86/xstate: Map/unmap xsave area in xstate_set_init() and
handle_setbv()
  x86/hvm: Map/unmap xsave area in hvmemul_{get,put}_fpu()
  x86/domctl: Map/unmap xsave area in arch_get_info_guest()
  x86/xstate: Map/unmap xsave area in {compress,expand}_xsave_states()
  x86/emulator: Refactor FXSAVE_AREA to use wrappers
  x86/mpx: Map/unmap xsave area in in read_bndcfgu()
  x86/fpu: Pass explicit xsave areas to fpu_(f)xsave()
  x86/fpu: Pass explicit xsave areas to fpu_(f)xrstor()
  x86/xstate: Make xstate_all() and vcpu_xsave_mask() take explicit
xstate

 xen/arch/x86/domctl.c |  9 +++--
 xen/arch/x86/hvm/emulate.c| 12 +-
 xen/arch/x86/hvm/hvm.c|  8 ++--
 xen/arch/x86/i387.c   | 65 +++
 xen/arch/x86/include/asm/xstate.h | 51 ++--
 xen/arch/x86/x86_emulate/blk.c| 11 +-
 xen/arch/x86/xstate.c | 47 +++---
 7 files changed, 150 insertions(+), 53 deletions(-)

-- 
2.47.0

[PATCH v2 09/13] x86/emulator: Refactor FXSAVE_AREA to use wrappers

2024-11-05 Thread Alejandro Vallejo

Adds an UNMAP primitive to make use of vcpu_unmap_xsave_area() when
linked into xen. unmap is a no-op during tests.

Signed-off-by: Alejandro Vallejo 
---
v2:
  * Added comments highlighting fastpath on `current`
---
 xen/arch/x86/x86_emulate/blk.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/x86_emulate/blk.c b/xen/arch/x86/x86_emulate/blk.c
index 08a05f8453f7..76fd497ed8a3 100644
--- a/xen/arch/x86/x86_emulate/blk.c
+++ b/xen/arch/x86/x86_emulate/blk.c
@@ -11,9 +11,12 @@
 !defined(X86EMUL_NO_SIMD)
 # ifdef __XEN__
 #  include 
-#  define FXSAVE_AREA ((void *)¤t->arch.xsave_area->fpu_sse)
+/* has a fastpath for `current`, so there's no actual map */
+#  define FXSAVE_AREA ((void *)VCPU_MAP_XSAVE_AREA(current))
+#  define UNMAP_FXSAVE_AREA(x) VCPU_UNMAP_XSAVE_AREA(currt ent, x)
 # else
 #  define FXSAVE_AREA get_fpu_save_area()
+#  define UNMAP_FXSAVE_AREA(x) ((void)x)
 # endif
 #endif
 
@@ -292,6 +295,9 @@ int x86_emul_blk(
 }
 else
 asm volatile ( "fxrstor %0" :: "m" (*fxsr) );
+
+UNMAP_FXSAVE_AREA(fxsr);
+
 break;
 }
 
@@ -320,6 +326,9 @@ int x86_emul_blk(
 
 if ( fxsr != ptr ) /* i.e. s->op_bytes < sizeof(*fxsr) */
 memcpy(ptr, fxsr, s->op_bytes);
+
+UNMAP_FXSAVE_AREA(fxsr);
+
 break;
 }
 
-- 
2.47.0

[PATCH v2 10/13] x86/mpx: Map/unmap xsave area in in read_bndcfgu()

2024-11-05 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
v2:
  * s/ret/bndcfgu
---
 xen/arch/x86/xstate.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index 6db7ec2ea6a9..9ecbef760277 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -1022,9 +1022,10 @@ int handle_xsetbv(u32 index, u64 new_bv)
 
 uint64_t read_bndcfgu(void)
 {
+uint64_t bndcfgu = 0;
 unsigned long cr0 = read_cr0();
-struct xsave_struct *xstate
-= idle_vcpu[smp_processor_id()]->arch.xsave_area;
+struct vcpu *v = idle_vcpu[smp_processor_id()];
+struct xsave_struct *xstate = VCPU_MAP_XSAVE_AREA(v);
 const struct xstate_bndcsr *bndcsr;
 
 ASSERT(cpu_has_mpx);
@@ -1050,7 +1051,12 @@ uint64_t read_bndcfgu(void)
 if ( cr0 & X86_CR0_TS )
 write_cr0(cr0);
 
-return xstate->xsave_hdr.xstate_bv & X86_XCR0_BNDCSR ? bndcsr->bndcfgu : 0;
+if ( xstate->xsave_hdr.xstate_bv & X86_XCR0_BNDCSR )
+bndcfgu = bndcsr->bndcfgu;
+
+VCPU_UNMAP_XSAVE_AREA(v, xstate);
+
+return bndcfgu;
 }
 
 void xstate_set_init(uint64_t mask)
-- 
2.47.0

[PATCH v2 05/13] x86/xstate: Map/unmap xsave area in xstate_set_init() and handle_setbv()

2024-11-05 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
v2:
  * Added comment highlighting fastpath for current
---
 xen/arch/x86/xstate.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index af9e345a7ace..401bdad2eb0d 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -993,7 +993,13 @@ int handle_xsetbv(u32 index, u64 new_bv)
 
 clts();
 if ( curr->fpu_dirtied )
-asm ( "stmxcsr %0" : "=m" (curr->arch.xsave_area->fpu_sse.mxcsr) );
+{
+/* has a fastpath for `current`, so there's no actual map */
+struct xsave_struct *xsave_area = VCPU_MAP_XSAVE_AREA(curr);
+
+asm ( "stmxcsr %0" : "=m" (xsave_area->fpu_sse.mxcsr) );
+VCPU_UNMAP_XSAVE_AREA(curr, xsave_area);
+}
 else if ( xstate_all(curr) )
 {
 /* See the comment in i387.c:vcpu_restore_fpu_eager(). */
@@ -1048,7 +1054,7 @@ void xstate_set_init(uint64_t mask)
 unsigned long cr0 = read_cr0();
 unsigned long xcr0 = this_cpu(xcr0);
 struct vcpu *v = idle_vcpu[smp_processor_id()];
-struct xsave_struct *xstate = v->arch.xsave_area;
+struct xsave_struct *xstate;
 
 if ( ~xfeature_mask & mask )
 {
@@ -1061,8 +1067,10 @@ void xstate_set_init(uint64_t mask)
 
 clts();
 
+xstate = VCPU_MAP_XSAVE_AREA(v);
 memset(&xstate->xsave_hdr, 0, sizeof(xstate->xsave_hdr));
 xrstor(v, mask);
+VCPU_UNMAP_XSAVE_AREA(v, xstate);
 
 if ( cr0 & X86_CR0_TS )
 write_cr0(cr0);
-- 
2.47.0

Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links

2024-11-04 Thread Alejandro Vallejo

On Sat Nov 2, 2024 at 3:18 PM GMT, Daniel P. Smith wrote:
> On 11/1/24 16:16, Stewart Hildebrand wrote:
> > +Daniel (XSM mention)
> > 
> > On 10/28/24 13:02, Jan Beulich wrote:
> >> On 18.10.2024 22:39, Stewart Hildebrand wrote:
> >>> Add links between a VF's struct pci_dev and its associated PF struct
> >>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
> >>> dropping and re-acquiring the pcidevs_lock().
> >>>
> >>> During PF removal, unlink VF from PF and mark the VF broken. As before,
> >>> VFs may exist without a corresponding PF, although now only with
> >>> pdev->broken = true.
> >>>
> >>> The hardware domain is expected to remove the associated VFs before
> >>> removing the PF. Print a warning in case a PF is removed with associated
> >>> VFs still present.
> >>>
> >>> Signed-off-by: Stewart Hildebrand 
> >>> ---
> >>> Candidate for backport to 4.19 (the next patch depends on this one)
> >>>
> >>> v5->v6:
> >>> * move printk() before ASSERT_UNREACHABLE()
> >>> * warn about PF removal with VFs still present
> >>
> >> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
> >> just after an adjustment to the commit message. I'm instead actively
> >> concerned of the resulting behavior. Question is whether we can reasonably
> >> do something about that.
> >>
> >> Jan
> > 
> > Right. My suggestion then is to go back to roughly how it was done in
> > v4 [0]:
> > 
> > * Remove the VFs right away during PF removal, so that we don't end up
> > with stale VFs. Regarding XSM, assume that a domain with permission to
> > remove the PF is also allowed to remove the VFs. We should probably also
> > return an error from pci_remove_device in the case of removing the PF
> > with VFs still present (and still perform the removals despite returning
> > an error). Subsequent attempts by a domain to remove the VFs would
> > return an error (as they have already been removed), but that's expected
> > since we've taken a stance that PF-then-VF removal order is invalid
> > anyway.
>
> I am not confident this is a safe assumption. It will likely be safe for 
> probably 99% of the implementations. Apologies for not following 
> closely, and correct me if I am wrong here, but from a resource 
> perspective each VF can appear to the system as its own unique BDF and 
> so I am fairly certain it would be possible to uniquely label each VF. 
> For instance in the SVP architecture, the VF may be labeled to restrict 
> control to a hardware domain within a Guest Virtual Platform while the 
> PF may be restricted to the Supervisor Virtual Platform. In this 
> scenario, the Guest would be torn down before the Supervisor so the VF 
> should get released before the PF. But it's all theoretical, so I have 
> no real implementation to point at that this could be checked/confirmed.
>
> I am only raising this for awareness and not as an objection. If people 
> want to punt that theoretical use case down the road until someone 
> actually attempts it, I would not be opposed.

Wouldn't it stand to reason then to act conditionally on the authority of the
caller?

i.e: If the caller has the (XSM-checked) authority to remove _BOTH_ PF and
VFs, remove all. If it doesn't have authority to remove the VFs then early exit
with an error, leaving the PF behind as well.

That would do the clean thing in the common case and be consistent with the
security policy even with a conflicting policy. The semantics are somewhat more
complex, but trying to remove a PF before removing the VFs is silly and the
only sensible thing (imo) is to help out during cleanup _or_ be strict about
checking.

>
> v/r,
> dps

Cheers,
Alejandro

Re: [RFC PATCH 0/6] xen/abi: On wide bitfields inside primitive types

2024-10-30 Thread Alejandro Vallejo

In the course of preparing this answer I just noticed that altp2m_opts suffers
from the exact same annoyance, with the exact same fix. I just noticed while
rebasing my Rust branch.

On Wed Oct 30, 2024 at 9:14 AM GMT, Jan Beulich wrote:
> On 29.10.2024 19:16, Alejandro Vallejo wrote:
> > Non-boolean bitfields in the hypercall ABI make it fairly inconvenient to
> > create bindings for any language because (a) they are always ad-hoc and are
> > subject to restrictions regular fields are not (b) require boilerplate that
> > regular fields do not and (c) might not even be part of the core language,
> > forcing avoidable external libraries into any sort of generic library.
> > 
> > This patch (it's a series merely to split roughly by maintainer) is one such
> > case that I happened to spot while playing around. It's the grant_version
> > field, buried under an otherwise empty grant_opts.
> > 
> > The invariant I'd like to (slowly) introduce and discuss is that fields may
> > have bitflags (e.g: a packed array of booleans indexed by some enumerated
> > type), but not be mixed with wider fields in the same primitive type. This
> > ensures any field containing an integer of any kind can be referred by 
> > pointer
> > and treated the same way as any other with regards to sizeof() and the like.
>
> While I don't strictly mind, I'm also not really seeing why taking addresses
> or applying sizeof() would be commonly necessary. Can you perhaps provide a
> concrete example of where the present way of dealing with grant max version
> is getting in the way? After all your use of the term "bitfield" doesn't
> really mean C's understanding of it, so especially (c) above escapes me to a
> fair degree.

Wall of text ahead, but I'll try to stay on point. The rationale should become
a lot clearer after I send an RFC series with initial code to autogenerate some
hypercall payloads from markup. The biggest question is: Can I create a
definition language such that (a) it precisely represents the Xen ABI and (b)
is fully type-safe under modern strongly-typed languages?

I already have a backbone I can define the ABI in, so my options when I hit
some impedance mismatch are:

  1. Change the ABI so it matches better my means of defining it.
  2. Change the means to define so it captures the existing ABI better.

Most of the work I've done has moved in the (2) direction so far, but I found a
number of pain points when mapping the existing ABI to Rust that, while not
impossible to work around, are quite annoying for no clear benefit. If
possible, I'd like to simplify the cognitive load involved in defining, using
and updating hypercalls rather than bending over backwards to support a
construct that provides no real benefit. IOW: If I can define an ABI that is
_simpler_, it follows that it's also easier to not make mistakes and it's
easier to generate code for it.

The use of packed fields is one such case. Even in C, we create extra macros
for creating a field, modifying it, fetching it, etc. Patches 2-6 are strict
code removals. And even in the most extreme cases the space savings are largely
irrelevant because the hypercall has a fixed size. We do want to pack _flags_
as otherwise the payload size would explode pretty quickly on hypercalls with
tons of boolean options, but I'm not aware of that being problematic for wider
subfields (like the grant max version).

Now, being more concrete...

##
# IDL is simpler if the size is a property of the type
##

Consider the definition of the (new) max_grant_version type under the IDL I'm
working on (it's TOML, but I don't particularly care about which markup we end
up using).

  [[enums]]
  name = "xen_domaincreate_max_grant_version"
  description = "Content of the `max_grant_version` field of the domain 
creation hypercall."
  typ = { tag = "u8" }

  [[enums.variants]]
  name = "off"
  description = "Must be used with gnttab support compiled out"
  value = 0

  [[enums.variants]]
  name = "v1"
  description = "Allow the domain to use up to gnttab_v1"
  value = 1

  [[enums.variants]]
  name = "v2"
  description = "Allow the domain to use up to gnttab_v2"
  value = 2

Note that I can define a type being enumerated, can choose its specific
variants and its width is a property of the type itself. With bitfields you're
always in a weird position of the width not being part of the type that goes
into it.

Should I need it as a field somewhere, then...

  [[structs.fields]]
  name = "max_grant_version"
  description = "Maximum grant table version the doma

Re: [RFC PATCH 1/6] xen/domctl: Refine grant_opts into grant_version

2024-10-30 Thread Alejandro Vallejo

Hi,

On Wed Oct 30, 2024 at 9:08 AM GMT, Jan Beulich wrote:
> On 29.10.2024 19:16, Alejandro Vallejo wrote:
> > grant_opts is overoptimizing for space packing in a hypercall that
> > doesn't warrant the effort. Tweak the ABI without breaking it in order
> > to remove the bitfield by extending it to 8 bits.
> > 
> > Xen only supports little-endian systems, so the transformation from
> > uint32_t to uint8_t followed by 3 octets worth of padding is not an ABI
> > breakage.
> > 
> > No functional change
> > 
> > Signed-off-by: Alejandro Vallejo 
> > ---
> >  xen/include/public/domctl.h | 15 +++
> >  1 file changed, 11 insertions(+), 4 deletions(-)
>
> This isn't a complete patch, is it? I expect it'll break the build without
> users of the field also adjusted.

Indeed. The non-RFC version would have everything folded in one. I just wanted
to avoid Cc-ing everyone in MAINTAINERS for the same single RFC patch. It's
split by (rough) maintained area.

>
> > --- a/xen/include/public/domctl.h
> > +++ b/xen/include/public/domctl.h
> > @@ -90,11 +90,18 @@ struct xen_domctl_createdomain {
> >  int32_t max_grant_frames;
> >  int32_t max_maptrack_frames;
> >  
> > -/* Grant version, use low 4 bits. */
> > -#define XEN_DOMCTL_GRANT_version_mask0xf
> > -#define XEN_DOMCTL_GRANT_version(v)  ((v) & 
> > XEN_DOMCTL_GRANT_version_mask)
> > +/*
> > + * Maximum grant table version the domain can be configured with.
> > + *
> > + * Domains always start with v1 (if CONFIG_GRANT_TABLE) and can be 
> > bumped
> > + * to use up to `max_grant_version` via GNTTABOP_set_version.
> > + *
> > + * Must be zero iff !CONFIG_GRANT_TABLE.
> > + */
> > +uint8_t max_grant_version;
> >  
> > -uint32_t grant_opts;
> > +/* Unused */
> > +uint8_t rsvd0[3];
> >  
> >  /*
> >   * Enable altp2m mixed mode.
>
> Just to mention it: I think while binary compatible, this is still on the edge
> of needing an interface version bump. We may get away without as users of the
> removed identifiers will still notice by way of observing build failures.
>
> Jan

If users are forced to rebuild either way, might as well prevent existing
binaries from breaking. There ought to be a strict distinction between ABI and
API compatibility because, while they typically move in lockstep, they don't
always (and this is one such an example).

Regardless, this is a discussion for the final patch if we get there.

Cheers,
Alejandro

Re: [RFC PATCH 0/6] xen/abi: On wide bitfields inside primitive types

2024-10-30 Thread Alejandro Vallejo

On Wed Oct 30, 2024 at 8:45 AM GMT, Christian Lindig wrote:
>
>
> > On 29 Oct 2024, at 18:16, Alejandro Vallejo  
> > wrote:
> > 
> > 
> > The invariant I'd like to (slowly) introduce and discuss is that fields may
> > have bitflags (e.g: a packed array of booleans indexed by some enumerated
> > type), but not be mixed with wider fields in the same primitive type. This
> > ensures any field containing an integer of any kind can be referred by 
> > pointer
> > and treated the same way as any other with regards to sizeof() and the like.
>
> Acked-by: Christian Lindig 

Thanks.

>
>
> Fine with me but the OCaml part is not very exposed to this.

Yeah, OCaml is pretty far from interacting with these details at all.
>
> — C

Cheers,
Alejandro

Re: [PATCH] x86/cpu-policy: Extend the guest max policy max leaf/subleaves

2024-10-30 Thread Alejandro Vallejo

On Wed Oct 30, 2024 at 3:13 PM GMT, Roger Pau Monné wrote:
> On Wed, Oct 30, 2024 at 02:45:19PM +, Andrew Cooper wrote:
> > On 30/10/2024 11:03 am, Roger Pau Monné wrote:
> > > On Wed, Oct 30, 2024 at 10:39:12AM +, Andrew Cooper wrote:
> > >> On 30/10/2024 8:59 am, Roger Pau Monné wrote:
> > >>> On Tue, Oct 29, 2024 at 05:55:05PM +, Andrew Cooper wrote:
> >  diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
> >  index b6d9fad56773..78bc9872b09a 100644
> >  --- a/xen/arch/x86/cpu-policy.c
> >  +++ b/xen/arch/x86/cpu-policy.c
> >  @@ -391,6 +391,27 @@ static void __init calculate_host_policy(void)
> >   p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
> >   }
> >   
> >  +/*
> >  + * Guest max policies can have any max leaf/subleaf within bounds.
> >  + *
> >  + * - Some incoming VMs have a larger-than-necessary feat max_subleaf.
> >  + * - Some VMs we'd like to synthesise leaves not present on the host.
> >  + */
> >  +static void __init guest_common_max_leaves(struct cpu_policy *p)
> >  +{
> >  +p->basic.max_leaf   = ARRAY_SIZE(p->basic.raw) - 1;
> >  +p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
> >  +p->extd.max_leaf= 0x8000U + ARRAY_SIZE(p->extd.raw) - 
> >  1;
> >  +}
> >  +
> >  +/* Guest default policies inherit the host max leaf/subleaf settings. 
> >  */
> >  +static void __init guest_common_default_leaves(struct cpu_policy *p)
> >  +{
> >  +p->basic.max_leaf   = host_cpu_policy.basic.max_leaf;
> >  +p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
> >  +p->extd.max_leaf= host_cpu_policy.extd.max_leaf;
> >  +}
> > >>> I think this what I'm going to ask is future work.  After the
> > >>> modifications done to the host policy by max functions
> > >>> (calculate_{hvm,pv}_max_policy()) won't the max {sub,}leaf adjustments
> > >>> better be done taking into account the contents of the policy, rather
> > >>> than capping to the host values?
> > >>>
> > >>> (note this comment is strictly for guest_common_default_leaves(), the
> > >>> max version is fine using ARRAY_SIZE).
> > >> I'm afraid I don't follow.
> > >>
> > >> calculate_{pv,hvm}_max_policy() don't modify the host policy.
> > > Hm, I don't think I've expressed myself clearly, sorry.  Let me try
> > > again.
> > >
> > > calculate_{hvm,pv}_max_policy() extends the host policy by possibly
> > > setting new features, and such extended policy is then used as the
> > > base for the PV/HVM default policies.
> > >
> > > Won't the resulting policy in calculate_{hvm,pv}_def_policy() risks
> > > having bits set past the max {sub,}leaf in the host policy, as it's
> > > based in {hvm,pv}_def_cpu_policy that might have such bits set?
> > 
> > Oh, right.
> > 
> > This patch doesn't change anything WRT that.
>
> Indeed, didn't intend my comment to block it, just that I think at
> some point the logic in guest_common_default_leaves() will need to be
> expanded.
>
> > But I think you're right that we do risk getting into that case (in
> > principle at least) because of how guest_common_*_feature_adjustment() work.
> > 
> > Furthermore, the bug will typically get hidden because we serialise
> > based on the max_leaf/subleaf, and will discard feature words outside of
> > the max_leaf/subleaf bounds.
>
> Yes, once we serialize it for toolstack consumption the leafs will be
> implicitly zeroed.
>
> > I suppose we probably want a variation of x86_cpu_featureset_to_policy()
> > which extends the max_leaf/subleaf based on non-zero values in leaves. 
> > (This already feels like it's going to be an ugly algorithm.)
>
> Hm, I was thinking that we would need to adjust
> guest_common_default_leaves() to properly shrink the max {sub,}leaf
> fields from the max policies.

That would be tricky in case we end up with subleafs that are strictly
populated at runtime. Xen would have no way of knowing whether that's meant to
be implemented or not. It seems safer to raise the max if we find a non-zero
leaves higher than the current max.

The algorithm is probably quite simple for static data, as it's merely
traversing the raw arrays and keeping track of the last non-zero leaf.

>
> Thanks, Roger.

Cheers,
Alejandro

Re: [PATCH v7 02/10] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-10-30 Thread Alejandro Vallejo

Hi,

On Wed Oct 30, 2024 at 6:37 AM GMT, Jan Beulich wrote:
> On 29.10.2024 21:30, Andrew Cooper wrote:
> > On 21/10/2024 4:45 pm, Alejandro Vallejo wrote:
> >> @@ -310,19 +309,16 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
> >>  break;
> >>  
> >>  case 0xb:
> >> -/*
> >> - * In principle, this leaf is Intel-only.  In practice, it is 
> >> tightly
> >> - * coupled with x2apic, and we offer an x2apic-capable APIC 
> >> emulation
> >> - * to guests on AMD hardware as well.
> >> - *
> >> - * TODO: Rework topology logic.
> >> - */
> >>  if ( p->basic.x2apic )
> >>  {
> >>  *(uint8_t *)&res->c = subleaf;
> >>  
> >> -/* Fix the x2APIC identifier. */
> >> -res->d = v->vcpu_id * 2;
> >> +/*
> >> + * Fix the x2APIC identifier. The PV side is nonsensical, but
> >> + * we've always shown it like this so it's kept for compat.
> >> + */
> > 
> > In hindsight I should changed "Fix the x2APIC identifier." when I
> > reworked this logic, but oh well - better late than never.
> > 
> > /* The x2APIC_ID is per-vCPU, and fixed irrespective of the requested
> > subleaf. */
>
> Can we perhaps avoid "fix" in this comment? "Adjusted", "overwritten", or
> some such ought to do, without carrying a hint towards some bug somewhere.

I understood "fix" there as "pin" rather than "unbreak". Regardless I can also
rewrite it as "The x2APIC ID is per-vCPU and shown on all subleafs"

>
> >> --- a/xen/include/public/arch-x86/hvm/save.h
> >> +++ b/xen/include/public/arch-x86/hvm/save.h
> >> @@ -394,6 +394,8 @@ struct hvm_hw_lapic {
> >>  uint32_t disabled; /* VLAPIC_xx_DISABLED */
> >>  uint32_t timer_divisor;
> >>  uint64_t tdt_msr;
> >> +uint32_t x2apic_id;
> >> +uint32_t rsvd_zero;
> > 
> > ... we do normally spell it _rsvd; to make it extra extra clear that
> > people shouldn't be doing anything with it.
>
> Alternatively, to carry the "zero" in the name, how about _mbz?
>
> Jan

I'd prefer that to _rsvd, if anything to make it patently clear that leaving
rubble is not ok.

Cheers,
Alejandro

Re: [PATCH v7 02/10] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-10-30 Thread Alejandro Vallejo

I'm fine with all suggestions, with one exception that needs a bit more
explanation...

On Tue Oct 29, 2024 at 8:30 PM GMT, Andrew Cooper wrote:
> On 21/10/2024 4:45 pm, Alejandro Vallejo wrote:
> > This allows the initial x2APIC ID to be sent on the migration stream.
> > This allows further changes to topology and APIC ID assignment without
> > breaking existing hosts. Given the vlapic data is zero-extended on
> > restore, fix up migrations from hosts without the field by setting it to
> > the old convention if zero.
> >
> > The hardcoded mapping x2apic_id=2*vcpu_id is kept for the time being,
> > but it's meant to be overriden by toolstack on a later patch with
> > appropriate values.
> >
> > Signed-off-by: Alejandro Vallejo 
>
> I'm going to request some changes, but I think they're only comment
> changes. [edit, no sadly, one non-comment change.]
>
> It's unfortunate that Xen uses an instance of hvm_hw_lapic for it's
> internal state, but one swamp at a time.
>
>
> In the subject, there's no such thing as the "initial" x2APIC ID. 
> There's just "the x2APIC ID" and it's not mutable state as far as the
> guest is concerned  (This is different to the xAPIC id, where there is
> an architectural concept of the initial xAPIC ID, from the days when
> OSes were permitted to edit it).  Also, it's x86/hvm, seeing as this is
> an HVM specific change you're making.
>
> Next, while it's true that this allows the value to move in the
> migration stream, the more important point is that this allows the
> toolstack to configure the x2APIC ID for each vCPU.
>
> So, for the commit message, I recommend:
>
> ---%<---
> Today, Xen hard-codes x2APIC_ID = vcpu_id * 2, but this is unwise and
> interferes with providing accurate topology information to the guest.
>
> Introduce a new x2apic_id field into hvm_hw_lapic.  This is immutable
> state from the guest's point of view, but it allows the toolstack to
> configure the value, and for the value to move on migrate.
>
> For backwards compatibility, we treat incoming zeroes as if they were
> the old hardcoded scheme.
> ---%<---
>
> > diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
> > index 2a777436ee27..e2489ff8e346 100644
> > --- a/xen/arch/x86/cpuid.c
> > +++ b/xen/arch/x86/cpuid.c
> > @@ -138,10 +138,9 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
> >  const struct cpu_user_regs *regs;
> >  
> >  case 0x1:
> > -/* TODO: Rework topology logic. */
> >  res->b &= 0x00ffu;
> >  if ( is_hvm_domain(d) )
> > -res->b |= (v->vcpu_id * 2) << 24;
> > +res->b |= vlapic_x2apic_id(vcpu_vlapic(v)) << 24;
>
> There wants to be some kind of note here, especially as you're feeding
> vlapic_x2apic_id() into a field called xAPIC ID.  Perhaps
>
> /* Large systems do wrap around 255 in the xAPIC_ID field. */
>
> ?
>
>
> >  
> >  /* TODO: Rework vPMU control in terms of toolstack choices. */
> >  if ( vpmu_available(v) &&
> > @@ -310,19 +309,16 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
> >  break;
> >  
> >  case 0xb:
> > -/*
> > - * In principle, this leaf is Intel-only.  In practice, it is 
> > tightly
> > - * coupled with x2apic, and we offer an x2apic-capable APIC 
> > emulation
> > - * to guests on AMD hardware as well.
> > - *
> > - * TODO: Rework topology logic.
> > - */
> >  if ( p->basic.x2apic )
> >  {
> >  *(uint8_t *)&res->c = subleaf;
> >  
> > -/* Fix the x2APIC identifier. */
> > -res->d = v->vcpu_id * 2;
> > +/*
> > + * Fix the x2APIC identifier. The PV side is nonsensical, but
> > + * we've always shown it like this so it's kept for compat.
> > + */
>
> In hindsight I should changed "Fix the x2APIC identifier." when I
> reworked this logic, but oh well - better late than never.
>
> /* The x2APIC_ID is per-vCPU, and fixed irrespective of the requested
> subleaf. */
>
> I'd also put a little more context in the PV side:
>
> /* Xen 4.18 and earlier leaked x2APIC into PV guests.  The value shown
> is nonsensical but kept as-was for compatibility. */
>
> > diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
> > index 3363926b487b..33b463925f4e

Re: [PATCH 1/6] xen: add a domain unique id to each domain

2024-11-01 Thread Alejandro Vallejo

On Fri Nov 1, 2024 at 7:06 AM GMT, Jürgen Groß wrote:
> On 31.10.24 12:58, Alejandro Vallejo wrote:
> > On Wed Oct 23, 2024 at 3:27 PM BST, Juergen Gross wrote:
> >> On 23.10.24 16:08, Alejandro Vallejo wrote:
> >>> On Wed Oct 23, 2024 at 2:10 PM BST, Juergen Gross wrote:
> >>>> Xenstore is referencing domains by their domid, but reuse of a domid
> >>>> can lead to the situation that Xenstore can't tell whether a domain
> >>>> with that domid has been deleted and created again without Xenstore
> >>>> noticing the domain is a new one now.
> >>>>
> >>>> Add a global domain creation unique id which is updated when creating
> >>>> a new domain, and store that value in struct domain of the new domain.
> >>>> The global unique id is initialized with the system time and updates
> >>>> are done via the xorshift algorithm which is used for pseudo random
> >>>> number generation, too (see https://en.wikipedia.org/wiki/Xorshift).
> >>>>
> >>>> Signed-off-by: Juergen Gross 
> >>>> Reviewed-by: Jan Beulich 
> >>>> ---
> >>>> V1:
> >>>> - make unique_id local to function (Jan Beulich)
> >>>> - add lock (Julien Grall)
> >>>> - add comment (Julien Grall)
> >>>> ---
> >>>>xen/common/domain.c | 20 
> >>>>xen/include/xen/sched.h |  3 +++
> >>>>2 files changed, 23 insertions(+)
> >>>>
> >>>> diff --git a/xen/common/domain.c b/xen/common/domain.c
> >>>> index 92263a4fbd..3948640fb0 100644
> >>>> --- a/xen/common/domain.c
> >>>> +++ b/xen/common/domain.c
> >>>> @@ -562,6 +562,25 @@ static void _domain_destroy(struct domain *d)
> >>>>free_domain_struct(d);
> >>>>}
> >>>>
> >>>> +static uint64_t get_unique_id(void)
> >>>> +{
> >>>> +static uint64_t unique_id;
> >>>> +static DEFINE_SPINLOCK(lock);
> >>>> +uint64_t x = unique_id ? : NOW();
> >>>> +
> >>>> +spin_lock(&lock);
> >>>> +
> >>>> +/* Pseudo-randomize id in order to avoid consumers relying on 
> >>>> sequence. */
> >>>> +x ^= x << 13;
> >>>> +x ^= x >> 7;
> >>>> +x ^= x << 17;
> >>>> +unique_id = x;
> > 
> > How "unique" are they? With those shifts it's far less obvious to know how 
> > many
> > times we can call get_unique_id() and get an ID that hasn't been seen since
> > reset. With sequential numbers it's pretty obvious that it'd be a
> > non-overflowable monotonic counter. Here's it's far less clear, particularly
> > when it's randomly seeded.
>
> If you'd look into the Wikipedia article mentioned in the commit message
> you'd know that the period is 2^64 - 1.
>

Bah. I did, but skimmed too fast looking for keywords. Thanks for bearing with
me :). Ok, with that I'm perfectly happy.

  Reviewed-by: Alejandro Vallejo 

> > I don't quite see why sequential IDs are problematic. What is this
> > (pseudo)randomization specifically trying to prevent? If it's just breaking 
> > the
> > assumption that numbers go in strict sequence you could just flip the high 
> > and
> > low nibbles (or any other deterministic swapping of counter nibbles)
>
> That was a request from the RFC series of this patch.
>
> > Plus, with the counter going in sequence we could get rid of the lock 
> > because
> > an atomic fetch_add() would do.
>
> Its not as if this would be a hot path. So the lock is no real issue IMO.
>
> > 
> >>>> +
> >>>> +spin_unlock(&lock);
> >>>> +
> >>>> +return x;
> >>>> +}
> >>>> +
> >>>>static int sanitise_domain_config(struct xen_domctl_createdomain 
> >>>> *config)
> >>>>{
> >>>>bool hvm = config->flags & XEN_DOMCTL_CDF_hvm;
> >>>> @@ -654,6 +673,7 @@ struct domain *domain_create(domid_t domid,
> >>>>
> >>>>/* Sort out our idea of is_system_domain(). */
> >>>>d->domain_id = domid;
> >>>> +d->unique_id = get_unique_id();
> >>>>
> >>>>/* Holdi

Re: [PATCH] x86/ucode: Explain what microcode_set_module() does

2024-10-23 Thread Alejandro Vallejo

On Wed Oct 23, 2024 at 1:28 PM BST, Andrew Cooper wrote:
> Signed-off-by: Andrew Cooper 

  Reviewed-by: Alejandro Vallejo 

With a single nit that I don't care much about, but...

> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
>
> I found this hiding in other microcode changes, and decided it was high time
> it got included.
> ---
>  xen/arch/x86/cpu/microcode/core.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/xen/arch/x86/cpu/microcode/core.c 
> b/xen/arch/x86/cpu/microcode/core.c
> index 8564e4d2c94c..dc2c064cf176 100644
> --- a/xen/arch/x86/cpu/microcode/core.c
> +++ b/xen/arch/x86/cpu/microcode/core.c
> @@ -108,6 +108,10 @@ static bool ucode_in_nmi = true;
>  /* Protected by microcode_mutex */
>  static const struct microcode_patch *microcode_cache;
>  
> +/*
> + * Used by the EFI path only, when xen.cfg identifies an explicit microcode
> + * file.  Overrides ucode=|scan on the regular command line.
> + */

... this it would be better at the interface in microcode.h, imo.

>  void __init microcode_set_module(unsigned int idx)
>  {
>  ucode_mod_idx = idx;
>
> base-commit: be84e7fe58b51f6b6dd907a038f0ef998a1e281e
> prerequisite-patch-id: ef20898eb25a7ca1ea2d7b1d676f00b91b46d5f6
> prerequisite-patch-id: e0d0c0acbe4864a00451187ef7232dcaf10b2477
> prerequisite-patch-id: f6010b4a6e0b43ac837aea470b3b5e5f390ee3b2

Cheers,
Alejandro

Re: [PATCH 1/6] xen: add a domain unique id to each domain

2024-10-23 Thread Alejandro Vallejo

On Wed Oct 23, 2024 at 2:10 PM BST, Juergen Gross wrote:
> Xenstore is referencing domains by their domid, but reuse of a domid
> can lead to the situation that Xenstore can't tell whether a domain
> with that domid has been deleted and created again without Xenstore
> noticing the domain is a new one now.
>
> Add a global domain creation unique id which is updated when creating
> a new domain, and store that value in struct domain of the new domain.
> The global unique id is initialized with the system time and updates
> are done via the xorshift algorithm which is used for pseudo random
> number generation, too (see https://en.wikipedia.org/wiki/Xorshift).
>
> Signed-off-by: Juergen Gross 
> Reviewed-by: Jan Beulich 
> ---
> V1:
> - make unique_id local to function (Jan Beulich)
> - add lock (Julien Grall)
> - add comment (Julien Grall)
> ---
>  xen/common/domain.c | 20 
>  xen/include/xen/sched.h |  3 +++
>  2 files changed, 23 insertions(+)
>
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 92263a4fbd..3948640fb0 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -562,6 +562,25 @@ static void _domain_destroy(struct domain *d)
>  free_domain_struct(d);
>  }
>  
> +static uint64_t get_unique_id(void)
> +{
> +static uint64_t unique_id;
> +static DEFINE_SPINLOCK(lock);
> +uint64_t x = unique_id ? : NOW();
> +
> +spin_lock(&lock);
> +
> +/* Pseudo-randomize id in order to avoid consumers relying on sequence. 
> */
> +x ^= x << 13;
> +x ^= x >> 7;
> +x ^= x << 17;
> +unique_id = x;
> +
> +spin_unlock(&lock);
> +
> +return x;
> +}
> +
>  static int sanitise_domain_config(struct xen_domctl_createdomain *config)
>  {
>  bool hvm = config->flags & XEN_DOMCTL_CDF_hvm;
> @@ -654,6 +673,7 @@ struct domain *domain_create(domid_t domid,
>  
>  /* Sort out our idea of is_system_domain(). */
>  d->domain_id = domid;
> +d->unique_id = get_unique_id();
>  
>  /* Holding CDF_* internal flags. */
>  d->cdf = flags;
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index 90666576c2..1dd8a425f9 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -370,6 +370,9 @@ struct domain
>  domid_t  domain_id;
>  
>  unsigned int max_vcpus;
> +
> +uint64_t unique_id;   /* Unique domain identifier */
> +

Why not xen_domain_handle_t handle, defined later on? That's meant to be a
UUID, so this feels like a duplicate field.

>  struct vcpu**vcpu;
>  
>  shared_info_t   *shared_info; /* shared data area */

Cheers,
Alejandro

[PATCH 10/14] x86/mpx: Map/unmap xsave area in in read_bndcfgu()

2024-10-28 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/xstate.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index 4019ca4aae83..2a54da2823cf 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -1021,9 +1021,10 @@ int handle_xsetbv(u32 index, u64 new_bv)
 
 uint64_t read_bndcfgu(void)
 {
+uint64_t ret = 0;
 unsigned long cr0 = read_cr0();
-struct xsave_struct *xstate
-= idle_vcpu[smp_processor_id()]->arch.xsave_area;
+struct vcpu *v = idle_vcpu[smp_processor_id()];
+struct xsave_struct *xstate = vcpu_map_xsave_area(v);
 const struct xstate_bndcsr *bndcsr;
 
 ASSERT(cpu_has_mpx);
@@ -1049,7 +1050,12 @@ uint64_t read_bndcfgu(void)
 if ( cr0 & X86_CR0_TS )
 write_cr0(cr0);
 
-return xstate->xsave_hdr.xstate_bv & X86_XCR0_BNDCSR ? bndcsr->bndcfgu : 0;
+if ( xstate->xsave_hdr.xstate_bv & X86_XCR0_BNDCSR )
+ret = bndcsr->bndcfgu;
+
+vcpu_unmap_xsave_area(v, xstate);
+
+return ret;
 }
 
 void xstate_set_init(uint64_t mask)
-- 
2.47.0

[PATCH 01/14] x86/xstate: Update stale assertions in fpu_x{rstor,save}()

2024-10-28 Thread Alejandro Vallejo

The asserts' intent was to establish whether the xsave instruction was
usable or not, which at the time was strictly given by the presence of
the xsave area. After edb48e76458b("x86/fpu: Combine fpu_ctxt and
xsave_area in arch_vcpu"), that area is always present a more relevant
assert is that the host supports XSAVE.

Fixes: edb48e76458b("x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu")
Signed-off-by: Alejandro Vallejo 
---
I'd also be ok with removing the assertions altogether. They serve very
little purpose there after the merge of xsave and fpu_ctxt.
---
 xen/arch/x86/i387.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index 83f9b2502bff..375a8274f632 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -24,7 +24,7 @@ static inline void fpu_xrstor(struct vcpu *v, uint64_t mask)
 {
 bool ok;
 
-ASSERT(v->arch.xsave_area);
+ASSERT(cpu_has_xsave);
 /*
  * XCR0 normally represents what guest OS set. In case of Xen itself,
  * we set the accumulated feature mask before doing save/restore.
@@ -136,7 +136,7 @@ static inline void fpu_xsave(struct vcpu *v)
 uint64_t mask = vcpu_xsave_mask(v);
 
 ASSERT(mask);
-ASSERT(v->arch.xsave_area);
+ASSERT(cpu_has_xsave);
 /*
  * XCR0 normally represents what guest OS set. In case of Xen itself,
  * we set the accumulated feature mask before doing save/restore.
-- 
2.47.0

[PATCH 08/14] x86/xstate: Map/unmap xsave area in {compress,expand}_xsave_states()

2024-10-28 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/xstate.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index 60e752a245ca..4019ca4aae83 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -177,7 +177,7 @@ static void setup_xstate_comp(uint16_t *comp_offsets,
  */
 void expand_xsave_states(const struct vcpu *v, void *dest, unsigned int size)
 {
-const struct xsave_struct *xstate = v->arch.xsave_area;
+const struct xsave_struct *xstate = vcpu_map_xsave_area(v);
 const void *src;
 uint16_t comp_offsets[sizeof(xfeature_mask)*8];
 u64 xstate_bv = xstate->xsave_hdr.xstate_bv;
@@ -228,6 +228,8 @@ void expand_xsave_states(const struct vcpu *v, void *dest, 
unsigned int size)
 
 valid &= ~feature;
 }
+
+vcpu_unmap_xsave_area(v, xstate);
 }
 
 /*
@@ -242,7 +244,7 @@ void expand_xsave_states(const struct vcpu *v, void *dest, 
unsigned int size)
  */
 void compress_xsave_states(struct vcpu *v, const void *src, unsigned int size)
 {
-struct xsave_struct *xstate = v->arch.xsave_area;
+struct xsave_struct *xstate = vcpu_map_xsave_area(v);
 void *dest;
 uint16_t comp_offsets[sizeof(xfeature_mask)*8];
 u64 xstate_bv, valid;
@@ -294,6 +296,8 @@ void compress_xsave_states(struct vcpu *v, const void *src, 
unsigned int size)
 
 valid &= ~feature;
 }
+
+vcpu_unmap_xsave_area(v, xstate);
 }
 
 void xsave(struct vcpu *v, uint64_t mask)
-- 
2.47.0

[PATCH 05/14] x86/xstate: Map/unmap xsave area in xstate_set_init() and handle_setbv()

2024-10-28 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/xstate.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index af9e345a7ace..60e752a245ca 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -993,7 +993,12 @@ int handle_xsetbv(u32 index, u64 new_bv)
 
 clts();
 if ( curr->fpu_dirtied )
-asm ( "stmxcsr %0" : "=m" (curr->arch.xsave_area->fpu_sse.mxcsr) );
+{
+struct xsave_struct *xsave_area = vcpu_map_xsave_area(curr);
+
+asm ( "stmxcsr %0" : "=m" (xsave_area->fpu_sse.mxcsr) );
+vcpu_unmap_xsave_area(curr, xsave_area);
+}
 else if ( xstate_all(curr) )
 {
 /* See the comment in i387.c:vcpu_restore_fpu_eager(). */
@@ -1048,7 +1053,7 @@ void xstate_set_init(uint64_t mask)
 unsigned long cr0 = read_cr0();
 unsigned long xcr0 = this_cpu(xcr0);
 struct vcpu *v = idle_vcpu[smp_processor_id()];
-struct xsave_struct *xstate = v->arch.xsave_area;
+struct xsave_struct *xstate;
 
 if ( ~xfeature_mask & mask )
 {
@@ -1061,8 +1066,10 @@ void xstate_set_init(uint64_t mask)
 
 clts();
 
+xstate = vcpu_map_xsave_area(v);
 memset(&xstate->xsave_hdr, 0, sizeof(xstate->xsave_hdr));
 xrstor(v, mask);
+vcpu_unmap_xsave_area(v, xstate);
 
 if ( cr0 & X86_CR0_TS )
 write_cr0(cr0);
-- 
2.47.0

[PATCH 06/14] x86/hvm: Map/unmap xsave area in hvmemul_{get,put}_fpu()

2024-10-28 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/hvm/emulate.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index f2bc6967dfcb..a6ddc9928f16 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2371,7 +2371,8 @@ static int cf_check hvmemul_get_fpu(
 alternative_vcall(hvm_funcs.fpu_dirty_intercept);
 else if ( type == X86EMUL_FPU_fpu )
 {
-const fpusse_t *fpu_ctxt = &curr->arch.xsave_area->fpu_sse;
+const struct xsave_struct *xsave_area = vcpu_map_xsave_area(curr);
+const fpusse_t *fpu_ctxt = &xsave_area->fpu_sse;
 
 /*
  * Latch current register state so that we can back out changes
@@ -2397,6 +2398,8 @@ static int cf_check hvmemul_get_fpu(
 else
 ASSERT(fcw == fpu_ctxt->fcw);
 }
+
+vcpu_unmap_xsave_area(curr, xsave_area);
 }
 
 return X86EMUL_OKAY;
@@ -2411,7 +2414,8 @@ static void cf_check hvmemul_put_fpu(
 
 if ( aux )
 {
-fpusse_t *fpu_ctxt = &curr->arch.xsave_area->fpu_sse;
+struct xsave_struct *xsave_area = vcpu_map_xsave_area(curr);
+fpusse_t *fpu_ctxt = &xsave_area->fpu_sse;
 bool dval = aux->dval;
 int mode = hvm_guest_x86_mode(curr);
 
@@ -2465,6 +2469,8 @@ static void cf_check hvmemul_put_fpu(
 
 fpu_ctxt->fop = aux->op;
 
+vcpu_unmap_xsave_area(curr, xsave_area);
+
 /* Re-use backout code below. */
 backout = X86EMUL_FPU_fpu;
 }
-- 
2.47.0

[PATCH 00/14] x86: Address Space Isolation FPU preparations

2024-10-28 Thread Alejandro Vallejo

In a Xen build with Address Space Isolation the FPU state cannot come from the
xenheap, as that means the FPU state of vCPU_A may be speculatively accesible
from any pCPU running a hypercall on behalf of vCPU_B. This series prepares
code that manipulates the FPU state to use wrappers that fetch said state from
"elsewhere"[1]. Those wrappers will crystalise into something more than dummy
accesors after existing ASI efforts are merged. So far, they are:

  a) Remove the directmap (Elias El Yadouzi):
  https://lore.kernel.org/xen-devel/20240513134046.82605-1-elias...@amazon.com/

Removes all confidential data pages from the directmap and sets up the
infrastructure to access them. Its trust boundary is the domain and builds
the foundations of the secret hiding API around {un,}map_domain_page().

  b) x86: adventures in Address Space Isolation (Roger Pau Monne):
  https://lore.kernel.org/xen-devel/20240726152206.28411-1-roger@citrix.com/

Extends (a) to put the trust boundary at the vCPU instead so the threat
model covers mutually distrustful vCPUs of the same domain. Extends the
API for secret hiding to provide private pCPU-local resources. And an
efficient means of accessing resources of the "current" vCPU.

In essence, the idea is to stop directly accessing a pointer in the vCPU
structure and instead collect it indirectly via a macro invocation. The
proposed API is a map/unmap pair in order to tame the complexity involved in
the various cases uniformly (Does the domain run with ASI enabled? Is the vCPU
"current"? Are we lazy-switching?).

The series is somewhat long, but each patch is fairly trivial. If need be, I
can fold back a lot of these onto single commits to make it shorter.

  * Patch 1 refreshes of a couple of asserts back into something helpful. Can
be folded onto patches 12 and 13 if deemed too silly for a Fixes tag.

  * Patch 2 is the introduction of the wrappers in isolation.

  * Patches 3 - 10 are split for ease of review, but are conceptually the same
thing over and over (to stop using direct v->arch.xsave_area and to use
wrappers instead).

  * Patch 11 cleans the idle vcpu state after using it as dumping groung. It's
not strictly required for this series, but I'm bound to forget to do it
later after we _do_ care and does no harm to do it now. It's otherwise
independent of the other patches (it clashes with 10, but only due to both
modifying the same code; it's conceptually independent).

  * Patches 12 and 13 bite the bullet and enlightens the (f)xsave and (f)xrstor
abstractions to use the wrappers rather than direct access.

  * Patches 14 is the last remaining direct use xsave area. It's too tricky to
introduce ahead of patches 12 and 13 because they need state passed not
available until those have gone in.

[1] That "elsewhere" will be with high likelihood either the directmap (on
non-ASI), some perma-mapped vCPU-local area (see series (b) at the top) or
implemented as a transient mapping in the style of {un,}map_domain_page() for
glacially cold accesses to non-current vCPUs. Importantly, writing the final
macros involve the other series going in.

Alejandro Vallejo (14):
  x86/xstate: Update stale assertions in fpu_x{rstor,save}()
  x86/xstate: Create map/unmap primitives for xsave areas
  x86/hvm: Map/unmap xsave area in hvm_save_cpu_ctxt()
  x86/fpu: Map/umap xsave area in vcpu_{reset,setup}_fpu()
  x86/xstate: Map/unmap xsave area in xstate_set_init() and
handle_setbv()
  x86/hvm: Map/unmap xsave area in hvmemul_{get,put}_fpu()
  x86/domctl: Map/unmap xsave area in arch_get_info_guest()
  x86/xstate: Map/unmap xsave area in {compress,expand}_xsave_states()
  x86/emulator: Refactor FXSAVE_AREA to use wrappers
  x86/mpx: Map/unmap xsave area in in read_bndcfgu()
  x86/mpx: Adjust read_bndcfgu() to clean after itself
  x86/fpu: Pass explicit xsave areas to fpu_(f)xsave()
  x86/fpu: Pass explicit xsave areas to fpu_(f)xrstor()
  x86/xstate: Make xstate_all() and vcpu_xsave_mask() take explicit
xstate

 xen/arch/x86/domctl.c |  9 +++--
 xen/arch/x86/hvm/emulate.c| 10 -
 xen/arch/x86/hvm/hvm.c|  8 ++--
 xen/arch/x86/i387.c   | 67 ---
 xen/arch/x86/include/asm/xstate.h | 29 +++--
 xen/arch/x86/x86_emulate/blk.c| 10 -
 xen/arch/x86/xstate.c | 51 ---
 7 files changed, 130 insertions(+), 54 deletions(-)

-- 
2.47.0

[PATCH 14/14] x86/xstate: Make xstate_all() and vcpu_xsave_mask() take explicit xstate

2024-10-28 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/i387.c   | 9 +
 xen/arch/x86/include/asm/xstate.h | 5 +++--
 xen/arch/x86/xstate.c | 2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index 7e1fb8ad8779..87b44dc11b55 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -108,7 +108,8 @@ static inline void fpu_fxrstor(struct vcpu *v, const 
fpusse_t *fpu_ctxt)
 /*  FPU Save Functions */
 /***/
 
-static inline uint64_t vcpu_xsave_mask(const struct vcpu *v)
+static inline uint64_t vcpu_xsave_mask(const struct vcpu *v,
+   const struct xsave_struct *xsave_area)
 {
 if ( v->fpu_dirtied )
 return v->arch.nonlazy_xstate_used ? XSTATE_ALL : XSTATE_LAZY;
@@ -125,14 +126,14 @@ static inline uint64_t vcpu_xsave_mask(const struct vcpu 
*v)
  * XSTATE_FP_SSE), vcpu_xsave_mask will return XSTATE_ALL. Otherwise
  * return XSTATE_NONLAZY.
  */
-return xstate_all(v) ? XSTATE_ALL : XSTATE_NONLAZY;
+return xstate_all(v, xsave_area) ? XSTATE_ALL : XSTATE_NONLAZY;
 }
 
 /* Save x87 extended state */
 static inline void fpu_xsave(struct vcpu *v, struct xsave_struct *xsave_area)
 {
 bool ok;
-uint64_t mask = vcpu_xsave_mask(v);
+uint64_t mask = vcpu_xsave_mask(v, xsave_area);
 
 ASSERT(mask);
 ASSERT(cpu_has_xsave);
@@ -213,7 +214,7 @@ void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool 
need_stts)
  * saving state belonging to another vCPU.
  */
 xsave_area = vcpu_map_xsave_area(v);
-if ( v->arch.fully_eager_fpu || xstate_all(v) )
+if ( v->arch.fully_eager_fpu || xstate_all(v, xsave_area) )
 {
 if ( cpu_has_xsave )
 fpu_xrstor(v, xsave_area, XSTATE_ALL);
diff --git a/xen/arch/x86/include/asm/xstate.h 
b/xen/arch/x86/include/asm/xstate.h
index 43f7731c2b17..81350d0105bb 100644
--- a/xen/arch/x86/include/asm/xstate.h
+++ b/xen/arch/x86/include/asm/xstate.h
@@ -132,14 +132,15 @@ xsave_area_compressed(const struct xsave_struct 
*xsave_area)
 return xsave_area->xsave_hdr.xcomp_bv & XSTATE_COMPACTION_ENABLED;
 }
 
-static inline bool xstate_all(const struct vcpu *v)
+static inline bool xstate_all(const struct vcpu *v,
+  const struct xsave_struct *xsave_area)
 {
 /*
  * XSTATE_FP_SSE may be excluded, because the offsets of XSTATE_FP_SSE
  * (in the legacy region of xsave area) are fixed, so saving
  * XSTATE_FP_SSE will not cause overwriting problem with XSAVES/XSAVEC.
  */
-return xsave_area_compressed(v->arch.xsave_area) &&
+return xsave_area_compressed(xsave_area) &&
(v->arch.xcr0_accum & XSTATE_LAZY & ~XSTATE_FP_SSE);
 }
 
diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index aa5c062f7e51..cbe56eba89eb 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -1002,7 +1002,7 @@ int handle_xsetbv(u32 index, u64 new_bv)
 asm ( "stmxcsr %0" : "=m" (xsave_area->fpu_sse.mxcsr) );
 vcpu_unmap_xsave_area(curr, xsave_area);
 }
-else if ( xstate_all(curr) )
+else if ( xstate_all(curr, xsave_area) )
 {
 /* See the comment in i387.c:vcpu_restore_fpu_eager(). */
 mask |= XSTATE_LAZY;
-- 
2.47.0

[PATCH 04/14] x86/fpu: Map/umap xsave area in vcpu_{reset,setup}_fpu()

2024-10-28 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/i387.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index 375a8274f632..a571bcb23c91 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -306,8 +306,10 @@ int vcpu_init_fpu(struct vcpu *v)
 
 void vcpu_reset_fpu(struct vcpu *v)
 {
+struct xsave_struct *xsave_area = vcpu_map_xsave_area(v);
+
 v->fpu_initialised = false;
-*v->arch.xsave_area = (struct xsave_struct) {
+*xsave_area = (struct xsave_struct) {
 .fpu_sse = {
 .mxcsr = MXCSR_DEFAULT,
 .fcw = FCW_RESET,
@@ -315,15 +317,21 @@ void vcpu_reset_fpu(struct vcpu *v)
 },
 .xsave_hdr.xstate_bv = X86_XCR0_X87,
 };
+
+vcpu_unmap_xsave_area(v, xsave_area);
 }
 
 void vcpu_setup_fpu(struct vcpu *v, const void *data)
 {
+struct xsave_struct *xsave_area = vcpu_map_xsave_area(v);
+
 v->fpu_initialised = true;
-*v->arch.xsave_area = (struct xsave_struct) {
+*xsave_area = (struct xsave_struct) {
 .fpu_sse = *(const fpusse_t*)data,
 .xsave_hdr.xstate_bv = XSTATE_FP_SSE,
 };
+
+vcpu_unmap_xsave_area(v, xsave_area);
 }
 
 /* Free FPU's context save area */
-- 
2.47.0

[PATCH 13/14] x86/fpu: Pass explicit xsave areas to fpu_(f)xrstor()

2024-10-28 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/i387.c   | 26 --
 xen/arch/x86/include/asm/xstate.h |  2 +-
 xen/arch/x86/xstate.c | 10 ++
 3 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index 5950fbcf272e..7e1fb8ad8779 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -20,7 +20,8 @@
 /* FPU Restore Functions   */
 /***/
 /* Restore x87 extended state */
-static inline void fpu_xrstor(struct vcpu *v, uint64_t mask)
+static inline void fpu_xrstor(struct vcpu *v, struct xsave_struct *xsave_area,
+  uint64_t mask)
 {
 bool ok;
 
@@ -31,16 +32,14 @@ static inline void fpu_xrstor(struct vcpu *v, uint64_t mask)
  */
 ok = set_xcr0(v->arch.xcr0_accum | XSTATE_FP_SSE);
 ASSERT(ok);
-xrstor(v, mask);
+xrstor(v, xsave_area, mask);
 ok = set_xcr0(v->arch.xcr0 ?: XSTATE_FP_SSE);
 ASSERT(ok);
 }
 
 /* Restore x87 FPU, MMX, SSE and SSE2 state */
-static inline void fpu_fxrstor(struct vcpu *v)
+static inline void fpu_fxrstor(struct vcpu *v, const fpusse_t *fpu_ctxt)
 {
-const fpusse_t *fpu_ctxt = &v->arch.xsave_area->fpu_sse;
-
 /*
  * Some CPUs don't save/restore FDP/FIP/FOP unless an exception
  * is pending. Clear the x87 state here by setting it to fixed
@@ -197,6 +196,8 @@ static inline void fpu_fxsave(struct vcpu *v, fpusse_t 
*fpu_ctxt)
 /* Restore FPU state whenever VCPU is schduled in. */
 void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool need_stts)
 {
+struct xsave_struct *xsave_area;
+
 /* Restore nonlazy extended state (i.e. parts not tracked by CR0.TS). */
 if ( !v->arch.fully_eager_fpu && !v->arch.nonlazy_xstate_used )
 goto maybe_stts;
@@ -211,12 +212,13 @@ void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool 
need_stts)
  * above) we also need to restore full state, to prevent subsequently
  * saving state belonging to another vCPU.
  */
+xsave_area = vcpu_map_xsave_area(v);
 if ( v->arch.fully_eager_fpu || xstate_all(v) )
 {
 if ( cpu_has_xsave )
-fpu_xrstor(v, XSTATE_ALL);
+fpu_xrstor(v, xsave_area, XSTATE_ALL);
 else
-fpu_fxrstor(v);
+fpu_fxrstor(v, &xsave_area->fpu_sse);
 
 v->fpu_initialised = 1;
 v->fpu_dirtied = 1;
@@ -226,9 +228,10 @@ void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool 
need_stts)
 }
 else
 {
-fpu_xrstor(v, XSTATE_NONLAZY);
+fpu_xrstor(v, xsave_area, XSTATE_NONLAZY);
 need_stts = true;
 }
+vcpu_unmap_xsave_area(v, xsave_area);
 
  maybe_stts:
 if ( need_stts )
@@ -240,6 +243,7 @@ void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool 
need_stts)
  */
 void vcpu_restore_fpu_lazy(struct vcpu *v)
 {
+struct xsave_struct *xsave_area;
 ASSERT(!is_idle_vcpu(v));
 
 /* Avoid recursion. */
@@ -250,10 +254,12 @@ void vcpu_restore_fpu_lazy(struct vcpu *v)
 
 ASSERT(!v->arch.fully_eager_fpu);
 
+xsave_area = vcpu_map_xsave_area(v);
 if ( cpu_has_xsave )
-fpu_xrstor(v, XSTATE_LAZY);
+fpu_xrstor(v, xsave_area, XSTATE_LAZY);
 else
-fpu_fxrstor(v);
+fpu_fxrstor(v, &xsave_area->fpu_sse);
+vcpu_unmap_xsave_area(v, xsave_area);
 
 v->fpu_initialised = 1;
 v->fpu_dirtied = 1;
diff --git a/xen/arch/x86/include/asm/xstate.h 
b/xen/arch/x86/include/asm/xstate.h
index 104fe0d44173..43f7731c2b17 100644
--- a/xen/arch/x86/include/asm/xstate.h
+++ b/xen/arch/x86/include/asm/xstate.h
@@ -98,7 +98,7 @@ void set_msr_xss(u64 xss);
 uint64_t get_msr_xss(void);
 uint64_t read_bndcfgu(void);
 void xsave(struct vcpu *v, struct xsave_struct *ptr, uint64_t mask);
-void xrstor(struct vcpu *v, uint64_t mask);
+void xrstor(struct vcpu *v, struct xsave_struct *ptr, uint64_t mask);
 void xstate_set_init(uint64_t mask);
 bool xsave_enabled(const struct vcpu *v);
 int __must_check validate_xstate(const struct domain *d,
diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index 518388e6e272..aa5c062f7e51 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -374,11 +374,10 @@ void xsave(struct vcpu *v, struct xsave_struct *ptr, 
uint64_t mask)
 ptr->fpu_sse.x[FPU_WORD_SIZE_OFFSET] = fip_width;
 }
 
-void xrstor(struct vcpu *v, uint64_t mask)
+void xrstor(struct vcpu *v, struct xsave_struct *ptr, uint64_t mask)
 {
 uint32_t hmask = mask >> 32;
 uint32_t lmask = mask;
-struct xsave_struct *ptr = v->arch.xsave_area;
 unsigned int faults, prev_faults;
 
 /*
@@ -992,6 +991,7 @@ int handle_xsetbv(u32 index, u64 new_bv)
 mask &= curr->fpu_dirtied ? ~XSTATE_FP_SSE : XSTATE_NONLAZY;
 if ( mask )
 {
+struct xsave_struct *xsave_area = vcpu_map_xsave_area(curr);
 unsigned long cr0 =

[PATCH 07/14] x86/domctl: Map/unmap xsave area in arch_get_info_guest()

2024-10-28 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/domctl.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 5f0619da..8f6075bc84b8 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1377,16 +1377,17 @@ void arch_get_info_guest(struct vcpu *v, 
vcpu_guest_context_u c)
 unsigned int i;
 const struct domain *d = v->domain;
 bool compat = is_pv_32bit_domain(d);
+const struct xsave_struct *xsave_area;
 #ifdef CONFIG_COMPAT
 #define c(fld) (!compat ? (c.nat->fld) : (c.cmp->fld))
 #else
 #define c(fld) (c.nat->fld)
 #endif
 
-BUILD_BUG_ON(sizeof(c.nat->fpu_ctxt) !=
- sizeof(v->arch.xsave_area->fpu_sse));
-memcpy(&c.nat->fpu_ctxt, &v->arch.xsave_area->fpu_sse,
-   sizeof(c.nat->fpu_ctxt));
+xsave_area = vcpu_map_xsave_area(v);
+BUILD_BUG_ON(sizeof(c.nat->fpu_ctxt) != sizeof(xsave_area->fpu_sse));
+memcpy(&c.nat->fpu_ctxt, &xsave_area->fpu_sse, sizeof(c.nat->fpu_ctxt));
+vcpu_unmap_xsave_area(v, xsave_area);
 
 if ( is_pv_domain(d) )
 c(flags = v->arch.pv.vgc_flags & ~(VGCF_i387_valid|VGCF_in_kernel));
-- 
2.47.0

[PATCH 09/14] x86/emulator: Refactor FXSAVE_AREA to use wrappers

2024-10-28 Thread Alejandro Vallejo

Adds an UNMAP primitive to make use of vcpu_unmap_xsave_area() when
linked into xen. unmap is a no-op during tests.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/x86_emulate/blk.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/x86_emulate/blk.c b/xen/arch/x86/x86_emulate/blk.c
index 08a05f8453f7..d5b59333823f 100644
--- a/xen/arch/x86/x86_emulate/blk.c
+++ b/xen/arch/x86/x86_emulate/blk.c
@@ -11,9 +11,11 @@
 !defined(X86EMUL_NO_SIMD)
 # ifdef __XEN__
 #  include 
-#  define FXSAVE_AREA ((void *)¤t->arch.xsave_area->fpu_sse)
+#  define FXSAVE_AREA ((void *)vcpu_map_xsave_area(current))
+#  define UNMAP_FXSAVE_AREA(x) vcpu_unmap_xsave_area(currt ent, x)
 # else
 #  define FXSAVE_AREA get_fpu_save_area()
+#  define UNMAP_FXSAVE_AREA(x) ((void)x)
 # endif
 #endif
 
@@ -292,6 +294,9 @@ int x86_emul_blk(
 }
 else
 asm volatile ( "fxrstor %0" :: "m" (*fxsr) );
+
+UNMAP_FXSAVE_AREA(fxsr);
+
 break;
 }
 
@@ -320,6 +325,9 @@ int x86_emul_blk(
 
 if ( fxsr != ptr ) /* i.e. s->op_bytes < sizeof(*fxsr) */
 memcpy(ptr, fxsr, s->op_bytes);
+
+UNMAP_FXSAVE_AREA(fxsr);
+
 break;
 }
 
-- 
2.47.0

[PATCH 02/14] x86/xstate: Create map/unmap primitives for xsave areas

2024-10-28 Thread Alejandro Vallejo

Add infrastructure to simplify ASI handling. With ASI in the picture
we'll have several different means of accessing the XSAVE area of a
given vCPU, depending on whether a domain is covered by ASI or not and
whether the vCPU is question is scheduled on the current pCPU or not.

Having these complexities exposed at the call sites becomes unwieldy
very fast. These wrappers are intended to be used in a similar way to
map_domain_page() and unmap_domain_page(); The map operation will
dispatch the appropriate pointer for each case in a future patch, while
unmap will remain a no-op where no unmap is required (e.g: when there's
no ASI) and remove the transient maping if one was required.

Follow-up patches replace all uses of raw v->arch.xsave_area by this
mechanism in preparation to add the beforementioned dispatch logic to be
added at a later time.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/include/asm/xstate.h | 20 
 1 file changed, 20 insertions(+)

diff --git a/xen/arch/x86/include/asm/xstate.h 
b/xen/arch/x86/include/asm/xstate.h
index 07017cc4edfd..36260459667c 100644
--- a/xen/arch/x86/include/asm/xstate.h
+++ b/xen/arch/x86/include/asm/xstate.h
@@ -143,4 +143,24 @@ static inline bool xstate_all(const struct vcpu *v)
(v->arch.xcr0_accum & XSTATE_LAZY & ~XSTATE_FP_SSE);
 }
 
+/*
+ * Fetch a pointer to the XSAVE area of a vCPU
+ *
+ * If ASI is enabled for the domain, this mapping is pCPU-local.
+ *
+ * @param v Owner of the XSAVE area
+ */
+#define vcpu_map_xsave_area(v) ((v)->arch.xsave_area)
+
+/*
+ * Drops the XSAVE area of a vCPU and nullifies its pointer on exit.
+ *
+ * If ASI is enabled and v is not the currently scheduled vCPU then the
+ * per-pCPU mapping is removed from the address space.
+ *
+ * @param v   vCPU logically owning xsave_area
+ * @param xsave_area  XSAVE blob of v
+ */
+#define vcpu_unmap_xsave_area(v, x) ({ (x) = NULL; })
+
 #endif /* __ASM_XSTATE_H */
-- 
2.47.0

[PATCH 12/14] x86/fpu: Pass explicit xsave areas to fpu_(f)xsave()

2024-10-28 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/i387.c   | 16 ++--
 xen/arch/x86/include/asm/xstate.h |  2 +-
 xen/arch/x86/xstate.c |  3 +--
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index a571bcb23c91..5950fbcf272e 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -130,7 +130,7 @@ static inline uint64_t vcpu_xsave_mask(const struct vcpu *v)
 }
 
 /* Save x87 extended state */
-static inline void fpu_xsave(struct vcpu *v)
+static inline void fpu_xsave(struct vcpu *v, struct xsave_struct *xsave_area)
 {
 bool ok;
 uint64_t mask = vcpu_xsave_mask(v);
@@ -143,15 +143,14 @@ static inline void fpu_xsave(struct vcpu *v)
  */
 ok = set_xcr0(v->arch.xcr0_accum | XSTATE_FP_SSE);
 ASSERT(ok);
-xsave(v, mask);
+xsave(v, xsave_area, mask);
 ok = set_xcr0(v->arch.xcr0 ?: XSTATE_FP_SSE);
 ASSERT(ok);
 }
 
 /* Save x87 FPU, MMX, SSE and SSE2 state */
-static inline void fpu_fxsave(struct vcpu *v)
+static inline void fpu_fxsave(struct vcpu *v, fpusse_t *fpu_ctxt)
 {
-fpusse_t *fpu_ctxt = &v->arch.xsave_area->fpu_sse;
 unsigned int fip_width = v->domain->arch.x87_fip_width;
 
 if ( fip_width != 4 )
@@ -266,6 +265,8 @@ void vcpu_restore_fpu_lazy(struct vcpu *v)
  */
 static bool _vcpu_save_fpu(struct vcpu *v)
 {
+struct xsave_struct *xsave_area;
+
 if ( !v->fpu_dirtied && !v->arch.nonlazy_xstate_used )
 return false;
 
@@ -274,11 +275,14 @@ static bool _vcpu_save_fpu(struct vcpu *v)
 /* This can happen, if a paravirtualised guest OS has set its CR0.TS. */
 clts();
 
+xsave_area = vcpu_map_xsave_area(v);
+
 if ( cpu_has_xsave )
-fpu_xsave(v);
+fpu_xsave(v, xsave_area);
 else
-fpu_fxsave(v);
+fpu_fxsave(v, &xsave_area->fpu_sse);
 
+vcpu_unmap_xsave_area(v, xsave_area);
 v->fpu_dirtied = 0;
 
 return true;
diff --git a/xen/arch/x86/include/asm/xstate.h 
b/xen/arch/x86/include/asm/xstate.h
index 36260459667c..104fe0d44173 100644
--- a/xen/arch/x86/include/asm/xstate.h
+++ b/xen/arch/x86/include/asm/xstate.h
@@ -97,7 +97,7 @@ uint64_t get_xcr0(void);
 void set_msr_xss(u64 xss);
 uint64_t get_msr_xss(void);
 uint64_t read_bndcfgu(void);
-void xsave(struct vcpu *v, uint64_t mask);
+void xsave(struct vcpu *v, struct xsave_struct *ptr, uint64_t mask);
 void xrstor(struct vcpu *v, uint64_t mask);
 void xstate_set_init(uint64_t mask);
 bool xsave_enabled(const struct vcpu *v);
diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index a9a7ee2cd1e6..518388e6e272 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -300,9 +300,8 @@ void compress_xsave_states(struct vcpu *v, const void *src, 
unsigned int size)
 vcpu_unmap_xsave_area(v, xstate);
 }
 
-void xsave(struct vcpu *v, uint64_t mask)
+void xsave(struct vcpu *v, struct xsave_struct *ptr, uint64_t mask)
 {
-struct xsave_struct *ptr = v->arch.xsave_area;
 uint32_t hmask = mask >> 32;
 uint32_t lmask = mask;
 unsigned int fip_width = v->domain->arch.x87_fip_width;
-- 
2.47.0

[PATCH 03/14] x86/hvm: Map/unmap xsave area in hvm_save_cpu_ctxt()

2024-10-28 Thread Alejandro Vallejo

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/hvm/hvm.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 018d44a08b6b..77b975f07f32 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -914,11 +914,11 @@ static int cf_check hvm_save_cpu_ctxt(struct vcpu *v, 
hvm_domain_context_t *h)
 
 if ( v->fpu_initialised )
 {
-BUILD_BUG_ON(sizeof(ctxt.fpu_regs) !=
- sizeof(v->arch.xsave_area->fpu_sse));
-memcpy(ctxt.fpu_regs, &v->arch.xsave_area->fpu_sse,
-   sizeof(ctxt.fpu_regs));
+const struct xsave_struct *xsave_area = vcpu_map_xsave_area(v);
 
+BUILD_BUG_ON(sizeof(ctxt.fpu_regs) != sizeof(xsave_area->fpu_sse));
+memcpy(ctxt.fpu_regs, &xsave_area->fpu_sse, sizeof(ctxt.fpu_regs));
+vcpu_unmap_xsave_area(v, xsave_area);
 ctxt.flags = XEN_X86_FPU_INITIALISED;
 }
 
-- 
2.47.0

Re: [PATCH v4] NUMA: Introduce NODE_DATA->node_present_pages(RAM pages)

2024-10-28 Thread Alejandro Vallejo

Hi,

On Sun Oct 27, 2024 at 2:43 PM GMT, Bernhard Kaindl wrote:
> From: Bernhard Kaindl 
>
> At the moment, Xen keeps track of the spans of PFNs of the NUMA nodes.
> But the PFN span sometimes includes large MMIO holes, so these values
> might not be an exact representation of the total usable RAM of nodes.
>
> Xen does not need it, but the size of the NUMA node's memory can be
> helpful for management tools and HW information tools like hwloc/lstopo
> with its Xen backend for Dom0: https://github.com/xenserver-next/hwloc/
>
> First, introduce NODE_DATA(nodeid)->node_present_pages to node_data[],
> determine the sum of usable PFNs at boot and update them on memory_add().
>
> (The Linux kernel handles NODE_DATA->node_present_pages likewise)
>
> Signed-off-by: Bernhard Kaindl 
> ---
> Changes in v3:
> - Use PFN_UP/DOWN, refactored further to simplify the code while leaving
>   compiler-level optimisations to the compiler's optimisation passes.
> Changes in v4:
> - Refactored code and doxygen documentation according to the review.
> ---
>  xen/arch/x86/numa.c  | 13 +
>  xen/arch/x86/x86_64/mm.c |  3 +++
>  xen/common/numa.c| 36 +---
>  xen/include/xen/numa.h   | 21 +
>  4 files changed, 70 insertions(+), 3 deletions(-)
>
> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> index 4b0b297c7e..3c0574f773 100644
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -100,6 +100,19 @@ unsigned int __init arch_get_dma_bitsize(void)
>   + PAGE_SHIFT, 32);
>  }
>  
> +/**
> + * @brief Retrieves the RAM range for a given index from the e820 memory map.
> + *
> + * This function fetches the start and end address (exclusive) of a RAM range
> + * specified by the given index idx from the e820 memory map.
> + *
> + * @param idx The index of the RAM range in the e820 memory map to retrieve.
> + * @param start Pointer to store the start address of the RAM range.
> + * @param end Pointer to store the end address of the RAM range.

Same as setup_node_bootmem(), we probably want this to explicitly state
"exclusive" to indicate it's not the last address, but the address after.

> + *
> + * @return 0 on success, -ENOENT if the index is out of bounds,
> + * or -ENODATA if the memory map at index idx is not of type 
> E820_RAM.
> + */
>  int __init arch_get_ram_range(unsigned int idx, paddr_t *start, paddr_t *end)
>  {
>  if ( idx >= e820.nr_map )
> diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
> index b2a280fba3..66b9bed057 100644
> --- a/xen/arch/x86/x86_64/mm.c
> +++ b/xen/arch/x86/x86_64/mm.c
> @@ -1334,6 +1334,9 @@ int memory_add(unsigned long spfn, unsigned long epfn, 
> unsigned int pxm)
>  share_hotadd_m2p_table(&info);
>  transfer_pages_to_heap(&info);
>  
> +/* Update the node's present pages (like the total_pages of the system) 
> */
> +NODE_DATA(node)->node_present_pages += epfn - spfn;
> +
>  return 0;
>  
>  destroy_m2p:
> diff --git a/xen/common/numa.c b/xen/common/numa.c
> index 209c546a3b..9a8b805dd7 100644
> --- a/xen/common/numa.c
> +++ b/xen/common/numa.c
> @@ -4,6 +4,7 @@
>   * Adapted for Xen: Ryan Harper 
>   */
>  
> +#include "xen/pfn.h"
>  #include 
>  #include 
>  #include 
> @@ -499,15 +500,44 @@ int __init compute_hash_shift(const struct node *nodes,
>  return shift;
>  }
>  
> -/* Initialize NODE_DATA given nodeid and start/end */
> +/**
> + * @brief Initialize a NUMA node's node_data structure at boot.
> + *
> + * It is given the NUMA node's index in the node_data array as well
> + * as the start and exclusive end address of the node's memory span
> + * as arguments and initializes the node_data entry with this information.
> + *
> + * It then initializes the total number of usable memory pages within
> + * the NUMA node's memory span using the arch_get_ram_range() function.
> + *
> + * @param nodeid The index into the node_data array for the node.
> + * @param start The starting physical address of the node's memory range.
> + * @param end The exclusive ending physical address of the node's memory 
> range.
> + */
>  void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
>  {
>  unsigned long start_pfn = paddr_to_pfn(start);
>  unsigned long end_pfn = paddr_to_pfn(end);
> +struct node_data *numa_node = NODE_DATA(nodeid);
> +paddr_t start_ram, end_ram;

With the loop in place and arch_get_ram_range() being called inside, these two
can further reduce scope by being moved inside as well.

> +unsigned int idx = 0;
> +unsigned long *pages = &numa_node->node_present_pages;
>  
> -NODE_DATA(nodeid)->node_start_pfn = start_pfn;
> -NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
> +numa_node->node_start_pfn = start_pfn;
> +numa_node->node_spanned_pages = end_pfn - start_pfn;
> +
> +/* Calculate the number of present RAM pages within the node: */

nit: that last ":" feels a bi

1 2 3 4 5 6 7 8 >

1 - 100 of 707 matches

Mail list logo