date:20250707

Re: [PATCH] xen/efi: Fix crash with initial empty EFI options

2025-07-07 Thread Frediano Ziglio

On Mon, Jul 7, 2025 at 5:04 PM Jan Beulich  wrote:
>
> On 07.07.2025 17:51, Frediano Ziglio wrote:
> > On Mon, Jul 7, 2025 at 4:42 PM Jan Beulich  wrote:
> >>
> >> On 07.07.2025 17:11, Frediano Ziglio wrote:
> >>> EFI code path split options from EFI LoadOptions fields in 2
> >>> pieces, first EFI options, second Xen options.
> >>> "get_argv" function is called first to get the number of arguments
> >>> in the LoadOptions, second, after allocating enough space, to
> >>> fill some "argc"/"argv" variable. However the first parsing could
> >>> be different from second as second is able to detect "--" argument
> >>> separator. So it was possible that "argc" was bigger that the "argv"
> >>> array leading to potential buffer overflows, in particular
> >>> a string like "-- a b c" would lead to buffer overflow in "argv"
> >>> resulting in crashes.
> >>> Using EFI shell is possible to pass any kind of string in
> >>> LoadOptions.
> >>>
> >>> Fixes: 201f261e859e ("EFI: move x86 boot/runtime code to common/efi")
> >>
> >> This only moves the function, but doesn't really introduce any issue 
> >> afaics.
> >>
> >
> > Okay, I'll follow the rename
> >
> >>> --- a/xen/common/efi/boot.c
> >>> +++ b/xen/common/efi/boot.c
> >>> @@ -345,6 +345,7 @@ static unsigned int __init get_argv(unsigned int 
> >>> argc, CHAR16 **argv,
> >>>  VOID *data, UINTN size, UINTN 
> >>> *offset,
> >>>  CHAR16 **options)
> >>>  {
> >>> +CHAR16 **const orig_argv = argv;
> >>>  CHAR16 *ptr = (CHAR16 *)(argv + argc + 1), *prev = NULL, *cmdline = 
> >>> NULL;
> >>>  bool prev_sep = true;
> >>>
> >>> @@ -384,7 +385,7 @@ static unsigned int __init get_argv(unsigned int 
> >>> argc, CHAR16 **argv,
> >>>  {
> >>>  cmdline = data + *offset;
> >>>  /* Cater for the image name as first component. */
> >>> -++argc;
> >>> +++argv;
> >>
> >> We're on the argc == 0 and argv == NULL path here. Incrementing NULL is UB,
> >> if I'm not mistaken.
> >
> > Not as far as I know. Why?
>
> Increment and decrement operators are like additions. For additions the 
> standard
> says: "For addition, either both operands shall have arithmetic type, or one
> operand shall be a pointer to an object type and the other shall have integer
> type." Neither of the alternatives is true for NULL.
>

Yes and no. The expression here is not NULL + 1, but (CHAR16**)NULL +
1, hence the pointer has a type and so the expression is valid.

> > Some systems even can use NULL pointers as valid, like mmap.
>
> Right, but that doesn't make the use of NULL C-compliant.
>
> >>> @@ -402,7 +403,7 @@ static unsigned int __init get_argv(unsigned int 
> >>> argc, CHAR16 **argv,
> >>>  {
> >>>  if ( cur_sep )
> >>>  ++ptr;
> >>> -else if ( argv )
> >>> +else if ( orig_argv )
> >>>  {
> >>>  *ptr = *cmdline;
> >>>  *++ptr = 0;
> >>> @@ -410,8 +411,8 @@ static unsigned int __init get_argv(unsigned int 
> >>> argc, CHAR16 **argv,
> >>>  }
> >>>  else if ( !cur_sep )
> >>>  {
> >>> -if ( !argv )
> >>> -++argc;
> >>> +if ( !orig_argv )
> >>> +++argv;
> >>>  else if ( prev && wstrcmp(prev, L"--") == 0 )
> >>>  {
> >>>  --argv;
> >>
> >> As per this, it looks like that on the 1st pass we may indeed overcount
> >> arguments. But ...
> >>
> >
> > I can use again argc if you prefer, not strong about it.
> >
> >>> @@ -428,9 +429,9 @@ static unsigned int __init get_argv(unsigned int 
> >>> argc, CHAR16 **argv,
> >>>  }
> >>>  prev_sep = cur_sep;
> >>>  }
> >>> -if ( argv )
> >>> +if ( orig_argv )
> >>>  *argv = NULL;
> >>> -return argc;
> >>> +return argv - orig_argv;
> >>>  }
> >>>
> >>>  static EFI_FILE_HANDLE __init get_parent_handle(const EFI_LOADED_IMAGE 
> >>> *loaded_image,
> >>> @@ -1348,8 +1349,8 @@ void EFIAPI __init noreturn efi_start(EFI_HANDLE 
> >>> ImageHandle,
> >>>(argc + 1) * sizeof(*argv) +
> >>>loaded_image->LoadOptionsSize,
> >>>(void **)&argv) == EFI_SUCCESS )
> >>> -get_argv(argc, argv, loaded_image->LoadOptions,
> >>> - loaded_image->LoadOptionsSize, &offset, &options);
> >>> +argc = get_argv(argc, argv, loaded_image->LoadOptions,
> >>> +loaded_image->LoadOptionsSize, &offset, 
> >>> &options);
> >>
> >> ... wouldn't this change alone cure that problem? And even that I don't
> >> follow. Below here we have
> >>
> >> for ( i = 1; i < argc; ++i )
> >> {
> >> CHAR16 *ptr = argv[i];
> >>
> >> if ( !ptr )
> >> break;
> >

[MINI-OS PATCH 0/2] x86: don't use a memory page for mapping the shared info page

2025-07-07 Thread Juergen Gross

This is a small add-on series after the live-update series sent a week
ago.

I realized that having the shared info page in the normal RAM area is
a bad idea when considering kexec, as the new kernel might want it at
a different location.

So this series is moving the shared info page away from the RAM areas,
resulting in a net win of one memory page.

Juergen Gross (2):
  mm: provide a way to do very early page table allocations
  x86: don't use a memory page for mapping the shared info page

 arch/x86/mm.c | 23 +++
 arch/x86/setup.c  | 15 ---
 arch/x86/x86_32.S |  7 +--
 arch/x86/x86_64.S |  7 +--
 hypervisor.c  | 15 +++
 5 files changed, 36 insertions(+), 31 deletions(-)

-- 
2.43.0

Re: [PATCH v3 02/22] include/xen/slr-table.h: Secure Launch Resource Table definitions

2025-07-07 Thread Jan Beulich

On 07.07.2025 19:31, Sergii Dmytruk wrote:
> On Mon, Jul 07, 2025 at 10:29:46AM +0200, Jan Beulich wrote:
 ... then isn't used right here, instead requiring a cast somewhere 
 (presumably,
 as code using this isn't visible in this patch).
>>>
>>> As was mentioned earlier: because size of a pointer between Xen and a
>>> bootloader doesn't necessarily match.  What you're proposing can break
>>> under certain conditions.
>>
>> But the header isn't shared with a bootloader's code base. It's private to
>> Xen.
> 
> Yes, but sources of Xen be compiled with different size of a pointer
> which messes up the interpretation of the data.  I tried using
> BUILD_BUG_ON() to enforce the pointer is 64-bit and early code stopped
> compiling.  The structures must not vary like that.

Hmm. Does early code actually need to have this struct exposed to it?

Jan

RE: [PATCH v5 18/18] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc driver

2025-07-07 Thread Penny, Zheng

[Public]

> -Original Message-
> From: Jan Beulich 
> Sent: Tuesday, June 17, 2025 6:38 PM
> To: Penny, Zheng 
> Cc: Huang, Ray ; Anthony PERARD
> ; Andrew Cooper ;
> Orzel, Michal ; Julien Grall ; Roger Pau
> Monné ; Stefano Stabellini ; 
> xen-
> de...@lists.xenproject.org
> Subject: Re: [PATCH v5 18/18] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC
> xen_sysctl_pm_op for amd-cppc driver
>
> On 27.05.2025 10:48, Penny Zheng wrote:
> > --- a/tools/misc/xenpm.c
> > +++ b/tools/misc/xenpm.c
> > +
> > +case XEN_SYSCTL_CPPC_SET_PRESET_BALANCE:
> > +if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
> > +return -EINVAL;
> > +policy->policy = CPUFREQ_POLICY_BALANCE;
> > +min_perf = data->caps.lowest_perf;
> > +max_perf = data->caps.highest_perf;
> > +epp = CPPC_ENERGY_PERF_BALANCE;
> > +break;
>
> Isn't this more line "ondemand"? To me, "balance" would mean tying perf to at 
> least
> close around the middle of lowest and highest.
>

The "balance" word comes from the epp value, it is 127, which is the middle 
value
In actual hardware algorithm, the value of Energy Performance Preference 
register(EPP) will be translated to frequency setting,
And it sets the minimum active frequency.
 An EPP of zero sets the min active frequency to Fmax, while an EPP of 255 sets 
the min active frequency to Fmin (~IdleFreq).  It is linear scaling, so epp of 
127 is calculated to the middle of Fmax and Fmin.
And Fmax corresponds to the highest perf, and Fmin corresponds to the 
non-linear lowest perf

> > +case XEN_SYSCTL_CPPC_SET_PRESET_NONE:
> > +/*
> > + * In paasive mode, Xen governor is responsible for perfomance 
> > tuning.
>
> Nit: passive
>
> > + * we shall set lowest_perf with "lowest_nonlinear_perf" to ensure
> > + * governoring performance in P-states range.
> > + */
> > +min_perf = data->caps.lowest_nonlinear_perf;
> > +max_perf = data->caps.highest_perf;
>
> As in the earlier patch - I fear I don't understand the comment, and hence 
> why to
> use lowest-nonlinear here remains unclear to me.
>

After previous discussion, I'll use non-linear lowest in all places

> Jan

Re: [PATCH 2/2] xen/x86: introduce AMD_MCE_NONFATAL

2025-07-07 Thread Demi Marie Obenour

On 7/7/25 20:07, Stefano Stabellini wrote:
> Today, checking for non-fatal MCE errors on ARM is very invasive: it

s/ARM/AMD/
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

OpenPGP_0xB288B55FFF9C22C1.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH v3 02/22] include/xen/slr-table.h: Secure Launch Resource Table definitions

2025-07-07 Thread Jan Beulich

On 06.07.2025 18:55, Sergii Dmytruk wrote:
> On Wed, Jul 02, 2025 at 04:36:27PM +0200, Jan Beulich wrote:
>> On 30.05.2025 15:17, Sergii Dmytruk wrote:
>>> The file provides constants, structures and several helper functions for
>>> parsing SLRT.
>>>
>>> The data described by the structures is passed to Xen by a bootloader
>>> which initiated DRTM.
>>>
>>> Signed-off-by: Daniel P. Smith 
>>> Signed-off-by: Ross Philipson 
>>> Signed-off-by: Sergii Dmytruk 
>>> ---
>>>  xen/include/xen/slr-table.h | 276 
>>>  1 file changed, 276 insertions(+)
>>>  create mode 100644 xen/include/xen/slr-table.h
>>
>> Btw, please don't forget to Cc maintainers of code you're changing / adding.
> 
> What do you mean?  I'm running add_maintainers.pl on the patches.

The Cc: list had none of the REST maintainers. (Whether there's a bug in the
script I can't tell.)

>>> +/*
>>> + * Prototype of a function pointed to by slr_entry_dl_info::dl_handler.
>>> + */
>>> +typedef void (*dl_handler_func)(struct slr_bl_context *bl_context);
>>
>> I keep wondering why this ...
>>
>>> +/*
>>> + * DRTM Dynamic Launch Configuration
>>> + */
>>> +struct slr_entry_dl_info
>>> +{
>>> +struct slr_entry_hdr hdr;
>>> +uint64_t dce_size;
>>> +uint64_t dce_base;
>>> +uint64_t dlme_size;
>>> +uint64_t dlme_base;
>>> +uint64_t dlme_entry;
>>> +struct slr_bl_context bl_context;
>>> +uint64_t dl_handler;
>>
>> ... then isn't used right here, instead requiring a cast somewhere 
>> (presumably,
>> as code using this isn't visible in this patch).
> 
> As was mentioned earlier: because size of a pointer between Xen and a
> bootloader doesn't necessarily match.  What you're proposing can break
> under certain conditions.

But the header isn't shared with a bootloader's code base. It's private to
Xen.

>>> +} __packed;
>>
>> I similarly keep wondering why there are all these packed attributes here, 
>> when
>> (afaics) all of the structures are defined in a way that any padding is 
>> explicit
>> anyway.
> 
> As before: it won't hurt, it's future-proof, just in case and to let
> reader of the code know the structure must have no padding.  If you want
> them gone, I can do that, but I think it will make the code harder to
> maintain.

Well, if there's a maintenance concern, then I would of course like to learn
about that. I could see such being the case if the file was imported from
somewhere. But with neither description nor any code comment not saying so,
that doesn't look to be the case.

Jan

Re: [PATCH v3 07/22] x86/mtrr: expose functions for pausing caching

2025-07-07 Thread Jan Beulich

On 06.07.2025 19:34, Sergii Dmytruk wrote:
> On Wed, Jul 02, 2025 at 04:57:12PM +0200, Jan Beulich wrote:
>>> @@ -440,9 +436,10 @@ static DEFINE_SPINLOCK(set_atomicity_lock);
>>>   * has been called.
>>>   */
>>>
>>> -static bool prepare_set(void)
>>> +struct mtrr_pausing_state mtrr_pause_caching(void)
>>
>> These becoming non-static without being called from anywhere else isn't 
>> going to
>> be liked by Misra. Hence the part of static -> extern may need to be deferred
>> until the new user(s) appear(s).
> 
> Sounds like small part needs to be moved into the next patch.
> 
>> Furthermore this returning of a struct by value isn't very nice, and looks 
>> to be
>> easy to avoid here.
> 
> Are you suggesting to use an output parameter instead?

Yes.

>  Out of curiosity, what's bad in returning trivial structs by value?

It's generally deemed bad practice from all I know, to a fair degree because of
this then (generally) translating to a hidden function argument. While not
relevant here, this also is a corner case of the respective psABI, which is more
likely to cause interoperability issues.

Jan

Re: [PATCH v2 3/6] arm/mpu: Populate a new region in Xen MPU mapping table

2025-07-07 Thread Orzel, Michal




On 02/07/2025 16:13, Hari Limaye wrote:
> From: Penny Zheng 
> 
> Introduce map_pages_to_xen() that is implemented using a new helper,
> xen_mpumap_update(), which is responsible for updating Xen MPU memory
> mapping table(xen_mpumap), including creating a new entry, updating
> or destroying an existing one, it is equivalent to xen_pt_update in MMU.
> 
> This commit only implements populating a new entry in Xen MPU memory mapping
> table(xen_mpumap).
> 
> Signed-off-by: Penny Zheng 
> Signed-off-by: Wei Chen 
> Signed-off-by: Luca Fancellu 
> Signed-off-by: Hari Limaye 
> ---
> Changes from v1:
> - Simplify if condition
> - Use normal printk
> - Use %# over 0x%
> - Add same asserts as in Patch 4
> ---
>  xen/arch/arm/include/asm/mpu/mm.h |  12 
>  xen/arch/arm/mpu/mm.c | 100 ++
>  2 files changed, 112 insertions(+)
> 
> diff --git a/xen/arch/arm/include/asm/mpu/mm.h 
> b/xen/arch/arm/include/asm/mpu/mm.h
> index 81e47b9d0b..101002f7d4 100644
> --- a/xen/arch/arm/include/asm/mpu/mm.h
> +++ b/xen/arch/arm/include/asm/mpu/mm.h
> @@ -64,6 +64,7 @@ static inline void context_sync_mpu(void)
>   * The following API requires context_sync_mpu() after being used to modify 
> MPU
>   * regions:
>   *  - write_protection_region
> + *  - xen_mpumap_update
>   */
>  
>  /* Reads the MPU region (into @pr_read) with index @sel from the HW */
> @@ -72,6 +73,17 @@ void read_protection_region(pr_t *pr_read, uint8_t sel);
>  /* Writes the MPU region (from @pr_write) with index @sel to the HW */
>  void write_protection_region(const pr_t *pr_write, uint8_t sel);
>  
> +/*
> + * Maps an address range into the MPU data structure and updates the HW.
> + * Equivalent to xen_pt_update in an MMU system.
> + *
> + * @param base  Base address of the range to map (inclusive).
> + * @param limit Limit address of the range to map (exclusive).
> + * @param flags Flags for the memory range to map.
> + * @return  0 on success, negative on error.
> + */
> +int xen_mpumap_update(paddr_t base, paddr_t limit, unsigned int flags);
> +
>  /*
>   * Creates a pr_t structure describing a protection region.
>   *
> diff --git a/xen/arch/arm/mpu/mm.c b/xen/arch/arm/mpu/mm.c
> index 25e7f36c1e..dd54b66901 100644
> --- a/xen/arch/arm/mpu/mm.c
> +++ b/xen/arch/arm/mpu/mm.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -29,6 +30,8 @@ DECLARE_BITMAP(xen_mpumap_mask, MAX_MPU_REGION_NR) \
>  /* EL2 Xen MPU memory region mapping table. */
>  pr_t __cacheline_aligned __section(".data") xen_mpumap[MAX_MPU_REGION_NR];
>  
> +static DEFINE_SPINLOCK(xen_mpumap_lock);
> +
>  static void __init __maybe_unused build_assertions(void)
>  {
>  /*
> @@ -162,6 +165,103 @@ int mpumap_contains_region(pr_t *table, uint8_t 
> nr_regions, paddr_t base,
>  return MPUMAP_REGION_NOTFOUND;
>  }
>  
> +/*
> + * Allocate a new free EL2 MPU memory region, based on bitmap 
> xen_mpumap_mask.
I would clarify that you allocate entry in bitmap, not a region.

> + * @param idx   Set to the index of the allocated EL2 MPU region on success.
> + * @return  0 on success, otherwise -ENOENT on failure.
> + */
> +static int xen_mpumap_alloc_entry(uint8_t *idx)
> +{
> +ASSERT(spin_is_locked(&xen_mpumap_lock));
> +
> +*idx = find_first_zero_bit(xen_mpumap_mask, max_mpu_regions);
> +if ( *idx == max_mpu_regions )
> +{
> +printk(XENLOG_ERR "mpu: EL2 MPU memory region mapping pool 
> exhausted\n");
> +return -ENOENT;
> +}
> +
> +set_bit(*idx, xen_mpumap_mask);
Empty line here please.

> +return 0;
> +}
> +
> +/*
> + * Update the entry in the MPU memory region mapping table (xen_mpumap) for 
> the
> + * given memory range and flags, creating one if none exists.
> + *
> + * @param base  Base address (inclusive).
> + * @param limit Limit address (exclusive).
> + * @param flags Region attributes (a combination of PAGE_HYPERVISOR_XXX)
> + * @return  0 on success, otherwise negative on error.
> + */
> +static int xen_mpumap_update_entry(paddr_t base, paddr_t limit,
> +   unsigned int flags)
> +{
> +uint8_t idx;
> +int rc;
> +
> +ASSERT(spin_is_locked(&xen_mpumap_lock));
> +
> +rc = mpumap_contains_region(xen_mpumap, max_mpu_regions, base, limit, 
> &idx);
> +if ( !(rc == MPUMAP_REGION_NOTFOUND) )
Why not ( rc != MPUMAP_REGION_NOTFOUND )?

> +return -EINVAL;
> +
> +/* We are inserting a mapping => Create new region. */
> +if ( flags & _PAGE_PRESENT )
Wouldn't it make more sense to have this check before calling
mpumap_contains_region()?

> +{
> +rc = xen_mpumap_alloc_entry(&idx);
> +if ( rc )
> +return -ENOENT;
> +
> +xen_mpumap[idx] = pr_of_addr(base, limit, flags);
> +
> +write_protection_region(&xen_mpumap[idx], idx);
> +}
> +
> +return 0;
> +}
> +
> +int xen_mpumap_update(paddr_t base, padd

Re: [XEN PATCH 2/5] iommu: address violation of MISRA C Rule 5.5

2025-07-07 Thread Jan Beulich

On 04.07.2025 22:39, Dmytro Prokopchuk1 wrote:
> Address a violation of MISRA C:2012 Rule 5.5:
> "Identifiers shall be distinct from macro names".
> 
> Reports for service MC3A2.R5.5:
> xen/include/xen/iommu.h: non-compliant struct 'page_list_head'
> xen/include/xen/mm.h: non-compliant macro 'page_list_head'

What is this about? There's no code change that I could see which would
alter the situation here.

> xen/drivers/passthrough/iommu.c: non-compliant macro 'iommu_quarantine'
> xen/include/xen/iommu.h: non-compliant variable 'iommu_quarantine'
> 
> These external variables ('iommu_pt_cleanup_lock'
> and 'iommu_pt_cleanup_list') are no longer used
> in the codebase. Removing them eliminates dead
> code and ensures compliance with coding standards.
> Fixes: b5622eb627 (iommu: remove unused iommu_ops method and tasklet, 
> 2020-09-22)

That's a different Misra rule, so may better be put in a separate patch?
Otherwise please at least mention the rule number as well.

> The variable 'iommu_quarantine' makes sence to use
> only if 'CONFIG_HAS_PCI=y', so place it inside '#ifdef'.

Hmm, yes - not nice, but what do you do. I question "makes sense" though:
Quarantining is possible also without PCI, aiui. Just we don't that right
now.

Jan

[PATCH v6 6/7] xen/riscv: implement setup_irq()

2025-07-07 Thread Oleksii Kurochko

Introduce support for IRQ setup on RISC-V by implementing setup_irq() and
__setup_irq(), adapted and extended from an initial implementation by [1].

__setup_irq() does the following:
  - Sets up an IRQ action.
  - Validates that shared IRQs have non-NULL `dev_id` and are only used when
existing handlers allow sharing.
  - Uses smp_wmb() to enforce memory ordering after assigning desc->action
to ensure visibility before enabling the IRQ.
  - Supports multi-action setups via CONFIG_IRQ_HAS_MULTIPLE_ACTION.

setup_irq() does the following:
  - Converts IRQ number to descriptor and acquires its lock.
  - Rejects registration if the IRQ is already assigned to a guest domain,
printing an error.
  - Delegates the core setup to __setup_irq().
  - On first-time setup, disables the IRQ, routes it to Xen using
intc_route_irq_to_xen(), sets default CPU affinity (current CPU),
calls the handler’s startup routine, and finally enables the IRQ.

irq_set_affinity() invokes set_affinity() callback from the IRQ handler
if present.

Defined IRQ_NO_PRIORITY as default priority used when routing IRQs to Xen.

[1] 
https://gitlab.com/xen-project/people/olkur/xen/-/commit/7390e2365828b83e27ead56b03114a56e3699dd5

Co-developed-by: Romain Caritey 
Signed-off-by: Oleksii Kurochko 
Acked-by: Jan Beulich 
---
Changes in V6:
 - Nothing changed. Only rebase.
---
Changes in V5:
 - Add Acked-by: Jan Beulich .
---
Changes in V3-4:
 - Nothing changed. Only rebase.
---
Changes in V2:
 - Added implenmtation of aplic_set_irq_type() as it is going to be used in
   this commit. And also, update the implementation of it. Make default case
   of switch to do panic().
 - Move all forward declaration up  in asm/irq.h.
 - s/__setup_irq/_setup_irq.
 - Code style fixes.
 - Update commit message.
 - use smp_wmb() instead of smp_mb() in _setup_irq().
 - Drop irq_set_affinity().
 - Use plain C operator instead if {clear,set}_bit() for desc->status as it
   is always used under spinlock().
 - Drop set_bit(_IRQ_DISABLED, &desc->status) in setup_irq() as in the case
   when IRQ is setuped for a first time, desc->status should be already set
   to IRQ_DISABLED in init_one_irq_desc().

 xen/arch/riscv/include/asm/irq.h |  2 +
 xen/arch/riscv/irq.c | 84 
 2 files changed, 86 insertions(+)

diff --git a/xen/arch/riscv/include/asm/irq.h b/xen/arch/riscv/include/asm/irq.h
index 94151eb083..f633636dc3 100644
--- a/xen/arch/riscv/include/asm/irq.h
+++ b/xen/arch/riscv/include/asm/irq.h
@@ -17,6 +17,8 @@
  */
 #define NR_IRQS 1024
 
+#define IRQ_NO_PRIORITY 0
+
 /* TODO */
 #define nr_irqs 0U
 #define nr_static_irqs 0
diff --git a/xen/arch/riscv/irq.c b/xen/arch/riscv/irq.c
index 466f1b4ba9..25d3295002 100644
--- a/xen/arch/riscv/irq.c
+++ b/xen/arch/riscv/irq.c
@@ -7,6 +7,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -63,6 +64,89 @@ int platform_get_irq(const struct dt_device_node *device, 
int index)
 return dt_irq.irq;
 }
 
+static int _setup_irq(struct irq_desc *desc, unsigned int irqflags,
+  struct irqaction *new)
+{
+bool shared = irqflags & IRQF_SHARED;
+
+ASSERT(new != NULL);
+
+/*
+ * Sanity checks:
+ *  - if the IRQ is marked as shared
+ *  - dev_id is not NULL when IRQF_SHARED is set
+ */
+if ( desc->action != NULL && (!(desc->status & IRQF_SHARED) || !shared) )
+return -EINVAL;
+if ( shared && new->dev_id == NULL )
+return -EINVAL;
+
+if ( shared )
+desc->status |= IRQF_SHARED;
+
+#ifdef CONFIG_IRQ_HAS_MULTIPLE_ACTION
+new->next = desc->action;
+#endif
+
+desc->action = new;
+smp_wmb();
+
+return 0;
+}
+
+int setup_irq(unsigned int irq, unsigned int irqflags, struct irqaction *new)
+{
+int rc;
+unsigned long flags;
+struct irq_desc *desc = irq_to_desc(irq);
+bool disabled;
+
+spin_lock_irqsave(&desc->lock, flags);
+
+disabled = (desc->action == NULL);
+
+if ( desc->status & IRQ_GUEST )
+{
+spin_unlock_irqrestore(&desc->lock, flags);
+/*
+ * TODO: would be nice to have functionality to print which domain owns
+ *   an IRQ.
+ */
+printk(XENLOG_ERR "ERROR: IRQ %u is already in use by a domain\n", 
irq);
+return -EBUSY;
+}
+
+rc = _setup_irq(desc, irqflags, new);
+if ( rc )
+goto err;
+
+/* First time the IRQ is setup */
+if ( disabled )
+{
+/* Route interrupt to xen */
+intc_route_irq_to_xen(desc, IRQ_NO_PRIORITY);
+
+/*
+ * We don't care for now which CPU will receive the
+ * interrupt.
+ *
+ * TODO: Handle case where IRQ is setup on different CPU than
+ *   the targeted CPU and the priority.
+ */
+desc->handler->set_affinity(desc, cpumask_of(smp_processor_id()));
+
+desc->handler->startup(desc);
+
+/* Enable irq */
+desc->status &= ~I

[PATCH v6 0/7] riscv: introduce basic UART support and interrupts for hypervisor mode

2025-07-07 Thread Oleksii Kurochko

The patch series introduces basic UART support (in interrupt mode) and support 
of
interrupts for hypervisor mode.

To implement this the following has been added:
- APLIC and IMISC initialization.
- Introduce of intc_hw_operations abstraction.
- Introduce some APLIC and IMSIC operations.
- Introduce init_IRQ(), platform_get_irq() and setup_irq() functions.
- Update do_trap() handler to handle IRQ_S_EXT.
- Introduce some other functions such as: get_s_time(), smp_clear_cpu_maps(),
  ioremap().
- Enable UART. 

CI tests: https://gitlab.com/xen-project/people/olkur/xen/-/pipelines/1910703345
---
Changes in V6:
 - Have been merged to staging:
   [PATCH v5 2/9] xen/riscv: introduce register_intc_ops() and intc_hw_ops.
   [PATCH v4 1/9] xen/riscv: dt_processor_hartid()
 - All other changes are patch-specific. Please check each patch separately.
---
Changes in V5:
 - Update CI tests link.
 - Mostly changes are patch-specific. Please check each patch separately.
---
Changes in V4:
 - Merged to staging:
- xen/riscv: initialize bitmap to zero in riscv_fill_hwcap_from_isa_string()
- xen/asm-generic: introduce asm-generic/irq-dt.h
- xen/riscv: introduce smp_prepare_boot_cpu()
- xen/riscv: introduce support of Svpbmt extension
- add ioremap_*() variants using ioremap_attr()
- xen/riscv: introduce init_IRQ()
- xen/riscv: introduce platform_get_irq()
 - All other changes are patch-specific. Please check each patch separately.
---
Changes in V2:
 - Merged to staging:
xen/riscv: initialize bitmap to zero in riscv_fill_hwcap_from_isa_string()
xen/asm-generic: introduce asm-generic/irq-dt.h
 - All other changes are patch-specific. Please check each patch separately.
---

Oleksii Kurochko (7):
  xen/riscv: imsic_init() implementation
  xen/riscv: aplic_init() implementation
  xen/riscv: introduce intc_init() and helpers
  xen/riscv: implementation of aplic and imsic operations
  xen/riscv: add external interrupt handling for hypervisor mode
  xen/riscv: implement setup_irq()
  xen/riscv: add basic UART support

 xen/arch/riscv/Kconfig |   1 +
 xen/arch/riscv/Makefile|   1 +
 xen/arch/riscv/aplic-priv.h|  38 +++
 xen/arch/riscv/aplic.c | 297 ++
 xen/arch/riscv/imsic.c | 489 +
 xen/arch/riscv/include/asm/aplic.h |  73 +
 xen/arch/riscv/include/asm/imsic.h |  74 +
 xen/arch/riscv/include/asm/intc.h  |  13 +
 xen/arch/riscv/include/asm/irq.h   |   8 +-
 xen/arch/riscv/include/asm/smp.h   |  13 +
 xen/arch/riscv/intc.c  |  46 +++
 xen/arch/riscv/irq.c   | 131 
 xen/arch/riscv/setup.c |  14 +
 xen/arch/riscv/traps.c |  19 ++
 xen/drivers/char/Kconfig   |   3 +-
 15 files changed, 1217 insertions(+), 3 deletions(-)
 create mode 100644 xen/arch/riscv/aplic-priv.h
 create mode 100644 xen/arch/riscv/imsic.c
 create mode 100644 xen/arch/riscv/include/asm/aplic.h
 create mode 100644 xen/arch/riscv/include/asm/imsic.h

-- 
2.50.0

[PATCH v6 4/7] xen/riscv: implementation of aplic and imsic operations

2025-07-07 Thread Oleksii Kurochko

Introduce interrupt controller descriptor for host APLIC to describe
the low-lovel hardare. It includes implementation of the following functions:
 - aplic_irq_startup()
 - aplic_irq_enable()
 - aplic_irq_disable()
 - aplic_set_irq_affinity()

As APLIC is used in MSI mode it requires to enable/disable interrupts not
only for APLIC but also for IMSIC. Thereby for the purpose of
aplic_irq_{enable,disable}() it is introduced imsic_irq_{enable,disable)().

For the purpose of aplic_set_irq_affinity() aplic_get_cpu_from_mask() is
introduced to get hart id.

Also, introduce additional interrupt controller h/w operations and
host_irq_type for APLIC:
 - aplic_host_irq_type

Patch is based on the code from [1].

[1] 
https://gitlab.com/xen-project/people/olkur/xen/-/commit/7390e2365828b83e27ead56b03114a56e3699dd5

Co-developed-by: Romain Caritey 
Signed-off-by: Oleksii Kurochko 
Acked-by: Jan Beulich 
---
Changes in V6:
 - Fix code style issue: start multi-word comments with a capital letter.
 - Align properly comment in imsic_irq_enable().
---
Changes in V5:
 - Add Acked-by: Jan Beulich .
---
Changes in V4:
 - Code style fixes.
 - Add cf_check for hook functions.
 - Use min() macros instead of open-coding it in imsic_local_eix_update().
 - Update the comment above ASSERT() in imsic_irq_disable():
   s/aplic_irq_enable/aplic_irq_disable.
 - Add an explanatory comment above initializing of hhxs in
   aplic_set_irq_affinity().
---
Changes in V3:
 - Update the lock above lock member of struct aplic_priv and imsic_config 
struct.
 - Use spin_{un}lock() in aplic_irq_{enable,disable}() as it is IRQ-safe.
   Also drop local variable 'flags'.
 - Add ASSERT(spin_is_locked(&desc->lock)) to aplic_set_irq_affinity() and
   aplic_set_irq_type().
 - Use an initializer instead of spin_lock_init() for aplic.lock.
 - Drop "(l)" in the comment in imsic_irq_enable() as it doesn't point to
   anything.
 - Use ASSERT(!local_irq_is_enabled()) + spin_lock() in 
imsic_irq_{enable,disable}().
 - Use an initializer instead of spin_lock_init() for imsic_config.lock.
---
Changes in V2:
 - Move imsic_ids_local_delivery() and connected to it parts to the current
   patch to fix compilation issue. Also, add __init for
   imsic_ids_local_delivery().
 - Move introduction of aplic_set_irq_type() and aplic_set_irq_priority()
   to patch [PATCH v1 12/14] xen/riscv: implement setup_irq() where they
   really started to be used.
 - Update the commit message.
 - Drop is_used variable for imsic_cfg and use (aplic.regs->domaincfg & 
APLIC_DOMAINCFG_DM) instead.
 - Use writel() to write to APLIC regs.
 - Drop aplic_irq_shutdown() and use aplic_irq_disable explicitly.
 - Drop local variable cpu in aplic_get_cpu_from_mask():
   Use cpu_online_map instead of cpu_possible_map.
   Remame possible_mask to mask.
 - Code style fixes.
 - Move spin_lock(&aplic.lock) down before write to the register in 
aplic_set_irq_affinity.
 - Make aplic_host_irq_type const.
 - imsic_local_eix_update() updates:
   - move unsigned long isel, ireg; to inner loop.
   - Drop unnecessary parentheses.
   - Optimize inner loop of ireg's setting.
 - Drop aplic_irq_ack() and aplic_host_irq_end() as they do nothing.
 - Rename s/hwirq/irq.
 - Add explanatory comment to imsic_irq_enable() about why there is not -1 for 
IRQ in
   comparison with APLIC's sourcecfg.
 - Use IMSIC_MMIO_PAGE_SHIFT instead of constant 12 in aplic_set_irq_affinity().
 - s/aplic_host_irq_type/aplic_xen_irq_type
 - Drop set/clear of IRQ_DISABLED bit in aplic_{enable,disable}() as guest will 
always
   first request an interrupt and then only an interrupt will be enabled.
   (for example, in Arm, the physical interrupts would be enabled when the
   interrupt is initially routed. This could lead to problem because the guest
   will usually boot with interrupt disabled.)
---
 xen/arch/riscv/aplic-priv.h|   4 +
 xen/arch/riscv/aplic.c | 123 -
 xen/arch/riscv/imsic.c | 122 +++-
 xen/arch/riscv/include/asm/aplic.h |   2 +
 xen/arch/riscv/include/asm/imsic.h |  18 +
 5 files changed, 267 insertions(+), 2 deletions(-)

diff --git a/xen/arch/riscv/aplic-priv.h b/xen/arch/riscv/aplic-priv.h
index 71792bbf76..b70e353ae4 100644
--- a/xen/arch/riscv/aplic-priv.h
+++ b/xen/arch/riscv/aplic-priv.h
@@ -14,6 +14,7 @@
 #ifndef ASM_RISCV_PRIV_APLIC_H
 #define ASM_RISCV_PRIV_APLIC_H
 
+#include 
 #include 
 
 #include 
@@ -27,6 +28,9 @@ struct aplic_priv {
 /* Registers */
 volatile struct aplic_regs __iomem *regs;
 
+/* Lock to protect access to APLIC's registers */
+spinlock_t lock;
+
 /* IMSIC configuration */
 const struct imsic_config *imsic_cfg;
 };
diff --git a/xen/arch/riscv/aplic.c b/xen/arch/riscv/aplic.c
index 0b0101ebc1..edf4db3113 100644
--- a/xen/arch/riscv/aplic.c
+++ b/xen/arch/riscv/aplic.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -23,11 +24,14 @@
 #include 
 #incl

[PATCH v6 3/7] xen/riscv: introduce intc_init() and helpers

2025-07-07 Thread Oleksii Kurochko

Introduce intc_init() to initialize the interrupt controller using the
registered hardware ops.
Also add intc_route_irq_to_xen() to route IRQs to Xen, with support for
setting IRQ type and priority via new internal helpers intc_set_irq_type()
and intc_set_irq_priority().

Call intc_init() to do basic initialization steps for APLIC and IMSIC.

Co-developed-by: Romain Caritey 
Signed-off-by: Oleksii Kurochko 
Acked-by: Jan Beulich 
---
Changes in V5-V6:
 - Nothing changed. Only rebase.
---
Changes in V4:
 - Add Acked-by: Jan Beulich .
---
Changes in V3:
 - Drop ASSERIT(intc_hw_ops) in intc_init().
 - Drop ASSERT(intc_hw_ops) in intc_set_irq_type() and
   intc_set_irq_priority() as intc_init() will be called first and so if it
   won't crash, then intc_hw_ops is registered.
---
Changes in V2:
 - This patch was part of "xen/riscv: Introduce intc_hw_operations abstraction"
   and splitted to have ability to merge patch "xen/riscv: initialize interrupt 
controller"
   to the current patch (where intc_init() call is actually introduced).
 - Add checks of that callbacks aren't set to NULL in intc_set_irq_priority()
   and intc_set_irq_type().
 - add num_irqs member to struct intc_info as it is used now in
   intc_route_irq_to_xen().
 - Add ASSERT(spin_is_locked(&desc->lock)) to intc_set_irq_priority() in
   the case this function will be called outside intc_route_irq_to_xen().
---
 xen/arch/riscv/include/asm/intc.h |  4 +++
 xen/arch/riscv/intc.c | 41 +++
 xen/arch/riscv/setup.c|  2 ++
 3 files changed, 47 insertions(+)

diff --git a/xen/arch/riscv/include/asm/intc.h 
b/xen/arch/riscv/include/asm/intc.h
index 3c4b211f58..a11b7aa55e 100644
--- a/xen/arch/riscv/include/asm/intc.h
+++ b/xen/arch/riscv/include/asm/intc.h
@@ -43,4 +43,8 @@ void intc_preinit(void);
 
 void register_intc_ops(const struct intc_hw_operations *ops);
 
+void intc_init(void);
+
+void intc_route_irq_to_xen(struct irq_desc *desc, unsigned int priority);
+
 #endif /* ASM__RISCV__INTERRUPT_CONTOLLER_H */
diff --git a/xen/arch/riscv/intc.c b/xen/arch/riscv/intc.c
index 1ecd651bf3..f2823267a9 100644
--- a/xen/arch/riscv/intc.c
+++ b/xen/arch/riscv/intc.c
@@ -1,9 +1,12 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 
 #include 
+#include 
 #include 
 #include 
+#include 
 #include 
+#include 
 
 #include 
 
@@ -21,3 +24,41 @@ void __init intc_preinit(void)
 else
 panic("ACPI interrupt controller preinit() isn't implemented\n");
 }
+
+void __init intc_init(void)
+{
+if ( intc_hw_ops->init() )
+panic("Failed to initialize the interrupt controller drivers\n");
+}
+
+/* desc->irq needs to be disabled before calling this function */
+static void intc_set_irq_type(struct irq_desc *desc, unsigned int type)
+{
+ASSERT(desc->status & IRQ_DISABLED);
+ASSERT(spin_is_locked(&desc->lock));
+ASSERT(type != IRQ_TYPE_INVALID);
+
+if ( intc_hw_ops->set_irq_type )
+intc_hw_ops->set_irq_type(desc, type);
+}
+
+static void intc_set_irq_priority(struct irq_desc *desc, unsigned int priority)
+{
+ASSERT(spin_is_locked(&desc->lock));
+
+if ( intc_hw_ops->set_irq_priority )
+intc_hw_ops->set_irq_priority(desc, priority);
+}
+
+void intc_route_irq_to_xen(struct irq_desc *desc, unsigned int priority)
+{
+ASSERT(desc->status & IRQ_DISABLED);
+ASSERT(spin_is_locked(&desc->lock));
+/* Can't route interrupts that don't exist */
+ASSERT(intc_hw_ops && desc->irq < intc_hw_ops->info->num_irqs);
+
+desc->handler = intc_hw_ops->host_irq_type;
+
+intc_set_irq_type(desc, desc->arch.type);
+intc_set_irq_priority(desc, priority);
+}
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index 8bcd19218d..0e7398159c 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -134,6 +134,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
 
 intc_preinit();
 
+intc_init();
+
 printk("All set up\n");
 
 machine_halt();
-- 
2.50.0

[PATCH v6 1/7] xen/riscv: imsic_init() implementation

2025-07-07 Thread Oleksii Kurochko

imsic_init() is introduced to parse device tree node, which has the following
bindings [2], and based on the parsed information update IMSIC configuration
which is stored in imsic_cfg.

The following helpers are introduces for imsic_init() usage:
  - imsic_parse_node() parses IMSIC node from DTS
  - imsic_get_parent_hartid() returns the hart ( CPU ) ID of the given device
tree node.

This patch is based on the code from [1].

Since Microchip originally developed imsic.{c,h}, an internal discussion with
them led to the decision to use the MIT license.

[1] 
https://gitlab.com/xen-project/people/olkur/xen/-/commit/0b1a94f2bc3bb1a81cd26bb75f0bf578f84cb4d4
[2] 
https://elixir.bootlin.com/linux/v6.12/source/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml

Co-developed-by: Romain Caritey 
Signed-off-by: Oleksii Kurochko 
---
Changes in V6:
 - Code style fixes.
 - s/xfree/xvfree.
 - (*nr_mmios)++ -> ++*nr_mmios.
 - Use %u for unsigned int arguments.
 - Change name of local variable cpuid to cpu.
---
Changes in V5:
 - Drop trailing underscore for an argument of IMSIC_HART_SIZE macros.
 - Avoid wrapping format strings across lines.
 - Use 'unsigned int' for cpu variable inside imsic_init().
 - Use IMSIC_HART_SIZE() instead of open-code it.
 - s/msi[cpu].base_addr /mmios[].base_addr for the check which checks
   that MMIO addres is properly aligned.
 - s/XFREE/xvfree.
 - Drop zero-ing of  msi[cpu].{offset,base_addr} as msi[]  is zero-ed
   when is allocated and cpu id can't be found twice.
 - Add check to vefiry that CPU won't be found twice in interrupt-extended
   property of IMSIC node.
---
Changes in V4:
 - s/expected/intended in the comment above imsic_get_config().
 - [imsic_parse_node()] s/irq_num can be 0/irq_num can't be 0 in panic()
   message.
 - [imsic_parse_node()] Move "if ( irq_range[1] == IRQ_M_EXT )" after 'for loop'
   which checks interrupts-extended property.
 - [imsic_parse_node()] s/%d/%u for logging unsigned values.
 - [imsic_parse_node()] drop redundant check "(imsic_cfg.nr_ids & IMSIC_MIN_ID) 
!= IMSIC_MIN_ID)".
 - [imsic_parse_node()] free irq_range much earlier and simlify error paths.
 - [imsic_parse_node()] s/-EINVAL/-ENOENT in the case of incorrect value for
   riscv,group-index-shift and id number.
 - [[imsic_parse_node()] s/xzalloc_array/xvzalloc_array.
 - s/xen_cpuid/cpu.
 - Identation fix.
 - use IMSIC_MMIO_PAGE_SZ instead of PAGE_SZ to check if interrupt file base
   addr is properly aligned.
 - s/ASM__RISCV__IMSIC_H/ASM_RISCV_IMSIC_H.
 - Drop *mmios from struct imsic_cfg as it is used only by imsic_init().
 - Drop cpus[] form struct imsic_mmios as it isn't really used.
 - Update declaration of hartid_to_cpuid() to return unsigned int instead of
   NR_CPUs as processor_id is in range [0, NR_CPUS) and NR_CPUs is less then
   unsigned int.
 - Calculate hart_index_bits as fls(*nr_parent_irqs - 1) to cover the case if
   nr_parent_irqs is a power of two.
 - Check an MMIO's size for IMSIC node.
---
Changes in V3:
 - Drop year in imsic.h in copyrights.
 - Correct identation in imsic_parse_node() and imsic_init()
   where for imsic_cfg.base_addr a mask is applied.
 - Use unsigned int istead of uint32_t for local variable nr_parent_irqs,
   index, nr_handlers in imsic_init().
 - Fix a leakage of ealiers successfull allocations in case if imsic_init()
   returns with an error.
 - Excess blank in printk() message: "%s: unable to parse MMIO regset %d\n".
 - Introduce hartid_to_cpuid() and use it in the check:
 if ( hardid_to_cpuid(cpuid) >= num_possible_cpus() )
   in imsic_init().
 - Use "%u" for unsigned int in printk(...).
 - Fix for loop condition: nr_mmios -> "j < nr_mmios".
 - [imsic_init()] Drop usage of j in nested loop. It is enough to have only
   index.
 - Change 0x%lx to %#lx for formatting of an address in printk().
 - [imsic_init()] Rename local variable cpuid to hartid.
 - s/imsic_get_parent_cpuid/imsic_get_parent_hartid, s/cpuid/hartid for an
   imsic_get_parent_hartid()'s argument.
 - Declare cpus member of struct imsic_mmios as cpumask_t.
 - [imsic_init()] Allocate imsic_mmios.cpus by using of alloc_cpumask_var().
 - [imsic_init()] Use cpumask_set_cpu() instead of bitmap_set().
 - [imsic_parse_node()] add check for correctness of "interrupt-extended" 
property.
 - [imsic_parse_node()] Use dt_node_name() or dt_full_node_name() instead of
   derefence of struct dt_node.
 - [imsic_parse_node()] Add cleanup label and update 'rc = XXX; goto cleanup'
   instead of 'return rc' as now we have to cleanup dynamically allocated 
irq_range
   array.
 - Add comments above imsic_init() and imsic_parse_node().
 - s/xen/arch/riscv/imsic.h/xen/arch/riscv/include/asm/imsic.h in the comment of
   imsic.h.
---
Changes in V2:
 - Drop years in copyrights.
 - s/riscv_of_processor_hartid/dt_processor_cpuid.
 - s/imsic_get_parent_hartid/imsic_get_parent_hartid.
   Rename argument hartid to cpuid.
   Make node argument const.
   Return res instead of -EINVAL for the f

[PATCH v6 5/7] xen/riscv: add external interrupt handling for hypervisor mode

2025-07-07 Thread Oleksii Kurochko

Implement functions necessarry to have working external interrupts in
hypervisor mode. The following changes are done:
  - Add a common function intc_handle_external_irq() to call APLIC specific
function to handle an interrupt.
  - Update do_trap() function to handle IRQ_S_EXT case; add the check to catch
case when cause of trap is an interrupt.
  - Add handle_interrrupt() member to intc_hw_operations structure.
  - Enable local interrupt delivery for IMSIC by calling of
imsic_ids_local_delivery() in imsic_init(); additionally introduce helper
imsic_csr_write() to update IMSIC_EITHRESHOLD and IMSIC_EITHRESHOLD.
  - Enable hypervisor external interrupts.
  - Implement aplic_handler_interrupt() and use it to init ->handle_interrupt
member of intc_hw_operations for APLIC.
  - Add implementation of do_IRQ() to dispatch the interrupt.

The current patch is based on the code from [1].

[1] 
https://gitlab.com/xen-project/people/olkur/xen/-/commit/7390e2365828b83e27ead56b03114a56e3699dd5

Co-developed-by: Romain Caritey 
Signed-off-by: Oleksii Kurochko 
Acked-by: Jan Beulich 
---
Changes in V6:
 - Fix code style issue: start multi-word comments with a capital letter.
 - Align properly comment in aplic_set_irq_type().
---
Changes in V4 - V5:
 - Nothing changed. Only rebase.
---
Changes in V3:
 - Add ASSERT(spin_is_locked(&desc->lock)) to aplic_set_irq_type().
 - Fix code style for switch () in aplic_set_irq_type().
 - Drop fallthrough between "case IRQ_TYPE_NONE: case IRQ_TYPE_INVALID:" as 
there
   is no other statements in "case IRQ_TYPE_NONE".
 - Add Acked-by: Jan Beulich .
---
Changes in V2:
 - use BIT() macros instead of 1UL << bit_num in aplic.c.
 - Drop passing of a cause to aplic_handle_interrupt() function. And update
   prototype of handle_interrupt() callback.
 - Drop ASSERT() in intc_handle_external_irqs() as it is useless.
 - Code style fixes.
 - Drop cause argument for intc_handle_external_irqs().
 - Update the commit message: drop words that imsic_ids_local_delivery() is
   implemented in this patch, it is only called.
 - Add cf_check for aplic_handle_interrupt(), aplic_set_irq_type().
 - Move forward declarations in asm/intc.h up.
 - Use plain C operator instead if {clear,set}_bit() for desc->status as it
   is always used under spinlock().
 - use writel() for writing to APLIC's sourcecfg in aplic_set_irq_type().
---
 xen/arch/riscv/aplic.c | 73 ++
 xen/arch/riscv/include/asm/aplic.h |  7 +++
 xen/arch/riscv/include/asm/imsic.h |  1 +
 xen/arch/riscv/include/asm/intc.h  |  6 +++
 xen/arch/riscv/include/asm/irq.h   |  6 ++-
 xen/arch/riscv/intc.c  |  5 ++
 xen/arch/riscv/irq.c   | 47 +++
 xen/arch/riscv/traps.c | 19 
 8 files changed, 163 insertions(+), 1 deletion(-)

diff --git a/xen/arch/riscv/aplic.c b/xen/arch/riscv/aplic.c
index edf4db3113..739e8dab34 100644
--- a/xen/arch/riscv/aplic.c
+++ b/xen/arch/riscv/aplic.c
@@ -9,6 +9,7 @@
  * Copyright (c) 2024-2025 Vates
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -227,6 +228,73 @@ static void cf_check aplic_set_irq_affinity(struct 
irq_desc *desc, const cpumask
 spin_unlock(&aplic.lock);
 }
 
+static void cf_check aplic_handle_interrupt(struct cpu_user_regs *regs)
+{
+/* Disable to avoid more external interrupts */
+csr_clear(CSR_SIE, BIT(IRQ_S_EXT, UL));
+
+/* Clear the pending bit */
+csr_clear(CSR_SIP, BIT(IRQ_S_EXT, UL));
+
+/* Dispatch the interrupt */
+do_IRQ(regs, csr_swap(CSR_STOPEI, 0) >> TOPI_IID_SHIFT);
+
+/* Enable external interrupts */
+csr_set(CSR_SIE, BIT(IRQ_S_EXT, UL));
+}
+
+static void cf_check aplic_set_irq_type(struct irq_desc *desc,
+unsigned int type)
+{
+/*
+ * Interrupt 0 isn't possible based on the spec:
+ *   Each of an APLIC???s interrupt sources has a fixed unique identity
+ *   number in the range 1 to N, where N is the total number of sources at
+ *   the APLIC. The number zero is not a valid interrupt identity number at
+ *   an APLIC. The maximum number of interrupt sources an APLIC may support
+ *   is 1023.
+ *
+ * Thereby interrupt 1 will correspond to bit 0 in sourcecfg[] register,
+ * interrupt 2 ->sourcecfg[1] and so on.
+ *
+ * And that is the reason why we need -1.
+ */
+unsigned int irq_bit = desc->irq - 1;
+
+ASSERT(spin_is_locked(&desc->lock));
+
+spin_lock(&aplic.lock);
+
+switch ( type )
+{
+case IRQ_TYPE_EDGE_RISING:
+writel(APLIC_SOURCECFG_SM_EDGE_RISE, &aplic.regs->sourcecfg[irq_bit]);
+break;
+
+case IRQ_TYPE_EDGE_FALLING:
+writel(APLIC_SOURCECFG_SM_EDGE_FALL, &aplic.regs->sourcecfg[irq_bit]);
+break;
+
+case IRQ_TYPE_LEVEL_HIGH:
+writel(APLIC_SOURCECFG_SM_LEVEL_HIGH, &aplic.regs->sourcecfg[irq_bit]);
+break;
+
+case IRQ_TYPE_LEVEL_LOW:
+writel(APLIC_SOURCE

[PATCH v6 2/7] xen/riscv: aplic_init() implementation

2025-07-07 Thread Oleksii Kurochko

aplic_init() function does the following few things:
 - checks that IMSIC in device tree node  ( by checking msi-parent property
   in APLIC node ) is present as current one implmenetaion of AIA is
   supported only MSI method.
 - initialize IMSIC based on IMSIC device tree node
 - Read value of APLIC's paddr start/end and size.
 - Map aplic.regs
 - Setup APLIC initial state interrupts (disable all interrupts, set
   interrupt type and default priority, confgifure APLIC domaincfg) by
   calling aplic_init_hw_interrutps().

aplic_init() is based on the code from [1] and [2].

Since Microchip originally developed aplic.c, an internal discussion with
them led to the decision to use the MIT license.

[1] 
https://gitlab.com/xen-project/people/olkur/xen/-/commit/7cfb4bd4748ca268142497ac5c327d2766fb342d
[2] 
https://gitlab.com/xen-project/people/olkur/xen/-/commit/392a531bfad39bf4656ce8128e004b241b8b3f3e

Co-developed-by: Romain Caritey 
Signed-off-by: Oleksii Kurochko 
---
Changes in V6:
 - Code style fixes.
 - Shorten message string in aplic_init().
 - Add hex offset for APLIC regs in struct aplic_regs.
---
Changes in V5:
 - Fix a typo in panic massege in aplic_init(): s/Failded/Failed.
 - Correct size check: (!IS_ALIGNED(size, KB(4)) && (size < KB(16))) ->
   (!IS_ALIGNED(size, KB(4)) || (size < KB(16))).
 - Avoid wrapping format strings across lines.
---
Changes in V4:
 - s/ASM__RISCV_PRIV_APLIC_H/ASM_RISCV_PRIV_APLIC_H
 - Drop APLIC_MAX_NUM_WIRED_IRQ_SOURCES and use 
ARRAY_SIZE(aplic.regs->sourcecfg).
 - Don't use 'else if' for the case when the earlier if() ends in an
   unconditional control flow change.
 - Use const for aplic_ops and drop __ro_after_init.
 - Add checks of APLIC's memory-mapped control region physical address and
   size.
 - s/ASM__RISCV__APLIC_H/ASM_RISCV_APLIC_H.
 - Use unsigned int in defintion of APLIC_DOMAINCFG_IE and APLIC_DOMAINCFG_DM.
 - set aplic_info.num_irqs to ARRAY_SIZE(aplic.regs->sourcecfg) if DT node
   provides too much.
 - Add some constraints for aplic.paddr_start and aplic.size.
---
Changes in V3:
 - Correct the comment on top of aplic-priv.h:
 xen/arch/riscv/aplic.h -> .../aplic-priv.h
 - Add __iomem for regs member of aplic_priv structure.
 - [aplic_init_hw_interrupts] Use ~0U instead of -1U in 
aplic_init_hw_interrupts()
   to disable all interrupts.
 - [aplic_init_hw_interrupts] Start 'i' (for-cycle variable) from 0, not from 1.
 - [aplic_init()] Declare imsic_node as pointer-to-const.
 - Use decimal for arrays in struct aplic_regs.
 - [aplic_init()] Check that aplic_info.num_irqs are less then 1023.
 - [aplic_init()] Drop out check of IMSIC's node interrupt-extended property
   from aplic_init().
---
Changes in V2:
 - use __ro_after_init for aplic_ops.
 - s/nr_irqs/num_irqs.
 - s/dt_processor_hartid/dt_processor_cpuid.
 - Drop confusing comment in aplic_init_hw_interrupts().
 - Code style fixes.
 - Drop years for Copyright.
 - Revert changes which drop nr_irq define from asm/irq.h,
   it shouldn't be, at least, part of this patch.
 - Drop paddr_enf from struct aplic_regs. It is enough to have pair
   (paddr_start, size).
 - Make  struct aplic_priv of asm/aplic.h private by moving it to
   riscv/aplic-priv.h.
 - Add the comment above the initialization of APLIC's target register.
 - use writel() to access APLIC's registers.
 - use 'unsinged int' for local variable i in aplic_init_hw_interrupts.
 - Add the check that all modes in interrupts-extended property of
   imsic node are equal. And drop rc != EOVERFLOW when interrupts-extended
   property is read.
 - Add cf_check to aplic_init().
 - Fix a cycle of clrie register initialization in aplic_init_hw_interrupts().
   Previous implementation leads to out-of-boundary.
 - Declare member num_irqs in struct intc_info as it is used by APLIC code.
---
 xen/arch/riscv/aplic-priv.h|  34 ++
 xen/arch/riscv/aplic.c | 103 +
 xen/arch/riscv/include/asm/aplic.h |  64 ++
 xen/arch/riscv/include/asm/intc.h  |   3 +
 4 files changed, 204 insertions(+)
 create mode 100644 xen/arch/riscv/aplic-priv.h
 create mode 100644 xen/arch/riscv/include/asm/aplic.h

diff --git a/xen/arch/riscv/aplic-priv.h b/xen/arch/riscv/aplic-priv.h
new file mode 100644
index 00..71792bbf76
--- /dev/null
+++ b/xen/arch/riscv/aplic-priv.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: MIT */
+
+/*
+ * xen/arch/riscv/aplic-priv.h
+ *
+ * Private part of aplic.h header.
+ *
+ * RISC-V Advanced Platform-Level Interrupt Controller support
+ *
+ * Copyright (c) Microchip.
+ * Copyright (c) Vates.
+ */
+
+#ifndef ASM_RISCV_PRIV_APLIC_H
+#define ASM_RISCV_PRIV_APLIC_H
+
+#include 
+
+#include 
+#include 
+
+struct aplic_priv {
+/* Base physical address and size */
+paddr_t paddr_start;
+size_t  size;
+
+/* Registers */
+volatile struct aplic_regs __iomem *regs;
+
+/* IMSIC configuration */
+const struct imsic_config *imsic_cfg;
+};
+
+#endif /* ASM_RISCV_PR

[PATCH v6 7/7] xen/riscv: add basic UART support

2025-07-07 Thread Oleksii Kurochko

Update Kconfig to select GENERIC_UART_INIT for basic UART init ( find a dt node
and call device specific device_init() ).

Drop `default n if RISCV` statement for config HAS_NS16550 as now ns16550 is
ready to be compiled and used by RISC-V. Also, make the config user selectable
for everyone except X86.

Initialize a minimal amount of stuff to have UART and Xen console:
 - Initialize uart by calling uart_init().
 - Initialize Xen console by calling console_init_{pre,post}irq().
 - Initialize timer and its internal lists which are used by
   init_timer() which is called by ns16550_init_postirq(); otherwise
   "Unhandled exception: Store/AMO Page Fault" occurs.
 - Enable local interrupt to recieve an input from UART

Signed-off-by: Oleksii Kurochko 
Acked-by: Jan Beulich 
---
Changes in V4-V6:
 - Nothing changed. Only rebase.
---
Changes in v3:
 - Drop inclusion of  as nothing in setup.c requires it.
 - Add Acked-by: Jan Beulich .
---
Changes in v2:
 - Drop #include  in setup.c, isn't needed anymore.
 - Drop call of percpu_init_areas() as it was needed when I used polling
   mode for UART,  for this case percpu is used to receive serial port info:
 struct serial_port *port = this_cpu(poll_port);
   So percpu isn't really needed at the current development state.
 - Make HAS_NS16550 user selectable for everyone, except X86.
 - Update the commit message.
---
 xen/arch/riscv/Kconfig   |  1 +
 xen/arch/riscv/setup.c   | 12 
 xen/drivers/char/Kconfig |  3 +--
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/xen/arch/riscv/Kconfig b/xen/arch/riscv/Kconfig
index 62c5b7ba34..96bef90751 100644
--- a/xen/arch/riscv/Kconfig
+++ b/xen/arch/riscv/Kconfig
@@ -2,6 +2,7 @@ config RISCV
def_bool y
select FUNCTION_ALIGNMENT_16B
select GENERIC_BUG_FRAME
+   select GENERIC_UART_INIT
select HAS_DEVICE_TREE
select HAS_PMAP
select HAS_UBSAN
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index 0e7398159c..a17096bf02 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -4,12 +4,15 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -134,8 +137,17 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
 
 intc_preinit();
 
+uart_init();
+console_init_preirq();
+
 intc_init();
 
+timer_init();
+
+local_irq_enable();
+
+console_init_postirq();
+
 printk("All set up\n");
 
 machine_halt();
diff --git a/xen/drivers/char/Kconfig b/xen/drivers/char/Kconfig
index e6e12bb413..8e49a52c73 100644
--- a/xen/drivers/char/Kconfig
+++ b/xen/drivers/char/Kconfig
@@ -2,8 +2,7 @@ config GENERIC_UART_INIT
bool
 
 config HAS_NS16550
-   bool "NS16550 UART driver" if ARM
-   default n if RISCV
+   bool "NS16550 UART driver" if !X86
default y
help
  This selects the 16550-series UART support. For most systems, say Y.
-- 
2.50.0

Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()

2025-07-07 Thread Jan Beulich

On 07.07.2025 13:46, Oleksii Kurochko wrote:
> On 7/7/25 9:20 AM, Jan Beulich wrote:
>> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>>> On 7/1/25 3:49 PM, Jan Beulich wrote:
 On 10.06.2025 15:05, Oleksii Kurochko wrote:
> This patch introduces p2m_set_entry() and its core helper 
> __p2m_set_entry() for
> RISC-V, based loosely on the Arm implementation, with several 
> RISC-V-specific
> modifications.
>
> Key differences include:
> - TLB Flushing: RISC-V allows caching of invalid PTEs and does not require
> break-before-make (BBM). As a result, the flushing logic is 
> simplified.
> TLB invalidation can be deferred until p2m_write_unlock() is called.
> Consequently, the p2m->need_flush flag is always considered true and 
> is
> removed.
> - Page Table Traversal: The order of walking the page tables differs from 
> Arm,
> and this implementation reflects that reversed traversal.
> - Macro Adjustments: The macros P2M_ROOT_LEVEL, P2M_ROOT_ORDER, and
> P2M_ROOT_PAGES are updated to align with the new RISC-V 
> implementation.
>
> The main functionality is in __p2m_set_entry(), which handles mappings 
> aligned
> to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).
>
> p2m_set_entry() breaks a region down into block-aligned mappings and calls
> __p2m_set_entry() accordingly.
>
> Stub implementations (to be completed later) include:
> - p2m_free_entry()
 What would a function of this name do?
>>> Recursively visiting all leaf PTE's for sub-tree behind an entry, then calls
>>> put_page() (which will free if there is no any reference to this page),
>>> freeing intermediate page table (after all entries were freed) by removing
>>> it from d->arch.paging.freelist, and removes correspondent page of 
>>> intermediate page
>>> table from p2m->pages list.
>>>
 You can clear entries, but you can't
 free them, can you?
>>> Is is a question regarding terminology?
>> Yes. If one sees a call to a function, it should be possible to at least
>> roughly know what it does without needing to go look at the implementation.
>>
>>> I can't free entry itself, but a page table or
>>> a page (if it is a leaf entry) on which it points could free.
>> Then e.g. pte_free_subtree() or some such?
> 
> It sounds fine to me. I'll use suggested name.
> 
> Just want to notice that other arches also have the same function
> for the same purpose with the same name.

As to x86, it's not general P2M code which uses this odd (for the purpose)
name, but only p2m-pt.c.

> Does it make sense then to change a name for all arches?

I think so.

> +{
> +panic("%s: isn't implemented for now\n", __func__);
> +
> +return false;
> +}
 For this function in particular, though: Besides the "p2me" in the name
 being somewhat odd (supposedly page table entries here are simply pte_t),
 how is this going to be different from pte_is_valid()?
>>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is 
>>> checking
>>> what is a type stored in the radix tree (p2m->p2m_types):
>>> /*
>>>  * In the case of the P2M, the valid bit is used for other purpose. Use
>>>  * the type to check whether an entry is valid.
>>>  */
>>> static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>> {
>>> return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>> }
>>>
>>> It is done to track which page was modified by a guest.
>> But then (again) the name doesn't convey what the function does.
> 
> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would 
> better.

For P2M type checks please don't invent new naming, but use what both x86
and Arm are already using. Note how we already have p2m_is_valid() in that
set. Just that it's not doing what you want here.

>>   Plus
>> can't a guest also arrange for an entry's type to move to p2m_invalid?
>> That's then still an entry that was modified by the guest.
> 
> I am not really sure that I fully understand the question.
> Do you ask if a guest can do something which will lead to a call of 
> p2m_set_entry()
> with p2m_invalid argument?

That I'm not asking, but rather stating. I.e. I expect such is possible.

> If yes, then it seems like it will be done only in case of 
> p2m_remove_mapping() what
> will mean that alongside with p2m_invalid INVALID_MFN will be also passed, 
> what means
> this entry isn't expected to be used anymore.

Right. But such an entry would still have been "modified" by the guest.

>> Overall I think I'm lacking clarity what you mean to use this predicate
>> for.
> 
> By using of "p2me_" predicate I wanted to express that not PTE's valid bit 
> will be
> checked, but the type saved in radix tree will be used.
> As suggested above probably it will be better drop "e" too and just use 
> p2m_type_is_valid().

See above regardi

Re: [PATCH 3/3] hvmloader: add new SMBIOS tables (7,8,9,26,27,28)

2025-07-07 Thread Jan Beulich

On 05.07.2025 01:48, Petr Beneš wrote:
> On Wed, Jul 2, 2025 at 9:15 AM Jan Beulich  wrote:
>>
>> On 02.07.2025 01:45, Petr Beneš wrote:
>>> From: Petr Beneš 
>>
>> This isn't in line with the first S-o-b, nor with the fact that in the cover
>> letter you say this was previously submitted (and hence authored?) by Anton.
> 
> Can you please point me to the right direction? I have no idea what
> tags should I specify here.

Well, in the common case the original author would never change, and it would
be their S-o-b that remains first forever. Anything else would need explaining.

Jan

Re: [PATCH v2 4/5] xen/arm: Implement standard PV time interface as per ARM DEN 0057A

2025-07-07 Thread Jan Beulich

On 05.07.2025 16:27, Koichiro Den wrote:
> --- a/xen/include/xen/macros.h
> +++ b/xen/include/xen/macros.h
> @@ -5,6 +5,7 @@
>  #define ROUNDDOWN(x, a) ((x) & ~((a) - 1))
>  
>  #define IS_ALIGNED(val, align) (!((val) & ((align) - 1)))
> +#define IS_POWER_OF_TWO(val)   ((val) && !((val) & ((val) - 1)))
>  
>  #define DIV_ROUND(n, d) (((n) + (d) / 2) / (d))
>  #define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))

I understand surrounding code has similar issues, but I wonder if we wouldn't
be better off requiring additions to this file to be clean towards arbitrary
use of the macros. That is in particular to make sure macro arguments are
evaluated exactly once. REST maintainers in particular - thoughts?

Jan

Re: [PATCH v3 08/22] x86/slaunch: restore boot MTRRs after Intel TXT DRTM

2025-07-07 Thread Jan Beulich

On 06.07.2025 23:55, Sergii Dmytruk wrote:
> On Wed, Jul 02, 2025 at 05:11:26PM +0200, Jan Beulich wrote:
>> On 30.05.2025 15:17, Sergii Dmytruk wrote:
>>> @@ -442,6 +444,9 @@ static uint64_t __init mtrr_top_of_ram(void)
>>>  ASSERT(paddr_bits);
>>>  addr_mask = ((1ULL << paddr_bits) - 1) & PAGE_MASK;
>>>
>>> +if ( slaunch_active )
>>> +txt_restore_mtrrs(e820_verbose);
>>
>> How did you pick this call site? Besides it being unrelated to the purpose of
>> the function, we may not even make it here (e.g. when "e820-mtrr-clip=no" on
>> the command line). Imo this wants to go in the caller of this function,
>> machine_specific_memory_setup(). Or you want to reason about this placement 
>> in
>> the description.
> 
> I think the motivation behind using this location was related to
> printing debug information in a fitting place.  The possible early exit
> definitely ruins everything here, thanks.  Will move at least to the
> caller.

Of course debug info logging will want to remain somewhat sensible. So, just
to mention, besides moving the call site there is also the option of simply
explaining why it needs to be here.

>>> --- a/xen/arch/x86/include/asm/intel-txt.h
>>> +++ b/xen/arch/x86/include/asm/intel-txt.h
>>> @@ -426,6 +426,9 @@ void txt_map_mem_regions(void);
>>>  /* Marks TXT-specific memory as used to avoid its corruption. */
>>>  void txt_reserve_mem_regions(void);
>>>
>>> +/* Restores original MTRR values saved by a bootloader before starting 
>>> DRTM. */
>>> +void txt_restore_mtrrs(bool e820_verbose);
>>
>> This parameter name is, when the header is used from e820.c, going to shadow 
>> the
>> static variable of the same name there. Misra objects to such. But the 
>> parameter
>> doesn't really have a need for having the e820_ prefix, does it?
> 
> I don't think it's possible for a parameter in a declaration to shadow
> anything, but the prefix is indeed unnecessary.

Considering that without a prior forward decl

void test(struct s);

gains a private declaration of struct s (scope being just the function decl),
I expect the same applies to the shadowing caused by parameter names.

Jan

Re: [PATCH v6] automation/eclair: update configuration of D4.10

2025-07-07 Thread Anthony PERARD

On Mon, Jun 23, 2025 at 06:19:27PM -0700, Stefano Stabellini wrote:
> diff --git a/xen/include/xen/compile.h.in b/xen/include/xen/compile.h.in
> index 3151d1e7d1..9206341ba6 100644
> --- a/xen/include/xen/compile.h.in
> +++ b/xen/include/xen/compile.h.in
> @@ -1,3 +1,6 @@
> +#ifndef XEN_COMPILE_H
> +#define XEN_COMPILE_H
> +
>  #define XEN_COMPILE_DATE "@@date@@"
>  #define XEN_COMPILE_TIME "@@time@@"
>  #define XEN_COMPILE_BY   "@@whoami@@"
> diff --git a/xen/tools/process-banner.sed b/xen/tools/process-banner.sed
> index 56c76558bc..4cf3f9a116 100755
> --- a/xen/tools/process-banner.sed
> +++ b/xen/tools/process-banner.sed
> @@ -12,3 +12,8 @@ s_(.*)_"\1\\n"_
>  
>  # Trailing \ on all but the final line.
>  $!s_$_ \\_
> +
> +# Append closing header guard
> +$a\
> +\
> +#endif /* XEN_COMPILE_H */

Is it wise to put the closing header guard in a file call
"process-banner" ? It's not call compile.h-footer.sed.

There's a few way to make this better:
- simple add the header guard from the Makefile, both opening and
  closing.
- Do some more sed with something like:
  sed -rf process-banner.sed < .banner >> .banner.processed.tmp
  sed -e 's/@@date@@/$(XEN_BUILD_DATE)/g' \
  ... \
  -e '/XEN_BANNER/r .banner.processed.tmp'
  # and having the closing header guard in "compile.h.in"
  This will add the outpot of process-banner.sed in the lines after
  "#define XEN_BANNER", and so before the closing header guard.
- rename the sed command file
(- a forth option would be to use filechk make macro, but the check for
 if [ ! -r $@ -o -O $@ ] would be annoying to reproduce.)

Another thing, this could be done in a patch that isn't called
"automation/eclair: update configuration of D4.10".

Cheers,

-- 

Anthony Perard | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech

Re: [PATCH 2/2] xen/arm: Skip loops in init_pdx() when no PDX compression is used

2025-07-07 Thread Orzel, Michal




On 07/07/2025 13:56, Hari Limaye wrote:
> Hi Michal,
> 
>> On Fri, Jul 04, 2025 at 09:54:28AM +, Michal Orzel wrote:
>> When CONFIG_PDX_COMPRESSION=n, pdx_init_mask(), pdx_region_mask() and
>> pfn_pdx_hole_setup() are just stubs doing nothing. It does not make
>> sense to keep the two loops iterating over all the memory banks.
>>
>> Signed-off-by: Michal Orzel 
>> ---
>>  xen/arch/arm/setup.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
>> index 93b730ffb5fb..12b76a0a9837 100644
>> --- a/xen/arch/arm/setup.c
>> +++ b/xen/arch/arm/setup.c
>> @@ -255,7 +255,9 @@ void __init init_pdx(void)
>>  {
>>  const struct membanks *mem = bootinfo_get_mem();
>>  paddr_t bank_start, bank_size, bank_end, ram_end = 0;
>> +int bank;
> 
> Minor/question: Would we potentially prefer to move the declaration of
> the loop counter `bank` variables to each for loop? As the variable is
> not used outside of the loops and is always initialized to 0, this seems
> perhaps better from a variable scope perspective?
Maybe it does but it was too minor for me to be willing to incorporate it in
this patch whose goal was not to reduce the variable scope (I think adding such
change would not fit the purpose of the patch).

> 
>>  
>> +#ifdef CONFIG_PDX_COMPRESSION
>>  /*
>>   * Arm does not have any restrictions on the bits to compress. Pass 0 to
>>   * let the common code further restrict the mask.
>> @@ -264,7 +266,6 @@ void __init init_pdx(void)
>>   * update this function too.
>>   */
>>  uint64_t mask = pdx_init_mask(0x0);
>> -int bank;
>>  
>>  for ( bank = 0 ; bank < mem->nr_banks; bank++ )
>>  {
>> @@ -284,6 +285,7 @@ void __init init_pdx(void)
>>  }
>>  
>>  pfn_pdx_hole_setup(mask >> PAGE_SHIFT);
>> +#endif
>>  
>>  for ( bank = 0 ; bank < mem->nr_banks; bank++ )
>>  {
> 
> Otherwise, LGTM! Tested (compilation) via both Arm AArch32 and AArch64 builds.
> 
> Many thanks,
> Hari

Thanks.

~Michal

Re: [PATCH v7 2/7] tools/xl: Add altp2m_count parameter

2025-07-07 Thread Jan Beulich

On 01.07.2025 21:54, Petr Beneš wrote:
> --- a/tools/libs/light/libxl_create.c
> +++ b/tools/libs/light/libxl_create.c
> @@ -421,6 +421,15 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc,
>  return -ERROR_INVAL;
>  }
>  
> +if (b_info->altp2m_count == LIBXL_ALTP2M_COUNT_DEFAULT) {
> +if ((libxl_defbool_val(b_info->u.hvm.altp2m) ||

Is it possible that this is what causes

Assertion failed: !libxl_defbool_is_default(db) (libxl.c: libxl_defbool_val: 
337)

observable in CI? Unless we can get a fix pretty quickly, I expect at least two
of the three patches that were committed from this series will need reverting.

Jan

> +b_info->altp2m != LIBXL_ALTP2M_MODE_DISABLED))
> +/* Reflect the default legacy count */
> +b_info->altp2m_count = 10;
> +else
> +b_info->altp2m_count = 0;
> +}
> +
>  /* Assume that providing a bootloader user implies enabling restrict. */
>  libxl_defbool_setdefault(&b_info->bootloader_restrict,
>   !!b_info->bootloader_user);

[PATCH] x86/PVH: extend checking in hwdom_fixup_p2m()

2025-07-07 Thread Jan Beulich

We're generally striving to minimize behavioral differences between PV
and PVH Dom0. Using (just?) is_memory_hole() in the PVH case looks quite
a bit weaker to me, compared to the page ownership check done in the PV
case. Extend checking accordingly.

Signed-off-by: Jan Beulich 
---
The addition may actually be suitable to replace the use of
is_memory_hole() here. While dropping that would in particular extend
coverage to E820_RESERVED regions, those are identity-mapped anyway
(albeit oddly enough still by IOMMU code).

--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -184,6 +184,22 @@ static int hwdom_fixup_p2m(paddr_t addr)
  !is_memory_hole(_mfn(gfn), _mfn(gfn)) )
 return -EPERM;
 
+/*
+ * Much like get_page_from_l1e() for PV Dom0 does, check that the page
+ * accessed is actually an MMIO one: Either its MFN is out of range, or
+ * it's owned by DOM_IO.
+ */
+if ( mfn_valid(_mfn(gfn)) )
+{
+struct page_info *pg = mfn_to_page(_mfn(gfn));
+const struct domain *owner = page_get_owner_and_reference(pg);
+
+if ( owner )
+put_page(pg);
+if ( owner != dom_io )
+return -EPERM;
+}
+
 mfn = get_gfn(currd, gfn, &type);
 if ( !mfn_eq(mfn, INVALID_MFN) || !p2m_is_hole(type) )
 rc = mfn_eq(mfn, _mfn(gfn)) ? -EEXIST : -ENOTEMPTY;

Re: [PATCH v5 03/10] x86: Replace arch-specific boot_domain with the common one

2025-07-07 Thread Alejandro Vallejo

On Thu Jul 3, 2025 at 8:04 AM CEST, Jan Beulich wrote:
> On 02.07.2025 17:34, Alejandro Vallejo wrote:
>> On Wed Jul 2, 2025 at 5:15 PM CEST, Jan Beulich wrote:
>>> On 02.07.2025 17:09, Alejandro Vallejo wrote:
 On Wed Jul 2, 2025 at 3:15 PM CEST, Jan Beulich wrote:
> On 01.07.2025 12:56, Alejandro Vallejo wrote:
>> --- a/xen/arch/x86/include/asm/bootfdt.h
>> +++ b/xen/arch/x86/include/asm/bootfdt.h
>> @@ -3,6 +3,12 @@
>>  #define X86_BOOTFDT_H
>>  
>>  #include 
>> +#include 
>> +
>> +struct arch_boot_domain
>> +{
>> +domid_t domid;
>> +};
>>  
>>  struct arch_boot_module
>>  {
>> [...]
>> @@ -1048,11 +1050,11 @@ static struct domain *__init create_dom0(struct 
>> boot_info *bi)
>>  dom0_cfg.flags |= XEN_DOMCTL_CDF_iommu;
>>  
>>  /* Create initial domain.  Not d0 for pvshim. */
>> -bd->domid = get_initial_domain_id();
>> -d = domain_create(bd->domid, &dom0_cfg,
>> +bd->arch.domid = get_initial_domain_id();
>> +d = domain_create(bd->arch.domid, &dom0_cfg,
>>pv_shim ? 0 : CDF_privileged | CDF_hardware);
>>  if ( IS_ERR(d) )
>> -panic("Error creating d%u: %ld\n", bd->domid, PTR_ERR(d));
>> +panic("Error creating d%u: %ld\n", bd->arch.domid, PTR_ERR(d));
>
> This being the only place where the (now) arch-specific field is used, why
> does it exist? A local variable would do? And if it's needed for
> (supposedly arch-agnostic) hyperlaunch, then it probably shouldn't be
> arch-specific? Daniel, Jason?

 As for the arch-agnostic side of things, arm needs some extra work to be
 able to do it safely. dom0less currently constructs domains immediately 
 after
 parsing them, which is problematic for cases where some domains have the 
 prop
 and others don't. The domid allocation strategy may preclude further 
 otherwise
 good domains from being created just because their domid was stolen by a 
 domain
 that didn't actually care about which domid it got.

 It'll eventually want to leave the arch-specific area, but I don't want to 
 do
 that work now.
>>>
>>> But if the domU field is fine to live in a common struct despite being 
>>> unused
>>> on x86, why can't the domid field live in a common struct too, despite being
>>> unused on non-x86? Otherwise it'll be extra churn for no gain to later move 
>>> it
>>> there.
>> 
>> Mostly out of tidiness. Otherwise it's hard to know which fields serve a 
>> purpose
>> where.
>> 
>> I genuinely forgot about the domU field. I'm more than happy to drop that 
>> arch
>> subfield and have domid in the main body of the struct, but I suspect MISRA
>> would have something to say about dead data?
>
> In principle yes (and then also about the domU field), but we rejected the
> respective rule altogether (for now? plus for a reason that I must have forgot
> and that escapes me).
>
> Jan

Actually, moving it to an arch-specific field is rather annoying. everyone but
x86 needs the field. I'll just compile it out for x86 specifically with ifdef
guards, even if it is common code.

For the record, I hope to get rid of it on arm/riscv/ppc entirely later on by
deducing domU vs dom0 from the capabilities property.

Cheers,
Alejandro

Re: [PATCH v7 2/7] tools/xl: Add altp2m_count parameter

2025-07-07 Thread Petr Beneš

On Mon, Jul 7, 2025 at 3:35 PM Anthony PERARD  wrote:
>
> It seems that altp2m_count is going to be used for the creation of all
> guest, right? That is in addition to HVM, it will be also used for PV
> guest and on Arm, and any other architectures that could be added.
>
> Anthony PERARD

I'm suggesting to use what is used elsewhere in the libxl_create.c:

(https://github.com/xen-project/xen/blob/9b0f0f6e235618c2764e925b58c4bfe412730ced/tools/libs/light/libxl_create.c#L1233
and 
https://github.com/xen-project/xen/blob/9b0f0f6e235618c2764e925b58c4bfe412730ced/tools/libs/light/libxl_create.c#L1241)

((d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM &&
  libxl_defbool_val(d_config->b_info.u.hvm.altp2m))

In other words, add check for the LIBXL_DOMAIN_TYPE_HVM type before
calling that offending libxl_defbool_val(). Would that be okay?

>  What should be the value of altp2m_count in all this case, if altp2m is only
> set on x86 HVM guest?

0. Similarly to other fields that are implemented only for x86, but
are available (and technically implementable) elsewhere - like altp2m
field, vmtrace_buf_kb, etc.

P.

Re: [PATCH v7 2/7] tools/xl: Add altp2m_count parameter

2025-07-07 Thread Anthony PERARD

On Tue, Jul 01, 2025 at 07:54:24PM +, Petr Beneš wrote:
> diff --git a/tools/libs/light/libxl_create.c b/tools/libs/light/libxl_create.c
> index 8a85fba1cf..acf7fd9837 100644
> --- a/tools/libs/light/libxl_create.c
> +++ b/tools/libs/light/libxl_create.c
> @@ -421,6 +421,15 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc,
>  return -ERROR_INVAL;
>  }
>  
> +if (b_info->altp2m_count == LIBXL_ALTP2M_COUNT_DEFAULT) {
> +if ((libxl_defbool_val(b_info->u.hvm.altp2m) ||

This access turned out to be an issue. "hvm.altp2m" is only set to a
default value for hvm guests on x86, in
libxl__arch_domain_build_info_setdefault() in ibxl_x86.c. So trying to
create a PV guest will fail here.

It seems that altp2m_count is going to be used for the creation of all
guest, right? That is in addition to HVM, it will be also used for PV
guest and on Arm, and any other architectures that could be added. What
should be the value of altp2m_count in all this case, if altp2m is only
set on x86 HVM guest?

> +b_info->altp2m != LIBXL_ALTP2M_MODE_DISABLED))
> +/* Reflect the default legacy count */
> +b_info->altp2m_count = 10;
> +else
> +b_info->altp2m_count = 0;

Cheers,

-- 
Anthony PERARD

Re: [PATCH] systemd: Add hooks to stop/start xen-watchdog on suspend/resume

2025-07-07 Thread Anthony PERARD

On Tue, Jul 01, 2025 at 03:11:39PM +0300, Mykola Kvach wrote:
> On Fri, Jun 27, 2025 at 3:37 PM Anthony PERARD  wrote:
> > On Thu, Jun 26, 2025 at 11:12:46AM +0300, Mykola Kvach wrote:
> > > Both scenarios are rare and typically require very small watchdog
> > > timeouts combined with significant delays in Xen or the Linux kernel
> > > during suspend/resume flows.
> > >
> > > Conceptually, however, if activating and pinging the Xen watchdog is the
> > > responsibility of the domain and its services, then the domain should
> > > also manage the watchdog service/daemon lifecycle. This is similar to
> > > what is already done by the Xen watchdog driver inside the Linux kernel.
> >
> > So there's already watchdog driver in Linux, why not activate it with
> > systemd, since it knows how to do it? I almost want to to remove the
> 
> Actually, I don't know the exact reason why we have two different
> implementations. It could be historical — for example, initially the

It's definitely historical. xenwatchdogd was introduced before systemd
existed. Then someone added systemd service files to Xen, but I don't
know if systemd's watchdog support existed at the time, or the service
file was created just to replace the existing init script.

> Xen daemon was used, and later the Linux kernel driver was introduced,
> or vice versa. It's also possible that some setups still use very old
> kernels that lack the driver, and backporting it would require
> additional effort.
> 
> More likely, though, the daemon approach was chosen for simplicity.
> Using the Linux kernel driver requires building the module or even
> rebuilding the entire kernel if the driver needs to be built-in.
> In contrast, with the daemon, you just need to build the binary and
> copy it into the domain's filesystem. It's easier and requires much
> less effort.

That sound like a good explanation, make it easier to use Xen's
watchdog, even if the currently built kernel doesn't have support for
it, so having a systemd.service file for this case is useful.

> 
> > service file and redirect users to use systemd's watchdog instead, in
> > the documentation.
> 
> I think that would be a better solution too. At the very least, we
> would avoid having to handle power-related scenarios for every
> existing init system.
> 
> For example, I see that we currently have three separate watchdog
> services: one for NetBSD (rc.d), and two for Linux (init.d and
> systemd). I also looked into how to set up power-down hooks with
> init.d, and it’s neither easy nor safe -- especially if pm-utils is
> not installed on the system.
> 
> > > diff --git a/tools/hotplug/Linux/systemd/xen-watchdog-sleep.sh 
> > > b/tools/hotplug/Linux/systemd/xen-watchdog-sleep.sh
> > > new file mode 100644
> > > index 00..2b2f0e16d8
> > > --- /dev/null
> > > +++ b/tools/hotplug/Linux/systemd/xen-watchdog-sleep.sh

> > > +XEN_WATCHDOG_SLEEP_LOG="${XEN_LOG_DIR}/xen-watchdog-sleep.log"
> >
> > Is this necessary? Use only `logger`, if `echo log` doesn't log anything.
> 
> In my case, I do see logs in journalctl after using echo in this script.
> I thought it was a common approach for the Xen toolstack to use its own log
> files for tools and services.

Well, most or all those log files were usually "created" before systemd
existed, and I don't think a single one of those were created for only
a systemd service.

> However, if it’s preferable to use only the standard systemd logging, I can
> remove all changes related to logging into separate files.

For something systemd specific, I think we should use systemd's logging
facility. There's already too many separated log files that are hard to
discover when there's an error. Adding one more isn't going to help,
especially for something that barely generate any logs.

> > > +log_watchdog() {
> > > +echo "$1"
> > > +echo "$(date): $1" >> "${XEN_WATCHDOG_SLEEP_LOG}"
> > > +}
> > > +
> > > +# Exit silently if Xen watchdog service is not present
> > > +if ! systemctl show "${SERVICE_NAME}" > /dev/null 2>&1; then
> >
> > Is this necessary? It seems `systemctl is-active` works fine when the
> > unit doesn't exist.
> 
> is-active doesn't work correctly on resume.
> In the case of resume, the script exits early because of
> the is-active check. Maybe I'm missing something.

I think I meant to remove this check and let the script just do its
thing. If the unit isn't present, `systemctl is-active` is going to be
"false" on suspend, and the $STATE_FILE isn't going to exist on resume.

> > > +exit 0
> > > +fi
> > > +
> > > +case "$1" in
> > > +pre)
> > > +if systemctl is-active --quiet "${SERVICE_NAME}"; then
> > > +touch "${STATE_FILE}"
> > > +log_watchdog "Stopping ${SERVICE_NAME} before $2."
> > > +systemctl stop "${SERVICE_NAME}"
> > > +fi
> > > +;;
> > > +post)
> > > +if [ -f "${STATE_FILE}" ]; then
> >
> > Would using `systemctl is-enabled` instead work? It seems to work for a
> > service on my machine

Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()

2025-07-07 Thread Jan Beulich

On 04.07.2025 17:01, Oleksii Kurochko wrote:
> On 7/1/25 3:49 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> This patch introduces p2m_set_entry() and its core helper __p2m_set_entry() 
>>> for
>>> RISC-V, based loosely on the Arm implementation, with several 
>>> RISC-V-specific
>>> modifications.
>>>
>>> Key differences include:
>>> - TLB Flushing: RISC-V allows caching of invalid PTEs and does not require
>>>break-before-make (BBM). As a result, the flushing logic is simplified.
>>>TLB invalidation can be deferred until p2m_write_unlock() is called.
>>>Consequently, the p2m->need_flush flag is always considered true and is
>>>removed.
>>> - Page Table Traversal: The order of walking the page tables differs from 
>>> Arm,
>>>and this implementation reflects that reversed traversal.
>>> - Macro Adjustments: The macros P2M_ROOT_LEVEL, P2M_ROOT_ORDER, and
>>>P2M_ROOT_PAGES are updated to align with the new RISC-V implementation.
>>>
>>> The main functionality is in __p2m_set_entry(), which handles mappings 
>>> aligned
>>> to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).
>>>
>>> p2m_set_entry() breaks a region down into block-aligned mappings and calls
>>> __p2m_set_entry() accordingly.
>>>
>>> Stub implementations (to be completed later) include:
>>> - p2m_free_entry()
>> What would a function of this name do?
> 
> Recursively visiting all leaf PTE's for sub-tree behind an entry, then calls
> put_page() (which will free if there is no any reference to this page),
> freeing intermediate page table (after all entries were freed) by removing
> it from d->arch.paging.freelist, and removes correspondent page of 
> intermediate page
> table from p2m->pages list.
> 
>> You can clear entries, but you can't
>> free them, can you?
> 
> Is is a question regarding terminology?

Yes. If one sees a call to a function, it should be possible to at least
roughly know what it does without needing to go look at the implementation.

> I can't free entry itself, but a page table or
> a page (if it is a leaf entry) on which it points could free.

Then e.g. pte_free_subtree() or some such?

>>> +static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>> The rule of thumb is to have inline functions only in header files, leaving
>> decisions to the compiler elsewhere.
> 
> I am not sure what you mean in the second part (after coma) of your sentence.

The compiler does its own inlining decisions quite fine when it can see all
call sites (as is the case for static functions). Hence in general you want
to omit "inline" there. Except of course in header files, where non-inline
static-s are a problem.

>>> +{
>>> +panic("%s: isn't implemented for now\n", __func__);
>>> +
>>> +return false;
>>> +}
>> For this function in particular, though: Besides the "p2me" in the name
>> being somewhat odd (supposedly page table entries here are simply pte_t),
>> how is this going to be different from pte_is_valid()?
> 
> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
> what is a type stored in the radix tree (p2m->p2m_types):
>/*
> * In the case of the P2M, the valid bit is used for other purpose. Use
> * the type to check whether an entry is valid.
> */
>static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>{
>return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>}
> 
> It is done to track which page was modified by a guest.

But then (again) the name doesn't convey what the function does. Plus
can't a guest also arrange for an entry's type to move to p2m_invalid?
That's then still an entry that was modified by the guest.

Overall I think I'm lacking clarity what you mean to use this predicate
for.

>>> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
>>> +{
>>> +write_pte(p, pte);
>>> +if ( clean_pte )
>>> +clean_dcache_va_range(p, sizeof(*p));
>>> +}
>>> +
>>> +static inline void p2m_remove_pte(pte_t *p, bool clean_pte)
>>> +{
>>> +pte_t pte;
>>> +
>>> +memset(&pte, 0x00, sizeof(pte));
>>> +p2m_write_pte(p, pte, clean_pte);
>>> +}
>> May I suggest "clear" instead of "remove" and plain 0 instead of 0x00
>> (or simply give the variable a trivial initializer)?
> 
> Sure, I will rename and use plain 0.
> 
>>
>> As to the earlier function that I commented on: Seeing the names here,
>> wouldn't p2m_pte_is_valid() be a more consistent name there?
> 
> Then all p2me_*() should be updated to p2m_pte_*().
> 
> But initial logic was that p2me = p2m entry = p2m page table entry.
> 
> Probably we can just return back to the prefix p2m_ as based on arguments
> it is clear that it is a function for working with P2M's PTE.

In the end it's up to you. Having thought about it some more, perhaps
p2me_*() is still quite helpful to separate from functions dealing with
P2Ms as a while, and to also avoid the verbosity of p2m_pte_*().

>>> +{
>

Re: [XEN PATCH 3/5] x86/irq: address violation of MISRA C Rule 5.5

2025-07-07 Thread Jan Beulich

On 04.07.2025 22:39, Dmytro Prokopchuk1 wrote:
> Address a violation of MISRA C:2012 Rule 5.5:
> "Identifiers shall be distinct from macro names".
> 
> Reports for service MC3A2.R5.5:
> xen/include/xen/irq.h: non-compliant function `pirq_cleanup_check(struct 
> pirq*, struct domain*)'
> xen/include/xen/irq.h: non-compliant macro `pirq_cleanup_check'
> 
> The primary issue stems from the macro and function
> having identical names, which is confusing and
> non-compliant with common coding standards.
> 
> Change the function name by adding two underscores at the end.
> 
> Signed-off-by: Dmytro Prokopchuk 

I'm not going to NAK this, but I dislike the transformation done. The aliasing
in this case was intentional, to avoid any caller appearing that would bypass
the macro. Yes, the double underscores will also stand out (as much as the
parenthesization that would have been needed to override the protection), but
still ...

Jan

[PATCH v1 3/3] iommu/ipmmu-vmsa: Implement basic PCIE-IPMMU OSID support

2025-07-07 Thread Mykyta Poturai

From: Oleksandr Tyshchenko 

Program PCIE BDF-OSID assignment according to the S4_PCIe_IPMMU-OSID
when adding PCI device to the IOMMU in ipmmu_add_device callback.
This is needed for being able to assign PCI devices to different
domains at the same time. Programmed OSID is emmited as sideband data on
the AXI bus during PCI DMA transactions and then used by IPMMU to match
the transaction to the corresponding uTLB.

Signed-off-by: Oleksandr Tyshchenko 
Signed-off-by: Mykyta Poturai 
---
 xen/drivers/passthrough/arm/ipmmu-vmsa.c | 141 +--
 1 file changed, 133 insertions(+), 8 deletions(-)

diff --git a/xen/drivers/passthrough/arm/ipmmu-vmsa.c 
b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
index d828d9cf6a..14ddabb30e 100644
--- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
+++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
@@ -51,12 +51,25 @@
 #include 
 #include 
 #include 
+#include "../arch/arm/pci/pci-host-rcar4.h"
 
 #define dev_name(dev) dt_node_full_name(dev_to_dt(dev))
 
 /* Device logger functions */
+#ifndef CONFIG_HAS_PCI
 #define dev_print(dev, lvl, fmt, ...)\
 printk(lvl "ipmmu: %s: " fmt, dev_name(dev), ## __VA_ARGS__)
+#else
+#define dev_print(dev, lvl, fmt, ...) ({\
+if ( !dev_is_pci((dev)) )   \
+printk(lvl "ipmmu: %s: " fmt, dev_name((dev)), ## __VA_ARGS__);  \
+else\
+{   \
+struct pci_dev *pdev = dev_to_pci((dev));   \
+printk(lvl "ipmmu: %pp: " fmt, &pdev->sbdf, ## __VA_ARGS__); \
+}   \
+})
+#endif
 
 #define dev_info(dev, fmt, ...)\
 dev_print(dev, XENLOG_INFO, fmt, ## __VA_ARGS__)
@@ -1121,6 +1134,8 @@ static void ipmmu_free_root_domain(struct 
ipmmu_vmsa_domain *domain)
 xfree(domain);
 }
 
+static int ipmmu_deassign_device(struct domain *d, struct device *dev);
+
 static int ipmmu_assign_device(struct domain *d, u8 devfn, struct device *dev,
uint32_t flag)
 {
@@ -1134,8 +1149,43 @@ static int ipmmu_assign_device(struct domain *d, u8 
devfn, struct device *dev,
 if ( !to_ipmmu(dev) )
 return -ENODEV;
 
-spin_lock(&xen_domain->lock);
+#ifdef CONFIG_HAS_PCI
+if ( dev_is_pci(dev) )
+{
+struct pci_dev *pdev = dev_to_pci(dev);
+struct domain *old_d = pdev->domain;
+
+printk(XENLOG_INFO "Assigning device %04x:%02x:%02x.%u to dom%d\n",
+   pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+   d->domain_id);
+
+/*
+ * XXX What would be the proper behavior? This could happen if
+ * pdev->phantom_stride > 0
+ */
+if ( devfn != pdev->devfn )
+ASSERT_UNREACHABLE();
+
+list_move(&pdev->domain_list, &d->pdev_list);
+pdev->domain = d;
+
+/* dom_io is used as a sentinel for quarantined devices */
+if ( d == dom_io )
+{
+int ret;
+
+/*
+ * Try to de-assign: do not return error if it was already
+ * de-assigned.
+ */
+ret = ipmmu_deassign_device(old_d, dev);
+
+return ret == -ESRCH ? 0 : ret;
+}
+}
+#endif
 
+spin_lock(&xen_domain->lock);
 /*
  * The IPMMU context for the Xen domain is not allocated beforehand
  * (at the Xen domain creation time), but on demand only, when the first
@@ -1240,7 +1290,7 @@ static int ipmmu_reassign_device(struct domain *s, struct 
domain *t,
 int ret = 0;
 
 /* Don't allow remapping on other domain than hwdom */
-if ( t && !is_hardware_domain(t) )
+if ( t && !is_hardware_domain(t) && (t != dom_io) )
 return -EPERM;
 
 if ( t == s )
@@ -1288,20 +1338,95 @@ static int ipmmu_dt_xlate(struct device *dev,
 
 static int ipmmu_add_device(u8 devfn, struct device *dev)
 {
-struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+struct iommu_fwspec *fwspec;
+
+#ifdef CONFIG_HAS_PCI
+if ( dev_is_pci(dev) )
+{
+struct pci_dev *pdev = dev_to_pci(dev);
+int ret;
+
+if ( devfn != pdev->devfn )
+return 0;
+
+ret = iommu_add_pci_sideband_ids(pdev);
+if ( ret < 0 )
+iommu_fwspec_free(dev);
+}
+#endif
+
+fwspec = dev_iommu_fwspec_get(dev);
 
 /* Only let through devices that have been verified in xlate(). */
 if ( !to_ipmmu(dev) )
 return -ENODEV;
 
-if ( dt_device_is_protected(dev_to_dt(dev)) )
+if ( !dev_is_pci(dev) )
 {
-dev_err(dev, "Already added to IPMMU\n");
-return -EEXIST;
+if ( dt_device_is_protected(dev_to_dt(dev)) )
+{
+dev_err(dev, "Already added to IPMMU\n");
+return -EEXIST;
+}
+
+/* Let Xen know that

[PATCH v1 0/3] IPMMU handling for PCIe Passthrough on ARM

2025-07-07 Thread Mykyta Poturai

This series introduces IPMMU handling for PCIe passthrough on ARM. It includes
changes to pci-designware, pci-host-rcar and ipmmu-vmsa drivers to enable
configuring BDF->OSID->uTLB translation chain needed to pass different PCIe
devices to different domains.

Tested on RCar S4 Spider board.

Mykyta Poturai (2):
  arm/pci: allow designware-based hosts to have private data
  pci/rcar: implement OSID configuration for Renesas RCar Gen4 PCIe host

Oleksandr Tyshchenko (1):
  iommu/ipmmu-vmsa: Implement basic PCIE-IPMMU OSID support

 xen/arch/arm/pci/pci-designware.c|  12 ++
 xen/arch/arm/pci/pci-designware.h|   4 +
 xen/arch/arm/pci/pci-host-rcar4.c| 148 +++
 xen/arch/arm/pci/pci-host-rcar4.h|  18 +++
 xen/drivers/passthrough/arm/ipmmu-vmsa.c | 141 +++--
 5 files changed, 315 insertions(+), 8 deletions(-)
 create mode 100644 xen/arch/arm/pci/pci-host-rcar4.h

-- 
2.34.1

[PATCH v1 2/3] pci/rcar: implement OSID configuration for Renesas RCar Gen4 PCIe host

2025-07-07 Thread Mykyta Poturai

For IPMMU to be able to associate a specific PCI device with it's TLB
the BDF to OSID mapping needs to be set up in the host bridge. The
configured OSID is then emmited as a sideband data on the AXI bus during
PCI DMA transactions. OSID configuration registers are located in the
"app" region of the host bridge.

Map the "app" region on init and implement methods for setting up
BDF->OSID mappings

Signed-off-by: Mykyta Poturai 
---
 xen/arch/arm/pci/pci-host-rcar4.c | 148 ++
 xen/arch/arm/pci/pci-host-rcar4.h |  18 
 2 files changed, 166 insertions(+)
 create mode 100644 xen/arch/arm/pci/pci-host-rcar4.h

diff --git a/xen/arch/arm/pci/pci-host-rcar4.c 
b/xen/arch/arm/pci/pci-host-rcar4.c
index 62d2130a63..9290c6cac5 100644
--- a/xen/arch/arm/pci/pci-host-rcar4.c
+++ b/xen/arch/arm/pci/pci-host-rcar4.c
@@ -16,6 +16,32 @@
 
 #define RCAR4_DWC_VERSION   0x520A
 
+/* PCIE BDF-OSID assignment */
+#define CNVID(n) (0x700 + ((n) * 4))
+#define CNVID_CNV_EN (1U << 31)
+#define CNVID_OSID_MASK  (0x0F << 16)
+#define CNVID_OSID_SHIFT 16
+#define CNVID_BDF_MASK   (0x << 0)
+#define CNVID_BDF_SHIFT  0
+
+#define CNVIDMSK(n)(0x780 + ((n) * 4))
+#define CNVIDMSK_BDF_MSK_MASK  (0x << 0)
+#define CNVIDMSK_BDF_MSK_SHIFT 0
+
+#define CNVOSIDCTRL0x800
+#define CNVOSIDCTRL_OSID_MASK  (0x0F << 16)
+#define CNVOSIDCTRL_OSID_SHIFT 16
+
+#define DEFAULT_OSID0
+
+#define NUM_OSID_REGS16
+
+struct rcar4_pcie_priv {
+bool init_done;
+void __iomem *app_base;
+DECLARE_BITMAP(osid_regs, NUM_OSID_REGS);
+};
+
 /*
  * PCI host bridges often have different ways to access the root and child
  * bus config spaces:
@@ -65,17 +91,139 @@ static const struct dt_device_match __initconstrel 
rcar4_pcie_dt_match[] = {
 {},
 };
 
+static void rcar4_pcie_writel_app(struct rcar4_pcie_priv *pci, uint32_t reg,
+  uint32_t val)
+{
+writel(val, pci->app_base + reg);
+}
+
+static uint32_t rcar4_pcie_readl_app(struct rcar4_pcie_priv *pci, uint32_t reg)
+{
+return readl(pci->app_base + reg);
+}
+
+int rcar4_pcie_osid_regs_init(struct pci_host_bridge *bridge)
+{
+struct rcar4_pcie_priv *priv = dw_pcie_get_priv(bridge);
+uint32_t val = rcar4_pcie_readl_app(priv, CNVOSIDCTRL);
+
+if ( priv->init_done )
+return 0;
+priv->init_done = true;
+
+val = (val & ~CNVOSIDCTRL_OSID_MASK) |
+  (DEFAULT_OSID << CNVOSIDCTRL_OSID_SHIFT);
+rcar4_pcie_writel_app(priv, CNVOSIDCTRL, val);
+bitmap_zero(priv->osid_regs, NUM_OSID_REGS);
+
+printk("%s: Initialized OSID regs (default OSID %u)\n",
+   bridge->dt_node->full_name, DEFAULT_OSID);
+
+return 0;
+}
+
+int rcar4_pcie_osid_reg_alloc(struct pci_host_bridge *bridge)
+{
+struct rcar4_pcie_priv *priv = dw_pcie_get_priv(bridge);
+int ret;
+
+ret = find_first_zero_bit(priv->osid_regs, NUM_OSID_REGS);
+if ( ret != NUM_OSID_REGS )
+set_bit(ret, priv->osid_regs);
+else
+ret = -EBUSY;
+
+return ret;
+}
+
+void rcar4_pcie_osid_reg_free(struct pci_host_bridge *bridge,
+  unsigned int reg_id)
+{
+struct rcar4_pcie_priv *priv = dw_pcie_get_priv(bridge);
+
+clear_bit(reg_id, priv->osid_regs);
+}
+
+void rcar4_pcie_osid_bdf_set(struct pci_host_bridge *bridge,
+ unsigned int reg_id, uint32_t osid, uint32_t bdf)
+{
+struct rcar4_pcie_priv *priv = dw_pcie_get_priv(bridge);
+uint32_t data = rcar4_pcie_readl_app(priv, CNVID(reg_id));
+
+data &= ~(CNVID_OSID_MASK | CNVID_BDF_MASK);
+data |= CNVID_CNV_EN | (osid << CNVID_OSID_SHIFT) |
+(bdf << CNVID_BDF_SHIFT);
+rcar4_pcie_writel_app(priv, CNVID(reg_id), data);
+}
+
+void rcar4_pcie_osid_bdf_clear(struct pci_host_bridge *bridge,
+   unsigned int reg_id)
+{
+struct rcar4_pcie_priv *priv = dw_pcie_get_priv(bridge);
+uint32_t data = rcar4_pcie_readl_app(priv, CNVID(reg_id));
+
+data &= ~CNVID_CNV_EN;
+rcar4_pcie_writel_app(priv, CNVID(reg_id), data);
+}
+
+void rcar4_pcie_bdf_msk_set(struct pci_host_bridge *bridge, unsigned int 
reg_id,
+uint32_t data)
+{
+struct rcar4_pcie_priv *priv = dw_pcie_get_priv(bridge);
+
+uint32_t val = rcar4_pcie_readl_app(priv, CNVIDMSK(reg_id));
+
+val = (val & ~CNVIDMSK_BDF_MSK_MASK) | (data << CNVIDMSK_BDF_MSK_SHIFT);
+
+rcar4_pcie_writel_app(priv, CNVIDMSK(reg_id), val);
+}
+
 static int __init pci_host_rcar4_probe(struct dt_device_node *dev,
const void *data)
 {
 struct pci_host_bridge *bridge;
+paddr_t app_phys_addr;
+paddr_t app_size;
+int app_idx, ret;
+
+struct rcar4_pcie_priv *priv = xzalloc(struct rcar4_pcie_priv);
+if ( !priv )
+return -ENOMEM;
 
 bridge = dw_pcie_host_probe(dev, data, &rcar4_pcie_ops,

[PATCH v1 1/3] arm/pci: allow designware-based hosts to have private data

2025-07-07 Thread Mykyta Poturai

Introduce an additional private data field in `dw_pcie_priv` to allow
vendors to store custom data without interfering with bridge->priv.
Also add get/set pair to make accesing that private data less
cumbersome.

Signed-off-by: Mykyta Poturai 
---
 xen/arch/arm/pci/pci-designware.c | 12 
 xen/arch/arm/pci/pci-designware.h |  4 
 2 files changed, 16 insertions(+)

diff --git a/xen/arch/arm/pci/pci-designware.c 
b/xen/arch/arm/pci/pci-designware.c
index 47dd2dd4c0..0bd67524ac 100644
--- a/xen/arch/arm/pci/pci-designware.c
+++ b/xen/arch/arm/pci/pci-designware.c
@@ -403,3 +403,15 @@ dw_pcie_host_probe(struct dt_device_node *dev, const void 
*data,
 
 return bridge;
 }
+
+void *dw_pcie_get_priv(struct pci_host_bridge *bridge)
+{
+struct dw_pcie_priv *priv = bridge->priv;
+return priv->priv;
+}
+
+void dw_pcie_set_priv(struct pci_host_bridge *bridge, void *other)
+{
+struct dw_pcie_priv *priv = bridge->priv;
+priv->priv = other;
+}
diff --git a/xen/arch/arm/pci/pci-designware.h 
b/xen/arch/arm/pci/pci-designware.h
index 7efb1dc9a2..b9deb3b138 100644
--- a/xen/arch/arm/pci/pci-designware.h
+++ b/xen/arch/arm/pci/pci-designware.h
@@ -66,8 +66,12 @@ struct dw_pcie_priv {
 bool iatu_unroll_enabled;
 void __iomem *atu_base;
 unsigned int version;
+void *priv;
 };
 
+void *dw_pcie_get_priv(struct pci_host_bridge *bridge);
+void dw_pcie_set_priv(struct pci_host_bridge *bridge, void *other);
+
 void dw_pcie_set_version(struct pci_host_bridge *bridge, unsigned int version);
 
 void __iomem *dw_pcie_child_map_bus(struct pci_host_bridge *bridge,
-- 
2.34.1

Re: [PATCH 2/2] xen/arm: Skip loops in init_pdx() when no PDX compression is used

2025-07-07 Thread Hari Limaye

Hi Michal,

> On Fri, Jul 04, 2025 at 09:54:28AM +, Michal Orzel wrote:
> When CONFIG_PDX_COMPRESSION=n, pdx_init_mask(), pdx_region_mask() and
> pfn_pdx_hole_setup() are just stubs doing nothing. It does not make
> sense to keep the two loops iterating over all the memory banks.
> 
> Signed-off-by: Michal Orzel 
> ---
>  xen/arch/arm/setup.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
> index 93b730ffb5fb..12b76a0a9837 100644
> --- a/xen/arch/arm/setup.c
> +++ b/xen/arch/arm/setup.c
> @@ -255,7 +255,9 @@ void __init init_pdx(void)
>  {
>  const struct membanks *mem = bootinfo_get_mem();
>  paddr_t bank_start, bank_size, bank_end, ram_end = 0;
> +int bank;

Minor/question: Would we potentially prefer to move the declaration of
the loop counter `bank` variables to each for loop? As the variable is
not used outside of the loops and is always initialized to 0, this seems
perhaps better from a variable scope perspective?

>  
> +#ifdef CONFIG_PDX_COMPRESSION
>  /*
>   * Arm does not have any restrictions on the bits to compress. Pass 0 to
>   * let the common code further restrict the mask.
> @@ -264,7 +266,6 @@ void __init init_pdx(void)
>   * update this function too.
>   */
>  uint64_t mask = pdx_init_mask(0x0);
> -int bank;
>  
>  for ( bank = 0 ; bank < mem->nr_banks; bank++ )
>  {
> @@ -284,6 +285,7 @@ void __init init_pdx(void)
>  }
>  
>  pfn_pdx_hole_setup(mask >> PAGE_SHIFT);
> +#endif
>  
>  for ( bank = 0 ; bank < mem->nr_banks; bank++ )
>  {

Otherwise, LGTM! Tested (compilation) via both Arm AArch32 and AArch64 builds.

Many thanks,
Hari

Re: [PATCH 2/2] xen/arm: Skip loops in init_pdx() when no PDX compression is used

2025-07-07 Thread Hari Limaye

Apologies for noise, my previous message was missing the intended tags:

On Fri, Jul 04, 2025 at 09:54:28AM +, Michal Orzel wrote:
> When CONFIG_PDX_COMPRESSION=n, pdx_init_mask(), pdx_region_mask() and
> pfn_pdx_hole_setup() are just stubs doing nothing. It does not make
> sense to keep the two loops iterating over all the memory banks.
> 
> Signed-off-by: Michal Orzel 
> ---
Reviewed-by: Hari Limaye 
Tested-by: Hari Limaye 

Cheers,
Hari

Re: [PATCH v2 5/5] xen/arm: Support ARM standard PV time for domains created via toolstack

2025-07-07 Thread Jan Beulich

On 05.07.2025 16:27, Koichiro Den wrote:
> --- a/xen/arch/arm/mm.c
> +++ b/xen/arch/arm/mm.c
> @@ -180,7 +180,21 @@ int xenmem_add_to_physmap_one(
>  case XENMAPSPACE_dev_mmio:
>  rc = map_dev_mmio_page(d, gfn, _mfn(idx));
>  return rc;
> +case XENMAPSPACE_pv_time:
> +#ifdef CONFIG_ARM_64

These two lines want to change places, I think.

> +ASSERT(IS_POWER_OF_TWO(sizeof(struct pv_time_region)));

BUILD_BUG_ON() please, so that an issue here can be spotted at build time
rather than only at runtime.

> +if ( idx >= DIV_ROUND_UP(d->max_vcpus * sizeof(struct 
> pv_time_region),
> + PAGE_SIZE) )
> +return -EINVAL;
> +
> +if ( idx == 0 )
> +d->arch.pv_time_regions_gfn = gfn;

This looks fragile, as it'll break once d->max_vcpus can grow large enough to
permit a non-zero idx by way of the earlier if() falling through. Imo you'll
need at least one further BUILD_BUG_ON() here.

>  
> +mfn = virt_to_mfn(d->arch.pv_time_regions[idx]);
> +t = p2m_ram_ro;

Is this the correct type to use here? That is, do you really mean guest write
attempts to be silently dropped, rather than being reported to the guest as a
fault? Then again I can't see such behavior being implemented on Arm, despite
the comment on the enumerator saying so (likely inherited from x86).

> +break;
> +#endif
>  default:
>  return -ENOSYS;
>  }

As to style: Please, rather than absorbing the blank line that was there, make
sure non-fall-through case blocks are separated from adjacent ones by a blank
line.

> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -217,6 +217,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_machphys_mapping_t);
>Stage-2 using the Normal Memory
>Inner/Outer Write-Back Cacheable
>memory attribute. */
> +#define XENMAPSPACE_pv_time  6 /* PV time shared region (ARM64 only) */

The comment isn't specific enough. As per the struct declaration in patch 4,
this interface is solely about stolen time. There's a wider PV interface,
which at least x86 Linux also uses, and which has been adopted by KVM as
well iirc. Hence this new type wants to clarify what exactly it's used for
right now, while leaving open other uses on other architectures.

Jan

Re: [PATCH v2 1/6] arm/mpu: Find MPU region by range

2025-07-07 Thread Orzel, Michal




On 02/07/2025 16:13, Hari Limaye wrote:
> From: Luca Fancellu 
> 
> Implement a function to find the index of a MPU region in the xen_mpumap
> MPU region array. This function will be used in future commits to
> implement creating and destroying MPU regions.
> 
> Signed-off-by: Luca Fancellu 
> Signed-off-by: Hari Limaye 
> ---
> Changes from v1:
> - Update commit message
> - Remove internal _index variable
> - Simplify logic by disallowing NULL index parameter
> - Use normal printk
> - Reorder conditional checks
> - Update some comments
> ---
>  xen/arch/arm/include/asm/mpu/mm.h | 29 +
>  xen/arch/arm/mpu/mm.c | 52 +++
>  2 files changed, 81 insertions(+)
> 
> diff --git a/xen/arch/arm/include/asm/mpu/mm.h 
> b/xen/arch/arm/include/asm/mpu/mm.h
> index a7f970b465..81e47b9d0b 100644
> --- a/xen/arch/arm/include/asm/mpu/mm.h
> +++ b/xen/arch/arm/include/asm/mpu/mm.h
> @@ -10,6 +10,13 @@
>  #include 
>  #include 
>  
> +#define MPUMAP_REGION_OVERLAP  -1
> +#define MPUMAP_REGION_NOTFOUND  0
> +#define MPUMAP_REGION_FOUND 1
> +#define MPUMAP_REGION_INCLUSIVE 2
> +
> +#define INVALID_REGION_IDX 0xFFU
> +
>  extern struct page_info *frame_table;
>  
>  extern uint8_t max_mpu_regions;
> @@ -75,6 +82,28 @@ void write_protection_region(const pr_t *pr_write, uint8_t 
> sel);
>   */
>  pr_t pr_of_addr(paddr_t base, paddr_t limit, unsigned int flags);
>  
> +/*
> + * Checks whether a given memory range is present in the provided table of
> + * MPU protection regions.
> + *
> + * @param table Array of pr_t protection regions.
> + * @param r_regions Number of elements in `table`.
> + * @param base  Start of the memory region to be checked (inclusive).
> + * @param limit End of the memory region to be checked (exclusive).
> + * @param index Set to the index of the region if an exact or 
> inclusive
> + *  match is found, and INVALID_REGION otherwise.
> + * @return: Return code indicating the result of the search:
> + *  MPUMAP_REGION_NOTFOUND: no part of the range is present in 
> `table`
> + *  MPUMAP_REGION_FOUND: found an exact match in `table`
> + *  MPUMAP_REGION_INCLUSIVE: found an inclusive match in `table`
> + *  MPUMAP_REGION_OVERLAP: found an overlap with a mapping in `table`
> + *
> + * Note: make sure that the range [`base`, `limit`) refers to the memory 
> region
> + * inclusive of `base` and exclusive of `limit`.
> + */
> +int mpumap_contains_region(pr_t *table, uint8_t nr_regions, paddr_t base,
> +  paddr_t limit, uint8_t *index);
> +
>  #endif /* __ARM_MPU_MM_H__ */
>  
>  /*
> diff --git a/xen/arch/arm/mpu/mm.c b/xen/arch/arm/mpu/mm.c
> index ccfb37a67b..25e7f36c1e 100644
> --- a/xen/arch/arm/mpu/mm.c
> +++ b/xen/arch/arm/mpu/mm.c
> @@ -110,6 +110,58 @@ pr_t pr_of_addr(paddr_t base, paddr_t limit, unsigned 
> int flags)
>  return region;
>  }
>  
> +int mpumap_contains_region(pr_t *table, uint8_t nr_regions, paddr_t base,
> +  paddr_t limit, uint8_t *index)
> +{
> +ASSERT(index);
> +*index = INVALID_REGION_IDX;
> +
> +/* Convert [base, limit) to [base, limit - 1] for inclusive comparison */
> +limit = limit - 1;
> +
> +if ( limit < base )
> +{
> +printk("Base address %#"PRIpaddr" must be smaller than limit address 
> %#"PRIpaddr"\n",
Given this message, what about region being empty i.e. limit == base? Is it
worth continuing the function?

> +   base, limit);
Here you print limit as inclusive but below as exclusive. Why the difference?

> +return -EINVAL;
> +}
> +
> +for (uint8_t i = 0; i < nr_regions; i++ )
Space before ( and uint8_t.

~Michal

Re: [PATCH v3 01/22] x86/include/asm/intel-txt.h: constants and accessors for TXT registers and heap

2025-07-07 Thread Jan Beulich

On 06.07.2025 17:57, Sergii Dmytruk wrote:
> On Wed, Jul 02, 2025 at 04:29:18PM +0200, Jan Beulich wrote:
>> Btw, a brief rev log would be nice here. I saw you have something in the
>> cover letter, but having to look in two places isn't very helpful.
> 
> I don't really know how to effectively maintain 23 logs at the same time
> given that changing one patch has cascading effects on the rest.  I'd
> suggest using `git diff-range` instead, commands for which I can include
> in cover letters for convenience.

Well, no, doing this per patch is possible and relevant. For cascading
effects their mentioning in a revlog can be pretty brief.

>>> +#define _txt(x) __va(x)
>>> +#endif
>>
>> Without any uses the correctness of the above is hard to judge.
> 
> The _txt() macro is used right below:

Well, the comment was meant a little indirectly: The correctness in
particular wrt the __EARLY_SLAUNCH__ distinction.

>>> +/*
>>> + * Always use private space as some of registers are either read-only or 
>>> not
>>> + * present in public space.
>>> + */
>>> +static inline uint64_t txt_read(unsigned int reg_no)
>>> +{
>>> +volatile uint64_t *reg = _txt(TXT_PRIV_CONFIG_REGS_BASE + reg_no);
>>> +return *reg;
>>> +}
>>> +
>>> +static inline void txt_write(unsigned int reg_no, uint64_t val)
>>> +{
>>> +volatile uint64_t *reg = _txt(TXT_PRIV_CONFIG_REGS_BASE + reg_no);
>>> +*reg = val;
>>> +}
> 
>>> + * This serves as TXT register barrier after writing to
>>> + * TXTCR_CMD_UNLOCK_MEM_CONFIG. Must be done to ensure that any future
>>> + * chipset operations see the write.
>>> + */
>>> +(void)txt_read(TXTCR_ESTS);
>>
>> I don't think the cast is needed.
> 
> It's not needed, but I think that explicitly discarding unused return
> value is a generally good practice even when there is a comment.

In the context of Misra there has been discussion about doing so. But in our
present code base you will find such as the exception, not the rule.

>>> +txt_write(TXTCR_CMD_RESET, 1);
>>> +unreachable();
>>
>> What guarantees the write to take immediate effect? That is, shouldn't there
>> be e.g. an infinite loop here, just in case?
> 
> I'll return infinite loop from v2.  Tried adding `halt()` as Ross
> suggests, but including  doesn't work in the early code
> (something about compat headers and missing expansion of things like
> __DeFiNe__).

Yeah, untangling that may be a little involved. Open-coding halt() is an
option, as long as you clearly indicate it as such (for e.g. grep to still
find that instance).

Jan

Re: [PATCH 1/2] xen/arm64: Panic if direct map is too small

2025-07-07 Thread Hari Limaye

On Fri, Jul 04, 2025 at 09:54:27AM +, Michal Orzel wrote:
> Harden the code by panicing if direct map is too small for current memory

NIT: s/panicing/panicking

> layout taking into account possible PDX compression. Otherwise the assert
> is observed:
> Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < 
> (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72
> 
> At the moment, we don't set max_pdx denoting maximum usable PDX which
> should be based on max_page. Consolidate setting of max_page and max_pdx
> in init_pdx() for both arm32 and arm64. max_pdx will be used in the
> future to set up frametable mappings respecting the PDX grouping.
> 
> Signed-off-by: Michal Orzel 
Reviewed-by: Hari Limaye 
Tested-by: Hari Limaye 

LGTM! Tested (compilation) via both Arm AArch32 and AArch64 builds.

Many thanks,
Hari

Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()

2025-07-07 Thread Oleksii Kurochko



On 7/7/25 9:20 AM, Jan Beulich wrote:

On 04.07.2025 17:01, Oleksii Kurochko wrote:

On 7/1/25 3:49 PM, Jan Beulich wrote:

On 10.06.2025 15:05, Oleksii Kurochko wrote:

This patch introduces p2m_set_entry() and its core helper __p2m_set_entry() for
RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
modifications.

Key differences include:
- TLB Flushing: RISC-V allows caching of invalid PTEs and does not require
break-before-make (BBM). As a result, the flushing logic is simplified.
TLB invalidation can be deferred until p2m_write_unlock() is called.
Consequently, the p2m->need_flush flag is always considered true and is
removed.
- Page Table Traversal: The order of walking the page tables differs from Arm,
and this implementation reflects that reversed traversal.
- Macro Adjustments: The macros P2M_ROOT_LEVEL, P2M_ROOT_ORDER, and
P2M_ROOT_PAGES are updated to align with the new RISC-V implementation.

The main functionality is in __p2m_set_entry(), which handles mappings aligned
to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).

p2m_set_entry() breaks a region down into block-aligned mappings and calls
__p2m_set_entry() accordingly.

Stub implementations (to be completed later) include:
- p2m_free_entry()

What would a function of this name do?

Recursively visiting all leaf PTE's for sub-tree behind an entry, then calls
put_page() (which will free if there is no any reference to this page),
freeing intermediate page table (after all entries were freed) by removing
it from d->arch.paging.freelist, and removes correspondent page of intermediate 
page
table from p2m->pages list.


You can clear entries, but you can't
free them, can you?

Is is a question regarding terminology?

Yes. If one sees a call to a function, it should be possible to at least
roughly know what it does without needing to go look at the implementation.


I can't free entry itself, but a page table or
a page (if it is a leaf entry) on which it points could free.

Then e.g. pte_free_subtree() or some such?


It sounds fine to me. I'll use suggested name.

Just want to notice that other arches also have the same function
for the same purpose with the same name.
Does it make sense then to change a name for all arches?




+static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)

The rule of thumb is to have inline functions only in header files, leaving
decisions to the compiler elsewhere.

I am not sure what you mean in the second part (after coma) of your sentence.

The compiler does its own inlining decisions quite fine when it can see all
call sites (as is the case for static functions). Hence in general you want
to omit "inline" there. Except of course in header files, where non-inline
static-s are a problem.


Thanks, now it is clear what you meant.




+{
+panic("%s: isn't implemented for now\n", __func__);
+
+return false;
+}

For this function in particular, though: Besides the "p2me" in the name
being somewhat odd (supposedly page table entries here are simply pte_t),
how is this going to be different from pte_is_valid()?

pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
what is a type stored in the radix tree (p2m->p2m_types):
/*
 * In the case of the P2M, the valid bit is used for other purpose. Use
 * the type to check whether an entry is valid.
 */
static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
{
return p2m_type_radix_get(p2m, pte) != p2m_invalid;
}

It is done to track which page was modified by a guest.

But then (again) the name doesn't convey what the function does.


Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would 
better.


  Plus
can't a guest also arrange for an entry's type to move to p2m_invalid?
That's then still an entry that was modified by the guest.


I am not really sure that I fully understand the question.
Do you ask if a guest can do something which will lead to a call of 
p2m_set_entry()
with p2m_invalid argument?
If yes, then it seems like it will be done only in case of p2m_remove_mapping() 
what
will mean that alongside with p2m_invalid INVALID_MFN will be also passed, what 
means
this entry isn't expected to be used anymore.


Overall I think I'm lacking clarity what you mean to use this predicate
for.


By using of "p2me_" predicate I wanted to express that not PTE's valid bit will 
be
checked, but the type saved in radix tree will be used.
As suggested above probably it will be better drop "e" too and just use 
p2m_type_is_valid().




+static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
+{
+write_pte(p, pte);
+if ( clean_pte )
+clean_dcache_va_range(p, sizeof(*p));
+}
+
+static inline void p2m_remove_pte(pte_t *p, bool clean_pte)
+{
+pte_t pte;
+
+memset(&pte, 0x00, sizeof(pte));
+p2m_write_pte(p, pte, clean_pte);
+}

May I suggest "clear" instead

Re: [PATCH 3/3] hvmloader: add new SMBIOS tables (7,8,9,26,27,28)

2025-07-07 Thread Petr Beneš

On Mon, Jul 7, 2025 at 8:56 AM Jan Beulich  wrote:
>
> Well, in the common case the original author would never change, and it would
> be their S-o-b that remains first forever. Anything else would need 
> explaining.

So, I should swap the two S-o-b lines, gotcha.

P.

Re: [PATCH] x86/PVH: extend checking in hwdom_fixup_p2m()

2025-07-07 Thread Jan Beulich

On 07.07.2025 16:44, Jan Beulich wrote:
> We're generally striving to minimize behavioral differences between PV
> and PVH Dom0. Using (just?) is_memory_hole() in the PVH case looks quite
> a bit weaker to me, compared to the page ownership check done in the PV
> case. Extend checking accordingly.
> 
> Signed-off-by: Jan Beulich 
> ---
> The addition may actually be suitable to replace the use of
> is_memory_hole() here. While dropping that would in particular extend
> coverage to E820_RESERVED regions, those are identity-mapped anyway
> (albeit oddly enough still by IOMMU code).

This really is more of a by-product of me looking into the failing (by
default) memory accesses that Dom0 is doing on systems I can put my hands
on. Those accesses are indeed occurring in the context of ACPI evaluating
e.g. _INI or _STA methods, and at least some of the underlying memory
regions also appear in /proc/iomem. Which means that in principle there
could be the option of Dom0 informing us of (at least some of) such
regions.

However, the accesses occur ahead of the kernel actually collecting
resource data. In particular, the intention would have been for the
to-be-added sub-op to be invoked from drivers/pnp/system.c:reserve_range().
That runs from an fs_initcall though, while the accesses occur from
underneath acpi_bus_init() and acpi_scan_init(), both called from
acpi_init(), which is a subsys_initcall. It also doesn't look as if
re-arranging things in Linux would be reasonably possible. Hence the only
(vague) option might be to duplicate some of what is already being done.
Yet besides this appearing undesirable to me, even if we hooked such
duplicate code up from acpi_arch_init(), that would only cover accesses
from underneath acpi_scan_init(); acpi_bus_init() runs yet earlier.

Nevertheless, for the sake of completeness I'll reproduce my draft, patch
(not having the intended effect) below. For the moment, however, I think
I need to give up my hope that we could deal with the problem by other
than the "dom0=pf-fixup" approach.

Jan
---
Along the lines of the respective remark in "x86/PVH: extend checking
in hwdom_fixup_p2m()" submitted earlier, is_memory_hole() isn't used in
PHYSDEVOP_system_memory handling.

The new physdev-op is intended to be invoked from e.g. Linux'es
drivers/pnp/system.c:reserve_range().

Whether to make this a physdev-op or a platform-op wasn't quite clear
to me; the two have a pretty fuzzy boundary between them.

By observation, the size field doesn't need to be 64 bits wide. If we
still wanted to make it so, we'd definitely need to add preemption
checking, too. (We may need to do so anyway, I simply wasn't sure.)

There are sub-page regions being reported on the system I'm looking at.
Not sure what to do about them.

--- unstable.orig/xen/arch/x86/hvm/hypercall.c
+++ unstable/xen/arch/x86/hvm/hypercall.c
@@ -88,6 +88,7 @@ long hvm_physdev_op(int cmd, XEN_GUEST_H
 case PHYSDEVOP_pci_device_remove:
 case PHYSDEVOP_pci_device_reset:
 case PHYSDEVOP_dbgp_op:
+case PHYSDEVOP_system_memory:
 if ( !is_hardware_domain(currd) )
 return -ENOSYS;
 break;
--- unstable.orig/xen/arch/x86/physdev.c
+++ unstable/xen/arch/x86/physdev.c
@@ -27,6 +27,8 @@ int physdev_unmap_pirq(struct domain *d,
 #ifndef COMPAT
 typedef long ret_t;

+#include 
+
 static int physdev_hvm_map_pirq(
 struct domain *d, int type, int *index, int *pirq)
 {
@@ -619,6 +621,71 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
 break;
 }

+#ifndef COMPAT
+case PHYSDEVOP_system_memory: {
+struct xen_physdev_sysmem_op op;
+
+if ( !is_hardware_domain(currd) )
+ret = -EPERM;
+else if ( copy_from_guest(&op, arg, 1) )
+ret = -EFAULT;
+else if ( !op.size )
+ret = 0;
+else if ( (op.addr | op.size) & (PAGE_SIZE - 1) ||
+  (op.addr >> paddr_bits) ||
+  ((op.addr + op.size - 1) >> paddr_bits) ||
+  op.flags )
+ret = -EINVAL;
+else
+{
+mfn_t mfn = maddr_to_mfn(op.addr);
+
+if ( mfn_valid(mfn) )
+{
+struct page_info *pg = mfn_to_page(mfn);
+const struct domain *owner = page_get_owner_and_reference(pg);
+
+if ( owner )
+put_page(pg);
+if ( owner != dom_io )
+return -EINVAL;
+}
+
+/*
+ * Avoid the need to fix up behind a PVH Dom0, by inserting the
+ * desired mapping right away.  See also hwdom_fixup_p2m().
+ */
+if ( !is_hvm_domain(currd) )
+ret = 0;
+else if ( !iomem_access_permitted(currd, mfn_x(mfn),
+  (mfn_x(mfn) +
+   PFN_DOWN(op.size) - 1)) ||
+  altp2m_active(currd) )
+ret = -EACC

Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()

2025-07-07 Thread Oleksii Kurochko



On 7/7/25 2:53 PM, Jan Beulich wrote:

On 07.07.2025 13:46, Oleksii Kurochko wrote:

On 7/7/25 9:20 AM, Jan Beulich wrote:

On 04.07.2025 17:01, Oleksii Kurochko wrote:

On 7/1/25 3:49 PM, Jan Beulich wrote:

On 10.06.2025 15:05, Oleksii Kurochko wrote:

This patch introduces p2m_set_entry() and its core helper __p2m_set_entry() for
RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
modifications.

Key differences include:
- TLB Flushing: RISC-V allows caching of invalid PTEs and does not require
 break-before-make (BBM). As a result, the flushing logic is simplified.
 TLB invalidation can be deferred until p2m_write_unlock() is called.
 Consequently, the p2m->need_flush flag is always considered true and is
 removed.
- Page Table Traversal: The order of walking the page tables differs from Arm,
 and this implementation reflects that reversed traversal.
- Macro Adjustments: The macros P2M_ROOT_LEVEL, P2M_ROOT_ORDER, and
 P2M_ROOT_PAGES are updated to align with the new RISC-V implementation.

The main functionality is in __p2m_set_entry(), which handles mappings aligned
to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).

p2m_set_entry() breaks a region down into block-aligned mappings and calls
__p2m_set_entry() accordingly.

Stub implementations (to be completed later) include:
- p2m_free_entry()

What would a function of this name do?

Recursively visiting all leaf PTE's for sub-tree behind an entry, then calls
put_page() (which will free if there is no any reference to this page),
freeing intermediate page table (after all entries were freed) by removing
it from d->arch.paging.freelist, and removes correspondent page of intermediate 
page
table from p2m->pages list.


You can clear entries, but you can't
free them, can you?

Is is a question regarding terminology?

Yes. If one sees a call to a function, it should be possible to at least
roughly know what it does without needing to go look at the implementation.


I can't free entry itself, but a page table or
a page (if it is a leaf entry) on which it points could free.

Then e.g. pte_free_subtree() or some such?

It sounds fine to me. I'll use suggested name.

Just want to notice that other arches also have the same function
for the same purpose with the same name.

As to x86, it's not general P2M code which uses this odd (for the purpose)
name, but only p2m-pt.c.


Does it make sense then to change a name for all arches?

I think so.


+{
+panic("%s: isn't implemented for now\n", __func__);
+
+return false;
+}

For this function in particular, though: Besides the "p2me" in the name
being somewhat odd (supposedly page table entries here are simply pte_t),
how is this going to be different from pte_is_valid()?

pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
what is a type stored in the radix tree (p2m->p2m_types):
 /*
  * In the case of the P2M, the valid bit is used for other purpose. Use
  * the type to check whether an entry is valid.
  */
 static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
 {
 return p2m_type_radix_get(p2m, pte) != p2m_invalid;
 }

It is done to track which page was modified by a guest.

But then (again) the name doesn't convey what the function does.

Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would 
better.

For P2M type checks please don't invent new naming, but use what both x86
and Arm are already using. Note how we already have p2m_is_valid() in that
set. Just that it's not doing what you want here.


Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
And in here it is checked if P2M pte is valid from P2M point of view by checking
the type in radix tree and/or in reserved PTEs bits (just to remind we have 
only 2
free bits for type).




   Plus
can't a guest also arrange for an entry's type to move to p2m_invalid?
That's then still an entry that was modified by the guest.

I am not really sure that I fully understand the question.
Do you ask if a guest can do something which will lead to a call of 
p2m_set_entry()
with p2m_invalid argument?

That I'm not asking, but rather stating. I.e. I expect such is possible.


If yes, then it seems like it will be done only in case of p2m_remove_mapping() 
what
will mean that alongside with p2m_invalid INVALID_MFN will be also passed, what 
means
this entry isn't expected to be used anymore.

Right. But such an entry would still have been "modified" by the guest.


Yes, but nothing then is needed to do with it. For example, if it is already 
invalid there
is not any sense to flush page to RAM (as in this case PTE's bit will be 
checked),
something like Arm does:
  https://elixir.bootlin.com/xen/v4.20.0/source/xen/arch/arm/p2m.c#L375


+/*
+ * Don't take into account the MFN when removing mapping (i.e
+ * MFN_INVALID) to calculate the correct targe

Re: [PATCH v5 14/18] xen/cpufreq: introduce GET_CPUFREQ_CPPC sub-cmd

2025-07-07 Thread Jason Andryuk





On 2025-07-06 23:31, Penny, Zheng wrote:

[Public]


-Original Message-
From: Jan Beulich 
Sent: Friday, July 4, 2025 5:46 PM
To: Penny, Zheng 
Cc: Huang, Ray ; Anthony PERARD
; Juergen Gross ; Andrew
Cooper ; Orzel, Michal ;
Julien Grall ; Roger Pau Monné ; Stefano
Stabellini ; xen-devel@lists.xenproject.org; Andryuk, 
Jason

Subject: Re: [PATCH v5 14/18] xen/cpufreq: introduce GET_CPUFREQ_CPPC sub-
cmd

On 04.07.2025 10:56, Penny, Zheng wrote:

[Public]


-Original Message-
From: Jan Beulich 
Sent: Friday, July 4, 2025 4:33 PM
To: Penny, Zheng ; Andryuk, Jason

Cc: Huang, Ray ; Anthony PERARD
; Juergen Gross ; Andrew
Cooper ; Orzel, Michal
; Julien Grall ; Roger Pau
Monné ; Stefano Stabellini
; xen-devel@lists.xenproject.org
Subject: Re: [PATCH v5 14/18] xen/cpufreq: introduce GET_CPUFREQ_CPPC
sub- cmd

On 04.07.2025 10:13, Penny, Zheng wrote:

[Public]


-Original Message-
From: Jan Beulich 
Sent: Tuesday, June 17, 2025 6:08 PM
To: Penny, Zheng 
Cc: Huang, Ray ; Anthony PERARD
; Juergen Gross ;
Andrew Cooper ; Orzel, Michal
; Julien Grall ; Roger Pau
Monné ; Stefano Stabellini
; xen-devel@lists.xenproject.org
Subject: Re: [PATCH v5 14/18] xen/cpufreq: introduce
GET_CPUFREQ_CPPC
sub- cmd

On 27.05.2025 10:48, Penny Zheng wrote:

--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -69,6 +69,7 @@ void show_help(void)
  " set-max-cstate|'unlimited' [|'unlimited']\n"
  " set the C-State limitation 
( >= 0)

and\n"

  " optionally the C-sub-state 
limitation

( >= 0)\n"

+" get-cpufreq-cppc  [cpuid]   list cpu cppc parameter of 
CPU

 or all\n"

  " set-cpufreq-cppc  [cpuid] [balance|performance|powersave]

*\n"

  " set Hardware P-State (HWP)

parameters\n"

  " on CPU  or all if 
omitted.\n"
@@ -812,33 +813,7 @@ static void print_cpufreq_para(int cpuid,
struct xc_get_cpufreq_para *p_cpufreq)

  printf("scaling_driver   : %s\n", p_cpufreq->scaling_driver);

-if ( hwp )
-{
-const xc_cppc_para_t *cppc = &p_cpufreq->u.cppc_para;
-
-printf("cppc variables   :\n");
-printf("  hardware limits: lowest [%"PRIu32"] lowest nonlinear

[%"PRIu32"]\n",

-   cppc->lowest, cppc->lowest_nonlinear);
-printf(" : nominal [%"PRIu32"] highest 
[%"PRIu32"]\n",
-   cppc->nominal, cppc->highest);
-printf("  configured limits  : min [%"PRIu32"] max [%"PRIu32"] energy

perf

[%"PRIu32"]\n",

-   cppc->minimum, cppc->maximum, cppc->energy_perf);
-
-if ( cppc->features & XEN_SYSCTL_CPPC_FEAT_ACT_WINDOW )
-{
-unsigned int activity_window;
-const char *units;
-
-activity_window = calculate_activity_window(cppc, &units);
-printf(" : activity_window [%"PRIu32" %s]\n",
-   activity_window, units);
-}
-
-printf(" : desired [%"PRIu32"%s]\n",
-   cppc->desired,
-   cppc->desired ? "" : " hw autonomous");
-}
-else
+if ( !hwp )
  {
  if ( p_cpufreq->gov_num )
  printf("scaling_avail_gov: %s\n",


I'm not sure it is a good idea to alter what is being output for 
get-cpufreq-para.
People may simply miss that output then, without having any
indication where it went.


Hwp is more like amd-cppc in active mode. It also does not rely on
Xen governor to do performance tuning, so in previous design, we
could borrow

governor filed to output cppc info However after introducing amd-cppc
passive mode, we have request to output both governor and CPPC info.
And if continuing expanding get-cpufreq-para to include CPPC info, it
will make the parent stuct xen_sysctl.u exceed exceed 128 bytes.

How is the xenpm command "get-cpufreq-para" related to the sysctl
interface struct size? If you need to invoke two sysctl sub-ops to
produce all relevant "get-cpufreq- para" output, so be it I would say.



Because we are limiting each sysctl-subcmd-struct, such as " struct

xen_sysctl_pm_op ", 128 bytes in "struct xen_sysctl"，They are all combined as a
union.

Also, in "struct xen_sysctl_pm_op", its descending sub-op structs,
including "struct xen_get_cpufreq_para", are all combined as union too
``` struct xen_sysctl_pm_op {
 ... ...
 union {
 struct xen_get_cpufreq_para get_para;
 struct xen_set_cpufreq_gov  set_gov;
 struct xen_set_cpufreq_para set_para;
 struct xen_set_cppc_paraset_cppc;
 uint64_aligned_t get_avgfreq;
 uint32_tset_sched_opt_smt;
#define XEN_SYSCTL_CX_UNLIMITED 0xU
 uint32_tget_max_cstate;
 uint32_tset_max_cstate;
 } u;
}

Re: [PATCH] xen/efi: Fix crash with initial empty EFI options

2025-07-07 Thread Frediano Ziglio

On Mon, Jul 7, 2025 at 4:42 PM Jan Beulich  wrote:
>
> On 07.07.2025 17:11, Frediano Ziglio wrote:
> > EFI code path split options from EFI LoadOptions fields in 2
> > pieces, first EFI options, second Xen options.
> > "get_argv" function is called first to get the number of arguments
> > in the LoadOptions, second, after allocating enough space, to
> > fill some "argc"/"argv" variable. However the first parsing could
> > be different from second as second is able to detect "--" argument
> > separator. So it was possible that "argc" was bigger that the "argv"
> > array leading to potential buffer overflows, in particular
> > a string like "-- a b c" would lead to buffer overflow in "argv"
> > resulting in crashes.
> > Using EFI shell is possible to pass any kind of string in
> > LoadOptions.
> >
> > Fixes: 201f261e859e ("EFI: move x86 boot/runtime code to common/efi")
>
> This only moves the function, but doesn't really introduce any issue afaics.
>

Okay, I'll follow the rename

> > --- a/xen/common/efi/boot.c
> > +++ b/xen/common/efi/boot.c
> > @@ -345,6 +345,7 @@ static unsigned int __init get_argv(unsigned int argc, 
> > CHAR16 **argv,
> >  VOID *data, UINTN size, UINTN *offset,
> >  CHAR16 **options)
> >  {
> > +CHAR16 **const orig_argv = argv;
> >  CHAR16 *ptr = (CHAR16 *)(argv + argc + 1), *prev = NULL, *cmdline = 
> > NULL;
> >  bool prev_sep = true;
> >
> > @@ -384,7 +385,7 @@ static unsigned int __init get_argv(unsigned int argc, 
> > CHAR16 **argv,
> >  {
> >  cmdline = data + *offset;
> >  /* Cater for the image name as first component. */
> > -++argc;
> > +++argv;
>
> We're on the argc == 0 and argv == NULL path here. Incrementing NULL is UB,
> if I'm not mistaken.
>

Not as far as I know. Why? Some systems even can use NULL pointers as
valid, like mmap.

> > @@ -402,7 +403,7 @@ static unsigned int __init get_argv(unsigned int argc, 
> > CHAR16 **argv,
> >  {
> >  if ( cur_sep )
> >  ++ptr;
> > -else if ( argv )
> > +else if ( orig_argv )
> >  {
> >  *ptr = *cmdline;
> >  *++ptr = 0;
> > @@ -410,8 +411,8 @@ static unsigned int __init get_argv(unsigned int argc, 
> > CHAR16 **argv,
> >  }
> >  else if ( !cur_sep )
> >  {
> > -if ( !argv )
> > -++argc;
> > +if ( !orig_argv )
> > +++argv;
> >  else if ( prev && wstrcmp(prev, L"--") == 0 )
> >  {
> >  --argv;
>
> As per this, it looks like that on the 1st pass we may indeed overcount
> arguments. But ...
>

I can use again argc if you prefer, not strong about it.

> > @@ -428,9 +429,9 @@ static unsigned int __init get_argv(unsigned int argc, 
> > CHAR16 **argv,
> >  }
> >  prev_sep = cur_sep;
> >  }
> > -if ( argv )
> > +if ( orig_argv )
> >  *argv = NULL;
> > -return argc;
> > +return argv - orig_argv;
> >  }
> >
> >  static EFI_FILE_HANDLE __init get_parent_handle(const EFI_LOADED_IMAGE 
> > *loaded_image,
> > @@ -1348,8 +1349,8 @@ void EFIAPI __init noreturn efi_start(EFI_HANDLE 
> > ImageHandle,
> >(argc + 1) * sizeof(*argv) +
> >loaded_image->LoadOptionsSize,
> >(void **)&argv) == EFI_SUCCESS )
> > -get_argv(argc, argv, loaded_image->LoadOptions,
> > - loaded_image->LoadOptionsSize, &offset, &options);
> > +argc = get_argv(argc, argv, loaded_image->LoadOptions,
> > +loaded_image->LoadOptionsSize, &offset, 
> > &options);
>
> ... wouldn't this change alone cure that problem? And even that I don't
> follow. Below here we have
>
> for ( i = 1; i < argc; ++i )
> {
> CHAR16 *ptr = argv[i];
>
> if ( !ptr )
> break;
>
> and the 2nd pass of get_argv() properly terminates the (possibly too large)
> array with a NULL sentinel. So I wonder what it is that I'm overlooking and
> that is broken.
>
> Jan

I realized that because I got a crash, not just by looking at the code.

The string was something like "-- a b c d":
- the first get_argv call produces a 5 argc;
- you allocate space for 6 pointers and length of the entire string to copy;
- the parser writes a single pointer in argv and returns still 5 as argc;
- returned argc is ignored;
- code "for (i = 1; i < argc; ++i)" starts accessing argv[1] which is
not initialized, in case of garbage you dereference garbage.

Frediano

Re: [PATCH v3 02/22] include/xen/slr-table.h: Secure Launch Resource Table definitions

2025-07-07 Thread Sergii Dmytruk

On Mon, Jul 07, 2025 at 10:29:46AM +0200, Jan Beulich wrote:
> >> Btw, please don't forget to Cc maintainers of code you're changing / 
> >> adding.
> >
> > What do you mean?  I'm running add_maintainers.pl on the patches.
>
> The Cc: list had none of the REST maintainers. (Whether there's a bug in the
> script I can't tell.)

Oh, looks like adding new entries to MAINTAINERS prevented application
of THE REST.  Will try running the script from master next time to
address this.

> >> ... then isn't used right here, instead requiring a cast somewhere 
> >> (presumably,
> >> as code using this isn't visible in this patch).
> >
> > As was mentioned earlier: because size of a pointer between Xen and a
> > bootloader doesn't necessarily match.  What you're proposing can break
> > under certain conditions.
>
> But the header isn't shared with a bootloader's code base. It's private to
> Xen.

Yes, but sources of Xen be compiled with different size of a pointer
which messes up the interpretation of the data.  I tried using
BUILD_BUG_ON() to enforce the pointer is 64-bit and early code stopped
compiling.  The structures must not vary like that.

> >>> +} __packed;
> >>
> >> I similarly keep wondering why there are all these packed attributes here, 
> >> when
> >> (afaics) all of the structures are defined in a way that any padding is 
> >> explicit
> >> anyway.
> >
> > As before: it won't hurt, it's future-proof, just in case and to let
> > reader of the code know the structure must have no padding.  If you want
> > them gone, I can do that, but I think it will make the code harder to
> > maintain.
>
> Well, if there's a maintenance concern, then I would of course like to learn
> about that. I could see such being the case if the file was imported from
> somewhere. But with neither description nor any code comment not saying so,
> that doesn't look to be the case.
>
> Jan

While there is some synchronization to do, that's not what I meant, I
don't think it warrants writing the header in a special way.  __packed
just relieves anyone from checking for padding, while just to remove the
attribute one needs to go through the structures and make sure nothing
will change.

Regard

[PATCH] x86/idle: Implement support for Meteor Lake

2025-07-07 Thread Alex XZ Cypher Zero

Adds support for Meteor Lake C-states. As the spec is identical to Alder Lake 
as per the Intel specs, I've reused the Alder Lake codepath.

Signed-off-by: Alex XZ Cypher Zero 
---
 xen/arch/x86/cpu/mwait-idle.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index 5e98011bfd..c8bf58b150 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -1148,6 +1148,8 @@ static const struct x86_cpu_id intel_idle_ids[] 
__initconstrel = {
ICPU(ICELAKE_D, icx),
ICPU(ALDERLAKE, adl),
ICPU(ALDERLAKE_L,   adl_l),
+   ICPU(METEORLAKE,adl),
+   ICPU(METEORLAKE_L,  adl_l)
ICPU(SAPPHIRERAPIDS_X,  spr),
ICPU(ATOM_GOLDMONT, bxt),
ICPU(ATOM_GOLDMONT_PLUS,bxt),
@@ -1386,6 +1388,8 @@ static void __init mwait_idle_state_table_update(void)
break;
case INTEL_FAM6_ALDERLAKE:
case INTEL_FAM6_ALDERLAKE_L:
+   case INTEL_FAM6_METEORLAKE:
+   case INTEL_FAM6_METEORLAKE_L:
adl_idle_state_table_update();
break;
}
-- 
2.50.0

[PATCH 2/2] xen/x86: introduce AMD_MCE_NONFATAL

2025-07-07 Thread Stefano Stabellini

Today, checking for non-fatal MCE errors on ARM is very invasive: it
involves a periodic timer interrupting the physical CPU execution at
regular intervals. Moreover, when the timer fires, the handler sends an
IPI to all physical CPUs.

Both these actions are disruptive in terms of latency and deterministic
execution times for real-time workloads. They might miss a deadline due
to one of these IPIs. Make it possible to disable non-fatal MCE errors
checking with a new Kconfig option (AMD_MCE_NONFATAL).

Signed-off-by: Stefano Stabellini 
---
RFC. I couldn't find a better way to do this.
---
 xen/arch/x86/Kconfig.cpu   | 15 +++
 xen/arch/x86/cpu/mcheck/amd_nonfatal.c |  3 ++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/Kconfig.cpu b/xen/arch/x86/Kconfig.cpu
index 5fb18db1aa..14e20ad19d 100644
--- a/xen/arch/x86/Kconfig.cpu
+++ b/xen/arch/x86/Kconfig.cpu
@@ -10,6 +10,21 @@ config AMD
  May be turned off in builds targetting other vendors.  Otherwise,
  must be enabled for Xen to work suitably on AMD platforms.
 
+config AMD_MCE_NONFATAL
+   bool "Check for non-fatal MCEs on AMD CPUs"
+   default y
+   depends on AMD
+   help
+ Check for non-fatal MCE errors.
+
+ When this option is on (default), Xen regularly checks for
+ non-fatal MCEs potentially occurring on all physical CPUs. The
+ checking is done via timers and IPI interrupts, which is
+ acceptable in most configurations, but not for real-time.
+
+ Turn this option off if you plan on deploying real-time workloads
+ on Xen.
+
 config INTEL
bool "Support Intel CPUs"
default y
diff --git a/xen/arch/x86/cpu/mcheck/amd_nonfatal.c 
b/xen/arch/x86/cpu/mcheck/amd_nonfatal.c
index 7d48c9ab5f..812e18f612 100644
--- a/xen/arch/x86/cpu/mcheck/amd_nonfatal.c
+++ b/xen/arch/x86/cpu/mcheck/amd_nonfatal.c
@@ -191,7 +191,8 @@ static void cf_check mce_amd_work_fn(void *data)
 
 void __init amd_nonfatal_mcheck_init(struct cpuinfo_x86 *c)
 {
-   if (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)))
+   if ( !IS_ENABLED(CONFIG_AMD_MCE_NONFATAL) ||
+(!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON))) )
return;
 
/* Assume we are on K8 or newer AMD or Hygon CPU here */
-- 
2.25.1

[PATCH 1/2] xen/x86: don't send IPI to sync TSC when it is reliable

2025-07-07 Thread Stefano Stabellini

On real time configuration with the null scheduler, we shouldn't
interrupt the guest execution unless strictly necessary: the guest could
be a real time guest (e.g. FreeRTOS) and interrupting its execution
could lead to a missed deadline.

The principal source of interruptions is IPIs. Remove the unnecessary
IPI on all physical CPUs to sync the TSC when the TSC is known to be
reliable.

Signed-off-by: Stefano Stabellini 
---
 xen/arch/x86/time.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index 59129f419d..bfd022174a 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -2303,6 +2303,10 @@ static void cf_check time_calibration(void *unused)
 local_irq_enable();
 }
 
+if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+ boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
+return;
+
 cpumask_copy(&r.cpu_calibration_map, &cpu_online_map);
 
 /* @wait=1 because we must wait for all cpus before freeing @r. */
-- 
2.25.1

Re: [XEN PATCH 4/5] device-tree: address violation of MISRA C Rule 5.5

2025-07-07 Thread Stefano Stabellini

On Fri, 4 Jul 2025, Dmytro Prokopchuk1 wrote:
> Address a violation of MISRA C:2012 Rule 5.5:
> "Identifiers shall be distinct from macro names".
> 
> Reports for service MC3A2.R5.5:
> xen/include/xen/fdt-domain-build.h: non-compliant parameter 'copy_to_guest'
> xen/include/xen/guest_access.h: non-compliant macro 'copy_to_guest'
> 
> Rename 'copy_to_guest' function parameter to 'cb' for compliance.
> No functional changes.
> 
> Signed-off-by: Dmytro Prokopchuk 

Nice!


> ---
>  xen/common/device-tree/domain-build.c | 9 -
>  xen/include/xen/fdt-domain-build.h| 4 ++--
>  2 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/xen/common/device-tree/domain-build.c 
> b/xen/common/device-tree/domain-build.c
> index cd01a8b4bc..2b009547d0 100644
> --- a/xen/common/device-tree/domain-build.c
> +++ b/xen/common/device-tree/domain-build.c
> @@ -331,7 +331,7 @@ void __init allocate_memory(struct domain *d, struct 
> kernel_info *kinfo)
>  }
>  
>  void __init dtb_load(struct kernel_info *kinfo,
> - copy_to_guest_phys_cb copy_to_guest)
> + copy_to_guest_phys_cb cb)
>  {
>  unsigned long left;
>  
> @@ -339,7 +339,7 @@ void __init dtb_load(struct kernel_info *kinfo,
> kinfo->d, kinfo->dtb_paddr,
> kinfo->dtb_paddr + fdt_totalsize(kinfo->fdt));
>  
> -left = copy_to_guest(kinfo->d, kinfo->dtb_paddr,
> +left = cb(kinfo->d, kinfo->dtb_paddr,
>   kinfo->fdt,
>   fdt_totalsize(kinfo->fdt));

NIT: code style, alignment

Reviewed-by: Stefano Stabellini 


>  
> @@ -350,7 +350,7 @@ void __init dtb_load(struct kernel_info *kinfo,
>  }
>  
>  void __init initrd_load(struct kernel_info *kinfo,
> -copy_to_guest_phys_cb copy_to_guest)
> +copy_to_guest_phys_cb cb)
>  {
>  const struct boot_module *mod = kinfo->initrd;
>  paddr_t load_addr = kinfo->initrd_paddr;
> @@ -393,8 +393,7 @@ void __init initrd_load(struct kernel_info *kinfo,
>  if ( !initrd )
>  panic("Unable to map the %pd initrd\n", kinfo->d);
>  
> -res = copy_to_guest(kinfo->d, load_addr,
> -initrd, len);
> +res = cb(kinfo->d, load_addr, initrd, len);
>  if ( res != 0 )
>  panic("Unable to copy the initrd in the %pd memory\n", kinfo->d);
>  
> diff --git a/xen/include/xen/fdt-domain-build.h 
> b/xen/include/xen/fdt-domain-build.h
> index 45981dbec0..3a20623cf5 100644
> --- a/xen/include/xen/fdt-domain-build.h
> +++ b/xen/include/xen/fdt-domain-build.h
> @@ -50,10 +50,10 @@ typedef unsigned long (*copy_to_guest_phys_cb)(struct 
> domain *d,
> unsigned int len);
>  
>  void initrd_load(struct kernel_info *kinfo,
> - copy_to_guest_phys_cb copy_to_guest);
> + copy_to_guest_phys_cb cb);
>  
>  void dtb_load(struct kernel_info *kinfo,
> -  copy_to_guest_phys_cb copy_to_guest);
> +  copy_to_guest_phys_cb cb);
>  
>  int find_unallocated_memory(const struct kernel_info *kinfo,
>  const struct membanks *mem_banks[],
> -- 
> 2.43.0
>

Re: [PATCH v3 03/22] x86/boot: add MLE header and Secure Launch entry point

2025-07-07 Thread Sergii Dmytruk

On Thu, Jul 03, 2025 at 12:25:27PM +0200, Jan Beulich wrote:
> On 30.05.2025 15:17, Sergii Dmytruk wrote:
> > From: Kacper Stojek 
> >
> > Signed-off-by: Kacper Stojek 
> > Signed-off-by: Krystian Hebel 
> > Signed-off-by: Sergii Dmytruk 
>
> Such a change can hardly come without any description. As just one aspect,
> neither here nor ...
>
> > --- a/docs/hypervisor-guide/x86/how-xen-boots.rst
> > +++ b/docs/hypervisor-guide/x86/how-xen-boots.rst
> > @@ -55,6 +55,11 @@ If ``CONFIG_PVH_GUEST`` was selected at build time, an 
> > Elf note is included
> >  which indicates the ability to use the PVH boot protocol, and registers
> >  ``__pvh_start`` as the entrypoint, entered in 32bit mode.
> >
> > +A combination of Multiboot 2 and MLE headers is used to implement DRTM for
> > +legacy (BIOS) boot. The separate entry point is used mainly to 
> > differentiate
>
> ... here the MLE acronym is being deciphered. Same for DRTM here. There's
> also no reference anywhere as to some kind of spec (except in the cover
> letter, but that won't land in the tree).

Will add more details.

> > +from other kinds of boots. It moves a magic number to EAX before jumping 
> > into
> > +common startup code.
> > +
> >
> >  xen.gz
> >  ~~
>
> Any reason the single blank line is converted to a double one? Generally, in
> particular for patch context to be more meaningful, we'd prefer to not have
> double blank lines. In documentation they _sometimes_ may be warranted.

Take a closer look, the patch just preserves double blank lines which
are used consistently to separate sections within this file.

> > --- a/xen/arch/x86/boot/head.S
> > +++ b/xen/arch/x86/boot/head.S
> > @@ -4,6 +4,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -126,6 +127,25 @@ multiboot2_header:
> >  .size multiboot2_header, . - multiboot2_header
> >  .type multiboot2_header, @object
> >
> > +.balign 16
> > +mle_header:
> > +.long   0x9082ac5a  /* UUID0 */
> > +.long   0x74a7476f  /* UUID1 */
> > +.long   0xa2555c0f  /* UUID2 */
> > +.long   0x42b651cb  /* UUID3 */
> > +.long   0x0034  /* MLE header size */
>
> Better use an expression (difference of two labels)?

Won't hurt.

> > +.long   0x00020002  /* MLE version 2.2 */
> > +.long   (slaunch_stub_entry - start)  /* Linear entry point of MLE 
> > (SINIT virt. address) */
> > +.long   0x  /* First valid page of MLE */
> > +.long   0x  /* Offset within binary of first byte of MLE */
> > +.long   (_end - start)  /* Offset within binary of last byte + 1 
> > of MLE */
>
> Is the data here describing xen.gz or (rather) xen.efi? In the latter case,
> does data past _end (in particular the .reloc section) not matter here?

Eventually, both.  EFI case deals with loaded image which, I believe,
should have all relocations applied at the time of measurement.

> > +.long   0x0723  /* Bit vector of MLE-supported capabilities */
> > +.long   0x  /* Starting linear address of command line 
> > (unused) */
> > +.long   0x  /* Ending linear address of command line 
> > (unused) */
> > +
> > +.size mle_header, .-mle_header
> > +.type mle_header, @object
>
> Please use what xen/linkage.h provides now.

OK.

> However, the entire additions here and below likely want to go inside some
> #ifdef CONFIG_xyz, just like additions in subsequent patches. Which obviously
> would require a suitable Kconfig option to be introduced up front.

Will add CONFIG_SLAUNCH.

> > @@ -332,6 +352,38 @@ cs32_switch:
> >  /* Jump to earlier loaded address. */
> >  jmp *%edi
> >
> > +/*
> > + * Entry point for TrenchBoot Secure Launch on Intel TXT platforms.
> > + *
> > + * CPU is in 32b protected mode with paging disabled. On entry:
> > + * - %ebx = %eip = MLE entry point,
> > + * - stack pointer is undefined,
> > + * - CS is flat 4GB code segment,
> > + * - DS, ES, SS, FS and GS are undefined according to TXT SDG, but 
> > this
> > + *   would make it impossible to initialize GDTR, because GDT base 
> > must
> > + *   be relocated in the descriptor, which requires write access 
> > that
> > + *   CS doesn't provide. Instead we have to assume that DS is set 
> > by
> > + *   SINIT ACM as flat 4GB data segment.
>
> Do you really _have to_? At least as plausibly SS might be properly set up,
> while DS might not be.

"have to" is referring to the fact that making this assumption is forced
on the implementation.  LGDT instruction uses DS in the code below,
hence it's DS.

> > + * Additional restrictions:
> > + * - some MSRs are partially cleared, among them IA32_MISC_ENABLE, 
> > so
> > + *   some capabilities might be reported as disabled even if they 
> > are
> > +

Re: [PATCH] xen/arm: Fix booting hwdom/1:1 domU with CONFIG_GRANT_TABLE=n

2025-07-07 Thread Stefano Stabellini

On Mon, 30 Jun 2025, Luca Fancellu wrote:
> Hi Michal,
> 
> > On 25 Jun 2025, at 11:12, Michal Orzel  wrote:
> > 
> > At the moment, we unconditionally allocate space for grant table region
> > membank and add it in the membanks array to find_unallocated_memory() to
> > find unused memory. In case of CONFIG_GRANT_TABLE=n, the size of the
> > region is empty and assertion in rangeset_remove_range() fails when
> > booting hwdom or 1:1 domU without IOMMU. Example:
> > 
> > (XEN) Assertion 's <= e' failed at common/rangeset.c:189
> > ...
> > (XEN) Xen call trace:
> > (XEN)[<0a218b5c>] rangeset_remove_range+0xbc/0x2d4 (PC)
> > (XEN)[<0a2b8370>] find_unallocated_memory+0x140/0x208 (LR)
> > (XEN)[<0a2cc28c>] make_hypervisor_node+0x310/0x7e0
> > ...
> > 
> > Same issue would occur when booting hwdom with LLC coloring enabled.
> > Fix it by performing conditional allocation and configuration.
> > 
> > Signed-off-by: Michal Orzel 
> 
> The patch looks good to me, I’ve reproduced locally the issue and tested that 
> this patch
> solves it, using FVP.
> 
> Reviewed-by: Luca Fancellu 
> Tested-by: Luca Fancellu 

Reviewed-by: Stefano Stabellini

Re: [PATCH 4/4] xsm/dummy: Allow hwdom SYSCTL_readconsole/physinfo

2025-07-07 Thread Stefano Stabellini

On Fri, 20 Jun 2025, Jan Beulich wrote:
> On 19.06.2025 02:36, Stefano Stabellini wrote:
> > On Tue, 17 Jun 2025, Jan Beulich wrote:
> >> On 17.06.2025 02:10, Stefano Stabellini wrote:
> >>> On Mon, 16 Jun 2025, Jan Beulich wrote:
>  On 14.06.2025 00:51, Stefano Stabellini wrote:
> > On Wed, 11 Jun 2025, Jason Andryuk wrote:
> >> On 2025-06-11 09:27, Jan Beulich wrote:
> >>> On 11.06.2025 00:57, Jason Andryuk wrote:
>  Allow the hwdom to access the console, and to access physical
>  information about the system.
> 
>  xenconsoled can read Xen's dmesg.  If it's in hwdom, then that
>  permission would be required.
> >>>
> >>> Why would xenconsoled run in the hardware domain? It's purely a 
> >>> software
> >>> construct, isn't it? As a daemon, putting it in the control domain may
> >>> make sense. Otherwise it probably ought to go in a service domain.
> >>
> >> My approach has been to transform dom0 into the hardware domain and 
> >> add a new
> >> control domain.  xenconsoled was left running in the hardware domain.
> >
> > I think we should keep xenconsoled in the hardware domain because the
> > control domain should be just optional. (However, one could say that 
> > with
> > Denis' recent changes xenconsoled is also optional because one can use
> > console hypercalls or emulators (PL011, NS16550) for all DomUs.)
> >
> >
> >
> >> I suppose it could move.  Maybe that would be fine?  I haven't tried. 
> >> The
> >> Hyperlaunch code populates the console grants to point at the hardware 
> >> domain,
> >> and I just followed that.
> >>
> >> One aspect of why I left most things running in the Hardware domain 
> >> was to not
> >> run things in the Control domain.  If Control is the highest privileged
> >> entity, we'd rather run software in lower privileged places. Especially
> >> something like xenconsoled which is receiving data from the domUs.
> >
> > Yes, I agree with Jason. It is a bad idea to run xenconsoled in the
> > Control Domain because the Control Domain is meant to be safe from
> > interference. We want to keep the number of potential vehicles for
> > interference down to a minimum and shared memory between Control Domain
> > and DomUs is certainly a vehicle for interference.
> 
>  As much as it is when xenconsoled runs in the hardware domain? Especially
>  if the hardware domain is also running e.g. PV backends or qemu 
>  instances?
> >>>
> >>> It looks like you are thinking of the possible
> >>> interference from the Hardware Domain to the Control Domain via
> >>> xenconsoled, correct?
> >>
> >> More like interference with the system as a whole, which simply includes
> >> Control.
> >>
> >>> If that is the case, good thinking. I can see that you have really
> >>> understood the essence of the problem we are trying to solve.
> >>>
> >>> That is not an issue because the Control Domain shouldn't use PV
> >>> console. Instead, it should use the console hypercall, or the
> >>> PL011/NS16550 emulators in Xen.
> >>
> >> Well. I think the underlying concept of Control Domain being highly
> >> privileged needs more general discussion. As indicated elsewhere, I
> >> didn't think disaggregation (whichever way done) would leave any
> >> domain with effectively full privilege. I wonder what others think.
> > 
> > Keep in mind that the threat model here is different from the
> > datacenter. 
> > 
> > But the Control Domain is optional. If the user doesn't want it, the
> > user can avoid it.
> > 
> > Even on a fully static system (no VM creation), it is convenient to have
> > a domain that can monitor the others and trigger domain reset (we are
> > reimplementing domain reboot to be more like a soft reset so that the VM
> > doesn't need to be destroyed and recreated).
> 
> Suggesting that in such an environment Control should have no permission
> to create domains. This would avoid various threats, including e.g.
> massive amounts of dynamic memory allocation.

+1


> > As an example, the Control
> > Domain could be used to monitor a non-safe domain such as Android,
> > detect an Android crash, and trigger an Android reboot.
> 
> Yet at the same time it would have to be prevented from interfering with
> any of the critical domains.
> 
> Altogether this doesn't sound like "highest privilege" to me then.

You are right. We should be more precise with our wording.

Re: [PATCH 2/2] xen/x86: address violations of Rule 11.3

2025-07-07 Thread Stefano Stabellini

On Tue, 24 Jun 2025, Jan Beulich wrote:
> On 24.06.2025 02:20, victorm.l...@amd.com wrote:
> > From: Nicola Vetrini 
> > 
> > Use {get,put}_unaligned_t to ensure that reads and writes are
> > safe to perform even on potentially misaligned pointers.
> 
> Also applicable to the Arm patch: Please can such patches mention the
> main subject of the rule, not just the number?

+1


> Overall I'm unconvinced we really want or need this on x86; I'm curious
> what Andrew and Roger think.

To be honest, I had a similar reaction to you, which is why I suggested
on Matrix to:

- deviate the rule in its entirety on x86
- deviate the rule for all mappings except for devmem mappings on ARM

Leaving aside ARM for a second (this is exactly the kind of very
arch-specific behavior that is OK to device differently per
architecture), would you be OK with deviating the rule in its entirety on
x86?

Or do you prefer to continue with this patch?


> Further, even beyond the respective remark
> below, I'd be pretty surprised if these were all of the places that
> would need fiddling with. Mind me asking how the places to touch were
> identified? (This may actually be a good thing to mention in the
> description.)

The patch is simply based on the violations reported by the ECLAIR scan.


> > @@ -388,7 +392,7 @@ static int init_or_livepatch apply_alt_calls(
> >  return -EINVAL;
> >  }
> > 
> > -disp = *(int32_t *)(orig + 2);
> > +disp = get_unaligned_t(int32_t, orig + 2);
> >  dest = *(const void **)(orig + 6 + disp);
> 
> Why is this latter line not also adjusted? The field is expected to be
> aligned, yes, but for the code here there's no guarantee. Imo if this
> was left alone along with applying the suggested change, a code comment
> would need adding.
> 
> > --- a/xen/arch/x86/include/asm/hvm/vlapic.h
> > +++ b/xen/arch/x86/include/asm/hvm/vlapic.h
> > @@ -10,6 +10,7 @@
> >  #define __ASM_X86_HVM_VLAPIC_H__
> > 
> >  #include 
> > +#include 
> >  #include 
> > 
> >  #define vcpu_vlapic(x)   (&(x)->arch.hvm.vlapic)
> > @@ -85,13 +86,13 @@ struct vlapic {
> >  static inline uint32_t vlapic_get_reg(const struct vlapic *vlapic,
> >uint32_t reg)
> >  {
> > -return *((uint32_t *)(&vlapic->regs->data[reg]));
> > +return get_unaligned_t(uint32_t, &vlapic->regs->data[reg]);
> 
> This, aiui (or should I say "I hope"), also addresses another violation
> (casting away of const). Such will want mentioning in the description,
> imo.
> 
> > --- a/xen/arch/x86/setup.c
> > +++ b/xen/arch/x86/setup.c
> > @@ -1249,7 +1249,7 @@ void asmlinkage __init noreturn __start_xen(void)
> > (caps & 2) ? " V2" : "",
> > !(caps & 3) ? " none" : "");
> >  printk("EDID transfer time: %d seconds\n", caps >> 8);
> > -if ( *(u32 *)bootsym(boot_edid_info) == 0x13131313 )
> > +if ( get_unaligned_t(u32, bootsym(boot_edid_info)) == 0x13131313 )
> 
> When touching such, please can you also convert to uint_t?
> 
> Jan
>

Re: [PATCH v2] SUPPORT.md: Document guest PSCI support

2025-07-07 Thread Stefano Stabellini

On Tue, 1 Jul 2025, Mykola Kvach wrote:
> From: Mykola Kvach 
> 
> Add a new entry under "Virtual Hardware, Hypervisor" for guest PSCI
> support on ARM. This documents support for all mandatory functions of
> PSCI 1.1, and separately lists the supported optional functions.
> 
> Signed-off-by: Mykola Kvach 

Reviewed-by: Stefano Stabellini 


> ---
> Changes in v2:
> Addressed review comments from:
> https://patchew.org/Xen/cover.1751020456.git.mykola._5fkv...@epam.com/401d3745a295812fef14a22b0c2a3c6017d588c1.1751020456.git.mykola._5fkv...@epam.com/#f891958a-35cf-4c6d-b1b9-51d34559f...@xen.org
> ---
>  SUPPORT.md | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/SUPPORT.md b/SUPPORT.md
> index f0b5718e84..6a82a92189 100644
> --- a/SUPPORT.md
> +++ b/SUPPORT.md
> @@ -956,6 +956,15 @@ by hwdom. Some platforms use SCMI for access to 
> system-level resources.
>  
>  Status: Supported
>  
> +### ARM: Guest PSCI support
> +
> +Emulated PSCI interface exposed to guests. We support all mandatory
> +functions of PSCI 1.1. See below for the list of optional PSCI call
> +implemented and their status.
> +
> +   Status, Mandatory: Supported
> +   Status, MIGRATE_INFO_TYPE: Supported
> +
>  ## Virtual Hardware, QEMU
>  
>  This section describes supported devices available in HVM mode using a
> -- 
> 2.48.1
>

Re: [PATCH v1 1/5] vpci: const-ify some pdev instances

2025-07-07 Thread Stewart Hildebrand

On 6/11/25 15:28, Stewart Hildebrand wrote:
> On 6/5/25 05:47, Roger Pau Monné wrote:
>> On Sat, May 31, 2025 at 08:53:59AM -0400, Stewart Hildebrand wrote:
>>> Since 622bdd962822 ("vpci/header: handle p2m range sets per BAR"), a
>>> non-const pdev is no longer needed for error handling in
>>> vpci_process_pending(). Const-ify pdev in vpci_process_pending(),
>>> defer_map(), and struct vpci_vcpu.
>>>
>>> Get rid of const-removal workaround in modify_bars().
>>>
>>> Take the opportunity to remove an unused parameter in defer_map().
>>>
>>> Signed-off-by: Stewart Hildebrand 
>>
>> Reviewed-by: Roger Pau Monné 
> 
> Thanks!
> 
>> One further simplification below.
>>
>>> ---
>>> This is prerequisite for ("vpci: use separate rangeset for BAR
>>> unmapping") in order to call defer_map() with a const pdev.

I'm trying a somewhat different approach for the series for v2, and this
patch will no longer strictly be prerequisite. However, this patch seems
to be a desirable cleanup by itself, so I'll send it independently.

>>> ---
>>>  xen/drivers/vpci/header.c | 16 
>>>  xen/include/xen/vpci.h|  2 +-
>>>  2 files changed, 5 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>>> index 1f48f2aac64e..e42c8efa2302 100644
>>> --- a/xen/drivers/vpci/header.c
>>> +++ b/xen/drivers/vpci/header.c
>>> @@ -175,7 +175,7 @@ static void modify_decoding(const struct pci_dev *pdev, 
>>> uint16_t cmd,
>>>  
>>>  bool vpci_process_pending(struct vcpu *v)
>>>  {
>>> -struct pci_dev *pdev = v->vpci.pdev;
>>> +const struct pci_dev *pdev = v->vpci.pdev;
>>>  struct vpci_header *header = NULL;
>>>  unsigned int i;
>>>  
>>> @@ -283,8 +283,7 @@ static int __init apply_map(struct domain *d, const 
>>> struct pci_dev *pdev,
>>>  return rc;
>>>  }
>>>  
>>> -static void defer_map(struct domain *d, struct pci_dev *pdev,
>>> -  uint16_t cmd, bool rom_only)
>>> +static void defer_map(const struct pci_dev *pdev, uint16_t cmd, bool 
>>> rom_only)
>>>  {
>>>  struct vcpu *curr = current;
>>>  
>>> @@ -308,7 +307,7 @@ static void defer_map(struct domain *d, struct pci_dev 
>>> *pdev,
>>>  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool 
>>> rom_only)
>>>  {
>>>  struct vpci_header *header = &pdev->vpci->header;
>>> -struct pci_dev *tmp, *dev = NULL;
>>> +struct pci_dev *tmp;
>>>  const struct domain *d;
>>>  const struct vpci_msix *msix = pdev->vpci->msix;
>>>  unsigned int i, j;
>>> @@ -450,11 +449,6 @@ static int modify_bars(const struct pci_dev *pdev, 
>>> uint16_t cmd, bool rom_only)
>>>  
>>>  if ( tmp == pdev )
>>>  {
>>> -/*
>>> - * Need to store the device so it's not constified and 
>>> defer_map
>>> - * can modify it in case of error.
>>> - */
>>> -dev = tmp;
>>>  if ( !rom_only )
>>
>> You can now join this with the previous if, and reduce one level of
>> indentation:
>>
>> if ( tmp == pdev && !rom_only )
>> /* comment text */
>> continue;
> 
> Will do. I'll plan to keep your R-b tag for v2 since this is a trivial
> change.
>

Re: [XEN PATCH 2/5] iommu: address violation of MISRA C Rule 5.5

2025-07-07 Thread Stefano Stabellini

On Mon, 7 Jul 2025, Jan Beulich wrote:
> On 04.07.2025 22:39, Dmytro Prokopchuk1 wrote:
> > Address a violation of MISRA C:2012 Rule 5.5:
> > "Identifiers shall be distinct from macro names".
> > 
> > Reports for service MC3A2.R5.5:
> > xen/include/xen/iommu.h: non-compliant struct 'page_list_head'
> > xen/include/xen/mm.h: non-compliant macro 'page_list_head'
> 
> What is this about? There's no code change that I could see which would
> alter the situation here.
> 
> > xen/drivers/passthrough/iommu.c: non-compliant macro 'iommu_quarantine'
> > xen/include/xen/iommu.h: non-compliant variable 'iommu_quarantine'
> > 
> > These external variables ('iommu_pt_cleanup_lock'
> > and 'iommu_pt_cleanup_list') are no longer used
> > in the codebase. Removing them eliminates dead
> > code and ensures compliance with coding standards.
> > Fixes: b5622eb627 (iommu: remove unused iommu_ops method and tasklet, 
> > 2020-09-22)
> 
> That's a different Misra rule, so may better be put in a separate patch?
> Otherwise please at least mention the rule number as well.
> 
> > The variable 'iommu_quarantine' makes sence to use
> > only if 'CONFIG_HAS_PCI=y', so place it inside '#ifdef'.
> 
> Hmm, yes - not nice, but what do you do. I question "makes sense" though:
> Quarantining is possible also without PCI, aiui. Just we don't that right
> now.

Hi Jan,

As far as I can tell the change to #ifdef iommu_quarantine is necessary
to resolve a R5.5 violation here:

#ifdef CONFIG_HAS_PCI
uint8_t __read_mostly iommu_quarantine =
# if defined(CONFIG_IOMMU_QUARANTINE_NONE)
IOMMU_quarantine_none;
# elif defined(CONFIG_IOMMU_QUARANTINE_BASIC)
IOMMU_quarantine_basic;
# elif defined(CONFIG_IOMMU_QUARANTINE_SCRATCH_PAGE)
IOMMU_quarantine_scratch_page;
# endif
#else
# define iommu_quarantine IOMMU_quarantine_none   <<< violation
#endif /* CONFIG_HAS_PCI */

As you can see from the patch series, often it is not nice to find a
resoltution for these R5.5 violations. This is the reason why I
originally suggested to deviate R5.5 entirely.

https://lore.kernel.org/xen-devel/139aa595-8b41-44e7-b205-415443c8c...@suse.com/](https://lore.kernel.org/xen-devel/139aa595-8b41-44e7-b205-415443c8c...@suse.com/

That said, this patch is one of the nicer changes in this series, I
think it is OK.

Re: [XEN PATCH 3/5] x86/irq: address violation of MISRA C Rule 5.5

2025-07-07 Thread Stefano Stabellini

On Mon, 7 Jul 2025, Jan Beulich wrote:
> On 04.07.2025 22:39, Dmytro Prokopchuk1 wrote:
> > Address a violation of MISRA C:2012 Rule 5.5:
> > "Identifiers shall be distinct from macro names".
> > 
> > Reports for service MC3A2.R5.5:
> > xen/include/xen/irq.h: non-compliant function `pirq_cleanup_check(struct 
> > pirq*, struct domain*)'
> > xen/include/xen/irq.h: non-compliant macro `pirq_cleanup_check'
> > 
> > The primary issue stems from the macro and function
> > having identical names, which is confusing and
> > non-compliant with common coding standards.
> > 
> > Change the function name by adding two underscores at the end.
> > 
> > Signed-off-by: Dmytro Prokopchuk 
> 
> I'm not going to NAK this, but I dislike the transformation done. The aliasing
> in this case was intentional, to avoid any caller appearing that would bypass
> the macro. Yes, the double underscores will also stand out (as much as the
> parenthesization that would have been needed to override the protection), but
> still ...

Maybe you can suggest a different name? Looking at the diff, this patch
also seems OKish.

It is possible but difficult to deviate specific instances like this: if
a SAF in-code comment works, then great, otherwise we have to resort to
a regex which makes thing harder to maintain.

Unless a SAF in-code comment works, I think this patch is the best way
to go.

Re: [XEN PATCH 1/5] gnttab: address violation of MISRA C Rule 5.5

2025-07-07 Thread Stefano Stabellini

On Fri, 4 Jul 2025, Dmytro Prokopchuk1 wrote:
> Address a violation of MISRA C:2012 Rule 5.5:
> "Identifiers shall be distinct from macro names".
> 
> Reports for service MC3A2.R5.5:
> xen/common/grant_table.c: non-compliant macro 'update_gnttab_par'
> xen/common/grant_table.c: non-compliant macro 'parse_gnttab_limit'
> 
> The macros above are intended to discard function arguments (unused1, unused2)
> when compiling with different configurations of CONFIG_HYPFS.
> This can lead to confusion and unexpected behavior
> because the macro name and the function name are identical.
> Split the code and create two distinct function signatures.
> This ensures that the code behaves predictably and remains compliant.
> 
> Signed-off-by: Dmytro Prokopchuk 


I realize you tried to address Jan's comment about the global deviation.
In my opinion patch #2 and #3 are still OK, but I think this patch makes
things more confusing and error prone.

Can we find a way to deviate update_gnttab_par and parse_gnttab_limit
either with a SAF in-code comment (docs/misra/safe.json) or with a new
regex deviation (docs/misra/deviations.rst,
automation/eclair_analysis/ECLAIR/deviations.ecl)?


> ---
>  xen/common/grant_table.c | 22 +-
>  1 file changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
> index cf131c43a1..f3282a1d7b 100644
> --- a/xen/common/grant_table.c
> +++ b/xen/common/grant_table.c
> @@ -126,18 +126,12 @@ static void __init cf_check 
> max_maptrack_frames_init(struct param_hypfs *par)
>  update_gnttab_par(opt_max_maptrack_frames, par,
>opt_max_maptrack_frames_val);
>  }
> -#else
> -#define update_gnttab_par(v, unused1, unused2) update_gnttab_par(v)
> -#define parse_gnttab_limit(a, v, unused1, unused2) parse_gnttab_limit(a, v)
> -
> -static void update_gnttab_par(unsigned int val, struct param_hypfs *par,
> -  char *parval)
> -{
> -}
> -#endif
>  
>  static int parse_gnttab_limit(const char *arg, unsigned int *valp,
>struct param_hypfs *par, char *parval)
> +#else
> +static int parse_gnttab_limit(const char *arg, unsigned int *valp)
> +#endif
>  {
>  const char *e;
>  unsigned long val;
> @@ -150,7 +144,9 @@ static int parse_gnttab_limit(const char *arg, unsigned 
> int *valp,
>  return -ERANGE;
>  
>  *valp = val;
> +#ifdef CONFIG_HYPFS
>  update_gnttab_par(val, par, parval);
> +#endif
>  
>  return 0;
>  }
> @@ -161,9 +157,13 @@ custom_runtime_param("gnttab_max_frames", 
> parse_gnttab_max_frames,
>  
>  static int cf_check parse_gnttab_max_frames(const char *arg)
>  {
> +#ifdef CONFIG_HYPFS
>  return parse_gnttab_limit(arg, &opt_max_grant_frames,
>param_2_parfs(parse_gnttab_max_frames),
>opt_max_grant_frames_val);
> +#else
> +return parse_gnttab_limit(arg, &opt_max_grant_frames);
> +#endif
>  }
>  
>  static int cf_check parse_gnttab_max_maptrack_frames(const char *arg);
> @@ -173,9 +173,13 @@ custom_runtime_param("gnttab_max_maptrack_frames",
>  
>  static int cf_check parse_gnttab_max_maptrack_frames(const char *arg)
>  {
> +#ifdef CONFIG_HYPFS
>  return parse_gnttab_limit(arg, &opt_max_maptrack_frames,
>
> param_2_parfs(parse_gnttab_max_maptrack_frames),
>opt_max_maptrack_frames_val);
> +#else
> +return parse_gnttab_limit(arg, &opt_max_maptrack_frames);
> +#endif
>  }
>  
>  #ifndef GNTTAB_MAX_VERSION
> -- 
> 2.43.0
>

Re: [XEN PATCH 5/5] xen/bitops: address violation of MISRA C Rule 5.5

2025-07-07 Thread Stefano Stabellini

On Fri, 4 Jul 2025, Dmytro Prokopchuk1 wrote:
> Address a violation of MISRA C:2012 Rule 5.5:
> "Identifiers shall be distinct from macro names".
> 
> Reports for service MC3A2.R5.5:
> xen/include/xen/bitops.h: non-compliant function '__test_and_set_bit(int, 
> volatile void*)'
> xen/include/xen/bitops.h: non-compliant macro '__test_and_set_bit'
> xen/include/xen/bitops.h: non-compliant function '__test_and_clear_bit(int, 
> volatile void*)'
> xen/include/xen/bitops.h: non-compliant macro '__test_and_clear_bit'
> xen/include/xen/bitops.h: non-compliant function '__test_and_change_bit(int, 
> volatile void*)'
> xen/include/xen/bitops.h: non-compliant macro '__test_and_change_bit'
> xen/include/xen/bitops.h: non-compliant function 'test_bit(int, const 
> volatile void*)'
> xen/include/xen/bitops.h: non-compliant macro 'test_bit'
> 
> The primary issue stems from the macro and function
> having identical names, which is confusing and
> non-compliant with common coding standards.
> Change the functions names by adding two underscores at the end.
> No functional changes.
> 
> Signed-off-by: Dmytro Prokopchuk 

I think these should also be deviated either using a SAF in-code comment
if possible or a regex in
automation/eclair_analysis/ECLAIR/deviations.ecl and
docs/misra/deviations.rst



> ---
>  xen/include/xen/bitops.h | 24 
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/xen/include/xen/bitops.h b/xen/include/xen/bitops.h
> index a4d31ec02a..b292470ad7 100644
> --- a/xen/include/xen/bitops.h
> +++ b/xen/include/xen/bitops.h
> @@ -120,7 +120,7 @@ static always_inline bool generic_test_bit(int nr, const 
> volatile void *addr)
>  }
>  
>  /**
> - * __test_and_set_bit - Set a bit and return its old value
> + * __test_and_set_bit__ - Set a bit and return its old value
>   * @nr: Bit to set
>   * @addr: Address to count from
>   *
> @@ -129,7 +129,7 @@ static always_inline bool generic_test_bit(int nr, const 
> volatile void *addr)
>   * but actually fail.  You must protect multiple accesses with a lock.
>   */
>  static always_inline bool
> -__test_and_set_bit(int nr, volatile void *addr)
> +__test_and_set_bit__(int nr, volatile void *addr)
>  {
>  #ifndef arch__test_and_set_bit
>  #define arch__test_and_set_bit generic__test_and_set_bit
> @@ -139,11 +139,11 @@ __test_and_set_bit(int nr, volatile void *addr)
>  }
>  #define __test_and_set_bit(nr, addr) ({ \
>  if ( bitop_bad_size(addr) ) __bitop_bad_size(); \
> -__test_and_set_bit(nr, addr);   \
> +__test_and_set_bit__(nr, addr); \
>  })
>  
>  /**
> - * __test_and_clear_bit - Clear a bit and return its old value
> + * __test_and_clear_bit__ - Clear a bit and return its old value
>   * @nr: Bit to clear
>   * @addr: Address to count from
>   *
> @@ -152,7 +152,7 @@ __test_and_set_bit(int nr, volatile void *addr)
>   * but actually fail.  You must protect multiple accesses with a lock.
>   */
>  static always_inline bool
> -__test_and_clear_bit(int nr, volatile void *addr)
> +__test_and_clear_bit__(int nr, volatile void *addr)
>  {
>  #ifndef arch__test_and_clear_bit
>  #define arch__test_and_clear_bit generic__test_and_clear_bit
> @@ -162,11 +162,11 @@ __test_and_clear_bit(int nr, volatile void *addr)
>  }
>  #define __test_and_clear_bit(nr, addr) ({   \
>  if ( bitop_bad_size(addr) ) __bitop_bad_size(); \
> -__test_and_clear_bit(nr, addr); \
> +__test_and_clear_bit__(nr, addr);   \
>  })
>  
>  /**
> - * __test_and_change_bit - Change a bit and return its old value
> + * __test_and_change_bit__ - Change a bit and return its old value
>   * @nr: Bit to change
>   * @addr: Address to count from
>   *
> @@ -175,7 +175,7 @@ __test_and_clear_bit(int nr, volatile void *addr)
>   * but actually fail.  You must protect multiple accesses with a lock.
>   */
>  static always_inline bool
> -__test_and_change_bit(int nr, volatile void *addr)
> +__test_and_change_bit__(int nr, volatile void *addr)
>  {
>  #ifndef arch__test_and_change_bit
>  #define arch__test_and_change_bit generic__test_and_change_bit
> @@ -185,11 +185,11 @@ __test_and_change_bit(int nr, volatile void *addr)
>  }
>  #define __test_and_change_bit(nr, addr) ({  \
>  if ( bitop_bad_size(addr) ) __bitop_bad_size(); \
> -__test_and_change_bit(nr, addr);\
> +__test_and_change_bit__(nr, addr);  \
>  })
>  
>  /**
> - * test_bit - Determine whether a bit is set
> + * test_bit__ - Determine whether a bit is set
>   * @nr: bit number to test
>   * @addr: Address to start counting from
>   *
> @@ -197,7 +197,7 @@ __test_and_change_bit(int nr, volatile void *addr)
>   * If two examples of this operation race, one can appear to succeed
>   * but actually fail.  You must protect multiple accesses with a lock.
>   */
> -static always_inline bool test_bit(int nr, const volatile void *addr)
> +static always_inl

Re: [RFC PATCH] xen/flask: estimate max sidtable size

2025-07-07 Thread Stefano Stabellini

On Fri, 4 Jul 2025, Jan Beulich wrote:
> On 04.07.2025 12:10, Sergiy Kibrik wrote:
> > 01.07.25 13:42, Jan Beulich:
> >> On 30.06.2025 10:55, Sergiy Kibrik wrote:
> >>> @@ -54,4 +54,7 @@ $(obj)/policy.bin: FORCE
> >>>   FLASK_BUILD_DIR=$(FLASK_BUILD_DIR) 
> >>> POLICY_FILENAME=$(POLICY_SRC)
> >>>   cmp -s $(POLICY_SRC) $@ || cp $(POLICY_SRC) $@
> >>>   
> >>> +$(obj)/%/se_limits.h: $(obj)/policy.bin
> >>> + $(srcdir)/policy/mkselim.sh $^ $@
> >>
> >> Hmm, that's using the built-in policy, isn't it? What if later another
> >> policy is loaded? Wouldn't it be possible to have ...
> >>
> >>> --- a/xen/xsm/flask/ss/sidtab.c
> >>> +++ b/xen/xsm/flask/ss/sidtab.c
> >>> @@ -13,6 +13,7 @@
> >>>   #include "flask.h"
> >>>   #include "security.h"
> >>>   #include "sidtab.h"
> >>> +#include "se_limits.h"
> >>>   
> >>>   #define SIDTAB_HASH(sid) ((sid) & SIDTAB_HASH_MASK)
> >>>   
> >>> @@ -228,7 +229,7 @@ int sidtab_context_to_sid(struct sidtab *s, struct 
> >>> context *context,
> >>>   if ( sid )
> >>>   goto unlock_out;
> >>>   /* No SID exists for the context.  Allocate a new one. */
> >>> -if ( s->next_sid == UINT_MAX || s->shutdown )
> >>> +if ( s->next_sid == SEPOL_SID_LIMIT || s->shutdown )
> >>
> >> ... more than this many SIDs? What if CONFIG_XSM_FLASK_POLICY isn't even 
> >> set?
> >>
> > 
> > It's using a policy from tools/flask/policy, yes. But not a built-in 
> > policy, just reusing a bit of code from that code. The idea is that we 
> > can have CONFIG_XSM_FLASK_POLICY option disabled yet still be able to 
> > calculate SEPOL_SID_LIMIT.
> > 
> > As for loading another policy at runtime -- the calculated 
> > SEPOL_SID_LIMIT=384 for current master flask policy is still pretty big 
> > limit. From what I can see -- much less No. contexts are being used on a 
> > running system, because most of calculated combinations of 
> > user/role/type are not really usable (e.g. contexts with xen_t or 
> > xenboot_t types and user_1 user are not expected etc). So there should 
> > be enough room even for more complex custom policies.
> 
> But still there could be odd ones. Imo such a static limit can then only be
> introduced via Kconfig option.

I was going to suggest the same approach as Jan. While I appreciate
Sergiy's effort to calculate the limit automatically using mkselim.sh,
I think that for our purposes, a simple Kconfig option specifying the
maximum allocation limit is sufficient.

This type of limit is typically chosen before moving into production,
after extensive experimentation, measurements, and certifications.
Therefore, it is not necessary to make it easier for users to configure
it optimally based on policy.  However, we do need a way to enforce a
limit, and a straightforward Kconfig option would be adequate for that.

Re: [PATCH] docs/misra/rules.rst: allow string literals with memcmp

2025-07-07 Thread Stefano Stabellini

On Thu, 26 Jun 2025, Nicola Vetrini wrote:
> On 2025-06-24 08:11, Jan Beulich wrote:
> > On 24.06.2025 01:45, Stefano Stabellini wrote:
> > > Rule 21.16 is about the types of arguments allowed for memcpy.
> > 
> > Seeing the subject - is it memcmp(), memcpy(), or both? (Writing from
> > home, where I don't have the Misra spec to hand, and hence can't check
> > what coverage the rule has.)
> > 
> > Jan
> 
> In this case the rule covers just memcmp(): "The pointer arguments to the
> Standard Library function memcmp shall point to either a pointer type, an
> essentially signed type, an essentially unsigned type, an essentially Boolean
> type or an essentially enum type"

Jan, given the above, any chance you can ack it?

[XEN PATCH v2] automation/eclair: Make report browsing URL configurable.

2025-07-07 Thread Nicola Vetrini

Currently, the URL where the ECLAIR MISRA C scan reports are saved
is hardcoded; making it configurable allows multiple runners and storage
servers to be used without resorting to publishing all artifacts
to the same report server.

Additionally, reports will be accessed publicly by using a proxy,
therefore the address that needs to be printed in GitLab analysis logs
is that of the public url, rather than the location where they are stored.

Signed-off-by: Alessandro Zucchelli 
Signed-off-by: Nicola Vetrini 
---
Changes in V2:
- Use single variable for eclair_report host and port
- Introduced variable in analyze.yaml for the public url used in logs

This needs to be committed in coordination with setting up the proxy
indicated in analyze.yaml
---
 .../eclair_analysis/ECLAIR/action.helpers | 10 +++
 .../eclair_analysis/ECLAIR/action.settings| 30 ++-
 automation/gitlab-ci/analyze.yaml |  3 ++
 3 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/automation/eclair_analysis/ECLAIR/action.helpers 
b/automation/eclair_analysis/ECLAIR/action.helpers
index 9d4ae1f979f6..3a4c9d2b2855 100644
--- a/automation/eclair_analysis/ECLAIR/action.helpers
+++ b/automation/eclair_analysis/ECLAIR/action.helpers
@@ -58,7 +58,7 @@ summary() {
 ;;
 esac
 
-
currentDbReportsUrl="${eclairReportUrlPrefix}/fs${jobDir}/PROJECT.ecd;/by_service.html#service&kind"
+
currentDbReportsUrl="${eclairResultsUrl}/fs${jobDir}/PROJECT.ecd;/by_service.html#service&kind"
 if [ -z "${newReports}" ]; then
 fixedMsg="No fixed reports as there is no baseline"
 unfixedMsg="Unfixed reports: ${unfixedReports}"
@@ -69,11 +69,11 @@ summary() {
 unfixedMsg="Unfixed reports: ${unfixedReports} [new: ${newReports}]"
 case "${event}" in
 pull_request | auto_pull_request)
-
referenceDbReportsUrl="${eclairReportUrlPrefix}/fs${jobDir}/base/PROJECT.ecd;/by_service.html#service&kind"
+
referenceDbReportsUrl="${eclairResultsUrl}/fs${jobDir}/base/PROJECT.ecd;/by_service.html#service&kind"
 reference_kind=base
 ;;
 push)
-
referenceDbReportsUrl="${eclairReportUrlPrefix}/fs${jobDir}/prev/PROJECT.ecd;/by_service.html#service&kind"
+
referenceDbReportsUrl="${eclairResultsUrl}/fs${jobDir}/prev/PROJECT.ecd;/by_service.html#service&kind"
 reference_kind=previous
 ;;
 *)
@@ -92,7 +92,7 @@ summary() {
 ${fixedMsg}${eol}
 ${unfixedMsg}  
${eol}
 https://www.bugseng.com/eclair";>
-  
+  
 
 ${jobHeadline}
 Browse analysis summary
@@ -106,7 +106,7 @@ EOF
 fi
 cat <"${summaryTxt}"
 https://www.bugseng.com/eclair";>
-  
+  
 
 Analysis Summary
 
diff --git a/automation/eclair_analysis/ECLAIR/action.settings 
b/automation/eclair_analysis/ECLAIR/action.settings
index 1577368b613b..a9904377252a 100644
--- a/automation/eclair_analysis/ECLAIR/action.settings
+++ b/automation/eclair_analysis/ECLAIR/action.settings
@@ -14,9 +14,6 @@ autoPRRepository="${AUTO_PR_REPOSITORY:-}"
 # Customized
 autoPRBranch="${AUTO_PR_BRANCH:-}"
 
-# Customized
-artifactsRoot=/var/local/eclair
-
 case "${ci}" in
 github)
 # To be customized
@@ -166,16 +163,35 @@ esac
 
 ECLAIR_BIN_DIR=/opt/bugseng/eclair/bin/
 
-artifactsDir="${artifactsRoot}/xen-project.ecdf/${repository}/ECLAIR_${ANALYSIS_KIND}"
+# Artifacts URL served by the eclair_report server
+if [ -z "${ECLAIR_ECDF_DIR}" ]
+then
+  echo "WARNING: No ecdf dir supplied, using default"
+fi
+artifactsEcdfDir="${ECLAIR_ECDF_DIR:-/var/local/eclair/xen-project.ecdf}"
+artifactsDir="${artifactsEcdfDir}/${repository}/ECLAIR_${ANALYSIS_KIND}"
 subDir="${subDir}${variantSubDir}"
 jobHeadline="${jobHeadline}${variantHeadline}"
 
-# Customized
-eclairReportUrlPrefix=https://saas.eclairit.com:3787
+# Remote eclair_report hosting server
+if [ -z "${ECLAIR_REPORT_HOST}" ]
+then
+  echo "WARNING: No eclair_report host supplied, using default"
+fi
+
+# URL to browse eclair reports
+if [ -z "${ECLAIR_ANALYSIS_RESULTS}" ]
+then
+  echo "WARNING: No URL to browse analysis results is set, using default"
+fi
+
+eclairReportHost="${ECLAIR_REPORT_HOST:-saas.eclairit.com:3787}"
+eclairReportUrlPrefix="https://${eclairReportHost}";
+eclairResultsUrl="${ECLAIR_ANALYSIS_RESULTS:-${eclairReportUrlPrefix}}"
 
 jobDir="${artifactsDir}/${subDir}/${jobId}"
 updateLog="${analysisOutputDir}/update.log"
 cleanRegressionsLog="${analysisOutputDir}/clean_regressions.log"
 commentLog="${analysisOutputDir}/comment.json"
-indexHtmlUrl="${eclairReportUrlPrefix}/fs${jobDir}/index.html"
+indexHtmlUrl="${eclairResultsUrl}/fs${jobDir}/index.html"
 summaryTxt="${analysisOutputDir}/summary.txt"
diff --git a/automation/gitlab-ci/analyze.yaml 
b/automation/gitlab-ci/analyze.yaml
index 5b00b9f25ca6..9cec0f65a439 100644
--- a/automation/gitlab-ci/analyze.yaml
+++ b/automation/gitlab-ci/anal

Re: [PATCH] xen/efi: Fix crash with initial empty EFI options

2025-07-07 Thread Jan Beulich

On 07.07.2025 17:11, Frediano Ziglio wrote:
> EFI code path split options from EFI LoadOptions fields in 2
> pieces, first EFI options, second Xen options.
> "get_argv" function is called first to get the number of arguments
> in the LoadOptions, second, after allocating enough space, to
> fill some "argc"/"argv" variable. However the first parsing could
> be different from second as second is able to detect "--" argument
> separator. So it was possible that "argc" was bigger that the "argv"
> array leading to potential buffer overflows, in particular
> a string like "-- a b c" would lead to buffer overflow in "argv"
> resulting in crashes.
> Using EFI shell is possible to pass any kind of string in
> LoadOptions.
> 
> Fixes: 201f261e859e ("EFI: move x86 boot/runtime code to common/efi")

This only moves the function, but doesn't really introduce any issue afaics.

> --- a/xen/common/efi/boot.c
> +++ b/xen/common/efi/boot.c
> @@ -345,6 +345,7 @@ static unsigned int __init get_argv(unsigned int argc, 
> CHAR16 **argv,
>  VOID *data, UINTN size, UINTN *offset,
>  CHAR16 **options)
>  {
> +CHAR16 **const orig_argv = argv;
>  CHAR16 *ptr = (CHAR16 *)(argv + argc + 1), *prev = NULL, *cmdline = NULL;
>  bool prev_sep = true;
>  
> @@ -384,7 +385,7 @@ static unsigned int __init get_argv(unsigned int argc, 
> CHAR16 **argv,
>  {
>  cmdline = data + *offset;
>  /* Cater for the image name as first component. */
> -++argc;
> +++argv;

We're on the argc == 0 and argv == NULL path here. Incrementing NULL is UB,
if I'm not mistaken.

> @@ -402,7 +403,7 @@ static unsigned int __init get_argv(unsigned int argc, 
> CHAR16 **argv,
>  {
>  if ( cur_sep )
>  ++ptr;
> -else if ( argv )
> +else if ( orig_argv )
>  {
>  *ptr = *cmdline;
>  *++ptr = 0;
> @@ -410,8 +411,8 @@ static unsigned int __init get_argv(unsigned int argc, 
> CHAR16 **argv,
>  }
>  else if ( !cur_sep )
>  {
> -if ( !argv )
> -++argc;
> +if ( !orig_argv )
> +++argv;
>  else if ( prev && wstrcmp(prev, L"--") == 0 )
>  {
>  --argv;

As per this, it looks like that on the 1st pass we may indeed overcount
arguments. But ...

> @@ -428,9 +429,9 @@ static unsigned int __init get_argv(unsigned int argc, 
> CHAR16 **argv,
>  }
>  prev_sep = cur_sep;
>  }
> -if ( argv )
> +if ( orig_argv )
>  *argv = NULL;
> -return argc;
> +return argv - orig_argv;
>  }
>  
>  static EFI_FILE_HANDLE __init get_parent_handle(const EFI_LOADED_IMAGE 
> *loaded_image,
> @@ -1348,8 +1349,8 @@ void EFIAPI __init noreturn efi_start(EFI_HANDLE 
> ImageHandle,
>(argc + 1) * sizeof(*argv) +
>loaded_image->LoadOptionsSize,
>(void **)&argv) == EFI_SUCCESS )
> -get_argv(argc, argv, loaded_image->LoadOptions,
> - loaded_image->LoadOptionsSize, &offset, &options);
> +argc = get_argv(argc, argv, loaded_image->LoadOptions,
> +loaded_image->LoadOptionsSize, &offset, 
> &options);

... wouldn't this change alone cure that problem? And even that I don't
follow. Below here we have

for ( i = 1; i < argc; ++i )
{
CHAR16 *ptr = argv[i];

if ( !ptr )
break;

and the 2nd pass of get_argv() properly terminates the (possibly too large)
array with a NULL sentinel. So I wonder what it is that I'm overlooking and
that is broken.

Jan

[PATCH v2] vpci: const-ify some pdev instances

2025-07-07 Thread Stewart Hildebrand

Since 622bdd962822 ("vpci/header: handle p2m range sets per BAR"), a
non-const pdev is no longer needed for error handling in
vpci_process_pending(). Const-ify pdev in vpci_process_pending(),
defer_map(), and struct vpci_vcpu.

Get rid of const-removal workaround in modify_bars().

Take the opportunity to remove an unused parameter in defer_map().

Signed-off-by: Stewart Hildebrand 
Reviewed-by: Roger Pau Monné 
---
v1->v2:
* join two conditions into one
* add Roger's R-b
* split from series

v1: 
https://lore.kernel.org/xen-devel/20250531125405.268984-2-stewart.hildebr...@amd.com/
---
 xen/drivers/vpci/header.c | 29 +
 xen/include/xen/vpci.h|  2 +-
 2 files changed, 10 insertions(+), 21 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index d26cbba08ee1..bb76e707992c 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -175,7 +175,7 @@ static void modify_decoding(const struct pci_dev *pdev, 
uint16_t cmd,
 
 bool vpci_process_pending(struct vcpu *v)
 {
-struct pci_dev *pdev = v->vpci.pdev;
+const struct pci_dev *pdev = v->vpci.pdev;
 struct vpci_header *header = NULL;
 unsigned int i;
 
@@ -283,8 +283,7 @@ static int __init apply_map(struct domain *d, const struct 
pci_dev *pdev,
 return rc;
 }
 
-static void defer_map(struct domain *d, struct pci_dev *pdev,
-  uint16_t cmd, bool rom_only)
+static void defer_map(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
 struct vcpu *curr = current;
 
@@ -308,7 +307,7 @@ static void defer_map(struct domain *d, struct pci_dev 
*pdev,
 static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
 struct vpci_header *header = &pdev->vpci->header;
-struct pci_dev *tmp, *dev = NULL;
+struct pci_dev *tmp;
 const struct domain *d;
 const struct vpci_msix *msix = pdev->vpci->msix;
 unsigned int i, j;
@@ -448,21 +447,13 @@ static int modify_bars(const struct pci_dev *pdev, 
uint16_t cmd, bool rom_only)
  */
 continue;
 
-if ( tmp == pdev )
-{
+if ( tmp == pdev && !rom_only )
 /*
- * Need to store the device so it's not constified and 
defer_map
- * can modify it in case of error.
+ * If memory decoding is toggled avoid checking against the
+ * same device, or else all regions will be removed from the
+ * memory map in the unmap case.
  */
-dev = tmp;
-if ( !rom_only )
-/*
- * If memory decoding is toggled avoid checking against the
- * same device, or else all regions will be removed from 
the
- * memory map in the unmap case.
- */
-continue;
-}
+continue;
 
 for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
 {
@@ -507,8 +498,6 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t 
cmd, bool rom_only)
 d = dom_xen;
 }
 
-ASSERT(dev);
-
 if ( system_state < SYS_STATE_active )
 {
 /*
@@ -523,7 +512,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t 
cmd, bool rom_only)
 return apply_map(pdev->domain, pdev, cmd);
 }
 
-defer_map(dev->domain, dev, cmd, rom_only);
+defer_map(pdev, cmd, rom_only);
 
 return 0;
 }
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index fc8d5b470b0b..6a481a4e89d3 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -196,7 +196,7 @@ struct vpci {
 
 struct vpci_vcpu {
 /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
-struct pci_dev *pdev;
+const struct pci_dev *pdev;
 uint16_t cmd;
 bool rom_only : 1;
 };

base-commit: 05c574feeb00990b31bb472ef93b43a30a07fd33
-- 
2.49.0

Re: [PATCH v6] automation/eclair: update configuration of D4.10

2025-07-07 Thread Anthony PERARD

On Mon, Jul 07, 2025 at 12:48:06PM +, Anthony PERARD wrote:
> On Mon, Jun 23, 2025 at 06:19:27PM -0700, Stefano Stabellini wrote:
> > diff --git a/xen/include/xen/compile.h.in b/xen/include/xen/compile.h.in
> > index 3151d1e7d1..9206341ba6 100644
> > --- a/xen/include/xen/compile.h.in
> > +++ b/xen/include/xen/compile.h.in
> > @@ -1,3 +1,6 @@
> > +#ifndef XEN_COMPILE_H
> > +#define XEN_COMPILE_H
> > +
> >  #define XEN_COMPILE_DATE   "@@date@@"
> >  #define XEN_COMPILE_TIME   "@@time@@"
> >  #define XEN_COMPILE_BY "@@whoami@@"
> > diff --git a/xen/tools/process-banner.sed b/xen/tools/process-banner.sed
> > index 56c76558bc..4cf3f9a116 100755
> > --- a/xen/tools/process-banner.sed
> > +++ b/xen/tools/process-banner.sed
> > @@ -12,3 +12,8 @@ s_(.*)_"\1\\n"_
> >  
> >  # Trailing \ on all but the final line.
> >  $!s_$_ \\_
> > +
> > +# Append closing header guard
> > +$a\
> > +\
> > +#endif /* XEN_COMPILE_H */
> 
> Is it wise to put the closing header guard in a file call
> "process-banner" ? It's not call compile.h-footer.sed.
> 
> There's a few way to make this better:
> - simple add the header guard from the Makefile, both opening and
>   closing.
> - Do some more sed with something like:
>   sed -rf process-banner.sed < .banner >> .banner.processed.tmp
>   sed -e 's/@@date@@/$(XEN_BUILD_DATE)/g' \
>   ... \
>   -e '/XEN_BANNER/r .banner.processed.tmp'
>   # and having the closing header guard in "compile.h.in"
>   This will add the outpot of process-banner.sed in the lines after
>   "#define XEN_BANNER", and so before the closing header guard.
> - rename the sed command file
> (- a forth option would be to use filechk make macro, but the check for
>  if [ ! -r $@ -o -O $@ ] would be annoying to reproduce.)
> 
> Another thing, this could be done in a patch that isn't called
> "automation/eclair: update configuration of D4.10".

Sorry, I failed to notice the patch was already commited. I guess it's
good enough like that.

-- 

Anthony Perard | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech

Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()

2025-07-07 Thread Oleksii Kurochko



On 7/7/25 5:15 PM, Jan Beulich wrote:

On 07.07.2025 17:00, Oleksii Kurochko wrote:

On 7/7/25 2:53 PM, Jan Beulich wrote:

On 07.07.2025 13:46, Oleksii Kurochko wrote:

On 7/7/25 9:20 AM, Jan Beulich wrote:

On 04.07.2025 17:01, Oleksii Kurochko wrote:

On 7/1/25 3:49 PM, Jan Beulich wrote:

On 10.06.2025 15:05, Oleksii Kurochko wrote:

+{
+panic("%s: isn't implemented for now\n", __func__);
+
+return false;
+}

For this function in particular, though: Besides the "p2me" in the name
being somewhat odd (supposedly page table entries here are simply pte_t),
how is this going to be different from pte_is_valid()?

pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
what is a type stored in the radix tree (p2m->p2m_types):
  /*
   * In the case of the P2M, the valid bit is used for other purpose. Use
   * the type to check whether an entry is valid.
   */
  static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
  {
  return p2m_type_radix_get(p2m, pte) != p2m_invalid;
  }

It is done to track which page was modified by a guest.

But then (again) the name doesn't convey what the function does.

Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would 
better.

For P2M type checks please don't invent new naming, but use what both x86
and Arm are already using. Note how we already have p2m_is_valid() in that
set. Just that it's not doing what you want here.

Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
And in here it is checked if P2M pte is valid from P2M point of view by checking
the type in radix tree and/or in reserved PTEs bits (just to remind we have 
only 2
free bits for type).

Because this is how it's defined on x86:

#define p2m_is_valid(_t)(p2m_to_mask(_t) & \
  (P2M_RAM_TYPES | p2m_to_mask(p2m_mmio_direct)))

I.e. more strict that simply "!= p2m_invalid". And I think such predicates
would better be uniform across architectures, such that in principle they
might also be usable in common code (as we already do with p2m_is_foreign()).


Yeah, Arm isn't so strict in definition of p2m_is_valid() and it seems like
x86 and Arm have different understanding what is valid.

Except what mentioned in the comment that grant types aren't considered valid
for x86 (and shouldn't be the same then for Arm?), it isn't clear why x86's
p2m_is_valid() is stricter then Arm's one and if other arches should be also
so strict.
It seems like from the point of view of mapping/unmapping it is enough just
to verify a "copy" of PTE's valid bit (in terms of P2M it is p2m_invalid type).




Plus
can't a guest also arrange for an entry's type to move to p2m_invalid?
That's then still an entry that was modified by the guest.

I am not really sure that I fully understand the question.
Do you ask if a guest can do something which will lead to a call of 
p2m_set_entry()
with p2m_invalid argument?

That I'm not asking, but rather stating. I.e. I expect such is possible.


If yes, then it seems like it will be done only in case of p2m_remove_mapping() 
what
will mean that alongside with p2m_invalid INVALID_MFN will be also passed, what 
means
this entry isn't expected to be used anymore.

Right. But such an entry would still have been "modified" by the guest.

Yes, but nothing then is needed to do with it.

I understand that. Maybe I'm overly picky, but all of the above was in response
to you saying "It is done to track which page was modified by a guest." And I'm
simply trying to get you to use precise wording, both in code comments and in
discussions. In a case like the one here I simply can't judge whether you simply
expressed yourself not clear enough, or whether you indeed meant what you said.


+/*
+ * Don't take into account the MFN when removing mapping (i.e
+ * MFN_INVALID) to calculate the correct target order.
+ *
+ * XXX: Support superpage mappings if nr is not aligned to a
+ * superpage size.
+ */

Does this really need leaving as a to-do?

I think so, yes. It won’t break the current workflow if|nr| isn’t aligned,
a smaller order will simply be chosen.

Well, my question was more like "Isn't it simple enough to cover the case
right away?"


+mask = !mfn_eq(smfn, INVALID_MFN) ? mfn_x(smfn) : 0;
+mask |= gfn_x(sgfn) | nr;
+
+for ( ; i != 0; i-- )
+{
+if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) )
+{
+order = XEN_PT_LEVEL_ORDER(i);
+break;

Nit: Style.


+}
+}
+
+rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
+if ( rc )
+break;
+
+sgfn = gfn_add(sgfn, (1 << order));
+if ( !mfn_eq(smfn, INVALID_MFN) )
+   smfn = mfn_add(smfn, (1 << order));
+
+nr -= (1 << order);

Throughout maybe better be safe right away and use 1UL?


+

[PATCH] xen/efi: Fix crash with initial empty EFI options

2025-07-07 Thread Frediano Ziglio

EFI code path split options from EFI LoadOptions fields in 2
pieces, first EFI options, second Xen options.
"get_argv" function is called first to get the number of arguments
in the LoadOptions, second, after allocating enough space, to
fill some "argc"/"argv" variable. However the first parsing could
be different from second as second is able to detect "--" argument
separator. So it was possible that "argc" was bigger that the "argv"
array leading to potential buffer overflows, in particular
a string like "-- a b c" would lead to buffer overflow in "argv"
resulting in crashes.
Using EFI shell is possible to pass any kind of string in
LoadOptions.

Fixes: 201f261e859e ("EFI: move x86 boot/runtime code to common/efi")
Signed-off-by: Frediano Ziglio 
---
 xen/common/efi/boot.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c
index 9306dc8953..597252cfc4 100644
--- a/xen/common/efi/boot.c
+++ b/xen/common/efi/boot.c
@@ -345,6 +345,7 @@ static unsigned int __init get_argv(unsigned int argc, 
CHAR16 **argv,
 VOID *data, UINTN size, UINTN *offset,
 CHAR16 **options)
 {
+CHAR16 **const orig_argv = argv;
 CHAR16 *ptr = (CHAR16 *)(argv + argc + 1), *prev = NULL, *cmdline = NULL;
 bool prev_sep = true;
 
@@ -384,7 +385,7 @@ static unsigned int __init get_argv(unsigned int argc, 
CHAR16 **argv,
 {
 cmdline = data + *offset;
 /* Cater for the image name as first component. */
-++argc;
+++argv;
 }
 }
 }
@@ -402,7 +403,7 @@ static unsigned int __init get_argv(unsigned int argc, 
CHAR16 **argv,
 {
 if ( cur_sep )
 ++ptr;
-else if ( argv )
+else if ( orig_argv )
 {
 *ptr = *cmdline;
 *++ptr = 0;
@@ -410,8 +411,8 @@ static unsigned int __init get_argv(unsigned int argc, 
CHAR16 **argv,
 }
 else if ( !cur_sep )
 {
-if ( !argv )
-++argc;
+if ( !orig_argv )
+++argv;
 else if ( prev && wstrcmp(prev, L"--") == 0 )
 {
 --argv;
@@ -428,9 +429,9 @@ static unsigned int __init get_argv(unsigned int argc, 
CHAR16 **argv,
 }
 prev_sep = cur_sep;
 }
-if ( argv )
+if ( orig_argv )
 *argv = NULL;
-return argc;
+return argv - orig_argv;
 }
 
 static EFI_FILE_HANDLE __init get_parent_handle(const EFI_LOADED_IMAGE 
*loaded_image,
@@ -1348,8 +1349,8 @@ void EFIAPI __init noreturn efi_start(EFI_HANDLE 
ImageHandle,
   (argc + 1) * sizeof(*argv) +
   loaded_image->LoadOptionsSize,
   (void **)&argv) == EFI_SUCCESS )
-get_argv(argc, argv, loaded_image->LoadOptions,
- loaded_image->LoadOptionsSize, &offset, &options);
+argc = get_argv(argc, argv, loaded_image->LoadOptions,
+loaded_image->LoadOptionsSize, &offset, &options);
 else
 argc = 0;
 for ( i = 1; i < argc; ++i )
-- 
2.43.0

Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()

2025-07-07 Thread Jan Beulich

On 07.07.2025 17:00, Oleksii Kurochko wrote:
> On 7/7/25 2:53 PM, Jan Beulich wrote:
>> On 07.07.2025 13:46, Oleksii Kurochko wrote:
>>> On 7/7/25 9:20 AM, Jan Beulich wrote:
 On 04.07.2025 17:01, Oleksii Kurochko wrote:
> On 7/1/25 3:49 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> +{
>>> +panic("%s: isn't implemented for now\n", __func__);
>>> +
>>> +return false;
>>> +}
>> For this function in particular, though: Besides the "p2me" in the name
>> being somewhat odd (supposedly page table entries here are simply pte_t),
>> how is this going to be different from pte_is_valid()?
> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is 
> checking
> what is a type stored in the radix tree (p2m->p2m_types):
>  /*
>   * In the case of the P2M, the valid bit is used for other purpose. 
> Use
>   * the type to check whether an entry is valid.
>   */
>  static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>  {
>  return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>  }
>
> It is done to track which page was modified by a guest.
 But then (again) the name doesn't convey what the function does.
>>> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would 
>>> better.
>> For P2M type checks please don't invent new naming, but use what both x86
>> and Arm are already using. Note how we already have p2m_is_valid() in that
>> set. Just that it's not doing what you want here.
> 
> Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
> And in here it is checked if P2M pte is valid from P2M point of view by 
> checking
> the type in radix tree and/or in reserved PTEs bits (just to remind we have 
> only 2
> free bits for type).

Because this is how it's defined on x86:

#define p2m_is_valid(_t)(p2m_to_mask(_t) & \
 (P2M_RAM_TYPES | p2m_to_mask(p2m_mmio_direct)))

I.e. more strict that simply "!= p2m_invalid". And I think such predicates
would better be uniform across architectures, such that in principle they
might also be usable in common code (as we already do with p2m_is_foreign()).

Plus
 can't a guest also arrange for an entry's type to move to p2m_invalid?
 That's then still an entry that was modified by the guest.
>>> I am not really sure that I fully understand the question.
>>> Do you ask if a guest can do something which will lead to a call of 
>>> p2m_set_entry()
>>> with p2m_invalid argument?
>> That I'm not asking, but rather stating. I.e. I expect such is possible.
>>
>>> If yes, then it seems like it will be done only in case of 
>>> p2m_remove_mapping() what
>>> will mean that alongside with p2m_invalid INVALID_MFN will be also passed, 
>>> what means
>>> this entry isn't expected to be used anymore.
>> Right. But such an entry would still have been "modified" by the guest.
> 
> Yes, but nothing then is needed to do with it.

I understand that. Maybe I'm overly picky, but all of the above was in response
to you saying "It is done to track which page was modified by a guest." And I'm
simply trying to get you to use precise wording, both in code comments and in
discussions. In a case like the one here I simply can't judge whether you simply
expressed yourself not clear enough, or whether you indeed meant what you said.

>>> +/*
>>> + * Don't take into account the MFN when removing mapping (i.e
>>> + * MFN_INVALID) to calculate the correct target order.
>>> + *
>>> + * XXX: Support superpage mappings if nr is not aligned to a
>>> + * superpage size.
>>> + */
>> Does this really need leaving as a to-do?
> I think so, yes. It won’t break the current workflow if|nr| isn’t aligned,
> a smaller order will simply be chosen.
 Well, my question was more like "Isn't it simple enough to cover the case
 right away?"

>>> +mask = !mfn_eq(smfn, INVALID_MFN) ? mfn_x(smfn) : 0;
>>> +mask |= gfn_x(sgfn) | nr;
>>> +
>>> +for ( ; i != 0; i-- )
>>> +{
>>> +if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) )
>>> +{
>>> +order = XEN_PT_LEVEL_ORDER(i);
>>> +break;
>> Nit: Style.
>>
>>> +}
>>> +}
>>> +
>>> +rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
>>> +if ( rc )
>>> +break;
>>> +
>>> +sgfn = gfn_add(sgfn, (1 << order));
>>> +if ( !mfn_eq(smfn, INVALID_MFN) )
>>> +   smfn = mfn_add(smfn, (1 << order));
>>> +
>>> +nr -= (1 << order);
>> Throughout maybe better be safe right away and use 1UL?
>>
>>> +}
>>> +
>>> +return rc;
>>>

Re: [PATCH] xen/efi: Fix crash with initial empty EFI options

2025-07-07 Thread Jan Beulich

On 07.07.2025 17:51, Frediano Ziglio wrote:
> On Mon, Jul 7, 2025 at 4:42 PM Jan Beulich  wrote:
>>
>> On 07.07.2025 17:11, Frediano Ziglio wrote:
>>> EFI code path split options from EFI LoadOptions fields in 2
>>> pieces, first EFI options, second Xen options.
>>> "get_argv" function is called first to get the number of arguments
>>> in the LoadOptions, second, after allocating enough space, to
>>> fill some "argc"/"argv" variable. However the first parsing could
>>> be different from second as second is able to detect "--" argument
>>> separator. So it was possible that "argc" was bigger that the "argv"
>>> array leading to potential buffer overflows, in particular
>>> a string like "-- a b c" would lead to buffer overflow in "argv"
>>> resulting in crashes.
>>> Using EFI shell is possible to pass any kind of string in
>>> LoadOptions.
>>>
>>> Fixes: 201f261e859e ("EFI: move x86 boot/runtime code to common/efi")
>>
>> This only moves the function, but doesn't really introduce any issue afaics.
>>
> 
> Okay, I'll follow the rename
> 
>>> --- a/xen/common/efi/boot.c
>>> +++ b/xen/common/efi/boot.c
>>> @@ -345,6 +345,7 @@ static unsigned int __init get_argv(unsigned int argc, 
>>> CHAR16 **argv,
>>>  VOID *data, UINTN size, UINTN *offset,
>>>  CHAR16 **options)
>>>  {
>>> +CHAR16 **const orig_argv = argv;
>>>  CHAR16 *ptr = (CHAR16 *)(argv + argc + 1), *prev = NULL, *cmdline = 
>>> NULL;
>>>  bool prev_sep = true;
>>>
>>> @@ -384,7 +385,7 @@ static unsigned int __init get_argv(unsigned int argc, 
>>> CHAR16 **argv,
>>>  {
>>>  cmdline = data + *offset;
>>>  /* Cater for the image name as first component. */
>>> -++argc;
>>> +++argv;
>>
>> We're on the argc == 0 and argv == NULL path here. Incrementing NULL is UB,
>> if I'm not mistaken.
> 
> Not as far as I know. Why?

Increment and decrement operators are like additions. For additions the standard
says: "For addition, either both operands shall have arithmetic type, or one
operand shall be a pointer to an object type and the other shall have integer
type." Neither of the alternatives is true for NULL.

> Some systems even can use NULL pointers as valid, like mmap.

Right, but that doesn't make the use of NULL C-compliant.

>>> @@ -402,7 +403,7 @@ static unsigned int __init get_argv(unsigned int argc, 
>>> CHAR16 **argv,
>>>  {
>>>  if ( cur_sep )
>>>  ++ptr;
>>> -else if ( argv )
>>> +else if ( orig_argv )
>>>  {
>>>  *ptr = *cmdline;
>>>  *++ptr = 0;
>>> @@ -410,8 +411,8 @@ static unsigned int __init get_argv(unsigned int argc, 
>>> CHAR16 **argv,
>>>  }
>>>  else if ( !cur_sep )
>>>  {
>>> -if ( !argv )
>>> -++argc;
>>> +if ( !orig_argv )
>>> +++argv;
>>>  else if ( prev && wstrcmp(prev, L"--") == 0 )
>>>  {
>>>  --argv;
>>
>> As per this, it looks like that on the 1st pass we may indeed overcount
>> arguments. But ...
>>
> 
> I can use again argc if you prefer, not strong about it.
> 
>>> @@ -428,9 +429,9 @@ static unsigned int __init get_argv(unsigned int argc, 
>>> CHAR16 **argv,
>>>  }
>>>  prev_sep = cur_sep;
>>>  }
>>> -if ( argv )
>>> +if ( orig_argv )
>>>  *argv = NULL;
>>> -return argc;
>>> +return argv - orig_argv;
>>>  }
>>>
>>>  static EFI_FILE_HANDLE __init get_parent_handle(const EFI_LOADED_IMAGE 
>>> *loaded_image,
>>> @@ -1348,8 +1349,8 @@ void EFIAPI __init noreturn efi_start(EFI_HANDLE 
>>> ImageHandle,
>>>(argc + 1) * sizeof(*argv) +
>>>loaded_image->LoadOptionsSize,
>>>(void **)&argv) == EFI_SUCCESS )
>>> -get_argv(argc, argv, loaded_image->LoadOptions,
>>> - loaded_image->LoadOptionsSize, &offset, &options);
>>> +argc = get_argv(argc, argv, loaded_image->LoadOptions,
>>> +loaded_image->LoadOptionsSize, &offset, 
>>> &options);
>>
>> ... wouldn't this change alone cure that problem? And even that I don't
>> follow. Below here we have
>>
>> for ( i = 1; i < argc; ++i )
>> {
>> CHAR16 *ptr = argv[i];
>>
>> if ( !ptr )
>> break;
>>
>> and the 2nd pass of get_argv() properly terminates the (possibly too large)
>> array with a NULL sentinel. So I wonder what it is that I'm overlooking and
>> that is broken.
> 
> I realized that because I got a crash, not just by looking at the code.
> 
> The string was something like "-- a b c d":

That's in the "plain command line" case or the LOAD_OPTIONS one? In the
former case the image name should come first, aiui. And in the

Re: [PATCH] xen/arm32: Tidy up setup_mm()

2025-07-07 Thread Stefano Stabellini

On Fri, 4 Jul 2025, Michal Orzel wrote:
> The current look and feel of setup_mm() leaves a lot to be desired. The
> scope of variables is not the best, many variables are not really needed
> while some others are set but not used. The first iteration of membanks
> is split from the loop for no reason. Tidy up this function for better
> readability.
> 
> No functional change.
> 
> Signed-off-by: Michal Orzel 

Reviewed-by: Stefano Stabellini 

> ---
>  xen/arch/arm/arm32/mmu/mm.c | 28 +---
>  1 file changed, 9 insertions(+), 19 deletions(-)
> 
> diff --git a/xen/arch/arm/arm32/mmu/mm.c b/xen/arch/arm/arm32/mmu/mm.c
> index e6d9b49acd3c..5e4766ddcf65 100644
> --- a/xen/arch/arm/arm32/mmu/mm.c
> +++ b/xen/arch/arm/arm32/mmu/mm.c
> @@ -74,8 +74,7 @@ static paddr_t __init fit_xenheap_in_static_heap(uint32_t 
> size, paddr_t align)
>  void __init setup_mm(void)
>  {
>  const struct membanks *mem = bootinfo_get_mem();
> -paddr_t ram_start, ram_end, ram_size, e, bank_start, bank_end, bank_size;
> -paddr_t static_heap_end = 0, static_heap_size = 0;
> +paddr_t ram_start = INVALID_PADDR, ram_end = 0, ram_size = 0, e;
>  unsigned long heap_pages, xenheap_pages, domheap_pages;
>  unsigned int i;
>  const uint32_t ctr = READ_CP32(CTR);
> @@ -89,19 +88,14 @@ void __init setup_mm(void)
>  
>  init_pdx();
>  
> -ram_start = mem->bank[0].start;
> -ram_size  = mem->bank[0].size;
> -ram_end   = ram_start + ram_size;
> -
> -for ( i = 1; i < mem->nr_banks; i++ )
> +for ( i = 0; i < mem->nr_banks; i++ )
>  {
> -bank_start = mem->bank[i].start;
> -bank_size = mem->bank[i].size;
> -bank_end = bank_start + bank_size;
> +const struct membank *bank = &mem->bank[i];
> +paddr_t bank_end = bank->start + bank->size;
>  
> -ram_size  = ram_size + bank_size;
> -ram_start = min(ram_start,bank_start);
> -ram_end   = max(ram_end,bank_end);
> +ram_size = ram_size + bank->size;
> +ram_start = min(ram_start, bank->start);
> +ram_end = max(ram_end, bank_end);
>  }
>  
>  total_pages = ram_size >> PAGE_SHIFT;
> @@ -109,18 +103,14 @@ void __init setup_mm(void)
>  if ( using_static_heap )
>  {
>  const struct membanks *reserved_mem = bootinfo_get_reserved_mem();
> +paddr_t static_heap_size = 0;
>  
>  for ( i = 0 ; i < reserved_mem->nr_banks; i++ )
>  {
>  if ( reserved_mem->bank[i].type != MEMBANK_STATIC_HEAP )
>  continue;
>  
> -bank_start = reserved_mem->bank[i].start;
> -bank_size = reserved_mem->bank[i].size;
> -bank_end = bank_start + bank_size;
> -
> -static_heap_size += bank_size;
> -static_heap_end = max(static_heap_end, bank_end);
> +static_heap_size += reserved_mem->bank[i].size;
>  }
>  
>  heap_pages = static_heap_size >> PAGE_SHIFT;
> -- 
> 2.25.1
>

Re: [PATCH 1/2] xen/arm64: Panic if direct map is too small

2025-07-07 Thread Stefano Stabellini

On Fri, 4 Jul 2025, Michal Orzel wrote:
> Harden the code by panicing if direct map is too small for current memory
> layout taking into account possible PDX compression. Otherwise the assert
> is observed:
> Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < 
> (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72
> 
> At the moment, we don't set max_pdx denoting maximum usable PDX which
> should be based on max_page. Consolidate setting of max_page and max_pdx
> in init_pdx() for both arm32 and arm64. max_pdx will be used in the
> future to set up frametable mappings respecting the PDX grouping.
> 
> Signed-off-by: Michal Orzel 

Reviewed-by: Stefano Stabellini

Re: [PATCH 2/2] xen/arm: Skip loops in init_pdx() when no PDX compression is used

2025-07-07 Thread Stefano Stabellini

On Fri, 4 Jul 2025, Michal Orzel wrote:
> When CONFIG_PDX_COMPRESSION=n, pdx_init_mask(), pdx_region_mask() and
> pfn_pdx_hole_setup() are just stubs doing nothing. It does not make
> sense to keep the two loops iterating over all the memory banks.
> 
> Signed-off-by: Michal Orzel 

Reviewed-by: Stefano Stabellini 


> ---
>  xen/arch/arm/setup.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
> index 93b730ffb5fb..12b76a0a9837 100644
> --- a/xen/arch/arm/setup.c
> +++ b/xen/arch/arm/setup.c
> @@ -255,7 +255,9 @@ void __init init_pdx(void)
>  {
>  const struct membanks *mem = bootinfo_get_mem();
>  paddr_t bank_start, bank_size, bank_end, ram_end = 0;
> +int bank;
>  
> +#ifdef CONFIG_PDX_COMPRESSION
>  /*
>   * Arm does not have any restrictions on the bits to compress. Pass 0 to
>   * let the common code further restrict the mask.
> @@ -264,7 +266,6 @@ void __init init_pdx(void)
>   * update this function too.
>   */
>  uint64_t mask = pdx_init_mask(0x0);
> -int bank;
>  
>  for ( bank = 0 ; bank < mem->nr_banks; bank++ )
>  {
> @@ -284,6 +285,7 @@ void __init init_pdx(void)
>  }
>  
>  pfn_pdx_hole_setup(mask >> PAGE_SHIFT);
> +#endif
>  
>  for ( bank = 0 ; bank < mem->nr_banks; bank++ )
>  {
> -- 
> 2.25.1
>

Re: [PATCH] xen/arm32: Tidy up setup_mm()

2025-07-07 Thread Hari Limaye

Hi Michal,

> On Fri, Jul 04, 2025 at 11:08:31AM +, Michal Orzel wrote:
> The current look and feel of setup_mm() leaves a lot to be desired. The
> scope of variables is not the best, many variables are not really needed
> while some others are set but not used. The first iteration of membanks
> is split from the loop for no reason. Tidy up this function for better
> readability.
> 
> No functional change.
> 
> Signed-off-by: Michal Orzel 
Reviewed-by: Hari Limaye 
Tested-by: Hari Limaye 

LGTM! Tested (compilation) via Arm AArch32 build.

Many thanks,
Hari

RE: [PATCH] xen/arm: Enhance IPMMU-VMSA driver robustness and debug output

2025-07-07 Thread Jahan Murudi

Hi Julien,

> On 30/06/2025 13:44, Julien Grall wrote:

>> On 25/06/2025 16:53, Julien Grall wrote:
>
>> Hi Jahan,
> 
>> +dsb(sy);
> Any clue why Linux (mainline) does not do that?

> I understand for the PCI passthrough, Xen will be using stage-2, so in theory 
> the stage-1 could be used by the guest OS. But ultimately, this is the same 
> PCI device behind. So if it is not coherent, it should be for both stages. Do 
> you have any pointer to the documentation that would state otherwise?

You're right - coherency characteristics are identical for both stages. My 
earlier understanding was incorrect.

> Note, I just noticed that IOMMU_FEAT_COHERENT_WALK is not set for the IPMMU. 
> So the "dsb sy" is coherent. However, I find doubful an IOMMU would have a 
> difference of coherency between two stages. So maybe we should set the flag 
> either unconditionally or based on a register.

Excellent observation. Current R-car IPMMU doesn't supports coherent walks - we 
should indeed set this flag unconditionally.

 >> and we must also prevent(minimise) any DMA operations during TLB 
 >> invalidation( observed some IPMMU hardware limitations in the
documentation) .

> I don't understand what you wrote in parentheses. But isn't it what you wrote 
> all true for stage-1?

Correct – the hardware reference doc guidelines about minimizing DMA during 
flushes applies globally. This is true for both stage-1 and stage-2. 
Given that the patch has already been Acked by Michal, can we proceed with 
applying it?

Regards,
Jahan Murudi

Re: [PATCH] xen/arm: Enhance IPMMU-VMSA driver robustness and debug output

2025-07-07 Thread Julien Grall


Hi Jahan,

On 07/07/2025 12:24, Jahan Murudi wrote:

On 30/06/2025 13:44, Julien Grall wrote:



On 25/06/2025 16:53, Julien Grall wrote:



Hi Jahan,



+dsb(sy);

Any clue why Linux (mainline) does not do that?



I understand for the PCI passthrough, Xen will be using stage-2, so in theory 
the stage-1 could be used by the guest OS. But ultimately, this is the same PCI 
device behind. So if it is not coherent, it should be for both stages. Do you 
have any pointer to the documentation that would state otherwise?


You're right - coherency characteristics are identical for both stages. My 
earlier understanding was incorrect.


Note, I just noticed that IOMMU_FEAT_COHERENT_WALK is not set for the IPMMU. So the 
"dsb sy" is coherent. However, I find doubful an IOMMU would have a difference 
of coherency between two stages. So maybe we should set the flag either unconditionally 
or based on a register.


Excellent observation. Current R-car IPMMU doesn't supports coherent walks - we 
should indeed set this flag unconditionally.


To clarify, the R-car IPMMU doesn't support coherent walk, then the flag 
IOMMU_FEAT_COHERENT_WALK *must not* be set.


Cheers,

--
Julien Grall

[MINI-OS PATCH 2/2] x86: don't use a memory page for mapping the shared info page

2025-07-07 Thread Juergen Gross

There is no need to use a populated memory page for mapping the shared
info page at that location. Just use an allocated virtual address for
the shared info page. For PVH allocate an unused pfn.

Signed-off-by: Juergen Gross 
---
 arch/x86/mm.c |  7 ---
 arch/x86/setup.c  | 15 ---
 arch/x86/x86_32.S |  7 +--
 arch/x86/x86_64.S |  7 +--
 hypervisor.c  | 15 +++
 5 files changed, 21 insertions(+), 30 deletions(-)

diff --git a/arch/x86/mm.c b/arch/x86/mm.c
index 3f5c7ea7..78d614a6 100644
--- a/arch/x86/mm.c
+++ b/arch/x86/mm.c
@@ -497,7 +497,6 @@ static void build_pagetable(unsigned long *start_pfn, 
unsigned long *max_pfn)
 /*
  * Mark portion of the address space read only.
  */
-extern struct shared_info shared_info;
 
 struct change_readonly_par {
 unsigned long etext;
@@ -519,12 +518,6 @@ static int change_readonly_func(unsigned long va, unsigned 
int lvl,
 if ( va + (1UL << ptdata[lvl].shift) > ro->etext )
 return 1;
 
-if ( va == (unsigned long)&shared_info )
-{
-printk("skipped %lx\n", va);
-return 0;
-}
-
 newval = ro->readonly ? (*pte & ~_PAGE_RW) : (*pte | _PAGE_RW);
 
 #ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/setup.c b/arch/x86/setup.c
index 299ff8c7..8fd55c51 100644
--- a/arch/x86/setup.c
+++ b/arch/x86/setup.c
@@ -47,8 +47,6 @@ shared_info_t *HYPERVISOR_shared_info;
  */
 char stack[2*STACK_SIZE];
 
-extern char shared_info[PAGE_SIZE];
-
 static inline void fpu_init(void) {
asm volatile("fninit");
 }
@@ -76,18 +74,21 @@ static void set_info_ptr(start_info_t *ptr)
 
 #define hpc_init()
 
+static unsigned long shared_info_va;
+
 shared_info_t *map_shared_info(void)
 {
 int rc;
-unsigned long pa = start_info_ptr->shared_info;
 
-if ( (rc = HYPERVISOR_update_va_mapping((unsigned long)shared_info,
-__pte(pa | 7), UVMF_INVLPG)) )
+if ( !shared_info_va )
+shared_info_va = alloc_virt_kernel(1);
+rc = map_frame_rw(shared_info_va, PHYS_PFN(start_info_ptr->shared_info));
+if ( rc )
 {
 printk("Failed to map shared_info!! rc=%d\n", rc);
 do_exit();
 }
-return (shared_info_t *)shared_info;
+return (shared_info_t *)shared_info_va;
 }
 
 void unmap_shared_info(void)
@@ -95,7 +96,7 @@ void unmap_shared_info(void)
 int rc;
 pte_t nullpte = { };
 
-if ( (rc = HYPERVISOR_update_va_mapping((unsigned long)shared_info,
+if ( (rc = HYPERVISOR_update_va_mapping(shared_info_va,
 nullpte, UVMF_INVLPG)) )
 {
 printk("Failed to unmap shared_info page!! rc=%d\n", rc);
diff --git a/arch/x86/x86_32.S b/arch/x86/x86_32.S
index 3de00277..5d891164 100644
--- a/arch/x86/x86_32.S
+++ b/arch/x86/x86_32.S
@@ -36,13 +36,8 @@ _start:
 stack_start:
.long stack+(2*__STACK_SIZE), __KERNEL_SS
 
-.globl shared_info, hypercall_page
-/* Unpleasant -- the PTE that maps this page is actually overwritten */
-/* to map the real shared-info page! :-) */
 .align __PAGE_SIZE
-shared_info:
-.fill __PAGE_SIZE,1,0
-
+.globl hypercall_page
 hypercall_page:
 .fill __PAGE_SIZE,1,0
 
diff --git a/arch/x86/x86_64.S b/arch/x86/x86_64.S
index 7529c02e..09b93e39 100644
--- a/arch/x86/x86_64.S
+++ b/arch/x86/x86_64.S
@@ -33,13 +33,8 @@ _start:
 stack_start:
 .quad stack+(2*__STACK_SIZE)
 
-.globl shared_info, hypercall_page
-/* Unpleasant -- the PTE that maps this page is actually overwritten */
-/* to map the real shared-info page! :-) */
 .align __PAGE_SIZE
-shared_info:
-.fill __PAGE_SIZE,1,0
-
+.globl hypercall_page
 hypercall_page:
 .fill __PAGE_SIZE,1,0
 
diff --git a/hypervisor.c b/hypervisor.c
index 6476d658..519a4680 100644
--- a/hypervisor.c
+++ b/hypervisor.c
@@ -27,8 +27,10 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 #include 
 
 EXPORT_SYMBOL(hypercall_page);
@@ -37,7 +39,8 @@ EXPORT_SYMBOL(hypercall_page);
 ((sh)->evtchn_pending[idx] & ~(sh)->evtchn_mask[idx])
 
 #ifndef CONFIG_PARAVIRT
-extern shared_info_t shared_info;
+static unsigned long shinfo_pfn;
+static unsigned long shinfo_va;
 
 int hvm_get_parameter(int idx, uint64_t *value)
 {
@@ -69,14 +72,16 @@ shared_info_t *map_shared_info(void)
 {
 struct xen_add_to_physmap xatp;
 
+shinfo_pfn = e820_get_reserved_pfns(1);
 xatp.domid = DOMID_SELF;
 xatp.idx = 0;
 xatp.space = XENMAPSPACE_shared_info;
-xatp.gpfn = virt_to_pfn(&shared_info);
+xatp.gpfn = shinfo_pfn;
 if ( HYPERVISOR_memory_op(XENMEM_add_to_physmap, &xatp) != 0 )
 BUG();
+shinfo_va = map_frame_virt(shinfo_pfn);
 
-return &shared_info;
+return (shared_info_t *)shinfo_va;
 }
 
 void unmap_shared_info(void)
@@ -84,9 +89,11 @@ void unmap_shared_info(void)
 struct xen_remove_from_physmap xrtp;
 
 xrtp.domid = DOMID_SEL

[MINI-OS PATCH 1/2] mm: provide a way to do very early page table allocations

2025-07-07 Thread Juergen Gross

Add a small pool of statically allocated memory pages to be handed out
for very early page table allocations.

This will make it possible to do virtual allocations e.g. for mapping
the shared info page.

Signed-off-by: Juergen Gross 
---
 arch/x86/mm.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm.c b/arch/x86/mm.c
index bdff38fd..3f5c7ea7 100644
--- a/arch/x86/mm.c
+++ b/arch/x86/mm.c
@@ -640,12 +640,17 @@ void change_readonly(bool readonly)
  * return a valid PTE for a given virtual address. If PTE does not exist,
  * allocate page-table pages.
  */
+#define N_PTS 4
+static char early_pt[PAGE_SIZE * N_PTS] __attribute__((aligned(PAGE_SIZE)));
+static unsigned long n_early_pt = N_PTS;
+
 static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf,
  pgentry_t *pte, void *par)
 {
 pgentry_t **result = par;
 unsigned long pt_mfn;
 unsigned long pt_pfn;
+unsigned long pt_addr;
 unsigned int idx;
 
 if ( !is_leaf )
@@ -664,7 +669,16 @@ static int need_pgt_func(unsigned long va, unsigned int 
lvl, bool is_leaf,
 }
 
 pt_mfn = virt_to_mfn(pte);
-pt_pfn = virt_to_pfn(alloc_page());
+if ( n_early_pt )
+{
+n_early_pt--;
+pt_addr = (unsigned long)&early_pt[n_early_pt * PAGE_SIZE];
+}
+else
+{
+pt_addr = alloc_page();
+}
+pt_pfn = virt_to_pfn(pt_addr);
 if ( !pt_pfn )
 return -1;
 idx = idx_from_va_lvl(va, lvl);
-- 
2.43.0

[PATCH 0/2] Xen real-time x86

2025-07-07 Thread Stefano Stabellini

Hi all,

This short patch series improves Xen real-time execution on AMD x86
processors.

The key to real-time performance is deterministic guest execution times
and deterministic guest interrupt latency. In such configurations, the
null scheduler is typically used, and there should be no IPIs or other
sources of vCPU execution interruptions beyond the guest timer interrupt
as configured by the guest, and any passthrough interrupts for
passthrough devices.

This is because, upon receiving a critical interrupt, the guest (such as
FreeRTOS or Zephyr) typically has a very short window of time to
complete the required action. Being interrupted in the middle of this
critical section could prevent the guest from completing the action
within the allotted time, leading to malfunctions.

To address this, the patch series disables IPIs that could potentially
affect the real-time domain.

Cheers,
Stefano


Stefano Stabellini (2):
  xen/x86: don't send IPI to sync TSC when it is reliable
  xen/x86: introduce AMD_MCE_NONFATAL

 xen/arch/x86/Kconfig.cpu   | 15 +++
 xen/arch/x86/cpu/mcheck/amd_nonfatal.c |  3 ++-
 xen/arch/x86/time.c|  4 
 3 files changed, 21 insertions(+), 1 deletion(-)

Re: [RFC PATCH] xen/flask: estimate max sidtable size

2025-07-07 Thread Daniel P. Smith


On 7/4/25 06:48, Jan Beulich wrote:

On 04.07.2025 12:10, Sergiy Kibrik wrote:

01.07.25 13:42, Jan Beulich:

On 30.06.2025 10:55, Sergiy Kibrik wrote:

@@ -54,4 +54,7 @@ $(obj)/policy.bin: FORCE
FLASK_BUILD_DIR=$(FLASK_BUILD_DIR) POLICY_FILENAME=$(POLICY_SRC)
cmp -s $(POLICY_SRC) $@ || cp $(POLICY_SRC) $@
   
+$(obj)/%/se_limits.h: $(obj)/policy.bin

+   $(srcdir)/policy/mkselim.sh $^ $@


Hmm, that's using the built-in policy, isn't it? What if later another
policy is loaded? Wouldn't it be possible to have ...


--- a/xen/xsm/flask/ss/sidtab.c
+++ b/xen/xsm/flask/ss/sidtab.c
@@ -13,6 +13,7 @@
   #include "flask.h"
   #include "security.h"
   #include "sidtab.h"
+#include "se_limits.h"
   
   #define SIDTAB_HASH(sid) ((sid) & SIDTAB_HASH_MASK)
   
@@ -228,7 +229,7 @@ int sidtab_context_to_sid(struct sidtab *s, struct context *context,

   if ( sid )
   goto unlock_out;
   /* No SID exists for the context.  Allocate a new one. */
-if ( s->next_sid == UINT_MAX || s->shutdown )
+if ( s->next_sid == SEPOL_SID_LIMIT || s->shutdown )


... more than this many SIDs? What if CONFIG_XSM_FLASK_POLICY isn't even set?



It's using a policy from tools/flask/policy, yes. But not a built-in
policy, just reusing a bit of code from that code. The idea is that we
can have CONFIG_XSM_FLASK_POLICY option disabled yet still be able to
calculate SEPOL_SID_LIMIT.

As for loading another policy at runtime -- the calculated
SEPOL_SID_LIMIT=384 for current master flask policy is still pretty big
limit. From what I can see -- much less No. contexts are being used on a
running system, because most of calculated combinations of
user/role/type are not really usable (e.g. contexts with xen_t or
xenboot_t types and user_1 user are not expected etc). So there should
be enough room even for more complex custom policies.


But still there could be odd ones. Imo such a static limit can then only be
introduced via Kconfig option.


Jan, thank you for adding me on as the CC.

Not having seen the original patch, but based on the discussion, I would 
say this should be a Kconfig option that by default maintains the 
existing bounds/limits allowing for the distro maintainer to impose 
tighter restrictions. Additionally, any there was dynamic allocation, 
this should remain (being the default) and static allocation should only 
happen via Kconfig system.


v/r,
dps

82 matches

Mail list logo