Re: [XEN RFC PATCH v5 3/5] xen/public: Introduce PV-IOMMU hypercall interface
On 30.01.2025 21:17, Jason Andryuk wrote: > On 2025-01-21 11:13, Teddy Astie wrote: >> +/** >> + * IOMMU_alloc_nested >> + * Create a nested IOMMU context (needs IOMMUCAP_nested). >> + * >> + * This context uses a platform-specific page table from domain address >> space >> + * specified in pgtable_gfn and use it for nested translations. >> + * >> + * Explicit flushes needs to be submited with IOMMU_flush_nested on >> + * modification of the nested pagetable to ensure coherency between IOTLB >> and >> + * nested page table. >> + * >> + * This context can be destroyed using IOMMU_free_context. >> + * This context cannot be modified using map_pages, unmap_pages. >> + */ >> +struct pv_iommu_alloc_nested { >> + /* OUT: allocated IOMMU context number */ >> + uint16_t ctx_no; >> + >> + /* IN: guest frame number of the nested page table */ >> + uint64_aligned_t pgtable_gfn; >> + >> + /* IN: nested mode flags */ >> + uint64_aligned_t nested_flags; >> +}; >> +typedef struct pv_iommu_alloc_nested pv_iommu_alloc_nested_t; >> +DEFINE_XEN_GUEST_HANDLE(pv_iommu_alloc_nested_t); > > Is this command intended to be used for GVA -> GPA translation? Would you > need some way to associate with another iommu context for GPA -> HPA > translation? > > Maybe more broadly, what are your goals for enabling PV-IOMMU? The examples > on your blog post cover a domain restrict device access to specific portions > of the the GPA space. Are you also interested in GVA -> GPA? Does VFIO > required GVA -> GPA? > > And, sorry to bike shed, but ctx_no reads like "Context No" to me. I think > ctxid/ctx_id might be clearer. Others probably have their own opinions :) Or, if it absolutely is to derive from "number", ctx_nr. That said, I agree "id" is a better term to use here. Jan
Re: Config space access to Mediatek MT7922 doesn't work after device reset in Xen PV dom0 (regression, Linux 6.12)
On 30.01.2025 22:31, Bjorn Helgaas wrote: > On Thu, Jan 30, 2025 at 10:30:33AM +0100, Jan Beulich wrote: >> On 30.01.2025 05:55, Marek Marczykowski-Górecki wrote: >>> (XEN) d0v1 conf read cf8 0x80010088 bytes 2 offset 2 data 0x9 > > PCIe Cap at 0x80, PCI_EXP_DEVCTL is 0x08, PCI_EXP_DEVSTA is 0x0a. > > 0x80010088 would be PCI_EXP_DEVCTL (a 2-byte register), maybe offset 2 > gets us to PCI_EXP_DEVSTA? Not sure. > > 0x0001 PCI_EXP_DEVSTA_CED /* Correctable Error Detected */ > 0x0008 PCI_EXP_DEVSTA_URD /* Unsupported Request Detected */ > > Not impossible that these would be set. Lots of URs happen during > enumeration and we're not very good about cleaning these up. > Correctable errors are common for some devices. lspci -vv would > decode the PCIe cap registers, including this. > >>> (XEN) d0v1 conf read cf8 0x80010088 bytes 2 offset 0 data 0x2910 > > PCI_EXP_DEVCTL: > 0x2000 PCI_EXP_DEVCTL_READRQ_512B > 0x0800 PCI_EXP_DEVCTL_NOSNOOP_EN > 0x0100 PCI_EXP_DEVCTL_EXT_TAG > 0x0010 PCI_EXP_DEVCTL_RELAX_EN > >>> (XEN) d0v1 conf write cf8 0x80010088 bytes 2 offset 0 data 0xa910 > > PCI_EXP_DEVCTL: > set 0x8000 PCI_EXP_DEVCTL_BCR_FLR > > This looks like the actual FLR being initiated. > >> This is the express capability's Link Control 2 Register afaict. > > Unless I'm missing something this is actually Device Control. So far > I think this all looks OK. The next part: What you say is very plausible as far as the observed behavior goes, but: According to the lspci output provided earlier the express capability is at 58 (hex). Hence here we're 30 (hex) into the capability, which according to the spec I'm looking at is Link Control 2. Yet as said - with what you say being plausible, likely I'm simply getting something very wrong. Jan
Re: v5.4.289 failed to boot with error megasas_build_io_fusion 3219 sge_count (-12) is out of range
On Thu, 30 Jan 2025, Jürgen Groß wrote: > Can you try the attached patch, please? I don't have a system at hand > showing the problem. > > From cff43e997f79a95dc44e02debaeafe5f127f40bb Mon Sep 17 00:00:00 2001 > From: Juergen Gross > Date: Thu, 30 Jan 2025 09:56:57 +0100 > Subject: [PATCH] x86/xen: allow larger contiguous memory regions in PV guests > > Today a PV guest (including dom0) can create 2MB contiguous memory > regions for DMA buffers at max. This has led to problems at least > with the megaraid_sas driver, which wants to allocate a 2.3MB DMA > buffer. > > The limiting factor is the frame array used to do the hypercall for > making the memory contiguous, which has 512 entries and is just a > static array in mmu_pv.c. > > In case a contiguous memory area larger than the initially supported > 2MB is requested, allocate a larger buffer for the frame list. Note > that such an allocation is tried only after memory management has been > initialized properly, which is tested via the early_boot_irqs_disabled > flag. > > Fixes: 9f40ec84a797 ("xen/swiotlb: add alignment check for dma buffers") > Signed-off-by: Juergen Gross > --- > Note that the "Fixes:" tag is not really correct, as that patch didn't > introduce the problem, but rather made it visible. OTOH it is the best > indicator we have to identify kernel versions this patch should be > backported to. > --- > arch/x86/xen/mmu_pv.c | 44 --- > 1 file changed, 37 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c > index 55a4996d0c04..62aec29b8174 100644 > --- a/arch/x86/xen/mmu_pv.c > +++ b/arch/x86/xen/mmu_pv.c > @@ -2200,8 +2200,10 @@ void __init xen_init_mmu_ops(void) > } > > /* Protected by xen_reservation_lock. */ > -#define MAX_CONTIG_ORDER 9 /* 2MB */ > -static unsigned long discontig_frames[1< +#define MIN_CONTIG_ORDER 9 /* 2MB */ > +static unsigned int discontig_frames_order = MIN_CONTIG_ORDER; > +static unsigned long discontig_frames_early[1UL << MIN_CONTIG_ORDER]; > +static unsigned long *discontig_frames = discontig_frames_early; > > #define VOID_PTE (mfn_pte(0, __pgprot(0))) > static void xen_zap_pfn_range(unsigned long vaddr, unsigned int order, > @@ -2319,18 +2321,44 @@ int xen_create_contiguous_region(phys_addr_t pstart, > unsigned int order, >unsigned int address_bits, >dma_addr_t *dma_handle) > { > - unsigned long *in_frames = discontig_frames, out_frame; > + unsigned long *in_frames, out_frame; > + unsigned long *new_array, *old_array; > unsigned long flags; > intsuccess; > unsigned long vstart = (unsigned long)phys_to_virt(pstart); > > - if (unlikely(order > MAX_CONTIG_ORDER)) > - return -ENOMEM; > + if (unlikely(order > discontig_frames_order)) { > + if (early_boot_irqs_disabled) > + return -ENOMEM; > + > + new_array = vmalloc(sizeof(unsigned long) * (1UL << order)); > + > + if (!new_array) > + return -ENOMEM; > + > + spin_lock_irqsave(&xen_reservation_lock, flags); > + > + if (order > discontig_frames_order) { This second if check should not be needed because it is the same as the outer if check. > + if (discontig_frames == discontig_frames_early) > + old_array = NULL; > + else > + old_array = discontig_frames; > + discontig_frames = new_array; > + discontig_frames_order = order; > + } else > + old_array = new_array; > + > + spin_unlock_irqrestore(&xen_reservation_lock, flags); > + > + vfree(old_array); > + } > > memset((void *) vstart, 0, PAGE_SIZE << order); > > spin_lock_irqsave(&xen_reservation_lock, flags); > > + in_frames = discontig_frames; > + > /* 1. Zap current PTEs, remembering MFNs. */ > xen_zap_pfn_range(vstart, order, in_frames, NULL); > > @@ -2354,12 +2382,12 @@ int xen_create_contiguous_region(phys_addr_t pstart, > unsigned int order, > > void xen_destroy_contiguous_region(phys_addr_t pstart, unsigned int order) > { > - unsigned long *out_frames = discontig_frames, in_frame; > + unsigned long *out_frames, in_frame; > unsigned long flags; > int success; > unsigned long vstart; > > - if (unlikely(order > MAX_CONTIG_ORDER)) > + if (unlikely(order > discontig_frames_order)) > return; > > vstart = (unsigned long)phys_to_virt(pstart); > @@ -2367,6 +2395,8 @@ void xen_destroy_contiguous_region(phys_addr_t pstart, > unsigned int order) > > spin_lock_irqsave(&xen_reservation_lock, flags); > > + out_frames = discontig_frames; > + > /* 1. Find start MFN of contiguous extent. */ >
Re: [PATCH v2 0/2] tests/qtest: Make qtest_has_accel() generic
On 2025/01/30 19:37, Philippe Mathieu-Daudé wrote: (Series fully reviewed) Since v1: - Use g_strconcat (Akihiko) In preparation of running QTests using HVF on Darwin, make qtest_has_accel() generic. Note, this also allow running other accelerators such Xen, WHPX, ... Philippe Mathieu-Daudé (2): tests/qtest: Extract qtest_qom_has_concrete_type() helper tests/qtest: Make qtest_has_accel() generic tests/qtest/libqtest.c | 110 +++-- 1 file changed, 61 insertions(+), 49 deletions(-) Reviewed-by: Akihiko Odaki
Re: [RFC PATCH 3/3] drm/virtio: implement blob userptr resource object
On Wed, Jan 29, 2025 at 03:54:59PM -0500, Demi Marie Obenour wrote: > On 1/8/25 12:05 PM, Simona Vetter wrote: > > On Fri, Dec 27, 2024 at 10:24:29AM +0800, Huang, Honglei1 wrote: > >> > >> On 2024/12/22 9:59, Demi Marie Obenour wrote: > >>> On 12/20/24 10:35 AM, Simona Vetter wrote: > On Fri, Dec 20, 2024 at 06:04:09PM +0800, Honglei Huang wrote: > > From: Honglei Huang > > > > A virtio-gpu userptr is based on HMM notifier. > > Used for let host access guest userspace memory and > > notice the change of userspace memory. > > This series patches are in very beginning state, > > User space are pinned currently to ensure the host > > device memory operations are correct. > > The free and unmap operations for userspace can be > > handled by MMU notifier this is a simple and basice > > SVM feature for this series patches. > > The physical PFNS update operations is splited into > > two OPs in here. The evicted memories won't be used > > anymore but remap into host again to achieve same > > effect with hmm_rang_fault. > > So in my opinion there are two ways to implement userptr that make sense: > > - pinned userptr with pin_user_pages(FOLL_LONGTERM). there is not mmu > notifier > > - unpinnned userptr where you entirely rely on userptr and do not hold > any > page references or page pins at all, for full SVM integration. This > should use hmm_range_fault ideally, since that's the version that > doesn't ever grab any page reference pins. > > All the in-between variants are imo really bad hacks, whether they hold a > page reference or a temporary page pin (which seems to be what you're > doing here). In much older kernels there was some justification for them, > because strange stuff happened over fork(), but with FOLL_LONGTERM this > is > now all sorted out. So there's really only fully pinned, or true svm left > as clean design choices imo. > > With that background, why does pin_user_pages(FOLL_LONGTERM) not work for > you? > >>> > >>> +1 on using FOLL_LONGTERM. Fully dynamic memory management has a huge > >>> cost > >>> in complexity that pinning everything avoids. Furthermore, this avoids > >>> the > >>> host having to take action in response to guest memory reclaim requests. > >>> This avoids additional complexity (and thus attack surface) on the host > >>> side. > >>> Furthermore, since this is for ROCm and not for graphics, I am less > >>> concerned > >>> about supporting systems that require swappable GPU VRAM. > >> > >> Hi Sima and Demi, > >> > >> I totally agree the flag FOLL_LONGTERM is needed, I will add it in next > >> version. > >> > >> And for the first pin variants implementation, the MMU notifier is also > >> needed I think.Cause the userptr feature in UMD generally used like this: > >> the registering of userptr always is explicitly invoked by user code like > >> "registerMemoryToGPU(userptrAddr, ...)", but for the userptr release/free, > >> there is no explicit API for it, at least in hsakmt/KFD stack. User just > >> need call system call "free(userptrAddr)", then kernel driver will release > >> the userptr by MMU notifier callback.Virtio-GPU has no other way to know if > >> user has been free the userptr except for MMU notifior.And in UMD theres is > >> no way to get the free() operation is invoked by user.The only way is use > >> MMU notifier in virtio-GPU driver and free the corresponding data in host > >> by > >> some virtio CMDs as far as I can see. > >> > >> And for the second way that is use hmm_range_fault, there is a predictable > >> issues as far as I can see, at least in hsakmt/KFD stack. That is the > >> memory > >> may migrate when GPU/device is working. In bare metal, when memory is > >> migrating KFD driver will pause the compute work of the device in > >> mmap_wirte_lock then use hmm_range_fault to remap the migrated/evicted > >> memories to GPU then restore the compute work of device to ensure the > >> correction of the data. But in virtio-GPU driver the migration happen in > >> guest kernel, the evict mmu notifier callback happens in guest, a virtio > >> CMD > >> can be used for notify host but as lack of mmap_write_lock protection in > >> host kernel, host will hold invalid data for a short period of time, this > >> may lead to some issues. And it is hard to fix as far as I can see. > >> > >> I will extract some APIs into helper according to your request, and I will > >> refactor the whole userptr implementation, use some callbacks in page > >> getting path, let the pin method and hmm_range_fault can be choiced > >> in this series patches. > > > > Ok, so if this is for svm, then you need full blast hmm, or the semantics > > are buggy. You cannot fake svm with pin(FOLL_LONGTERM) userptr, this does > > not work. > > > > The other option is that hsakmt/kfd api is completely busted,
Re: [PATCH v2 1/4] xen: kconfig: rename MEM_ACCESS -> VM_EVENT
On Tue, Jan 21, 2025 at 5:19 AM Sergiy Kibrik wrote: > > Use more generic CONFIG_VM_EVENT name throughout Xen code instead of > CONFIG_MEM_ACCESS. This reflects the fact that vm_event is a higher level > feature, with mem_access & monitor depending on it. > > Suggested-by: Jan Beulich > Suggested-by: Tamas K Lengyel > Signed-off-by: Sergiy Kibrik Acked-by: Tamas K Lengyel
Re: [PATCH v2 1/4] xen: kconfig: rename MEM_ACCESS -> VM_EVENT
On Thu, Jan 30, 2025 at 8:24 AM Jan Beulich wrote: > > On 21.01.2025 11:19, Sergiy Kibrik wrote: > > Use more generic CONFIG_VM_EVENT name throughout Xen code instead of > > CONFIG_MEM_ACCESS. This reflects the fact that vm_event is a higher level > > feature, with mem_access & monitor depending on it. > > > > Suggested-by: Jan Beulich > > I don't think this is applicable; my suggestion went in a different direction. > > > Suggested-by: Tamas K Lengyel > > Signed-off-by: Sergiy Kibrik > > Before considering to ack this, I'd like you, Tamas, to confirm this is really > what you had thought of. In particular ... > > > --- a/xen/arch/arm/Makefile > > +++ b/xen/arch/arm/Makefile > > @@ -37,7 +37,7 @@ obj-y += irq.o > > obj-y += kernel.init.o > > obj-$(CONFIG_LIVEPATCH) += livepatch.o > > obj-$(CONFIG_LLC_COLORING) += llc-coloring.o > > -obj-$(CONFIG_MEM_ACCESS) += mem_access.o > > +obj-$(CONFIG_VM_EVENT) += mem_access.o > > ... changes like this one look somewhat odd to me. > > > --- a/xen/common/Kconfig > > +++ b/xen/common/Kconfig > > @@ -92,7 +92,7 @@ config HAS_VMAP > > config MEM_ACCESS_ALWAYS_ON > > bool > > > > -config MEM_ACCESS > > +config VM_EVENT > > def_bool MEM_ACCESS_ALWAYS_ON > > prompt "Memory Access and VM events" if !MEM_ACCESS_ALWAYS_ON > > depends on HVM > > What about MEM_ACCESS_ALWAYS_ON (visible in patch context)? Shouldn't that > become VM_EVENT_ALWAYS_ON then, too? > > Further, what about MEM_PAGING and MEM_SHARING? Shouldn't those, at least > documentation purposes, then also gain a dependency on VM_EVENT? MEM_PAGING, yes. MEM_SHARING, definitely not. MEM_SHARING is perfectly functional without vm_event. Tamas
Re: [PATCH v2 1/4] xen: kconfig: rename MEM_ACCESS -> VM_EVENT
On 31.01.2025 01:26, Tamas K Lengyel wrote: > On Thu, Jan 30, 2025 at 8:24 AM Jan Beulich wrote: >> >> On 21.01.2025 11:19, Sergiy Kibrik wrote: >>> Use more generic CONFIG_VM_EVENT name throughout Xen code instead of >>> CONFIG_MEM_ACCESS. This reflects the fact that vm_event is a higher level >>> feature, with mem_access & monitor depending on it. >>> >>> Suggested-by: Jan Beulich >> >> I don't think this is applicable; my suggestion went in a different >> direction. >> >>> Suggested-by: Tamas K Lengyel >>> Signed-off-by: Sergiy Kibrik >> >> Before considering to ack this, I'd like you, Tamas, to confirm this is >> really >> what you had thought of. In particular ... >> >>> --- a/xen/arch/arm/Makefile >>> +++ b/xen/arch/arm/Makefile >>> @@ -37,7 +37,7 @@ obj-y += irq.o >>> obj-y += kernel.init.o >>> obj-$(CONFIG_LIVEPATCH) += livepatch.o >>> obj-$(CONFIG_LLC_COLORING) += llc-coloring.o >>> -obj-$(CONFIG_MEM_ACCESS) += mem_access.o >>> +obj-$(CONFIG_VM_EVENT) += mem_access.o >> >> ... changes like this one look somewhat odd to me. >> >>> --- a/xen/common/Kconfig >>> +++ b/xen/common/Kconfig >>> @@ -92,7 +92,7 @@ config HAS_VMAP >>> config MEM_ACCESS_ALWAYS_ON >>> bool >>> >>> -config MEM_ACCESS >>> +config VM_EVENT >>> def_bool MEM_ACCESS_ALWAYS_ON >>> prompt "Memory Access and VM events" if !MEM_ACCESS_ALWAYS_ON >>> depends on HVM >> >> What about MEM_ACCESS_ALWAYS_ON (visible in patch context)? Shouldn't that >> become VM_EVENT_ALWAYS_ON then, too? >> >> Further, what about MEM_PAGING and MEM_SHARING? Shouldn't those, at least >> documentation purposes, then also gain a dependency on VM_EVENT? > > MEM_PAGING, yes. MEM_SHARING, definitely not. MEM_SHARING is perfectly > functional without vm_event. Is it? I see e.g. if ( sharing_enomem ) { #ifdef CONFIG_MEM_SHARING if ( !vm_event_check_ring(currd->vm_event_share) ) { gprintk(XENLOG_ERR, "Domain %pd attempt to unshare " "gfn %lx, ENOMEM and no helper\n", currd, gfn); /* Crash the domain */ rc = 0; } #endif } in hvm_hap_nested_page_fault(). Also - you responded only to a secondary remark here. What about the more basic points further up? Jan
Re: [PATCH v2 08/15] x86/hyperlaunch: locate dom0 kernel with hyperlaunch
On 30.01.2025 22:14, Stefano Stabellini wrote: > On Thu, 30 Jan 2025, Jan Beulich wrote: >> On 26.12.2024 17:57, Daniel P. Smith wrote: >>> Look for a subnode of type `multiboot,kernel` within a domain node. If >>> found, >>> process the reg property for the MB1 module index. If the bootargs property >>> is >>> present and there was not an MB1 string, then use the command line from the >>> device tree definition. >> >> While multiboot is apparently the first x86-specific part (as far as Xen >> goes) >> to be put under domain-builder/, I wonder: >> - Wouldn't looking for "multiboot,kernel" simply yield nothing on non-x86, >> so having the code under common/ would still be okay? > > One small clarification: multiboot,kernel is actually common between > both ARM and x86. It is "module-index" which is x86-specific and would > "simply yield nothing on non-x86", as you wrote. > > I'll let Dan address your point that "having the code under common/ > would still be okay". > > >> - What's "multiboot" describing here? The origin of the module? (What other >> origins would then be possible? How would MB1 and MB2 be distinguished? >> What about a native xen.efi boot?) A property of the kernel (when Linux >> doesn't use MB)? > > Each device tree node has a compatible string to qualify what kind of > information the node is describing. The compatible string for device > tree nodes describing a kernel binary or a ramdisk previously loaded > into memory by a bootloader have a "multiboot," prefix. See > docs/misc/arm/device-tree/booting.txt. This is unrelated to the binary > multiboot protocol Grub uses on x86 to boot Xen. > > A distinction between MB1 and MB2 is not needed in device tree, that > information is retrieved via the Grub multiboot protocol as usual. The > only thing needed here in device tree is the location of the kernel, > either by RAM address, or by Grub multiboot module index. This last > option (Grub multiboot module index) is the "module-index" property I > mentioned above. Hmm, then I'm afraid I can't make sense of the mentioning of MB1 in the description. Yet that's a point more towards Daniel than you. Jan
Re: v5.4.289 failed to boot with error megasas_build_io_fusion 3219 sge_count (-12) is out of range
On 30.01.25 21:28, Stefano Stabellini wrote: On Thu, 30 Jan 2025, Jürgen Groß wrote: Can you try the attached patch, please? I don't have a system at hand showing the problem. From cff43e997f79a95dc44e02debaeafe5f127f40bb Mon Sep 17 00:00:00 2001 From: Juergen Gross Date: Thu, 30 Jan 2025 09:56:57 +0100 Subject: [PATCH] x86/xen: allow larger contiguous memory regions in PV guests Today a PV guest (including dom0) can create 2MB contiguous memory regions for DMA buffers at max. This has led to problems at least with the megaraid_sas driver, which wants to allocate a 2.3MB DMA buffer. The limiting factor is the frame array used to do the hypercall for making the memory contiguous, which has 512 entries and is just a static array in mmu_pv.c. In case a contiguous memory area larger than the initially supported 2MB is requested, allocate a larger buffer for the frame list. Note that such an allocation is tried only after memory management has been initialized properly, which is tested via the early_boot_irqs_disabled flag. Fixes: 9f40ec84a797 ("xen/swiotlb: add alignment check for dma buffers") Signed-off-by: Juergen Gross --- Note that the "Fixes:" tag is not really correct, as that patch didn't introduce the problem, but rather made it visible. OTOH it is the best indicator we have to identify kernel versions this patch should be backported to. --- arch/x86/xen/mmu_pv.c | 44 --- 1 file changed, 37 insertions(+), 7 deletions(-) diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 55a4996d0c04..62aec29b8174 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -2200,8 +2200,10 @@ void __init xen_init_mmu_ops(void) } /* Protected by xen_reservation_lock. */ -#define MAX_CONTIG_ORDER 9 /* 2MB */ -static unsigned long discontig_frames[1< #define VOID_PTE (mfn_pte(0, __pgprot(0))) static void xen_zap_pfn_range(unsigned long vaddr, unsigned int order, @@ -2319,18 +2321,44 @@ int xen_create_contiguous_region(phys_addr_t pstart, unsigned int order, unsigned int address_bits, dma_addr_t *dma_handle) { - unsigned long *in_frames = discontig_frames, out_frame; + unsigned long *in_frames, out_frame; + unsigned long *new_array, *old_array; unsigned long flags; intsuccess; unsigned long vstart = (unsigned long)phys_to_virt(pstart); - if (unlikely(order > MAX_CONTIG_ORDER)) - return -ENOMEM; + if (unlikely(order > discontig_frames_order)) { + if (early_boot_irqs_disabled) + return -ENOMEM; + + new_array = vmalloc(sizeof(unsigned long) * (1UL << order)); + + if (!new_array) + return -ENOMEM; + + spin_lock_irqsave(&xen_reservation_lock, flags); + + if (order > discontig_frames_order) { This second if check should not be needed because it is the same as the outer if check. It is needed, as inside the locked region I need to verify that no concurrent call did already update the buffer, maybe with an even larger size. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: OpenPGP public key OpenPGP_signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 4/4] xen: mem_access: conditionally compile vm_event.c & monitor.c
On 31.01.2025 01:33, Tamas K Lengyel wrote: > On Tue, Jan 21, 2025 at 5:25 AM Sergiy Kibrik wrote: >> --- a/xen/arch/arm/vsmc.c >> +++ b/xen/arch/arm/vsmc.c >> @@ -330,7 +330,8 @@ void do_trap_smc(struct cpu_user_regs *regs, const union >> hsr hsr) >> } >> >> /* If monitor is enabled, let it handle the call. */ >> -if ( current->domain->arch.monitor.privileged_call_enabled ) >> +if ( IS_ENABLED(CONFIG_VM_EVENT) && >> + current->domain->arch.monitor.privileged_call_enabled ) >> rc = monitor_smc(); > > Why not wrap this entire if block above in an #ifdef CONFIG_VM_EVENT? > I think it would be more explicit what code is being compiled that way > instead of just relying on the compiler optimization to take care of > removing it. Well - we generally prefer things being written this way, where possible. This is to keep as much code as possible exposed to the compiler no matter what configuration. This way the risk of bit-rotting is a little lower (e.g. when making changes affecting such a piece of code, but not noticing the need for a change because things compile fine in whatever configuration(s) the person tests). Jan > The rest of the patch looks fine to me. > > Tamas
Re: [PATCH v2 07/15] x86/hyperlaunch: initial support for hyperlaunch device tree
On 26.12.2024 17:57, Daniel P. Smith wrote: > --- a/xen/arch/x86/domain-builder/fdt.c > +++ b/xen/arch/x86/domain-builder/fdt.c > @@ -13,14 +13,77 @@ > > #include "fdt.h" > > +static int __init find_hyperlaunch_node(const void *fdt) > +{ > +int hv_node = fdt_path_offset(fdt, "/chosen/hypervisor"); > + > +if ( hv_node >= 0 ) > +{ > +/* Anything other than zero indicates no match */ > +if ( fdt_node_check_compatible(fdt, hv_node, "hypervisor,xen") ) > +return -ENODATA; > +else > +return hv_node; > +} > +else > +{ > +/* Lood for dom0less config */ > +int node, chosen_node = fdt_path_offset(fdt, "/chosen"); > +if ( chosen_node < 0 ) Nit: Blank line between declaration(s) and statement(s) please. I'd also question "node" being plain int, if libfdt didn't have it like this. > +return -ENOENT; > + > +fdt_for_each_subnode(node, fdt, chosen_node) > +{ > +if ( fdt_node_check_compatible(fdt, node, "xen,domain") == 0 ) > +return chosen_node; > +} > +} > + > +return -ENODATA; > +} > + > int __init has_hyperlaunch_fdt(struct boot_info *bi) > { > int ret = 0; > const void *fdt = bootstrap_map_bm(&bi->mods[HYPERLAUNCH_MODULE_IDX]); > > -if ( fdt_check_header(fdt) < 0 ) > +if ( !fdt || fdt_check_header(fdt) < 0 ) > ret = -EINVAL; > +else > +ret = find_hyperlaunch_node(fdt); > + > +bootstrap_unmap(); > + > +return ret < 0 ? ret : 0; > +} > + > +int __init walk_hyperlaunch_fdt(struct boot_info *bi) const? > +{ > +int ret = 0, hv_node, node; > +void *fdt = bootstrap_map_bm(&bi->mods[HYPERLAUNCH_MODULE_IDX]); const? Jan
Re: [PATCH v2 08/15] x86/hyperlaunch: locate dom0 kernel with hyperlaunch
On 26.12.2024 17:57, Daniel P. Smith wrote: > Look for a subnode of type `multiboot,kernel` within a domain node. If found, > process the reg property for the MB1 module index. If the bootargs property is > present and there was not an MB1 string, then use the command line from the > device tree definition. While multiboot is apparently the first x86-specific part (as far as Xen goes) to be put under domain-builder/, I wonder: - Wouldn't looking for "multiboot,kernel" simply yield nothing on non-x86, so having the code under common/ would still be okay? - What's "multiboot" describing here? The origin of the module? (What other origins would then be possible? How would MB1 and MB2 be distinguished? What about a native xen.efi boot?) A property of the kernel (when Linux doesn't use MB)? > --- a/xen/arch/x86/domain-builder/core.c > +++ b/xen/arch/x86/domain-builder/core.c > @@ -59,6 +59,17 @@ void __init builder_init(struct boot_info *bi) > > printk(XENLOG_INFO " Number of domains: %d\n", bi->nr_domains); > } > +else > +{ > +unsigned int i; > + > +/* Find first unknown boot module to use as Dom0 kernel */ > +printk("Falling back to using first boot module as dom0\n"); Nit (personal taste?): Why Dom0 in the comment and dom0 in the log message. I think the former is to be preferred, but at the very least I see no reason to spell it differently on two adjacent lines. > +i = first_boot_module_index(bi, BOOTMOD_UNKNOWN); > +bi->mods[i].type = BOOTMOD_KERNEL; > +bi->domains[0].kernel = &bi->mods[i]; > +bi->nr_domains = 1; > +} Relating to a question on an earlier patch: The assumption here is that nothing could have marked another module as BOOTMOD_KERNEL? > --- a/xen/arch/x86/domain-builder/fdt.c > +++ b/xen/arch/x86/domain-builder/fdt.c > @@ -13,6 +13,114 @@ > > #include "fdt.h" > > +static int __init hl_module_index(void *fdt, int node, uint32_t *idx) const void *? > +{ > +int ret = 0; > +const struct fdt_property *prop = > +fdt_get_property(fdt, node, "module-index", &ret); > + > +/* FDT error or bad idx pointer, translate to -EINVAL */ > +if ( ret < 0 || idx == NULL ) This is a static helper - why check the parameter for being NULL? > +return -EINVAL; > + > +fdt_cell_as_u32((fdt32_t *)prop->data, idx); While I'm aware libfdt has quite a few of such casts, they're problematic. First and foremost this is a Misra violation, for casting away const-ness. And then how do you know there are 4 bytes of data to legitimately access? Hence why such casts would better be avoided altogether (or at least be suitably abstracted away). (There's at least one other instance further down.) > +if ( *idx > MAX_NR_BOOTMODS ) >= ? > +return -ERANGE; > + > +return 0; > +} > + > +static int __init dom0less_module_index( > +void *fdt, int node, int size_size, int address_size, uint32_t *idx) > +{ > +uint64_t size = ~0UL, addr = ~0UL; > +int ret = > +fdt_get_reg_prop(fdt, node, address_size, size_size, &addr, &size, > 1); int ret = fdt_get_reg_prop( fdt, node, address_size, size_size, &addr, &size, 1); > +/* FDT error or bad idx pointer, translate to -EINVAL */ > +if ( ret < 0 || idx == NULL ) See above as to the NULL check. > +return -EINVAL; > + > +/* Convention is that zero size indicates address is an index */ > +if ( size != 0 ) > +return -EOPNOTSUPP; > + > +if ( addr > MAX_NR_BOOTMODS ) >= again? > +return -ERANGE; > + > +/* > + * MAX_NR_BOOTMODS cannot exceed the max for MB1, represented by a u32, > + * thus the cast down to a u32 will be safe due to the prior check. > + */ Instead of (or in addition to) the comment, put in a BUILD_BUG_ON()? Also please can you avoid using u32 even in comments? It'll only yield needless grep matches once we go about fully purging it. > +*idx = (uint32_t)addr; > + > +return 0; > +} > + > +static int __init process_domain_node( > +struct boot_info *bi, void *fdt, int dom_node) const twice? (I guess I won't mention such any further. I think I previously asked that you make things as const-correct as possible.) > +{ > +int node; > +struct boot_domain *bd = &bi->domains[bi->nr_domains]; > +const char *name = fdt_get_name(fdt, dom_node, NULL) ?: "unknown"; > + > +fdt_for_each_subnode(node, fdt, dom_node) > +{ > +if ( fdt_node_check_compatible(fdt, node, "multiboot,kernel") == 0 ) > +{ > +unsigned int idx; > +int ret = 0; > + > +if ( bd->kernel ) > +{ > +printk(XENLOG_ERR "Duplicate kernel module for domain %s)\n", > + name); It's XENLOG_ERR here (but a seemingly stray closing parenthesis at the end), yet ... > +continue; > +} > + > +/* Try hyperlaunch pro
Re: [PATCH v2 09/15] x86/hyperlaunch: obtain cmdline from device tree
On 26.12.2024 17:57, Daniel P. Smith wrote: > If a command line is not provided through the bootloader's mechanism, e.g. > muiltboot module string field, then use one from the device tree if present. > The device tree command line is located in the bootargs property of the > `multiboot,kernel` node. This reads as if it can be mix-and-match (and the code looks to confirm this), which doesn't sound quite right. If you deem it right, please add justification here. > --- a/xen/arch/x86/domain-builder/core.c > +++ b/xen/arch/x86/domain-builder/core.c > @@ -8,9 +8,37 @@ > #include > > #include > +#include > > #include "fdt.h" > > +size_t __init builder_get_cmdline_size(struct boot_info *bi, int offset) > +{ > +#ifdef CONFIG_DOMAIN_BUILDER > +const void *fdt = bootstrap_map_bm(&bi->mods[HYPERLAUNCH_MODULE_IDX]); > +int size = fdt_cmdline_prop_size(fdt, offset); > + > +bootstrap_unmap(); > +return size < 0 ? 0 : (size_t) size; Nit: While I wouldn't insist, we don't normally put blanks after casts. (The cast also isn't really needed here anyway.) > +#else > +return 0; > +#endif > +} > + > +int __init builder_get_cmdline( > +struct boot_info *bi, int offset, char *cmdline, size_t size) > +{ > +#ifdef CONFIG_DOMAIN_BUILDER > +const void *fdt = bootstrap_map_bm(&bi->mods[HYPERLAUNCH_MODULE_IDX]); > +int ret = fdt_cmdline_prop_copy(fdt, offset, cmdline, size); > + > +bootstrap_unmap(); > +return ret; > +#else > +return 0; > +#endif > +} Such #ifdef-ary is better to be avoided by providing stubs in the header. Which then also works towards not needing to build in domain-builder/ when COMFIG_DOMAIN_BUILDER=n. > --- a/xen/arch/x86/domain-builder/fdt.c > +++ b/xen/arch/x86/domain-builder/fdt.c > @@ -109,6 +109,16 @@ static int __init process_domain_node( > printk(" kernel: boot module %d\n", idx); > bi->mods[idx].type = BOOTMOD_KERNEL; > bd->kernel = &bi->mods[idx]; > + > +/* If bootloader didn't set cmdline, see if FDT provides one. */ > +if ( bd->kernel->cmdline_pa && > + !((char *)__va(bd->kernel->cmdline_pa))[0] ) > +{ > +int ret = fdt_get_prop_by_offset( > +fdt, node, "bootargs", &bd->kernel->cmdline_pa); > +if ( ret > 0 ) > +bd->kernel->fdt_cmdline = true; Is there a guarantee that fdt_get_prop_by_offset() won't alter its output (passed in by indirection) in case of failure? Otherwise ... > --- a/xen/arch/x86/setup.c > +++ b/xen/arch/x86/setup.c > @@ -981,7 +981,10 @@ static size_t __init domain_cmdline_size( > { > size_t s = bi->kextra ? strlen(bi->kextra) : 0; > > -s += bd->kernel->cmdline_pa ? strlen(__va(bd->kernel->cmdline_pa)) : 0; > +if ( bd->kernel->fdt_cmdline ) > +s += builder_get_cmdline_size(bi, bd->kernel->cmdline_pa); > +else > +s += strlen(__va(bd->kernel->cmdline_pa)); ... you'll be hosed here (and elsewhere). > --- a/xen/include/xen/libfdt/libfdt-xen.h > +++ b/xen/include/xen/libfdt/libfdt-xen.h > @@ -28,6 +28,30 @@ static inline int __init fdt_cell_as_u64(const fdt32_t > *cell, uint64_t *val) > return 0; > } > > +static inline int __init fdt_get_prop_by_offset( > +const void *fdt, int node, const char *name, unsigned long *offset) > +{ > +int ret, poffset; > +const char *pname; > +size_t nsize = strlen(name); > + > +fdt_for_each_property_offset(poffset, fdt, node) > +{ > +fdt_getprop_by_offset(fdt, poffset, &pname, &ret); > + > +if ( ret < 0 || strlen(pname) != nsize ) > +continue; > + > +if ( !strncmp(pname, name, nsize) ) I find this slightly odd: By now we know strlen(name) == strlen(pname) == nsize. You could then use the usually more efficient memcmp() or the easier to invoke strcmp(). Jan
Re: [PATCH v2 10/15] x86/hyperlaunch: locate dom0 initrd with hyperlaunch
On 26.12.2024 17:57, Daniel P. Smith wrote: > Look for a subnode of type `multiboot,ramdisk` within a domain node. If > found, process the reg property for the MB1 module index. Unlike for cmdline it doesn't look to be mix-and-match here. > --- a/xen/arch/x86/domain-builder/fdt.c > +++ b/xen/arch/x86/domain-builder/fdt.c > @@ -119,6 +119,32 @@ static int __init process_domain_node( > if ( ret > 0 ) > bd->kernel->fdt_cmdline = true; > } > + > +continue; > +} > +else if ( > +fdt_node_check_compatible(fdt, node, "multiboot,ramdisk") == 0 ) I'm sorry, but this isn't style we use. Perhaps else if ( fdt_node_check_compatible( fdt, node, "multiboot,ramdisk") == 0 ) if you dislike else if ( fdt_node_check_compatible(fdt, node, "multiboot,ramdisk") == 0 ) > +{ > +int idx = dom0less_module_node(fdt, node, size_size, > address_size); > +if ( idx < 0 ) Nit: Blank line between declaration(s) and statement(s) please. (Again at least once elsewhere in this patch.) > +{ > +printk(" failed processing ramdisk module for domain %s\n", > + name); > +return -EINVAL; > +} > + > +if ( idx > bi->nr_modules ) > +{ > +printk(" invalid ramdisk module index for domain node > (%d)\n", > + bi->nr_domains); > +return -EINVAL; > +} See comments on similar printk()s in an earlier patch. > @@ -2141,22 +2141,25 @@ void asmlinkage __init noreturn __start_xen(void) > cpu_has_nx ? XENLOG_INFO : XENLOG_WARNING "Warning: ", > cpu_has_nx ? "" : "not "); > > -/* > - * At this point all capabilities that consume boot modules should have > - * claimed their boot modules. Find the first unclaimed boot module and > - * claim it as the initrd ramdisk. Do a second search to see if there are > - * any remaining unclaimed boot modules, and report them as unusued > initrd > - * candidates. > - */ > -initrdidx = first_boot_module_index(bi, BOOTMOD_UNKNOWN); > -if ( initrdidx < MAX_NR_BOOTMODS ) > +if ( !bi->hyperlaunch_enabled ) Can't this be "if ( !bi->hyperlaunch_enabled && initrdidx < MAX_NR_BOOTMODS )" and then all of the churn here can be avoided? An unnecessary call to first_boot_module_index() is unlikely to be the end of the world. Otherwise ... > { > -bi->mods[initrdidx].type = BOOTMOD_RAMDISK; > -bi->domains[0].ramdisk = &bi->mods[initrdidx]; > -if ( first_boot_module_index(bi, BOOTMOD_UNKNOWN) < MAX_NR_BOOTMODS ) > -printk(XENLOG_WARNING > - "Multiple initrd candidates, picking module #%u\n", > - initrdidx); > +/* > + * At this point all capabilities that consume boot modules should > have > + * claimed their boot modules. Find the first unclaimed boot module > and > + * claim it as the initrd ramdisk. Do a second search to see if > there are > + * any remaining unclaimed boot modules, and report them as unusued > initrd > + * candidates. > + */ > +unsigned int initrdidx = first_boot_module_index(bi, > BOOTMOD_UNKNOWN); > +if ( initrdidx < MAX_NR_BOOTMODS ) > +{ > +bi->mods[initrdidx].type = BOOTMOD_RAMDISK; > +bi->domains[0].ramdisk = &bi->mods[initrdidx]; > +if ( first_boot_module_index(bi, BOOTMOD_UNKNOWN) < > MAX_NR_BOOTMODS ) > +printk(XENLOG_WARNING > + "Multiple initrd candidates, picking module #%u\n", > + initrdidx); > +} ... please pay attention to line length when re-indenting. (If you still need to re-indent, perhaps also s/unusued/unused/ in the comment, while you touch it.) Jan
Re: [PATCH 2/3] tools/hvmloader: Replace LAPIC_ID() with cpu_to_apicid[]
On 28.01.2025 17:33, Alejandro Vallejo wrote: > Replace uses of the LAPIC_ID() macro with accesses to the > cpu_to_apicid[] lookup table. This table contains the APIC IDs of each > vCPU as probed at runtime rather than assuming a predefined relation. > > Moved smp_initialise() ahead of apic_setup() in order to initialise > cpu_to_apicid ASAP and avoid using it uninitialised. Note that bringing > up the APs doesn't need the APIC in hvmloader becasue it always runs > virtualized and uses the PV interface. > > Signed-off-by: Alejandro Vallejo > --- > Changes from v7 of the longer topology series: > * Removed ASSERT() for the MP tables and merely skipped writing them > if any vCPU has an APIC ID >=255. Throughout the patch I can't anything matching this; ... > --- a/tools/firmware/hvmloader/config.h > +++ b/tools/firmware/hvmloader/config.h > @@ -48,8 +48,9 @@ extern uint8_t ioapic_version; > > #define IOAPIC_ID 0x01 > > +extern uint32_t *cpu_to_apicid; > + > #define LAPIC_BASE_ADDRESS 0xfee0 > -#define LAPIC_ID(vcpu_id) ((vcpu_id) * 2) > > #define PCI_ISA_DEVFN 0x08/* dev 1, fn 0 */ > #define PCI_ISA_IRQ_MASK0x0c20U /* ISA IRQs 5,10,11 are PCI connected */ > --- a/tools/firmware/hvmloader/hvmloader.c > +++ b/tools/firmware/hvmloader/hvmloader.c > @@ -224,7 +224,7 @@ static void apic_setup(void) > > /* 8259A ExtInts are delivered through IOAPIC pin 0 (Virtual Wire Mode). > */ > ioapic_write(0x10, APIC_DM_EXTINT); > -ioapic_write(0x11, SET_APIC_ID(LAPIC_ID(0))); > +ioapic_write(0x11, SET_APIC_ID(cpu_to_apicid[0])); > } > > struct bios_info { > @@ -341,11 +341,11 @@ int main(void) > > printf("CPU speed is %u MHz\n", get_cpu_mhz()); > > +smp_initialise(); > + > apic_setup(); > pci_setup(); > > -smp_initialise(); > - > perform_tests(); > > if ( bios->bios_info_setup ) > --- a/tools/firmware/hvmloader/mp_tables.c > +++ b/tools/firmware/hvmloader/mp_tables.c > @@ -199,7 +199,7 @@ static void fill_mp_config_table(struct mp_config_table > *mpct, int length) > static void fill_mp_proc_entry(struct mp_proc_entry *mppe, int vcpu_id) > { > mppe->type = ENTRY_TYPE_PROCESSOR; > -mppe->lapic_id = LAPIC_ID(vcpu_id); > +mppe->lapic_id = cpu_to_apicid[vcpu_id]; > mppe->lapic_version = 0x11; > mppe->cpu_flags = CPU_FLAG_ENABLED; > if ( vcpu_id == 0 ) > --- a/tools/firmware/hvmloader/util.c > +++ b/tools/firmware/hvmloader/util.c > @@ -827,7 +827,7 @@ static void acpi_mem_free(struct acpi_ctxt *ctxt, > > static uint32_t acpi_lapic_id(unsigned cpu) > { > -return LAPIC_ID(cpu); > +return cpu_to_apicid[cpu]; > } > > void hvmloader_acpi_build_tables(struct acpi_config *config, ... no conditional is being added anywhere. What am I missing? Thing is - this way I'm uncertain whether the wording above is imprecise, or whether I should really ask that we continue to write all entries with at most 8-bit-wide APIC IDs, omitting just those with wider ones. Jan
Re: [PATCH 3/3] tools/hvmloader: Skip writing MP tables if any CPU has an APIC ID >= 255
On 28.01.2025 17:33, Alejandro Vallejo wrote: > MP Tables have no means of representing APIC IDs wider than 255. At the > moment the effect is wrapping around which would give a corrupted table > with duplicate APIC IDs (some of them possibly the broadcast 255). > > Skip writing it altogether. While it would be possible to restrict the > APs shown it's just not worth the work. Any OS that needs such > adjustments should not have been booted with that many vCPUs to begin > with. > > Signed-off-by: Alejandro Vallejo Oh, here we go. Yet then indeed ... > --- a/tools/firmware/hvmloader/config.h > +++ b/tools/firmware/hvmloader/config.h > @@ -49,6 +49,7 @@ extern uint8_t ioapic_version; > #define IOAPIC_ID 0x01 > > extern uint32_t *cpu_to_apicid; > +extern uint32_t max_apicid; .. I don't think we need this, as ... > --- a/tools/firmware/hvmloader/hvmloader.c > +++ b/tools/firmware/hvmloader/hvmloader.c > @@ -389,7 +389,11 @@ int main(void) > > if ( (hvm_info->nr_vcpus > 1) || hvm_info->apic_mode ) > { > -if ( bios->create_mp_tables ) > +/* > + * Legacy MP tables hold strictly xAPIC IDs. Skip writing > + * the tables altogether if we have IDs wider than 8bits. > + */ > +if ( max_apicid < 0xFF && bios->create_mp_tables ) > bios->create_mp_tables(); I think we ought to continue to write MP tables in this case, merely omitting processor entries with APIC IDs that don't fit. Jan
Re: [PATCH v3 20/24] x86/hvm: add HVM-specific Kconfig
On 04.01.2025 02:58, Denis Mukhin via B4 Relay wrote: > From: Denis Mukhin > > Add separate menu for configuring HVM build-time settings to help organizing > HVM-specific options. > > Signed-off-by: Denis Mukhin Largely: Why not. Question is whether what is being moved now may eventually require moving back, if support was extended to PV (MEM_PAGING and MEM_SHARING). That doesn't look very likely though. > --- a/xen/arch/x86/Kconfig > +++ b/xen/arch/x86/Kconfig > @@ -30,7 +30,6 @@ config X86 > select HAS_SCHED_GRANULARITY > select HAS_UBSAN > select HAS_VMAP > - select HAS_VPCI if HVM > select NEEDS_LIBELF I don't mind the movement of this line, but I'd like to point out that it may be beneficial to have all selects of HAS_* in a central place. Views of other maintainers (or of course anyone else) appreciated. > --- /dev/null > +++ b/xen/arch/x86/hvm/Kconfig > @@ -0,0 +1,74 @@ > +menuconfig HVM > + bool "HVM support" > + depends on !PV_SHIM_EXCLUSIVE > + default !PV_SHIM > + select COMPAT > + select IOREQ_SERVER > + select MEM_ACCESS_ALWAYS_ON > + select HAS_VPCI We strive to have such lists of selects sorted alphabetically, preventing everyone to add to the end of the list (in turn reducing the risk of patches conflicting). Jan
Re: [PATCH v2 3/4] automation: rename CONFIG_MEM_ACCESS -> CONFIG_VM_EVENT
On 21.01.2025 11:23, Sergiy Kibrik wrote: > Following the renaming of Xen build option. > > Signed-off-by: Sergiy Kibrik > --- > automation/eclair_analysis/xen_arm_config | 2 +- > automation/eclair_analysis/xen_x86_config | 2 +- > automation/gitlab-ci/build.yaml | 2 +- > 3 files changed, 3 insertions(+), 3 deletions(-) This can't really be separated from the changes doing the actual rename, can it? Aiui the build (randconfig ones in particular) may break between the two patches, or what is being tested may end up being different. Jan
Re: [PATCH v2 04/15] kconfig: introduce option to independently enable libfdt
On 26.12.2024 17:57, Daniel P. Smith wrote: > Currently, the inclusion of libfdt is controlled by the CONFIG_HAS_DEVICE_TREE > kconfig flag. This flag also changes behavior in a few places, such as boot > module processing for XSM. To support the ability to include libfdt without > changing these behaviors, introduce CONFIG_LIB_DEVICE_TREE. The inclusion of > libfdt is then moved under CONFIG_LIB_DEVICE_TREE. Hmm. I'm not a DT maintainer (imo approval here needs to come from one of its maintainers, despite the files being touched not saying so; I notice you have the larger Cc list already), but to me the 'f' in libfdt is lost with CONFIG_LIB_DEVICE_TREE. What's wrong with CONFIG_LIBFDT when that's the code that you want to cover? Jan
Re: [PATCH v2 05/15] kconfig: introduce domain builder config option
On 26.12.2024 17:57, Daniel P. Smith wrote: > Hyperlaunch domain builder will be the consolidated boot time domain building > logic framework. Introduces the config option to enable this domain builder to > and turn on the ability to load the domain configuration via a flattened > device > tree. s/and/eventually/? Else I fear I'm not getting what is being said here. There's no turning on of anything just yet, afaics. Jan
Re: [PATCH v2 11/15] x86/hyperlaunch: add domain id parsing to domain config
On 26.12.2024 17:57, Daniel P. Smith wrote: > Introduce the ability to specify the desired domain id for the domain > definition. The domain id will be populated in the domid property of the > domain > node in the device tree configuration. > > Signed-off-by: Daniel P. Smith (Not going to repeat style remarks already made on earlier patches. Please apply throughout the series.) > @@ -61,10 +62,40 @@ static int __init dom0less_module_index( > static int __init process_domain_node( > struct boot_info *bi, void *fdt, int dom_node) > { > -int node; > +int node, property; > struct boot_domain *bd = &bi->domains[bi->nr_domains]; > const char *name = fdt_get_name(fdt, dom_node, NULL) ?: "unknown"; > > +fdt_for_each_property_offset(property, fdt, dom_node) > +{ > +const struct fdt_property *prop; > +const char *prop_name; > +int name_len; > + > +prop = fdt_get_property_by_offset(fdt, property, NULL); > +if ( !prop ) > +continue; /* silently skip */ > + > +prop_name = fdt_get_string(fdt, fdt32_to_cpu(prop->nameoff), > &name_len); > + > +if ( strncmp(prop_name, "domid", name_len) == 0 ) Isn't this going to (wrongly) match when e.g. the property has just "d" (and hence name_len is 1). > +{ > +uint32_t val = DOMID_INVALID; > +if ( fdt_prop_as_u32(prop, &val) != 0 ) > +{ > +printk(" failed processing domain id for domain %s\n", > name); > +return -EINVAL; > +} > +if ( val >= DOMID_FIRST_RESERVED ) > +{ > +printk(" invalid domain id for domain %s\n", name); > +return -EINVAL; > +} > +bd->domid = (domid_t)val; > +printk(" domid: %d\n", bd->domid); > +} > +} Perhaps the question comes too early (will be taken care of in later patches), but still: What if multiple domains have the same ID specified? > @@ -125,7 +156,29 @@ static int __init process_domain_node( > else if ( > fdt_node_check_compatible(fdt, node, "multiboot,ramdisk") == 0 ) > { > -int idx = dom0less_module_node(fdt, node, size_size, > address_size); > +unsigned int idx; > +int ret = 0; > + > +if ( bd->ramdisk ) > +{ > +printk(XENLOG_ERR "Duplicate ramdisk module for domain > %s)\n", > + name); > +continue; > +} > + > +/* Try hyperlaunch property, fall back to dom0less property. */ > +if ( hl_module_index(fdt, node, &idx) < 0 ) > +{ > +int address_size = fdt_address_cells(fdt, dom_node); > +int size_size = fdt_size_cells(fdt, dom_node); > + > +if ( address_size < 0 || size_size < 0 ) > +ret = -EINVAL; > +else > +ret = dom0less_module_index( > +fdt, node, size_size, address_size, &idx); > +} Doesn't this belong into the earlier patch? > @@ -154,6 +207,12 @@ static int __init process_domain_node( > return -EFAULT; > } > > +if ( bd->domid == DOMID_INVALID ) > +bd->domid = get_initial_domain_id(); Isn't this redundant with ... > --- a/xen/arch/x86/setup.c > +++ b/xen/arch/x86/setup.c > @@ -1029,8 +1029,9 @@ static struct domain *__init create_dom0(struct > boot_info *bi) > if ( iommu_enabled ) > dom0_cfg.flags |= XEN_DOMCTL_CDF_iommu; > > -/* Create initial domain. Not d0 for pvshim. */ > -bd->domid = get_initial_domain_id(); > +if ( bd->domid == DOMID_INVALID ) > +/* Create initial domain. Not d0 for pvshim. */ > +bd->domid = get_initial_domain_id(); ... this? Jan
Re: [PATCH v2 12/15] x86/hyperlaunch: specify dom0 mode with device tree
On 26.12.2024 17:57, Daniel P. Smith wrote: > Enable selecting the mode in which the domain will be built and ran. This > includes: > > - whether it will be either a 32/64 bit domain I can't spot anything like this in the code changes. > --- a/xen/arch/x86/domain-builder/fdt.c > +++ b/xen/arch/x86/domain-builder/fdt.c > @@ -94,6 +94,25 @@ static int __init process_domain_node( > bd->domid = (domid_t)val; > printk(" domid: %d\n", bd->domid); > } > +else if ( strncmp(prop_name, "mode", name_len) == 0 ) > +{ > +if ( fdt_prop_as_u32(prop, &bd->mode) != 0 ) > +{ > +printk(" failed processing mode for domain %s\n", name); > +return -EINVAL; > +} > + > +printk(" mode: "); > +if ( !(bd->mode & BUILD_MODE_PARAVIRT) ) > +{ > +if ( bd->mode & BUILD_MODE_ENABLE_DM ) > +printk("HVM\n"); > +else > +printk("PVH\n"); > +} > +else > +printk("PV\n"); Shorter and less indentation as if ( bd->mode & BUILD_MODE_PARAVIRT ) printk("PV\n"); else if ( bd->mode & BUILD_MODE_ENABLE_DM ) printk("HVM\n"); else printk("PVH\n"); > --- a/xen/arch/x86/setup.c > +++ b/xen/arch/x86/setup.c > @@ -1016,7 +1016,8 @@ static struct domain *__init create_dom0(struct > boot_info *bi) > struct boot_domain *bd = &bi->domains[0]; > struct domain *d; > > -if ( opt_dom0_pvh ) > +if ( opt_dom0_pvh || > + (bi->hyperlaunch_enabled && !(bd->mode & BUILD_MODE_PARAVIRT)) ) And then dropping BUILD_MODE_ENABLE_DM on the floor? Also shouldn't this be if ( bi->hyperlaunch_enabled ? bd->mode & BUILD_MODE_PARAVIRT : opt_dom0_pvh ) as it can't do any good to honor a conflicting command line option. Command line doc then will want amending to clarify that the option will be ignored in certain cases. Jan
Re: [PATCH v2 13/15] x86/hyperlaunch: add memory parsing to domain config
On 26.12.2024 17:57, Daniel P. Smith wrote: > --- a/xen/arch/x86/dom0_build.c > +++ b/xen/arch/x86/dom0_build.c > @@ -609,6 +609,14 @@ int __init construct_dom0(struct boot_domain *bd) > > process_pending_softirqs(); > > +/* If param dom0_size was not set and HL config provided memory size */ > +if ( !get_memsize(&dom0_size, LONG_MAX) && bd->mem_pages ) > +dom0_size.nr_pages = bd->mem_pages; > +if ( !get_memsize(&dom0_min_size, LONG_MAX) && bd->min_pages ) > +dom0_size.nr_pages = bd->min_pages; > +if ( !get_memsize(&dom0_max_size, LONG_MAX) && bd->max_pages ) > +dom0_size.nr_pages = bd->max_pages; Again I don't think it should be mix-and-match: Either everything is taken from DT, or everything is taken from the command line. Yet I'm open to be convinced otherwise. > --- a/xen/include/xen/libfdt/libfdt-xen.h > +++ b/xen/include/xen/libfdt/libfdt-xen.h > @@ -37,6 +37,15 @@ static inline int __init fdt_prop_as_u32( > return fdt_cell_as_u32((fdt32_t *)prop->data, val); > } > > +static inline int __init fdt_prop_as_u64( > +const struct fdt_property *prop, uint64_t *val) > +{ > +if ( !prop || fdt32_to_cpu(prop->len) < sizeof(u64) ) No new uses of u64 (and alike) please. Jan
Re: [PATCH v2 02/15] x86/boot: introduce domid field to struct boot_domain
On 26.12.2024 17:57, Daniel P. Smith wrote: > Add a domid field to struct boot_domain to hold the assigned domain id for the > domain. During initialization, ensure all instances of struct boot_domain have > the invalid domid to ensure that the domid must be set either by convention or > configuration. I'm still missing justification for the duplication between the struct domain * that's already in struct boot_domain and this new member. Iirc you responded to this earlier question of mine, but nothing was put here. > --- a/xen/arch/x86/setup.c > +++ b/xen/arch/x86/setup.c > @@ -292,6 +292,7 @@ static const char *cmdline_cook(const char *p, const char > *loader_name); > struct boot_info __initdata xen_boot_info = { > .loader = "unknown", > .cmdline = "", > +.domains = { {.domid = DOMID_INVALID} }, Please can you fully use designated initializers here, thus also protecting against MAX_NR_BOOTDOMS increasing without and update being done here (and the compiler still being happy)? .domains = { [0 ... MAX_NR_BOOTDOMS - 1] = { .domid = DOMID_INVALID } }, Nit: Note also the blanks I added. Jan
Re: [PATCH v2 03/15] x86/boot: add cmdline to struct boot_domain
On 26.12.2024 17:57, Daniel P. Smith wrote: > @@ -759,9 +758,10 @@ static int __init pvh_load_kernel( > /* Free temporary buffers. */ > free_boot_modules(); > > -if ( cmdline != NULL ) > +if ( bd->cmdline != NULL ) > { > -rc = hvm_copy_to_guest_phys(last_addr, cmdline, strlen(cmdline) + 1, > v); > +rc = hvm_copy_to_guest_phys( > +last_addr, bd->cmdline, strlen(bd->cmdline) + 1, v); Nit: Indentation. The anchor point for this kind of increased indentation is the function name being called, so you want to add one more blank. (It is not N times the usual indentation of 4, until "it looks okay".) > --- a/xen/arch/x86/include/asm/bootdomain.h > +++ b/xen/arch/x86/include/asm/bootdomain.h > @@ -11,6 +11,8 @@ > #define __XEN_X86_BOOTDOMAIN_H__ > > struct boot_domain { > +const char *cmdline; > + > domid_t domid; > > struct boot_module *kernel; I can see why in the earlier patch you added domid at the top. But cmdline? > --- a/xen/arch/x86/setup.c > +++ b/xen/arch/x86/setup.c > @@ -975,10 +975,29 @@ static unsigned int __init copy_bios_e820(struct > e820entry *map, unsigned int li > return n; > } > > -static struct domain *__init create_dom0(struct boot_info *bi) > +static size_t __init domain_cmdline_size( > +struct boot_info *bi, struct boot_domain *bd) > { > -static char __initdata cmdline[MAX_GUEST_CMDLINE]; > +size_t s = bi->kextra ? strlen(bi->kextra) : 0; > + > +s += bd->kernel->cmdline_pa ? strlen(__va(bd->kernel->cmdline_pa)) : 0; > + > +if ( s == 0 ) > +return s; > + > +/* > + * Certain parameters from the Xen command line may be added to the dom0 > + * command line. Add additional space for the possible cases along with > one > + * extra char to hold \0. > + */ > +s += strlen(" noapic") + strlen(" acpi=") + sizeof(acpi_param) + 1; See below; I question this all being necessary for PVH Dom0. > @@ -1018,39 +1037,52 @@ static struct domain *__init create_dom0(struct > boot_info *bi) > panic("Error creating d%uv0\n", bd->domid); > > /* Grab the DOM0 command line. */ > -if ( bd->kernel->cmdline_pa || bi->kextra ) > +if ( (bd->kernel->cmdline_pa && > + ((char *)__va(bd->kernel->cmdline_pa))[0]) || > + bi->kextra ) > { > -if ( bd->kernel->cmdline_pa ) > -safe_strcpy(cmdline, > -cmdline_cook(__va(bd->kernel->cmdline_pa), > bi->loader)); > +size_t cmdline_size = domain_cmdline_size(bi, bd); > + > +if ( cmdline_size ) > +{ > +if ( !(cmdline = xzalloc_array(char, cmdline_size)) ) > +panic("Error allocating cmdline buffer for %pd\n", d); > > -if ( bi->kextra ) > -/* kextra always includes exactly one leading space. */ > -safe_strcat(cmdline, bi->kextra); > +if ( bd->kernel->cmdline_pa ) > +strlcpy(cmdline, > +cmdline_cook(__va(bd->kernel->cmdline_pa), > bi->loader), > +cmdline_size); > > -/* Append any extra parameters. */ > -if ( skip_ioapic_setup && !strstr(cmdline, "noapic") ) > -safe_strcat(cmdline, " noapic"); > +if ( bi->kextra ) > +/* kextra always includes exactly one leading space. */ > +strlcat(cmdline, bi->kextra, cmdline_size); > > -if ( (strlen(acpi_param) == 0) && acpi_disabled ) > -{ > -printk("ACPI is disabled, notifying Domain 0 (acpi=off)\n"); > -safe_strcpy(acpi_param, "off"); > -} > +/* Append any extra parameters. */ > +if ( skip_ioapic_setup && !strstr(cmdline, "noapic") ) > +strlcat(cmdline, " noapic", cmdline_size); Roger - this isn't going to work very well with PVH Dom0, is it? > -if ( (strlen(acpi_param) != 0) && !strstr(cmdline, "acpi=") ) > -{ > -safe_strcat(cmdline, " acpi="); > -safe_strcat(cmdline, acpi_param); > -} > +if ( (strlen(acpi_param) == 0) && acpi_disabled ) Not sure whether the compiler will do that transformation anyway, but this check looks odd to me. Why not simply check whether the first char is the nul one? > +{ > +printk("ACPI is disabled, notifying Domain 0 (acpi=off)\n"); > +safe_strcpy(acpi_param, "off"); > +} Here I'm doubtful, too, when it comes to PVH Dom0. If Xen's even works in this mode anymore at all, I don't think we can sensibly start a PVH Dom0 then. (This is leaving aside that all of this is Linux-centric anyway.) > -bd->kernel->cmdline_pa = __pa(cmdline); > +if ( (strlen(acpi_param) != 0) && !strstr(cmdline, "acpi=") ) > +{ > +strlcat(cmdline, " acpi=", cmdline_size); > +strlcat(cmdline, acpi_param, cmdline_size);
Re: v5.4.289 failed to boot with error megasas_build_io_fusion 3219 sge_count (-12) is out of range
On 29.01.25 19:46, Harshvardhan Jha wrote: On 30/01/25 12:13 AM, Jürgen Groß wrote: On 29.01.25 19:35, Harshvardhan Jha wrote: On 29/01/25 4:52 PM, Juergen Gross wrote: On 29.01.25 10:15, Harshvardhan Jha wrote: On 29/01/25 2:34 PM, Greg KH wrote: On Wed, Jan 29, 2025 at 02:29:48PM +0530, Harshvardhan Jha wrote: Hi Greg, On 29/01/25 2:18 PM, Greg KH wrote: On Wed, Jan 29, 2025 at 02:13:34PM +0530, Harshvardhan Jha wrote: Hi there, On 29/01/25 2:05 PM, Greg KH wrote: On Wed, Jan 29, 2025 at 02:03:51PM +0530, Harshvardhan Jha wrote: Hi All, +stable There seems to be some formatting issues in my log output. I have attached it as a file. Confused, what are you wanting us to do here in the stable tree? thanks, greg k-h Since, this is reproducible on 5.4.y I have added stable. The culprit commit which upon getting reverted fixes this issue is also present in 5.4.y stable. What culprit commit? I see no information here :( Remember, top-posting is evil... My apologies, The stable tag v5.4.289 seems to fail to boot with the following prompt in an infinite loop: [ 24.427217] megaraid_sas :65:00.0: megasas_build_io_fusion 3273 sge_count (-12) is out of range. Range is: 0-256 Reverting the following patch seems to fix the issue: stable-5.4 : v5.4.285 - 5df29a445f3a xen/swiotlb: add alignment check for dma buffers I tried changing swiotlb grub command line arguments but that didn't seem to help much unfortunately and the error was seen again. Ok, can you submit this revert with the information about why it should not be included in the 5.4.y tree and cc: everyone involved and then we will be glad to queue it up. thanks, greg k-h This might be reproducible on other stable trees and mainline as well so we will get it fixed there and I will submit the necessary fix to stable when everything is sorted out on mainline. Right. Just reverting my patch will trade one error with another one (the one which triggered me to write the patch). There are two possible ways to fix the issue: - allow larger DMA buffers in xen/swiotlb (today 2MB are the max. supported size, the megaraid_sas driver seems to effectively request 4MB) This seems relatively simpler to implement but I'm not sure whether it's the most optimal approach Just making the static array larger used to hold the frame numbers for the buffer seems to be a waste of memory for most configurations. Yep definitely not required in most cases. I'm thinking of an allocated array using the max needed size (replace a former buffer with a larger one if needed). This seems like the right way to go. Can you try the attached patch, please? I don't have a system at hand showing the problem. Juergen From cff43e997f79a95dc44e02debaeafe5f127f40bb Mon Sep 17 00:00:00 2001 From: Juergen Gross Date: Thu, 30 Jan 2025 09:56:57 +0100 Subject: [PATCH] x86/xen: allow larger contiguous memory regions in PV guests Today a PV guest (including dom0) can create 2MB contiguous memory regions for DMA buffers at max. This has led to problems at least with the megaraid_sas driver, which wants to allocate a 2.3MB DMA buffer. The limiting factor is the frame array used to do the hypercall for making the memory contiguous, which has 512 entries and is just a static array in mmu_pv.c. In case a contiguous memory area larger than the initially supported 2MB is requested, allocate a larger buffer for the frame list. Note that such an allocation is tried only after memory management has been initialized properly, which is tested via the early_boot_irqs_disabled flag. Fixes: 9f40ec84a797 ("xen/swiotlb: add alignment check for dma buffers") Signed-off-by: Juergen Gross --- Note that the "Fixes:" tag is not really correct, as that patch didn't introduce the problem, but rather made it visible. OTOH it is the best indicator we have to identify kernel versions this patch should be backported to. --- arch/x86/xen/mmu_pv.c | 44 --- 1 file changed, 37 insertions(+), 7 deletions(-) diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 55a4996d0c04..62aec29b8174 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -2200,8 +2200,10 @@ void __init xen_init_mmu_ops(void) } /* Protected by xen_reservation_lock. */ -#define MAX_CONTIG_ORDER 9 /* 2MB */ -static unsigned long discontig_frames[1< MAX_CONTIG_ORDER)) - return -ENOMEM; + if (unlikely(order > discontig_frames_order)) { + if (early_boot_irqs_disabled) + return -ENOMEM; + + new_array = vmalloc(sizeof(unsigned long) * (1UL << order)); + + if (!new_array) + return -ENOMEM; + + spin_lock_irqsave(&xen_reservation_lock, flags); + + if (order > discontig_frames_order) { + if (discontig_frames == discontig_frames_early) +old_array = NULL; + else +old_array = discontig_frames; + discontig_frames = new_array; + discontig_frames_order = order; + } else + old_array = new_array; + + spin_unlock
Re: [PATCH v2 01/15] x86/boot: introduce boot domain
On 26.12.2024 17:57, Daniel P. Smith wrote: > @@ -596,9 +597,10 @@ int __init dom0_setup_permissions(struct domain *d) > return rc; > } > > -int __init construct_dom0(struct boot_info *bi, struct domain *d) > +int __init construct_dom0(struct boot_domain *bd) Pointer-to-const? Domain construction should only be consuming data supplied, I expect. > --- /dev/null > +++ b/xen/arch/x86/include/asm/bootdomain.h Maybe boot-domain.h? Or was that suggested before and discarded for whatever reason? > @@ -0,0 +1,28 @@ > +/* SPDX-License-Identifier: GPL-2.0-or-later */ > +/* > + * Copyright (c) 2024 Apertus Solutions, LLC > + * Author: Daniel P. Smith > + * Copyright (c) 2024 Christopher Clark > + */ > + > +#ifndef __XEN_X86_BOOTDOMAIN_H__ > +#define __XEN_X86_BOOTDOMAIN_H__ > + > +struct boot_domain { > +struct boot_module *kernel; > +struct boot_module *ramdisk; "ramdisk" is Linux-centric, I think. Can we name this more generically? "module" perhaps, despite it then being the same name as we use for the modules Xen is passed? Also, are consumers of this struct supposed to be able to modify what the pointers point to? I'd expect they aren't, in which case const will want adding here, too. Jan
[PATCH for-4.20 2/3] x86/PCI: init segments earlier
In order for amd_iommu_detect_one_acpi()'s call to pci_ro_device() to have permanent effect, pci_segments_init() needs to be called ahead of making it there. Without this we're losing segment 0's r/o map, and thus we're losing write-protection of the PCI devices representing IOMMUs. Which in turn means that half-way recent Linux Dom0 will, as it boots, turn off MSI on these devices, thus preventing any IOMMU events (faults in particular) from being reported on pre-x2APIC hardware. As the acpi_iommu_init() invocation was moved ahead of acpi_mmcfg_init()'s by the offending commit, move the call to pci_segments_init() accordingly. Fixes: 3950f2485bbc ("x86/x2APIC: defer probe until after IOMMU ACPI table parsing") Signed-off-by: Jan Beulich --- Of course it would have been quite a bit easier to notice this issue if radix_tree_insert() wouldn't work fine ahead of radix_tree_init() being invoked for a given radix tree, when the index inserted at is 0. While hunting down various other dead paths to actually find the root cause, it occurred to me that it's probably not a good idea to fully disallow config space writes for r/o devices: Dom0 won't be able to size their BARs (luckily the IOMMU "devices" don't have any, but e.g. serial ones generally will have at least one), for example. Without being able to size BARs it also will likely be unable to correctly account for the address space taken by these BARs. However, outside of vPCI it's not really clear to me how we could reasonably emulate such BAR sizing writes - we can't, after all, allow Dom0 to actually write to the underlying physical registers, yet we don't intercept reads (i.e. we can't mimic expected behavior then). --- a/xen/arch/x86/x86_64/mmconfig-shared.c +++ b/xen/arch/x86/x86_64/mmconfig-shared.c @@ -402,8 +402,6 @@ void __init acpi_mmcfg_init(void) { bool valid = true; -pci_segments_init(); - /* MMCONFIG disabled */ if ((pci_probe & PCI_PROBE_MMCONF) == 0) return; --- a/xen/drivers/passthrough/x86/iommu.c +++ b/xen/drivers/passthrough/x86/iommu.c @@ -55,6 +55,8 @@ void __init acpi_iommu_init(void) { int ret = -ENODEV; +pci_segments_init(); + if ( !iommu_enable && !iommu_intremap ) return;
[PATCH for-4.20? 3/3] AMD/IOMMU: log IVHD contents
Despite all the verbosity with "iommu=debug", information on the IOMMUs themselves was missing. Signed-off-by: Jan Beulich --- a/xen/drivers/passthrough/amd/iommu_acpi.c +++ b/xen/drivers/passthrough/amd/iommu_acpi.c @@ -911,6 +911,11 @@ static int __init parse_ivhd_block(const return -ENODEV; } +AMD_IOMMU_DEBUG("IVHD: IOMMU @ %#lx cap @ %#x seg 0x%04x info %#x attr %#x\n", +ivhd_block->base_address, ivhd_block->capability_offset, +ivhd_block->pci_segment_group, ivhd_block->info, +ivhd_block->iommu_attr); + iommu = find_iommu_from_bdf_cap(ivhd_block->pci_segment_group, ivhd_block->header.device_id, ivhd_block->capability_offset);
[PATCH for-4.20 0/3] AMD/IOMMU: assorted corrections
The three patches are functionally independent, and they're presented here merely in the order I came to notice the respective issues. At least patch 2 wants seriously considering for 4.20. 1: AMD/IOMMU: drop stray MSI enabling 2: x86/PCI: init segments earlier 3: AMD/IOMMU: log IVHD contents Jan
[PATCH for-4.20? 1/3] AMD/IOMMU: drop stray MSI enabling
While the 2nd of the commits referenced below should have moved the call to amd_iommu_msi_enable() instead of adding another one, the situation wasn't quite right even before: It can't have done any good to enable MSI when no IRQ was allocated for it, yet. Fixes: 5f569f1ac50e ("AMD/IOMMU: allow enabling with IRQ not yet set up") Fixes: d9e49d1afe2e ("AMD/IOMMU: adjust setup of internal interrupt for x2APIC mode") Signed-off-by: Jan Beulich --- a/xen/drivers/passthrough/amd/iommu_init.c +++ b/xen/drivers/passthrough/amd/iommu_init.c @@ -902,8 +902,6 @@ static void enable_iommu(struct amd_iomm } } -amd_iommu_msi_enable(iommu, IOMMU_CONTROL_ENABLED); - set_iommu_ht_flags(iommu); set_iommu_command_buffer_control(iommu, IOMMU_CONTROL_ENABLED);
Re: [PATCH v2 1/4] xen: kconfig: rename MEM_ACCESS -> VM_EVENT
On 21.01.2025 11:19, Sergiy Kibrik wrote: > Use more generic CONFIG_VM_EVENT name throughout Xen code instead of > CONFIG_MEM_ACCESS. This reflects the fact that vm_event is a higher level > feature, with mem_access & monitor depending on it. > > Suggested-by: Jan Beulich I don't think this is applicable; my suggestion went in a different direction. > Suggested-by: Tamas K Lengyel > Signed-off-by: Sergiy Kibrik Before considering to ack this, I'd like you, Tamas, to confirm this is really what you had thought of. In particular ... > --- a/xen/arch/arm/Makefile > +++ b/xen/arch/arm/Makefile > @@ -37,7 +37,7 @@ obj-y += irq.o > obj-y += kernel.init.o > obj-$(CONFIG_LIVEPATCH) += livepatch.o > obj-$(CONFIG_LLC_COLORING) += llc-coloring.o > -obj-$(CONFIG_MEM_ACCESS) += mem_access.o > +obj-$(CONFIG_VM_EVENT) += mem_access.o ... changes like this one look somewhat odd to me. > --- a/xen/common/Kconfig > +++ b/xen/common/Kconfig > @@ -92,7 +92,7 @@ config HAS_VMAP > config MEM_ACCESS_ALWAYS_ON > bool > > -config MEM_ACCESS > +config VM_EVENT > def_bool MEM_ACCESS_ALWAYS_ON > prompt "Memory Access and VM events" if !MEM_ACCESS_ALWAYS_ON > depends on HVM What about MEM_ACCESS_ALWAYS_ON (visible in patch context)? Shouldn't that become VM_EVENT_ALWAYS_ON then, too? Further, what about MEM_PAGING and MEM_SHARING? Shouldn't those, at least documentation purposes, then also gain a dependency on VM_EVENT? Jan
Re: [PATCH v2 2/4] x86:monitor: control monitor.c build with CONFIG_VM_EVENT option
On 21.01.2025 11:21, Sergiy Kibrik wrote: > Replace more general CONFIG_HVM option with CONFIG_VM_EVENT which is more > relevant and specific to monitoring. This is only to clarify at build level > to which subsystem this file belongs. > > No functional change here, as VM_EVENT depends on HVM. > > Signed-off-by: Sergiy Kibrik As long as patch 1 stays roughly as it is right now: Acked-by: Jan Beulich Jan
Re: [PATCH v2 1/4] xen: kconfig: rename MEM_ACCESS -> VM_EVENT
On 21.01.2025 11:19, Sergiy Kibrik wrote: > @@ -58,7 +58,7 @@ p2m_mem_access_check_and_get_page(vaddr_t gva, unsigned > long flag, > return NULL; > } > > -#endif /*CONFIG_MEM_ACCESS*/ > +#endif /*CONFIG_VM_EVENT*/ Oh, also - as you touch this anyway: Would you mind adding the mising blanks, just like we have them ... > #endif /* _ASM_ARM_MEM_ACCESS_H */ ... on the immediately following line? Jan
Re: [PATCH v2 4/4] xen: mem_access: conditionally compile vm_event.c & monitor.c
On 21.01.2025 11:25, Sergiy Kibrik wrote: > --- a/xen/include/xen/monitor.h > +++ b/xen/include/xen/monitor.h > @@ -27,8 +27,17 @@ > struct domain; > struct xen_domctl_monitor_op; > > +#ifdef CONFIG_VM_EVENT > int monitor_domctl(struct domain *d, struct xen_domctl_monitor_op *mop); > void monitor_guest_request(void); > +#else > +static inline int monitor_domctl(struct domain *d, > + struct xen_domctl_monitor_op *mop) > +{ > +return -EINVAL; EOPNOTSUPP perhaps? Otherwise looks okay to me, but first and foremost requires Arm side approval. Jan
Re: [PATCH v2 06/15] x86/hyperlaunch: introduce the domain builder
On 26.12.2024 17:57, Daniel P. Smith wrote: > --- a/xen/arch/x86/Makefile > +++ b/xen/arch/x86/Makefile > @@ -81,6 +81,8 @@ obj-$(CONFIG_COMPAT) += x86_64/platform_hypercall.o > obj-y += sysctl.o > endif > > +obj-y += domain-builder/ The set of subdirs needed in $(obj-y) is specified at the top of the file. Also shouldn't this be obj-$(CONFIG_DOMAIN_BUILDER)? > --- /dev/null > +++ b/xen/arch/x86/domain-builder/core.c > @@ -0,0 +1,57 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * Copyright (C) 2024, Apertus Solutions, LLC > + */ > +#include > +#include > +#include > +#include > + > +#include > + > +#include "fdt.h" > + > +void __init builder_init(struct boot_info *bi) > +{ > +if ( IS_ENABLED(CONFIG_DOMAIN_BUILDER) ) > +{ > +int ret; > + > +switch ( ret = has_hyperlaunch_fdt(bi) ) > +{ > +case 0: > +printk("Hyperlaunch device tree detected\n"); > +bi->hyperlaunch_enabled = true; > +bi->mods[0].type = BOOTMOD_FDT; > +break; > + > +case -EINVAL: > +printk("Hyperlaunch device tree was not detected\n"); > +bi->hyperlaunch_enabled = false; > +break; > + > +case -ENOENT: > +case -ENODATA: > +printk("Device tree found, but not hyperlaunch (%d)\n", ret); > +bi->hyperlaunch_enabled = false; > +bi->mods[0].type = BOOTMOD_FDT; > +break; > + > +default: > +printk("Unknown error (%d) occured checking for hyperlaunch > device tree\n", > + ret); > +bi->hyperlaunch_enabled = false; > +break; > +} > +} > +} What is it that's x86-specific in here? > --- /dev/null > +++ b/xen/arch/x86/domain-builder/fdt.c > @@ -0,0 +1,37 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * Copyright (C) 2024, Apertus Solutions, LLC > + */ > +#include > +#include > +#include > +#include > + > +#include > +#include > +#include > + > +#include "fdt.h" > + > +int __init has_hyperlaunch_fdt(struct boot_info *bi) Pointer-to-const? > +{ > +int ret = 0; > +const void *fdt = bootstrap_map_bm(&bi->mods[HYPERLAUNCH_MODULE_IDX]); > + > +if ( fdt_check_header(fdt) < 0 ) > +ret = -EINVAL; > + > +bootstrap_unmap(); > + > +return ret; > +} Is this function intended to later be extended? Aiui anything fitting the hyperlaunch-agnostic fdt_check_header() will do here, despite the name of the function. And again - what is it that's x86-specific in here? > --- /dev/null > +++ b/xen/arch/x86/domain-builder/fdt.h > @@ -0,0 +1,21 @@ > +/* SPDX-License-Identifier: (GPL-2.0-or-later OR BSD-2-Clause) */ > +#ifndef __XEN_X86_FDT_H__ > +#define __XEN_X86_FDT_H__ > + > +#include > + > +#include This isn't needed here, nor ... > --- /dev/null > +++ b/xen/arch/x86/include/asm/domainbuilder.h > @@ -0,0 +1,8 @@ > +#ifndef __XEN_X86_DOMBUILDER_H__ > +#define __XEN_X86_DOMBUILDER_H__ > + > +#include ... here, is it? Forward decls of struct boot_info are going to do. > @@ -1285,9 +1286,12 @@ void asmlinkage __init noreturn __start_xen(void) > bi->nr_modules); > } > > -/* Dom0 kernel is always first */ > -bi->mods[0].type = BOOTMOD_KERNEL; > -bi->domains[0].kernel = &bi->mods[0]; > +builder_init(bi); > + > +/* Find first unknown boot module to use as Dom0 kernel */ > +i = first_boot_module_index(bi, BOOTMOD_UNKNOWN); > +bi->mods[i].type = BOOTMOD_KERNEL; > +bi->domains[0].kernel = &bi->mods[i]; This is going to change again later? Or else what about there already being a module marked BOOTMOD_KERNEL? Jan
Re: slow start of Pod HVM domU with qemu 9.1
On Wed, 29 Jan 2025, Stefano Stabellini wrote: > On Wed, 29 Jan 2025, Edgar E. Iglesias wrote: > > On Tue, Jan 28, 2025 at 03:58:14PM -0800, Stefano Stabellini wrote: > > > On Tue, 28 Jan 2025, Edgar E. Iglesias wrote: > > > > On Tue, Jan 28, 2025 at 03:15:44PM +0100, Olaf Hering wrote: > > > > > Hello, > > > > > > > > > > starting with qemu 9.1 a PoD HVM domU takes a long time to start. > > > > > Depending on the domU kernel, it may trigger a warning, which > > > > > prompted me > > > > > to notice this change in behavior: > > > > > > > > > > [0.00] Linux version 4.12.14-120-default (geeko@buildhost) > > > > > (gcc version 4.8.5 (SUSE Linux) ) #1 SMP Thu Nov 7 16:39:09 UTC 2019 > > > > > (fd9dc36) > > > > > ... > > > > > [1.096432] HPET: 3 timers in total, 0 timers will be used for > > > > > per-cpu timer > > > > > [1.101636] hpet0: at MMIO 0xfed0, IRQs 2, 8, 0 > > > > > [1.104051] hpet0: 3 comparators, 64-bit 62.50 MHz counter > > > > > [ 16.136086] random: crng init done > > > > > [ 31.712052] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 > > > > > nice=0 stuck for 30s! > > > > > [ 31.716029] Showing busy workqueues and worker pools: > > > > > [ 31.721164] workqueue events: flags=0x0 > > > > > [ 31.724054] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 > > > > > [ 31.728000] in-flight: 17:balloon_process > > > > > [ 31.728000] pending: hpet_work > > > > > [ 31.728023] workqueue mm_percpu_wq: flags=0x8 > > > > > [ 31.732987] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 > > > > > [ 31.736000] pending: vmstat_update > > > > > [ 31.736024] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=30s > > > > > workers=2 idle: 34 > > > > > [ 50.400102] clocksource: Switched to clocksource xen > > > > > [ 50.441153] VFS: Disk quotas dquot_6.6.0 > > > > > ... > > > > > > > > > > With qemu 9.0 and older, this domU will start the /init task after 8 > > > > > seconds. > > > > > > > > > > The change which caused this commit is qemu.git commit > > > > > 9ecdd4bf08dfe4a37e16b8a8b228575aff641468 > > > > > Author: Edgar E. Iglesias > > > > > AuthorDate: Tue Apr 30 10:26:45 2024 +0200 > > > > > Commit: Edgar E. Iglesias > > > > > CommitDate: Sun Jun 9 20:16:14 2024 +0200 > > > > > > > > > > xen: mapcache: Add support for grant mappings > > > > > > > > > > As you can see, v4 instead of v5 was apparently applied. > > > > > This was probably unintentional, but would probably not change the > > > > > result. > > > > > > > > Hi Olaf, > > > > > > > > It looks like v8 was applied, or am I missing something? > > > > > > > > > > > > > > > > > > With this change the domU starts fast again: > > > > > > > > > > --- a/hw/xen/xen-mapcache.c > > > > > +++ b/hw/xen/xen-mapcache.c > > > > > @@ -522,6 +522,7 @@ ram_addr_t xen_ram_addr_from_mapcache(void *ptr) > > > > > ram_addr_t addr; > > > > > > > > > > addr = xen_ram_addr_from_mapcache_single(mapcache, ptr); > > > > > +if (1) > > > > > if (addr == RAM_ADDR_INVALID) { > > > > > addr = xen_ram_addr_from_mapcache_single(mapcache_grants, > > > > > ptr); > > > > > } > > > > > @@ -626,6 +627,7 @@ static void > > > > > xen_invalidate_map_cache_entry_single(MapCache *mc, uint8_t *buffer) > > > > > static void xen_invalidate_map_cache_entry_all(uint8_t *buffer) > > > > > { > > > > > xen_invalidate_map_cache_entry_single(mapcache, buffer); > > > > > +if (1) > > > > > xen_invalidate_map_cache_entry_single(mapcache_grants, buffer); > > > > > } > > > > > > > > > > @@ -700,6 +702,7 @@ void xen_invalidate_map_cache(void) > > > > > bdrv_drain_all(); > > > > > > > > > > xen_invalidate_map_cache_single(mapcache); > > > > > +if (0) > > > > > xen_invalidate_map_cache_single(mapcache_grants); > > > > > } > > > > > > > > > > I did the testing with libvirt, the domU.cfg equivalent looks like > > > > > this: > > > > > maxmem = 4096 > > > > > memory = 2048 > > > > > maxvcpus = 4 > > > > > vcpus = 2 > > > > > pae = 1 > > > > > acpi = 1 > > > > > apic = 1 > > > > > viridian = 0 > > > > > rtc_timeoffset = 0 > > > > > localtime = 0 > > > > > on_poweroff = "destroy" > > > > > on_reboot = "destroy" > > > > > on_crash = "destroy" > > > > > device_model_override = "/usr/lib64/qemu-9.1/bin/qemu-system-i386" > > > > > sdl = 0 > > > > > vnc = 1 > > > > > vncunused = 1 > > > > > vnclisten = "127.0.0.1" > > > > > vif = [ "mac=52:54:01:23:63:29,bridge=br0,script=vif-bridge" ] > > > > > parallel = "none" > > > > > serial = "pty" > > > > > builder = "hvm" > > > > > kernel = "/bug1236329/linux" > > > > > ramdisk = "/bug1236329/initrd" > > > > > cmdline = "console=ttyS0,115200n8 quiet ignore_loglevel"" > > > > > boot = "c" > > > > > disk = [ > > > > > "format=qcow2,vdev=hda,access=rw,backendtype=qdisk,target=/bug1236329/sles12sp5.qcow2" > > > > > ] > > > > > usb = 1 > > > > > usbdevice = "tablet" > > > > > > > > > > Any idea what can be
Re: [PATCH v2 14/15] x86/hyperlaunch: add max vcpu parsing of hyperlaunch device tree
On 26.12.2024 17:57, Daniel P. Smith wrote: > --- a/xen/arch/x86/domain-builder/fdt.c > +++ b/xen/arch/x86/domain-builder/fdt.c > @@ -147,6 +147,17 @@ static int __init process_domain_node( > bd->max_pages = PFN_DOWN(kb * SZ_1K); > printk(" max memory: %ld kb\n", kb); > } > +else if ( strncmp(prop_name, "cpus", name_len) == 0 ) > +{ > +uint32_t val = UINT_MAX; It's not the first time I see such an initializer, yet it's even more pronounced here, as the call ... > +if ( fdt_prop_as_u32(prop, &val) != 0 ) ... is coming right next. If that function succeeds, it surely should set its output? And if it didn't, you're as hosed with initializer as you're without. Jan
Re: Config space access to Mediatek MT7922 doesn't work after device reset in Xen PV dom0 (regression, Linux 6.12)
On Thu, Jan 30, 2025 at 10:30:33AM +0100, Jan Beulich wrote: > On 30.01.2025 05:55, Marek Marczykowski-Górecki wrote: > > I've added logging of all config read/write to this device. Full log at > > [1]. > ... I suspect there's something wrong with the Root Port RRS SV configuration. Can you add the two patches below? I don't *think* either should make a difference. The first enables RRS SV earlier and maybe in a cleaner place; the second should avoid some pointless capability searches that clutter the logs. What does d0v1/d0v2/d0v3 mean? Can you add 00:02.2, the Root Port leading to bus 01, so we can see the RRS SV configuration? Maybe also lspci -vv for both 00:02.2 and 01:00.0? Maybe include timestamps and try an FLR without Xen (which I assume works correctly) so we can see how long the device typically takes to become ready? Notes below on the snippet that you commented on, Jan. I think it makes sense until the read after FLR returns 0x. > > (XEN) d0v1 conf read cf8 0x80010034 bytes 1 offset 0 data 0x80 PCI_CAPABILITY_LIST, first cap at 0x80 > > (XEN) d0v1 conf read cf8 0x80010080 bytes 2 offset 0 data 0xe010 PCI_CAP_ID_EXP (0x10) at 0x80, next cap at 0xe0 > > (XEN) d0v1 conf read cf8 0x800100e0 bytes 2 offset 0 data 0xf805 PCI_CAP_ID_MSI (0x05) at 0xe0, next cap at 0xf8 > > (XEN) d0v1 conf read cf8 0x800100f8 bytes 2 offset 0 data 0x1 PCI_CAP_ID_PM (0x01) at 0xf8, end of cap list These caps match the offsets from the lspci output in the full log: 1:00.0 Network controller: MEDIATEK Corp. MT7922 802.11ax PCI Express Wireless Network Adapter Subsystem: MEDIATEK Corp. Device e616 Flags: fast devsel, IRQ 255 Memory at 801090 (64-bit, prefetchable) [disabled] [size=1M] Memory at 90b0 (64-bit, non-prefetchable) [disabled] [size=32K] Capabilities: [80] Express Endpoint, IntMsgNum 0 Capabilities: [e0] MSI: Enable- Count=1/32 Maskable+ 64bit+ Capabilities: [f8] Power Management version 3 > > (XEN) d0v1 conf write cf8 0x80010004 bytes 2 offset 0 data 0x400 Set PCI_COMMAND_INTX_DISABLE, disable BARs, Bus Master. Looks like the end of pci_dev_save_and_disable(). > > (XEN) d0v1 conf read cf8 0x80010088 bytes 2 offset 2 data 0x9 PCIe Cap at 0x80, PCI_EXP_DEVCTL is 0x08, PCI_EXP_DEVSTA is 0x0a. 0x80010088 would be PCI_EXP_DEVCTL (a 2-byte register), maybe offset 2 gets us to PCI_EXP_DEVSTA? Not sure. 0x0001 PCI_EXP_DEVSTA_CED /* Correctable Error Detected */ 0x0008 PCI_EXP_DEVSTA_URD /* Unsupported Request Detected */ Not impossible that these would be set. Lots of URs happen during enumeration and we're not very good about cleaning these up. Correctable errors are common for some devices. lspci -vv would decode the PCIe cap registers, including this. > > (XEN) d0v1 conf read cf8 0x80010088 bytes 2 offset 0 data 0x2910 PCI_EXP_DEVCTL: 0x2000 PCI_EXP_DEVCTL_READRQ_512B 0x0800 PCI_EXP_DEVCTL_NOSNOOP_EN 0x0100 PCI_EXP_DEVCTL_EXT_TAG 0x0010 PCI_EXP_DEVCTL_RELAX_EN > > (XEN) d0v1 conf write cf8 0x80010088 bytes 2 offset 0 data 0xa910 PCI_EXP_DEVCTL: set 0x8000 PCI_EXP_DEVCTL_BCR_FLR This looks like the actual FLR being initiated. > This is the express capability's Link Control 2 Register afaict. Unless I'm missing something this is actually Device Control. So far I think this all looks OK. The next part: > > (XEN) d0v2 conf read cf8 0x8001 bytes 4 offset 0 data 0x looks like a problem. The normal successful read gets 0x061614c3. After the FLR, if RRS SV is enabled, we should get either 0x0001 or 0x061614c3. Here we would exit the loop in pci_dev_wait() because we didn't see 0x0001 and we expect that the device is ready and we got a valid Vendor ID. So we proceed to restoring config space via pci_restore_state(), where we restore some PCIe registers and the header (first 64 bytes). My *guess* is the device isn't ready (or at least not responding) since all the reads return ~0. > > [1] https://gist.github.com/marmarek/b4391c71801145e52590e877c559c5e0 commit c2fd12204dcb ("PCI: Enable Configuration RRS SV early") Author: Bjorn Helgaas Date: Thu Jan 30 15:16:40 2025 -0600 PCI: Enable Configuration RRS SV early diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index b6536ed599c3..0b013b196d00 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1373,8 +1373,6 @@ static int pci_scan_bridge_extend(struct pci_bus *bus, struct pci_dev *dev, pci_write_config_word(dev, PCI_BRIDGE_CONTROL, bctl & ~PCI_BRIDGE_CTL_MASTER_ABORT); - pci_enable_rrs_sv(dev); - if ((secondary || subordinate) && !pcibios_assign_all_busses() && !is_cardbus && !broken) { unsigned int cmax, buses; @@ -1615,6 +1613,11 @@ void set_pcie_port_type(struct pci_dev *pdev) pdev->pcie_cap = pos; pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, ®16); pdev->pcie_flags_reg = reg
Re: [PATCH v2 4/4] xen: mem_access: conditionally compile vm_event.c & monitor.c
On Tue, Jan 21, 2025 at 5:25 AM Sergiy Kibrik wrote: > > From: Stefano Stabellini > > Extend coverage of CONFIG_VM_EVENT option and make the build of VM events > and monitoring support optional. > This is to reduce code size on Arm when this option isn't enabled. > > CC: Jan Beulich > CC: Tamas K Lengyel > Reviewed-by: Ayan Kumar Halder > Signed-off-by: Stefano Stabellini > Signed-off-by: Sergiy Kibrik > --- > changes in v2: > - rename CONFIG_MEM_ACCESS -> CONFIG_VM_EVENT > - tags > --- > xen/arch/arm/Makefile | 4 ++-- > xen/arch/arm/vsmc.c| 3 ++- > xen/common/Makefile| 4 ++-- > xen/include/xen/monitor.h | 9 + > xen/include/xen/vm_event.h | 14 +++--- > 5 files changed, 26 insertions(+), 8 deletions(-) > > diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile > index ad29316df1..e61238c4d0 100644 > --- a/xen/arch/arm/Makefile > +++ b/xen/arch/arm/Makefile > @@ -39,7 +39,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o > obj-$(CONFIG_LLC_COLORING) += llc-coloring.o > obj-$(CONFIG_VM_EVENT) += mem_access.o > obj-y += mm.o > -obj-y += monitor.o > +obj-$(CONFIG_VM_EVENT) += monitor.o > obj-y += p2m.o > obj-y += platform.o > obj-y += platform_hypercall.o > @@ -65,7 +65,7 @@ obj-$(CONFIG_VGICV2) += vgic-v2.o > obj-$(CONFIG_GICV3) += vgic-v3.o > obj-$(CONFIG_HAS_ITS) += vgic-v3-its.o > endif > -obj-y += vm_event.o > +obj-$(CONFIG_VM_EVENT) += vm_event.o > obj-y += vtimer.o > obj-$(CONFIG_SBSA_VUART_CONSOLE) += vpl011.o > obj-y += vsmc.o > diff --git a/xen/arch/arm/vsmc.c b/xen/arch/arm/vsmc.c > index 62d8117a12..1ea75cd7f1 100644 > --- a/xen/arch/arm/vsmc.c > +++ b/xen/arch/arm/vsmc.c > @@ -330,7 +330,8 @@ void do_trap_smc(struct cpu_user_regs *regs, const union > hsr hsr) > } > > /* If monitor is enabled, let it handle the call. */ > -if ( current->domain->arch.monitor.privileged_call_enabled ) > +if ( IS_ENABLED(CONFIG_VM_EVENT) && > + current->domain->arch.monitor.privileged_call_enabled ) > rc = monitor_smc(); Why not wrap this entire if block above in an #ifdef CONFIG_VM_EVENT? I think it would be more explicit what code is being compiled that way instead of just relying on the compiler optimization to take care of removing it. The rest of the patch looks fine to me. Tamas
Re: [PATCH v2 08/15] x86/hyperlaunch: locate dom0 kernel with hyperlaunch
On Thu, 30 Jan 2025, Jan Beulich wrote: > On 26.12.2024 17:57, Daniel P. Smith wrote: > > Look for a subnode of type `multiboot,kernel` within a domain node. If > > found, > > process the reg property for the MB1 module index. If the bootargs property > > is > > present and there was not an MB1 string, then use the command line from the > > device tree definition. > > While multiboot is apparently the first x86-specific part (as far as Xen goes) > to be put under domain-builder/, I wonder: > - Wouldn't looking for "multiboot,kernel" simply yield nothing on non-x86, > so having the code under common/ would still be okay? One small clarification: multiboot,kernel is actually common between both ARM and x86. It is "module-index" which is x86-specific and would "simply yield nothing on non-x86", as you wrote. I'll let Dan address your point that "having the code under common/ would still be okay". > - What's "multiboot" describing here? The origin of the module? (What other > origins would then be possible? How would MB1 and MB2 be distinguished? > What about a native xen.efi boot?) A property of the kernel (when Linux > doesn't use MB)? Each device tree node has a compatible string to qualify what kind of information the node is describing. The compatible string for device tree nodes describing a kernel binary or a ramdisk previously loaded into memory by a bootloader have a "multiboot," prefix. See docs/misc/arm/device-tree/booting.txt. This is unrelated to the binary multiboot protocol Grub uses on x86 to boot Xen. A distinction between MB1 and MB2 is not needed in device tree, that information is retrieved via the Grub multiboot protocol as usual. The only thing needed here in device tree is the location of the kernel, either by RAM address, or by Grub multiboot module index. This last option (Grub multiboot module index) is the "module-index" property I mentioned above.
Re: [XEN RFC PATCH v5 3/5] xen/public: Introduce PV-IOMMU hypercall interface
Hi Teddy, Thanks for working on this. I'm curious about your plans for this: On 2025-01-21 11:13, Teddy Astie wrote: +/** + * IOMMU_alloc_nested + * Create a nested IOMMU context (needs IOMMUCAP_nested). + * + * This context uses a platform-specific page table from domain address space + * specified in pgtable_gfn and use it for nested translations. + * + * Explicit flushes needs to be submited with IOMMU_flush_nested on + * modification of the nested pagetable to ensure coherency between IOTLB and + * nested page table. + * + * This context can be destroyed using IOMMU_free_context. + * This context cannot be modified using map_pages, unmap_pages. + */ +struct pv_iommu_alloc_nested { +/* OUT: allocated IOMMU context number */ +uint16_t ctx_no; + +/* IN: guest frame number of the nested page table */ +uint64_aligned_t pgtable_gfn; + +/* IN: nested mode flags */ +uint64_aligned_t nested_flags; +}; +typedef struct pv_iommu_alloc_nested pv_iommu_alloc_nested_t; +DEFINE_XEN_GUEST_HANDLE(pv_iommu_alloc_nested_t); Is this command intended to be used for GVA -> GPA translation? Would you need some way to associate with another iommu context for GPA -> HPA translation? Maybe more broadly, what are your goals for enabling PV-IOMMU? The examples on your blog post cover a domain restrict device access to specific portions of the the GPA space. Are you also interested in GVA -> GPA? Does VFIO required GVA -> GPA? And, sorry to bike shed, but ctx_no reads like "Context No" to me. I think ctxid/ctx_id might be clearer. Others probably have their own opinions :) Thanks, Jason
Re: [PATCH 0/3] tools/hvmloader: Decouple APIC IDs from vCPU IDs
On 29.01.2025 17:25, Roger Pau Monné wrote: > On Tue, Jan 28, 2025 at 06:42:38PM +, Alejandro Vallejo wrote: >> On Tue Jan 28, 2025 at 5:45 PM GMT, Roger Pau Monné wrote: >>> On Tue, Jan 28, 2025 at 04:33:39PM +, Alejandro Vallejo wrote: The hypervisor, hvmloader and the toolstack currently engage in a shared assumption that for every vCPU apicid == 2 * vcpuid. This series removes such assumption from hvmloader, by making it read the APIC ID of each vCPU and storing it for later use. The last patch prevents writing an MP Tables should we have vCPUs that can not be represented there. That's at the moment dead code because all vCPUs are currently representable in 8 bits. This will inavitably stop being true in the future after we increase the maximum number of guest vCPUs. >>> >>> While I'm fine with the MP Table change, should it also come together >>> with a patch that introduces the code to create x2APIC entries in >>> libacpi construct_madt() helper? (and bumping the MADT revision, as >>> I'm quite sure version 2 didn't have x2APIC entries in the >>> specification). >> >> That's a lot more involved though. Matt started something in that direction >> last year, but testing it was (and still is) effectively impossible until >> HVM_MAX_VCPUS increases. >> >> >> https://lore.kernel.org/xen-devel/cd1a3ce14790af8c1bb4372ef0be5a6cbbb50b1c.1710338145.git.matthew.bar...@cloud.com/ >> >> The rest of the topo series can be used to test that (with a hack to >> artificially bump the width of thread_id space), I'd rather not test a patch >> with a long and still uncommitted series. >> >>> >>> Otherwise the MP Table change seems like a red herring, because the >>> MADT created by libacpi will also be incorrect and APIC IDs will wrap in >>> local APIC entries, just like it would on MP Tables. >>> >>> Thanks, Roger. >> >> My take is that this is strictly better than what we have today by virtue of >> going down from 2 latent bugs to just 1. That said, I don't strictly need it >> for the topo series to advance, so it is (in a sense) optional. > > I'm fine with the patch, but it probably wants to mention in the > commit message that MADT tables will still wrap when using APIC IDs > > 255, as otherwise it seems MADT is not taken into consideration. I think we simply should not add MADT entries with wrapped (truncated) APIC IDs. Which can be done when they truly are at risk of wrapping, or right here. Jan
Re: Config space access to Mediatek MT7922 doesn't work after device reset in Xen PV dom0 (regression, Linux 6.12)
On 30.01.2025 05:55, Marek Marczykowski-Górecki wrote: > I've added logging of all config read/write to this device. Full log at > [1]. > > A little explanation: > - it's done in pci_conf_read/pci_conf_write in > https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/pci.c;h=97b792e578f1093194466081ad3651ade21cae7d;hb=HEAD > - cf8 means cf8 port value (BDF + register) > - bytes is read/write size (1/2/4) > - offset is the offset in the register (on top of cf8), but not in data > - data is either retrieved value, or written value, depending on > function > - it's logging only accesses to 01:00.0 > > interesting part: > > lspci before reset: > (XEN) d0v3 conf read cf8 0x8001 bytes 4 offset 0 data 0x61614c3 > (XEN) d0v3 conf read cf8 0x80010004 bytes 4 offset 0 data 0x10 > (XEN) d0v3 conf read cf8 0x80010008 bytes 4 offset 0 data 0x280 > (XEN) d0v3 conf read cf8 0x8001000c bytes 4 offset 0 data 0x10 > (XEN) d0v3 conf read cf8 0x80010010 bytes 4 offset 0 data 0x109c > (XEN) d0v3 conf read cf8 0x80010014 bytes 4 offset 0 data 0x80 > (XEN) d0v3 conf read cf8 0x80010018 bytes 4 offset 0 data 0x90b4 > (XEN) d0v3 conf read cf8 0x8001001c bytes 4 offset 0 data 0 > (XEN) d0v3 conf read cf8 0x80010020 bytes 4 offset 0 data 0 > (XEN) d0v3 conf read cf8 0x80010024 bytes 4 offset 0 data 0 > (XEN) d0v3 conf read cf8 0x80010028 bytes 4 offset 0 data 0 > (XEN) d0v3 conf read cf8 0x8001002c bytes 4 offset 0 data 0xe61614c3 > (XEN) d0v3 conf read cf8 0x80010030 bytes 4 offset 0 data 0 > (XEN) d0v3 conf read cf8 0x80010034 bytes 4 offset 0 data 0x80 > (XEN) d0v3 conf read cf8 0x80010038 bytes 4 offset 0 data 0 > (XEN) d0v3 conf read cf8 0x8001003c bytes 4 offset 0 data 0x1ff > (XEN) d0v3 conf read cf8 0x80010080 bytes 4 offset 0 data 0x2e010 > (XEN) d0v3 conf read cf8 0x800100e0 bytes 4 offset 0 data 0x18af805 > (XEN) d0v3 conf read cf8 0x800100f8 bytes 4 offset 0 data 0xc8030001 > > reset: > (XEN) d0v1 conf read cf8 0x800100fc bytes 2 offset 0 data 0x8 > (XEN) d0v1 conf read cf8 0x800100fc bytes 2 offset 0 data 0x8 > (XEN) d0v1 conf read cf8 0x8001008c bytes 4 offset 0 data 0x145dc12 > (XEN) d0v1 conf read cf8 0x8001 bytes 4 offset 0 data 0x61614c3 > (XEN) d0v1 conf read cf8 0x80010004 bytes 4 offset 0 data 0x10 > (XEN) d0v1 conf read cf8 0x80010008 bytes 4 offset 0 data 0x280 > (XEN) d0v1 conf read cf8 0x8001000c bytes 4 offset 0 data 0x10 > (XEN) d0v1 conf read cf8 0x80010010 bytes 4 offset 0 data 0x109c > (XEN) d0v1 conf read cf8 0x80010014 bytes 4 offset 0 data 0x80 > (XEN) d0v1 conf read cf8 0x80010018 bytes 4 offset 0 data 0x90b4 > (XEN) d0v1 conf read cf8 0x8001001c bytes 4 offset 0 data 0 > (XEN) d0v1 conf read cf8 0x80010020 bytes 4 offset 0 data 0 > (XEN) d0v1 conf read cf8 0x80010024 bytes 4 offset 0 data 0 > (XEN) d0v1 conf read cf8 0x80010028 bytes 4 offset 0 data 0 > (XEN) d0v1 conf read cf8 0x8001002c bytes 4 offset 0 data 0xe61614c3 > (XEN) d0v1 conf read cf8 0x80010030 bytes 4 offset 0 data 0 > (XEN) d0v1 conf read cf8 0x80010034 bytes 4 offset 0 data 0x80 > (XEN) d0v1 conf read cf8 0x80010038 bytes 4 offset 0 data 0 > (XEN) d0v1 conf read cf8 0x8001003c bytes 4 offset 0 data 0x1ff > (XEN) d0v1 conf read cf8 0x80010088 bytes 2 offset 0 data 0x2910 > (XEN) d0v1 conf read cf8 0x80010090 bytes 2 offset 0 data 0x1c2 > (XEN) d0v1 conf read cf8 0x800100a8 bytes 2 offset 0 data 0x400 > (XEN) d0v1 conf read cf8 0x800100b0 bytes 2 offset 0 data 0x2 > (XEN) d0v1 conf read cf8 0x80010004 bytes 2 offset 2 data 0x10 > (XEN) d0v1 conf read cf8 0x80010034 bytes 1 offset 0 data 0x80 > (XEN) d0v1 conf read cf8 0x80010080 bytes 2 offset 0 data 0xe010 > (XEN) d0v1 conf read cf8 0x800100e0 bytes 2 offset 0 data 0xf805 > (XEN) d0v1 conf read cf8 0x800100f8 bytes 2 offset 0 data 0x1 > (XEN) d0v1 conf write cf8 0x80010004 bytes 2 offset 0 data 0x400 > (XEN) d0v1 conf read cf8 0x80010088 bytes 2 offset 2 data 0x9 > (XEN) d0v1 conf read cf8 0x80010088 bytes 2 offset 0 data 0x2910 > (XEN) d0v1 conf write cf8 0x80010088 bytes 2 offset 0 data 0xa910 This is the express capability's Link Control 2 Register afaict. As per the copy of the 6.0 spec that I have the top 4 bits have only two defined encodings - 0b and 0b0001. 0b1000, as is being set here, is not defined. Yet then the earlier questions remain: Why has this suddenly become a problem? And why would this depend on how the present session was started, and what was running in the previous session? Is this write perhaps conditional upon something that has changed? Jan > (XEN) d0v2 conf read cf8 0x8001 bytes 4 offset 0 data 0x > (XEN) d0v2 conf read cf8 0x80010090 bytes 2 offset 0 data 0x > (XEN) d0v2 conf write cf8 0x80010090 bytes 2 offset 0 data 0xfffc > (XEN) d0v2 conf write cf8 0x80010090 bytes 2 offset 0 data 0x > (XEN) d0v2 conf write cf8 0x80010088 bytes 2 offset 0 data 0x2910 > (XEN) d0v2 conf write cf8 0x80010090 bytes 2 offset 0 data 0x1c2 > (XEN) d0v2 conf write cf8 0x800100a8 bytes 2 offse
[PATCH v2 0/2] tests/qtest: Make qtest_has_accel() generic
(Series fully reviewed) Since v1: - Use g_strconcat (Akihiko) In preparation of running QTests using HVF on Darwin, make qtest_has_accel() generic. Note, this also allow running other accelerators such Xen, WHPX, ... Philippe Mathieu-Daudé (2): tests/qtest: Extract qtest_qom_has_concrete_type() helper tests/qtest: Make qtest_has_accel() generic tests/qtest/libqtest.c | 110 +++-- 1 file changed, 61 insertions(+), 49 deletions(-) -- 2.47.1
[PATCH v2 2/2] tests/qtest: Make qtest_has_accel() generic
Since commit b14a0b7469f ("accel: Use QOM classes for accel types") accelerators are registered as QOM objects. Use QOM as a generic API to query for available accelerators. This is in particular useful to query hardware accelerators such HFV, Xen or WHPX which otherwise have their definitions poisoned in "exec/poison.h". Reviewed-by: Thomas Huth Signed-off-by: Philippe Mathieu-Daudé --- tests/qtest/libqtest.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c index 7e9366ad6d5..a55ac57ff7e 100644 --- a/tests/qtest/libqtest.c +++ b/tests/qtest/libqtest.c @@ -30,6 +30,7 @@ #include "libqtest.h" #include "libqmp.h" +#include "qemu/accel.h" #include "qemu/ctype.h" #include "qemu/cutils.h" #include "qemu/sockets.h" @@ -1030,13 +1031,10 @@ static bool qtest_qom_has_concrete_type(const char *parent_typename, bool qtest_has_accel(const char *accel_name) { -if (g_str_equal(accel_name, "tcg")) { -#if defined(CONFIG_TCG) -return true; -#else -return false; -#endif -} else if (g_str_equal(accel_name, "kvm")) { +static QList *list; +g_autofree char *accel_type = NULL; + +if (g_str_equal(accel_name, "kvm")) { int i; const char *arch = qtest_get_arch(); const char *targets[] = { CONFIG_KVM_TARGETS }; @@ -1048,11 +1046,12 @@ bool qtest_has_accel(const char *accel_name) } } } -} else { -/* not implemented */ -g_assert_not_reached(); +return false; } -return false; + +accel_type = g_strconcat(accel_name, ACCEL_CLASS_SUFFIX, NULL); + +return qtest_qom_has_concrete_type("accel", accel_type, &list); } bool qtest_get_irq(QTestState *s, int num) -- 2.47.1
[PATCH v2 1/2] tests/qtest: Extract qtest_qom_has_concrete_type() helper
Extract qtest_qom_has_concrete_type() out of qtest_has_device() in order to re-use it in the following commit. Signed-off-by: Philippe Mathieu-Daudé Reviewed-by: Thomas Huth --- tests/qtest/libqtest.c | 89 -- 1 file changed, 51 insertions(+), 38 deletions(-) diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c index a1e105f27f9..7e9366ad6d5 100644 --- a/tests/qtest/libqtest.c +++ b/tests/qtest/libqtest.c @@ -978,6 +978,56 @@ const char *qtest_get_arch(void) return end + 1; } +static bool qtest_qom_has_concrete_type(const char *parent_typename, +const char *child_typename, +QList **cached_list) +{ +QList *list = cached_list ? *cached_list : NULL; +const QListEntry *p; +QObject *qobj; +QString *qstr; +QDict *devinfo; +int idx; + +if (!list) { +QDict *resp; +QDict *args; +QTestState *qts = qtest_init("-machine none"); + +args = qdict_new(); +qdict_put_bool(args, "abstract", false); +qdict_put_str(args, "implements", parent_typename); + +resp = qtest_qmp(qts, "{'execute': 'qom-list-types', 'arguments': %p }", + args); +g_assert(qdict_haskey(resp, "return")); +list = qdict_get_qlist(resp, "return"); +qobject_ref(list); +qobject_unref(resp); + +qtest_quit(qts); + +if (cached_list) { +*cached_list = list; +} +} + +for (p = qlist_first(list), idx = 0; p; p = qlist_next(p), idx++) { +devinfo = qobject_to(QDict, qlist_entry_obj(p)); +g_assert(devinfo); + +qobj = qdict_get(devinfo, "name"); +g_assert(qobj); +qstr = qobject_to(QString, qobj); +g_assert(qstr); +if (g_str_equal(qstring_get_str(qstr), child_typename)) { +return true; +} +} + +return false; +} + bool qtest_has_accel(const char *accel_name) { if (g_str_equal(accel_name, "tcg")) { @@ -1757,45 +1807,8 @@ bool qtest_has_machine(const char *machine) bool qtest_has_device(const char *device) { static QList *list; -const QListEntry *p; -QObject *qobj; -QString *qstr; -QDict *devinfo; -int idx; -if (!list) { -QDict *resp; -QDict *args; -QTestState *qts = qtest_init("-machine none"); - -args = qdict_new(); -qdict_put_bool(args, "abstract", false); -qdict_put_str(args, "implements", "device"); - -resp = qtest_qmp(qts, "{'execute': 'qom-list-types', 'arguments': %p }", - args); -g_assert(qdict_haskey(resp, "return")); -list = qdict_get_qlist(resp, "return"); -qobject_ref(list); -qobject_unref(resp); - -qtest_quit(qts); -} - -for (p = qlist_first(list), idx = 0; p; p = qlist_next(p), idx++) { -devinfo = qobject_to(QDict, qlist_entry_obj(p)); -g_assert(devinfo); - -qobj = qdict_get(devinfo, "name"); -g_assert(qobj); -qstr = qobject_to(QString, qobj); -g_assert(qstr); -if (g_str_equal(qstring_get_str(qstr), device)) { -return true; -} -} - -return false; +return qtest_qom_has_concrete_type("device", device, &list); } /* -- 2.47.1