On 13.05.2024 15:40, Elias El Yandouzi wrote:
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1751,6 +1751,22 @@ void asmlinkage __init noreturn __start_xen(unsigned 
> long mbi_p)
>  
>      numa_initmem_init(0, raw_max_page);
>  
> +    /*
> +     * When we do not have a direct map, memory for metadata of heap nodes in
> +     * init_node_heap() is allocated from xenheap, which needs to be mapped 
> and
> +     * unmapped on demand. However, we cannot just take memory from the boot
> +     * allocator to create the PTEs while we are passing memory to the heap
> +     * allocator during end_boot_allocator().
> +     *
> +     * To solve this race, we need to leave early boot before
> +     * end_boot_allocator() so that Xen PTE pages are allocated from the heap
> +     * instead of the boot allocator. We can do this because the metadata for
> +     * the 1st node is statically allocated, and by the time we need memory 
> to
> +     * create mappings for the 2nd node, we already have enough memory in the
> +     * heap allocator in the 1st node.
> +     */
> +    system_state = SYS_STATE_boot;
> +
>      if ( max_page - 1 > virt_to_mfn(HYPERVISOR_VIRT_END - 1) )
>      {
>          unsigned long lo = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
> @@ -1782,8 +1798,6 @@ void asmlinkage __init noreturn __start_xen(unsigned 
> long mbi_p)
>      else
>          end_boot_allocator();
>  
> -    system_state = SYS_STATE_boot;
> -
>      bsp_stack = cpu_alloc_stack(0);
>      if ( !bsp_stack )
>          panic("No memory for BSP stack\n");

I'm pretty wary of this movement, even more so when Arm isn't switched at
the same time. It has (virtually?) always been the case that this state
switch happens _after_ end_boot_allocator(), and I wouldn't be surprised
if there was a dependency on that somewhere. I realize you've been telling
use that at Amazon you've been running with an earlier variant of these
changes for a long time, and you not having hit issues with this is a good
sign. But I'm afraid it's not a proof.

As to possible alternatives - as pointed out by Roger, the comment / patch
description aren't entirely clear as to what exactly needs working around.
One possibility might be to introduce an x86-only boolean controlling from
when on to use the heap allocator for page table allocations, thus
decoupling that from system_state.

Jan

Reply via email to