Re: [RFC PATCH 01/20] kho: generalize radix tree APIs

sashiko-bot Wed, 27 May 2026 18:23:03 -0700

Thank you for your contribution! Sashiko AI review found 8 potential issue(s) 
to consider:


New issues:
- [High] Refactoring of scratch_size_update() skips CMA alignment for fixed 
scratch sizes.
- [High] Unsigned integer underflow when calculating scratch_size_global for 
unaligned lowmem sizes.
- [High] The generalized radix tree API silently truncates bits from generic 
64-bit keys.
- [Low] Unnecessary duplication of __kho_preserve_pages_order() logic in 
kho_preserve_pages().

Pre-existing issues:
- [Critical] Heap buffer overflow in radix tree node allocation on 32-bit 
platforms with 4-byte phys_addr_t.
- [High] Sleeping memory allocation (GFP_KERNEL) is performed while holding a 
mutex required in reclaim paths.
- [High] Bitwise shift operations using a 32-bit signed integer literal '1' 
overflow for large memory preservations.
- [High] Lack of bounds checking on reconstructed radix tree keys causes an 
underflow of 'order', leading to severe memory corruption.
--

commit 8b45214cacd5b180aa13f584e115cd95e44f26d2
Author: Pratyush Yadav (Google) <[email protected]>

kho: generalize radix tree APIs

This patch generalizes the KHO radix tree APIs to track arbitrary keys directly
instead of hardcoding page tracking operations (PFN and order). It renames the
functions to kho_radix_add_key() and kho_radix_del_key() and adjusts the
callback structures accordingly.

> diff --git a/kernel/liveupdate/kexec_handover.c 
> b/kernel/liveupdate/kexec_handover.c
> index 4834a809985ab..05a6eb56e1766 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
[ ... ]
> @@ -106,17 +106,17 @@ static unsigned long kho_radix_encode_key(phys_addr_t 
> phys, unsigned int order)
>  }
>  
>  /**
> - * kho_radix_decode_key - Decodes a radix key back into a physical address 
> and order.
> + * kho_decode_radix_key - Decodes a radix key back into a physical address 
> and order.
>   * @key: The unsigned long key to decode.
>   * @order: An output parameter, a pointer to an unsigned int where the 
> decoded
>   *         page order will be stored.
>   *
> - * This function reverses the encoding performed by kho_radix_encode_key(),
> + * This function reverses the encoding performed by kho_encode_radix_key(),
>   * extracting the original physical address and page order from a given key.
>   *
>   * Return: The decoded physical address.
>   */
> -static phys_addr_t kho_radix_decode_key(unsigned long key, unsigned int 
> *order)
> +static phys_addr_t kho_decode_radix_key(unsigned long key, unsigned int 
> *order)
>  {
>       unsigned int order_bit = fls64(key);
>       phys_addr_t phys;
>  
>       /* order_bit is numbered starting at 1 from fls64 */
>       *order = KHO_ORDER_0_LOG2 - order_bit + 1;

[Severity: High]
This isn't a bug introduced by this patch, but does this calculation risk an
unsigned integer underflow?

If a corrupted tree (for instance, an FDT from a crashing kernel) provides
entries at inappropriately high indices, the reconstructed key in
__kho_radix_walk_tree() might have bits set above KHO_ORDER_0_LOG2.

If fls64(key) returns an order_bit greater than KHO_ORDER_0_LOG2 + 1, this
calculation will underflow and produce an enormous order. This value is later
used in loops bounded by (1UL << order), which could lead to out-of-bounds
iteration.

[ ... ]
> @@ -144,24 +144,21 @@ static unsigned long kho_radix_get_table_index(unsigned 
> long key,
>  }
>  
>  /**
> - * kho_radix_add_page - Marks a page as preserved in the radix tree.
> + * kho_radix_add_key - Add a key to the radix tree.
>   * @tree: The KHO radix tree.
> - * @pfn: The page frame number of the page to preserve.
> - * @order: The order of the page.
> + * @key: The key to add.
>   *
> - * This function traverses the radix tree based on the key derived from @pfn
> - * and @order. It sets the corresponding bit in the leaf bitmap to mark the
> - * page for preservation. If intermediate nodes do not exist along the path,
> - * they are allocated and added to the tree.
> + * This function traverses the radix tree based on the key provided. It sets 
> the
> + * corresponding bit in the leaf bitmap to mark the key as present. If
> + * intermediate nodes do not exist along the path, they are allocated and 
> added
> + * to the tree.
>   *
>   * Return: 0 on success, or a negative error code on failure.
>   */
> -int kho_radix_add_page(struct kho_radix_tree *tree,
> -                    unsigned long pfn, unsigned int order)
> +int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key)

[Severity: High]
Does this new arbitrary key API truncate bits from generic 64-bit keys?

The KHO radix tree depth and shifting logic appear to be statically sized
based on physical address and order bounds. When an arbitrary 64-bit key
is passed here, any bits that exceed the maximum addressable depth might
be silently ignored (e.g., bits shifted out or lost in the modulo operations
inside kho_radix_get_table_index()).

Could this cause distinct large 64-bit keys to silently alias to the same
path and overwrite each other?

>  {
>       /* Newly allocated nodes for error cleanup */
>       struct kho_radix_node *intermediate_nodes[KHO_TREE_MAX_DEPTH] = { 0 };
> -     unsigned long key = kho_radix_encode_key(PFN_PHYS(pfn), order);
>       struct kho_radix_node *anchor_node = NULL;

[Severity: Critical]
This is a pre-existing issue, but can the node allocation later in this
function cause an out-of-bounds heap access on 32-bit platforms where
sizeof(phys_addr_t) is 4?

The struct kho_radix_node array size is based on KHO_TABLE_SIZE_LOG2,
which is calculated as ilog2(PAGE_SIZE / sizeof(phys_addr_t)).
On 32-bit systems, this evaluates to 1024 elements of u64, requiring
8192 bytes. However, the allocation relies on get_zeroed_page():

        new_node = (struct kho_radix_node *)get_zeroed_page(GFP_KERNEL);

This only allocates a single 4096-byte page. Could writes to indices >= 512,
or walking the tree via __kho_radix_walk_tree(), overflow this allocation?

[Severity: High]
This is also a pre-existing issue, but can using GFP_KERNEL for the node
allocation lead to an ABBA deadlock?

This allocation happens while holding tree->lock:

kho_radix_add_key()
    mutex_lock(&tree->lock)
    ...
    get_zeroed_page(GFP_KERNEL)

Since GFP_KERNEL can sleep and trigger direct memory reclaim, if a page cache
folio preserved for KHO is evicted, the filesystem's eviction path could call
kho_unpreserve_folio().

That function would attempt to acquire the exact same tree->lock, resulting
in a deadlock. Should this allocation be moved outside the lock or downgraded
to GFP_NOWAIT?

[ ... ]
> @@ -484,13 +474,16 @@ static struct page *__init 
> kho_get_preserved_page(phys_addr_t phys,
>       return pfn_to_page(pfn);
>  }
>  
> -static int __init kho_preserved_memory_reserve(phys_addr_t phys,
> -                                            unsigned int order)
> +static int __init kho_preserved_memory_reserve(unsigned long key)
>  {
>       union kho_page_info info;
>       struct page *page;
> +     unsigned int order;
> +     phys_addr_t phys;
>       u64 sz;
>  
> +     phys = kho_decode_radix_key(key, &order);
> +
>       sz = 1 << (order + PAGE_SHIFT);

[Severity: High]
This is a pre-existing issue, but will this bitwise shift overflow for large
memory preservations?

Since the literal 1 is a 32-bit signed integer, a shift amount of 31 or
greater (which can happen for contiguous preservations of 2GB or more, like
hugepages or CMA areas) results in a signed integer overflow.

This would produce a negative 32-bit value that is sign-extended into the
64-bit sz variable, leading to incorrect memblock reservations.
Does this need to use 1UL or 1ULL?

[ ... ]
> @@ -618,30 +611,20 @@ early_param("kho_scratch", kho_parse_scratch_size);
>  
>  static void __init scratch_size_update(void)
>  {
> -     /*
> -      * If fixed sizes are not provided via command line, calculate them
> -      * now.
> -      */
> -     if (scratch_scale) {
> -             phys_addr_t size;
> +     phys_addr_t size;
>  
> -             size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT,
> -                                                NUMA_NO_NODE);
> -             size = size * scratch_scale / 100;
> -             scratch_size_lowmem = size;
> +     if (!scratch_scale)
> +             return;

[Severity: High]
Does this early return skip CMA alignment for fixed scratch sizes?

In the previous implementation, scratch_size_lowmem and scratch_size_global
were unconditionally aligned using round_up(..., CMA_MIN_ALIGNMENT_BYTES)
at the end of the function, regardless of whether scratch_scale was set.

By returning early when scratch_scale is 0 (which occurs when fixed sizes
are provided via the command line), the alignment step is bypassed.
Will this cause CMA initialization to fail when iterating over partial
pageblocks?

>  
> -             size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE,
> -                                                NUMA_NO_NODE);
> -             size = size * scratch_scale / 100 - scratch_size_lowmem;
> -             scratch_size_global = size;
> -     }
> +     size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT,
> +                                        NUMA_NO_NODE);
> +     size = size * scratch_scale / 100;
> +     scratch_size_lowmem = round_up(size, CMA_MIN_ALIGNMENT_BYTES);
>  
> -     /*
> -      * Scratch areas are released as MIGRATE_CMA. Round them up to the right
> -      * size.
> -      */
> -     scratch_size_lowmem = round_up(scratch_size_lowmem, 
> CMA_MIN_ALIGNMENT_BYTES);
> -     scratch_size_global = round_up(scratch_size_global, 
> CMA_MIN_ALIGNMENT_BYTES);
> +     size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE,
> +                                        NUMA_NO_NODE);
> +     size = size * scratch_scale / 100 - scratch_size_lowmem;

[Severity: High]
Can this subtraction result in an unsigned integer underflow?

Because the round_up() for scratch_size_lowmem now occurs before the
calculation of scratch_size_global, scratch_size_lowmem may be larger
than the unaligned size (size * scratch_scale / 100).

On a system where all memory is below ARCH_LOW_ADDRESS_LIMIT, both calls to
memblock_reserved_kern_size() return the same value. The subtraction will
then subtract the rounded-up value from the unaligned value, resulting in
a negative number.

Since size is a phys_addr_t (unsigned), this will underflow to an enormous
positive value and likely lead to an unfulfillable CMA reservation attempt.

[ ... ]
> @@ -937,9 +922,19 @@ int kho_preserve_pages(struct page *page, unsigned long 
> nr_pages)
>       }
>  
>       while (pfn < end_pfn) {
> -             unsigned int order = __kho_preserve_pages_order(pfn, end_pfn);
> +             unsigned int order =
> +                     min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
> +
> +             /*
> +              * Make sure all the pages in a single preservation are in the
> +              * same NUMA node. The restore machinery can not cope with a
> +              * preservation spanning multiple NUMA nodes.
> +              */
> +             while (pfn_to_nid(pfn) != pfn_to_nid(pfn + (1UL << order) - 1))
> +                     order--;

[Severity: Low]
Is there a specific reason to inline and duplicate the logic from
__kho_preserve_pages_order() here?

The helper function __kho_preserve_pages_order() remains in the file and
is still actively used by __kho_unpreserve(). Duplicating this complex
order calculation and NUMA node alignment logic increases the risk of the
two implementations diverging in the future.

-- 
Sashiko AI review · 
https://sashiko.dev/#/patchset/[email protected]?part=1

Re: [RFC PATCH 01/20] kho: generalize radix tree APIs

Reply via email to