Thank you for your contribution! Sashiko AI review found 8 potential issue(s) to consider:
New issues: - [High] Refactoring of scratch_size_update() skips CMA alignment for fixed scratch sizes. - [High] Unsigned integer underflow when calculating scratch_size_global for unaligned lowmem sizes. - [High] The generalized radix tree API silently truncates bits from generic 64-bit keys. - [Low] Unnecessary duplication of __kho_preserve_pages_order() logic in kho_preserve_pages(). Pre-existing issues: - [Critical] Heap buffer overflow in radix tree node allocation on 32-bit platforms with 4-byte phys_addr_t. - [High] Sleeping memory allocation (GFP_KERNEL) is performed while holding a mutex required in reclaim paths. - [High] Bitwise shift operations using a 32-bit signed integer literal '1' overflow for large memory preservations. - [High] Lack of bounds checking on reconstructed radix tree keys causes an underflow of 'order', leading to severe memory corruption. -- commit 8b45214cacd5b180aa13f584e115cd95e44f26d2 Author: Pratyush Yadav (Google) <[email protected]> kho: generalize radix tree APIs This patch generalizes the KHO radix tree APIs to track arbitrary keys directly instead of hardcoding page tracking operations (PFN and order). It renames the functions to kho_radix_add_key() and kho_radix_del_key() and adjusts the callback structures accordingly. > diff --git a/kernel/liveupdate/kexec_handover.c > b/kernel/liveupdate/kexec_handover.c > index 4834a809985ab..05a6eb56e1766 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c [ ... ] > @@ -106,17 +106,17 @@ static unsigned long kho_radix_encode_key(phys_addr_t > phys, unsigned int order) > } > > /** > - * kho_radix_decode_key - Decodes a radix key back into a physical address > and order. > + * kho_decode_radix_key - Decodes a radix key back into a physical address > and order. > * @key: The unsigned long key to decode. > * @order: An output parameter, a pointer to an unsigned int where the > decoded > * page order will be stored. > * > - * This function reverses the encoding performed by kho_radix_encode_key(), > + * This function reverses the encoding performed by kho_encode_radix_key(), > * extracting the original physical address and page order from a given key. > * > * Return: The decoded physical address. > */ > -static phys_addr_t kho_radix_decode_key(unsigned long key, unsigned int > *order) > +static phys_addr_t kho_decode_radix_key(unsigned long key, unsigned int > *order) > { > unsigned int order_bit = fls64(key); > phys_addr_t phys; > > /* order_bit is numbered starting at 1 from fls64 */ > *order = KHO_ORDER_0_LOG2 - order_bit + 1; [Severity: High] This isn't a bug introduced by this patch, but does this calculation risk an unsigned integer underflow? If a corrupted tree (for instance, an FDT from a crashing kernel) provides entries at inappropriately high indices, the reconstructed key in __kho_radix_walk_tree() might have bits set above KHO_ORDER_0_LOG2. If fls64(key) returns an order_bit greater than KHO_ORDER_0_LOG2 + 1, this calculation will underflow and produce an enormous order. This value is later used in loops bounded by (1UL << order), which could lead to out-of-bounds iteration. [ ... ] > @@ -144,24 +144,21 @@ static unsigned long kho_radix_get_table_index(unsigned > long key, > } > > /** > - * kho_radix_add_page - Marks a page as preserved in the radix tree. > + * kho_radix_add_key - Add a key to the radix tree. > * @tree: The KHO radix tree. > - * @pfn: The page frame number of the page to preserve. > - * @order: The order of the page. > + * @key: The key to add. > * > - * This function traverses the radix tree based on the key derived from @pfn > - * and @order. It sets the corresponding bit in the leaf bitmap to mark the > - * page for preservation. If intermediate nodes do not exist along the path, > - * they are allocated and added to the tree. > + * This function traverses the radix tree based on the key provided. It sets > the > + * corresponding bit in the leaf bitmap to mark the key as present. If > + * intermediate nodes do not exist along the path, they are allocated and > added > + * to the tree. > * > * Return: 0 on success, or a negative error code on failure. > */ > -int kho_radix_add_page(struct kho_radix_tree *tree, > - unsigned long pfn, unsigned int order) > +int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key) [Severity: High] Does this new arbitrary key API truncate bits from generic 64-bit keys? The KHO radix tree depth and shifting logic appear to be statically sized based on physical address and order bounds. When an arbitrary 64-bit key is passed here, any bits that exceed the maximum addressable depth might be silently ignored (e.g., bits shifted out or lost in the modulo operations inside kho_radix_get_table_index()). Could this cause distinct large 64-bit keys to silently alias to the same path and overwrite each other? > { > /* Newly allocated nodes for error cleanup */ > struct kho_radix_node *intermediate_nodes[KHO_TREE_MAX_DEPTH] = { 0 }; > - unsigned long key = kho_radix_encode_key(PFN_PHYS(pfn), order); > struct kho_radix_node *anchor_node = NULL; [Severity: Critical] This is a pre-existing issue, but can the node allocation later in this function cause an out-of-bounds heap access on 32-bit platforms where sizeof(phys_addr_t) is 4? The struct kho_radix_node array size is based on KHO_TABLE_SIZE_LOG2, which is calculated as ilog2(PAGE_SIZE / sizeof(phys_addr_t)). On 32-bit systems, this evaluates to 1024 elements of u64, requiring 8192 bytes. However, the allocation relies on get_zeroed_page(): new_node = (struct kho_radix_node *)get_zeroed_page(GFP_KERNEL); This only allocates a single 4096-byte page. Could writes to indices >= 512, or walking the tree via __kho_radix_walk_tree(), overflow this allocation? [Severity: High] This is also a pre-existing issue, but can using GFP_KERNEL for the node allocation lead to an ABBA deadlock? This allocation happens while holding tree->lock: kho_radix_add_key() mutex_lock(&tree->lock) ... get_zeroed_page(GFP_KERNEL) Since GFP_KERNEL can sleep and trigger direct memory reclaim, if a page cache folio preserved for KHO is evicted, the filesystem's eviction path could call kho_unpreserve_folio(). That function would attempt to acquire the exact same tree->lock, resulting in a deadlock. Should this allocation be moved outside the lock or downgraded to GFP_NOWAIT? [ ... ] > @@ -484,13 +474,16 @@ static struct page *__init > kho_get_preserved_page(phys_addr_t phys, > return pfn_to_page(pfn); > } > > -static int __init kho_preserved_memory_reserve(phys_addr_t phys, > - unsigned int order) > +static int __init kho_preserved_memory_reserve(unsigned long key) > { > union kho_page_info info; > struct page *page; > + unsigned int order; > + phys_addr_t phys; > u64 sz; > > + phys = kho_decode_radix_key(key, &order); > + > sz = 1 << (order + PAGE_SHIFT); [Severity: High] This is a pre-existing issue, but will this bitwise shift overflow for large memory preservations? Since the literal 1 is a 32-bit signed integer, a shift amount of 31 or greater (which can happen for contiguous preservations of 2GB or more, like hugepages or CMA areas) results in a signed integer overflow. This would produce a negative 32-bit value that is sign-extended into the 64-bit sz variable, leading to incorrect memblock reservations. Does this need to use 1UL or 1ULL? [ ... ] > @@ -618,30 +611,20 @@ early_param("kho_scratch", kho_parse_scratch_size); > > static void __init scratch_size_update(void) > { > - /* > - * If fixed sizes are not provided via command line, calculate them > - * now. > - */ > - if (scratch_scale) { > - phys_addr_t size; > + phys_addr_t size; > > - size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT, > - NUMA_NO_NODE); > - size = size * scratch_scale / 100; > - scratch_size_lowmem = size; > + if (!scratch_scale) > + return; [Severity: High] Does this early return skip CMA alignment for fixed scratch sizes? In the previous implementation, scratch_size_lowmem and scratch_size_global were unconditionally aligned using round_up(..., CMA_MIN_ALIGNMENT_BYTES) at the end of the function, regardless of whether scratch_scale was set. By returning early when scratch_scale is 0 (which occurs when fixed sizes are provided via the command line), the alignment step is bypassed. Will this cause CMA initialization to fail when iterating over partial pageblocks? > > - size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE, > - NUMA_NO_NODE); > - size = size * scratch_scale / 100 - scratch_size_lowmem; > - scratch_size_global = size; > - } > + size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT, > + NUMA_NO_NODE); > + size = size * scratch_scale / 100; > + scratch_size_lowmem = round_up(size, CMA_MIN_ALIGNMENT_BYTES); > > - /* > - * Scratch areas are released as MIGRATE_CMA. Round them up to the right > - * size. > - */ > - scratch_size_lowmem = round_up(scratch_size_lowmem, > CMA_MIN_ALIGNMENT_BYTES); > - scratch_size_global = round_up(scratch_size_global, > CMA_MIN_ALIGNMENT_BYTES); > + size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE, > + NUMA_NO_NODE); > + size = size * scratch_scale / 100 - scratch_size_lowmem; [Severity: High] Can this subtraction result in an unsigned integer underflow? Because the round_up() for scratch_size_lowmem now occurs before the calculation of scratch_size_global, scratch_size_lowmem may be larger than the unaligned size (size * scratch_scale / 100). On a system where all memory is below ARCH_LOW_ADDRESS_LIMIT, both calls to memblock_reserved_kern_size() return the same value. The subtraction will then subtract the rounded-up value from the unaligned value, resulting in a negative number. Since size is a phys_addr_t (unsigned), this will underflow to an enormous positive value and likely lead to an unfulfillable CMA reservation attempt. [ ... ] > @@ -937,9 +922,19 @@ int kho_preserve_pages(struct page *page, unsigned long > nr_pages) > } > > while (pfn < end_pfn) { > - unsigned int order = __kho_preserve_pages_order(pfn, end_pfn); > + unsigned int order = > + min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn)); > + > + /* > + * Make sure all the pages in a single preservation are in the > + * same NUMA node. The restore machinery can not cope with a > + * preservation spanning multiple NUMA nodes. > + */ > + while (pfn_to_nid(pfn) != pfn_to_nid(pfn + (1UL << order) - 1)) > + order--; [Severity: Low] Is there a specific reason to inline and duplicate the logic from __kho_preserve_pages_order() here? The helper function __kho_preserve_pages_order() remains in the file and is still actively used by __kho_unpreserve(). Duplicating this complex order calculation and NUMA node alignment logic increases the risk of the two implementations diverging in the future. -- Sashiko AI review ยท https://sashiko.dev/#/patchset/[email protected]?part=1
