On Wed, May 14, 2025 at 2:40 AM Alexander Graf <g...@amazon.com> wrote: > > When booting a new kernel with kexec_file, the kernel picks a target > location that the kernel should live at, then allocates random pages, > checks whether any of those patches magically happens to coincide with > a target address range and if so, uses them for that range. > > For every page allocated this way, it then creates a page list that the > relocation code - code that executes while all CPUs are off and we are > just about to jump into the new kernel - copies to their final memory > location. We can not put them there before, because chances are pretty > good that at least some page in the target range is already in use by > the currently running Linux environment. Copying is happening from a > single CPU at RAM rate, which takes around 4-50 ms per 100 MiB. > > All of this is inefficient and error prone. > > To successfully kexec, we need to quiesce all devices of the outgoing > kernel so they don't scribble over the new kernel's memory. We have seen > cases where that does not happen properly (*cough* GIC *cough*) and hence > the new kernel was corrupted. This started a month long journey to root > cause failing kexecs to eventually see memory corruption, because the new > kernel was corrupted severely enough that it could not emit output to > tell us about the fact that it was corrupted. By allocating memory for the > next kernel from a memory range that is guaranteed scribbling free, we can > boot the next kernel up to a point where it is at least able to detect > corruption and maybe even stop it before it becomes severe. This increases > the chance for successful kexecs. > > Since kexec got introduced, Linux has gained the CMA framework which > can perform physically contiguous memory mappings, while keeping that > memory available for movable memory when it is not needed for contiguous > allocations. The default CMA allocator is for DMA allocations. > > This patch adds logic to the kexec file loader to attempt to place the > target payload at a location allocated from CMA. If successful, it uses > that memory range directly instead of creating copy instructions during > the hot phase. To ensure that there is a safety net in case anything goes > wrong with the CMA allocation, it also adds a flag for user space to force > disable CMA allocations. > > Using CMA allocations has two advantages: > > 1) Faster by 4-50 ms per 100 MiB. There is no more need to copy in the > hot phase. > 2) More robust. Even if by accident some page is still in use for DMA, > the new kernel image will be safe from that access because it resides > in a memory region that is considered allocated in the old kernel and > has a chance to reinitialize that component. > > Signed-off-by: Alexander Graf <g...@amazon.com> > > --- > > v1 -> v2: > > - Clarify patch description > - Move cma pointer out of kexec_segment. That is a sneaky UAPI struct we > can not modify. Fixes non kexec_file path > - Coding style > - Move memset(0) to only clear remainder > - Move kexec_alloc_contig() into kexec_locate_mem_hole(). Makes the code > flow easier to read. > - Sanitize return values > > v2 -> v3: > > - Fix refactoring bug which meant we never exercised the new code path > --- > arch/riscv/kernel/elf_kexec.c | 1 + > include/linux/kexec.h | 10 ++++ > include/uapi/linux/kexec.h | 1 + > kernel/kexec.c | 2 +- > kernel/kexec_core.c | 100 +++++++++++++++++++++++++++++++--- > kernel/kexec_file.c | 47 +++++++++++++++- > kernel/kexec_internal.h | 2 +- > 7 files changed, 152 insertions(+), 11 deletions(-) > > diff --git a/arch/riscv/kernel/elf_kexec.c b/arch/riscv/kernel/elf_kexec.c > index e783a72d051f..d81647c98c92 100644 > --- a/arch/riscv/kernel/elf_kexec.c > +++ b/arch/riscv/kernel/elf_kexec.c > @@ -109,6 +109,7 @@ static int elf_find_pbase(struct kimage *image, unsigned > long kernel_len, > kbuf.mem = KEXEC_BUF_MEM_UNKNOWN; > kbuf.memsz = ALIGN(kernel_len, PAGE_SIZE); > kbuf.top_down = false; > + kbuf.cma = NULL; > ret = arch_kexec_locate_mem_hole(&kbuf); > if (!ret) { > *old_pbase = lowest_paddr; > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > index c8971861521a..7821b23bd1e9 100644 > --- a/include/linux/kexec.h > +++ b/include/linux/kexec.h > @@ -75,6 +75,12 @@ extern note_buf_t __percpu *crash_notes; > > typedef unsigned long kimage_entry_t; > > +/* > + * This is a copy of the UAPI struct kexec_segment and must be identical > + * to it because it gets copied straight from user space into kernel > + * memory. Do not modify this structure unless you change the way segments > + * get ingested from user space. > + */ > struct kexec_segment { > /* > * This pointer can point to user memory if kexec_load() system > @@ -169,6 +175,7 @@ int kexec_image_post_load_cleanup_default(struct kimage > *image); > * @buf_min: The buffer can't be placed below this address. > * @buf_max: The buffer can't be placed above this address. > * @top_down: Allocate from top of memory. > + * @cma: CMA page if the buffer is backed by CMA. > */ > struct kexec_buf { > struct kimage *image; > @@ -180,6 +187,7 @@ struct kexec_buf { > unsigned long buf_min; > unsigned long buf_max; > bool top_down; > + struct page *cma; > }; > > int kexec_load_purgatory(struct kimage *image, struct kexec_buf *kbuf); > @@ -310,6 +318,7 @@ struct kimage { > > unsigned long nr_segments; > struct kexec_segment segment[KEXEC_SEGMENT_MAX]; > + struct page *segment_cma[KEXEC_SEGMENT_MAX]; > > struct list_head control_pages; > struct list_head dest_pages; > @@ -331,6 +340,7 @@ struct kimage { > */ > unsigned int hotplug_support:1; > #endif > + unsigned int no_cma:1; > > #ifdef ARCH_HAS_KIMAGE_ARCH > struct kimage_arch arch; > diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h > index 5ae1741ea8ea..8958ebfcff94 100644 > --- a/include/uapi/linux/kexec.h > +++ b/include/uapi/linux/kexec.h > @@ -27,6 +27,7 @@ > #define KEXEC_FILE_ON_CRASH 0x00000002 > #define KEXEC_FILE_NO_INITRAMFS 0x00000004 > #define KEXEC_FILE_DEBUG 0x00000008 > +#define KEXEC_FILE_NO_CMA 0x00000010 > > /* These values match the ELF architecture values. > * Unless there is a good reason that should continue to be the case. > diff --git a/kernel/kexec.c b/kernel/kexec.c > index a6b3f96bb50c..28008e3d462e 100644 > --- a/kernel/kexec.c > +++ b/kernel/kexec.c > @@ -152,7 +152,7 @@ static int do_kexec_load(unsigned long entry, unsigned > long nr_segments, > goto out; > > for (i = 0; i < nr_segments; i++) { > - ret = kimage_load_segment(image, &image->segment[i]); > + ret = kimage_load_segment(image, i); > if (ret) > goto out; > } > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index 3e62b944c883..b5b680dd1796 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -40,6 +40,7 @@ > #include <linux/hugetlb.h> > #include <linux/objtool.h> > #include <linux/kmsg_dump.h> > +#include <linux/dma-map-ops.h> > > #include <asm/page.h> > #include <asm/sections.h> > @@ -553,6 +554,24 @@ static void kimage_free_entry(kimage_entry_t entry) > kimage_free_pages(page); > } > > +static void kimage_free_cma(struct kimage *image) > +{ > + unsigned long i; > + > + for (i = 0; i < image->nr_segments; i++) { > + struct page *cma = image->segment_cma[i]; > + u32 nr_pages = image->segment[i].memsz >> PAGE_SHIFT; > + > + if (!cma) > + continue; > + > + arch_kexec_pre_free_pages(page_address(cma), nr_pages); > + dma_release_from_contiguous(NULL, cma, nr_pages); > + image->segment_cma[i] = NULL; > + } > + > +} > + > void kimage_free(struct kimage *image) > { > kimage_entry_t *ptr, entry; > @@ -591,6 +610,9 @@ void kimage_free(struct kimage *image) > /* Free the kexec control pages... */ > kimage_free_page_list(&image->control_pages); > > + /* Free CMA allocations */ > + kimage_free_cma(image); > + > /* > * Free up any temporary buffers allocated. This might hit if > * error occurred much later after buffer allocation. > @@ -716,9 +738,69 @@ static struct page *kimage_alloc_page(struct kimage > *image, > return page; > } > > -static int kimage_load_normal_segment(struct kimage *image, > - struct kexec_segment *segment) > +static int kimage_load_cma_segment(struct kimage *image, int idx) > +{ > + struct kexec_segment *segment = &image->segment[idx]; > + struct page *cma = image->segment_cma[idx]; > + char *ptr = page_address(cma); > + unsigned long maddr; > + size_t ubytes, mbytes; > + int result = 0; > + unsigned char __user *buf = NULL; > + unsigned char *kbuf = NULL; > + > + if (image->file_mode) > + kbuf = segment->kbuf; > + else > + buf = segment->buf; > + ubytes = segment->bufsz; > + mbytes = segment->memsz; > + maddr = segment->mem; > + > + /* Then copy from source buffer to the CMA one */ > + while (mbytes) { > + size_t uchunk, mchunk; > + > + ptr += maddr & ~PAGE_MASK; > + mchunk = min_t(size_t, mbytes, > + PAGE_SIZE - (maddr & ~PAGE_MASK)); > + uchunk = min(ubytes, mchunk); > + > + if (uchunk) { > + /* For file based kexec, source pages are in kernel > memory */ > + if (image->file_mode) > + memcpy(ptr, kbuf, uchunk); > + else > + result = copy_from_user(ptr, buf, uchunk); > + ubytes -= uchunk; > + if (image->file_mode) > + kbuf += uchunk; > + else > + buf += uchunk; > + } > + > + if (result) { > + result = -EFAULT; > + goto out; > + } > + > + ptr += mchunk; > + maddr += mchunk; > + mbytes -= mchunk; > + > + cond_resched(); > + } > + > + /* Clear any remainder */ > + memset(ptr, 0, mbytes); > + > +out: > + return result; > +} > + > +static int kimage_load_normal_segment(struct kimage *image, int idx) > { > + struct kexec_segment *segment = &image->segment[idx]; > unsigned long maddr; > size_t ubytes, mbytes; > int result; > @@ -733,6 +815,9 @@ static int kimage_load_normal_segment(struct kimage > *image, > mbytes = segment->memsz; > maddr = segment->mem; > > + if (image->segment_cma[idx]) > + return kimage_load_cma_segment(image, idx); > + > result = kimage_set_destination(image, maddr); > if (result < 0) > goto out; > @@ -787,13 +872,13 @@ static int kimage_load_normal_segment(struct kimage > *image, > } > > #ifdef CONFIG_CRASH_DUMP > -static int kimage_load_crash_segment(struct kimage *image, > - struct kexec_segment *segment) > +static int kimage_load_crash_segment(struct kimage *image, int idx) > { > /* For crash dumps kernels we simply copy the data from > * user space to it's destination. > * We do things a page at a time for the sake of kmap. > */ > + struct kexec_segment *segment = &image->segment[idx]; > unsigned long maddr; > size_t ubytes, mbytes; > int result; > @@ -858,18 +943,17 @@ static int kimage_load_crash_segment(struct kimage > *image, > } > #endif > > -int kimage_load_segment(struct kimage *image, > - struct kexec_segment *segment) > +int kimage_load_segment(struct kimage *image, int idx) > { > int result = -ENOMEM; > > switch (image->type) { > case KEXEC_TYPE_DEFAULT: > - result = kimage_load_normal_segment(image, segment); > + result = kimage_load_normal_segment(image, idx); > break; > #ifdef CONFIG_CRASH_DUMP > case KEXEC_TYPE_CRASH: > - result = kimage_load_crash_segment(image, segment); > + result = kimage_load_crash_segment(image, idx); > break; > #endif > } > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index fba686487e3b..916beae68fb6 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -27,6 +27,7 @@ > #include <linux/kernel_read_file.h> > #include <linux/syscalls.h> > #include <linux/vmalloc.h> > +#include <linux/dma-map-ops.h> > #include "kexec_internal.h" > > #ifdef CONFIG_KEXEC_SIG > @@ -230,6 +231,8 @@ kimage_file_prepare_segments(struct kimage *image, int > kernel_fd, int initrd_fd, > ret = 0; > } > > + image->no_cma = !!(flags & KEXEC_FILE_NO_CMA); > + > if (cmdline_len) { > image->cmdline_buf = memdup_user(cmdline_ptr, cmdline_len); > if (IS_ERR(image->cmdline_buf)) { > @@ -406,7 +409,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, > initrd_fd, > i, ksegment->buf, ksegment->bufsz, > ksegment->mem, > ksegment->memsz); > > - ret = kimage_load_segment(image, &image->segment[i]); > + ret = kimage_load_segment(image, i); > if (ret) > goto out; > } > @@ -632,6 +635,39 @@ static int kexec_walk_resources(struct kexec_buf *kbuf, > return walk_system_ram_res(0, ULONG_MAX, kbuf, func); > } > > +static int kexec_alloc_contig(struct kexec_buf *kbuf) > +{ > + size_t nr_pages = kbuf->memsz >> PAGE_SHIFT; > + unsigned long mem; > + struct page *p; > + > + /* User space disabled CMA allocations, bail out. */ > + if (kbuf->image->no_cma) > + return -EPERM; > + > + p = dma_alloc_from_contiguous(NULL, nr_pages, > get_order(kbuf->buf_align), true); > + if (!p) > + return -ENOMEM; > + > + pr_debug("allocated %zu DMA pages at 0x%lx", nr_pages, > page_to_boot_pfn(p)); > + > + mem = page_to_boot_pfn(p) << PAGE_SHIFT; > + > + if (kimage_is_destination_range(kbuf->image, mem, mem + kbuf->memsz)) > { > + /* Our region is already in use by a statically defined one. > Bail out. */ > + pr_debug("CMA overlaps existing mem: 0x%lx+0x%lx\n", mem, > kbuf->memsz); > + dma_release_from_contiguous(NULL, p, nr_pages); > + return -EBUSY; > + } > + > + kbuf->mem = page_to_boot_pfn(p) << PAGE_SHIFT; > + kbuf->cma = p; > + > + arch_kexec_post_alloc_pages(page_address(p), (int)nr_pages, 0); > + > + return 0; > +} > + > /** > * kexec_locate_mem_hole - find free memory for the purgatory or the next > kernel > * @kbuf: Parameters for the memory search. > @@ -648,6 +684,13 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) > if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN) > return 0; > > + /* > + * Try to find a free physically contiguous block of memory first. > With that, we > + * can avoid any copying at kexec time. > + */ > + if (!kexec_alloc_contig(kbuf)) > + return 0;
IIUC, The kexec_locate_mem_hole() function is also used in the KEXEC_TYPE_CRASH kimage, but kexec_alloc_contig() does not skip it. This can cause kdump to fail and lead to CMA memory leakage. So I ran some tests, as listed below: The CMA memory decreases with each execution of the kdump-config reload operation. /home/hezhongkun.hzk# kdump-config reload unloaded kdump kernel. Creating symlink /var/lib/kdump/vmlinuz. Creating symlink /var/lib/kdump/initrd.img. kexec_file_load failed: Cannot assign requested address failed to load kdump kernel ... failed! [ 1387.536346] kexec_file: allocated 1817 DMA pages at 0x11b16e [ 1399.147677] kexec_file: allocated 113 DMA pages at 0x119087 [ 1399.148915] kexec_file: allocated 5 DMA pages at 0x1140f8 [ 1399.150266] kexec_file: allocated 2 DMA pages at 0x1118fd [ 1399.151474] kexec_file: allocated 8302 DMA pages at 0x11b900 /home/hezhongkun.hzk# cat /proc/meminfo | grep Cma CmaTotal: 1048576 kB CmaFree: 679972 kB /home/hezhongkun.hzk# cat /proc/meminfo | grep Cma CmaTotal: 1048576 kB CmaFree: 639016 kB /home/hezhongkun.hzk# cat /proc/meminfo | grep Cma CmaTotal: 1048576 kB CmaFree: 557104 kB cat /proc/meminfo | grep Cma CmaTotal: 1048576 kB CmaFree: 557104 kB > + > if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) > ret = kexec_walk_resources(kbuf, locate_mem_hole_callback); > else > @@ -693,6 +736,7 @@ int kexec_add_buffer(struct kexec_buf *kbuf) > /* Ensure minimum alignment needed for segments. */ > kbuf->memsz = ALIGN(kbuf->memsz, PAGE_SIZE); > kbuf->buf_align = max(kbuf->buf_align, PAGE_SIZE); > + kbuf->cma = NULL; > > /* Walk the RAM ranges and allocate a suitable range for the buffer */ > ret = arch_kexec_locate_mem_hole(kbuf); > @@ -705,6 +749,7 @@ int kexec_add_buffer(struct kexec_buf *kbuf) > ksegment->bufsz = kbuf->bufsz; > ksegment->mem = kbuf->mem; > ksegment->memsz = kbuf->memsz; > + kbuf->image->segment_cma[kbuf->image->nr_segments] = kbuf->cma; > kbuf->image->nr_segments++; > return 0; > } > diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h > index d35d9792402d..29e6cebe0c43 100644 > --- a/kernel/kexec_internal.h > +++ b/kernel/kexec_internal.h > @@ -10,7 +10,7 @@ struct kimage *do_kimage_alloc_init(void); > int sanity_check_segment_list(struct kimage *image); > void kimage_free_page_list(struct list_head *list); > void kimage_free(struct kimage *image); > -int kimage_load_segment(struct kimage *image, struct kexec_segment *segment); > +int kimage_load_segment(struct kimage *image, int idx); > void kimage_terminate(struct kimage *image); > int kimage_is_destination_range(struct kimage *image, > unsigned long start, unsigned long end); > -- > 2.34.1 > > > > > Amazon Web Services Development Center Germany GmbH > Tamara-Danz-Str. 13 > 10243 Berlin > Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss > Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B > Sitz: Berlin > Ust-ID: DE 365 538 597 > >