On Wed, Feb 8, 2023 at 2:54 PM Jason A. Donenfeld <ja...@zx2c4.com> wrote: > > Hi Nathan (and MST), > > On Wed, Feb 8, 2023 at 2:45 PM Nathan Chancellor <nat...@kernel.org> wrote: > > > > Hi Jason, > > > > On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote: > > > The setup_data links are appended to the compressed kernel image. Since > > > the kernel image is typically loaded at 0x100000, setup_data lives at > > > `0x100000 + compressed_size`, which does not get relocated during the > > > kernel's boot process. > > > > > > The kernel typically decompresses the image starting at address > > > 0x1000000 (note: there's one more zero there than the compressed image > > > above). This usually is fine for most kernels. > > > > > > However, if the compressed image is actually quite large, then > > > setup_data will live at a `0x100000 + compressed_size` that extends into > > > the decompressed zone at 0x1000000. In other words, if compressed_size > > > is larger than `0x1000000 - 0x100000`, then the decompression step will > > > clobber setup_data, resulting in crashes. > > > > > > Visually, what happens now is that QEMU appends setup_data to the kernel > > > image: > > > > > > kernel image setup_data > > > |--------------------------||----------------| > > > 0x100000 0x100000+l1 0x100000+l1+l2 > > > > > > The problem is that this decompresses to 0x1000000 (one more zero). So > > > if l1 is > (0x1000000-0x100000), then this winds up looking like: > > > > > > kernel image setup_data > > > |--------------------------||----------------| > > > 0x100000 0x100000+l1 0x100000+l1+l2 > > > > > > d e c o m p r e s s e d k e r n e l > > > > > > |-------------------------------------------------------------| > > > 0x1000000 > > > 0x1000000+l3 > > > > > > The decompressed kernel seemingly overwriting the compressed kernel > > > image isn't a problem, because that gets relocated to a higher address > > > early on in the boot process, at the end of startup_64. setup_data, > > > however, stays in the same place, since those links are self referential > > > and nothing fixes them up. So the decompressed kernel clobbers it. > > > > > > Fix this by appending setup_data to the cmdline blob rather than the > > > kernel image blob, which remains at a lower address that won't get > > > clobbered. > > > > > > This could have been done by overwriting the initrd blob instead, but > > > that poses big difficulties, such as no longer being able to use memory > > > mapped files for initrd, hurting performance, and, more importantly, the > > > initrd address calculation is hard coded in qboot, and it always grows > > > down rather than up, which means lots of brittle semantics would have to > > > be changed around, incurring more complexity. In contrast, using cmdline > > > is simple and doesn't interfere with anything. > > > > > > The microvm machine has a gross hack where it fiddles with fw_cfg data > > > after the fact. So this hack is updated to account for this appending, > > > by reserving some bytes. > > > > > > Cc: x...@kernel.org > > > Cc: Philippe Mathieu-Daudé <phi...@linaro.org> > > > Cc: H. Peter Anvin <h...@zytor.com> > > > Cc: Borislav Petkov <b...@alien8.de> > > > Cc: Eric Biggers <ebigg...@kernel.org> > > > Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> > > > > I apologize if this has already been reported/fixed already (I did a > > brief search on lore.kernel.org) or if my terminology is not as precise > > as it could be, I am a little out of my element here :) > > > > After this change as commit eac7a7791b ("x86: don't let decompressed > > kernel image clobber setup_data") in QEMU master, I am no longer able to > > boot a kernel directly through OVMF using '-append' + '-initrd' + > > '-kernel'. Instead, systemd-boot tries to start the distribution's > > kernel, which fails with: > > > > Error registering initrd: Already started > > > > This can be reproduced with just a defconfig Linux kernel (I used > > 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for > > boot testing [1], and OVMF. Prior to this change, the kernel would start > > up but after, I am dumped into the UEFI shell (as there is no > > bootloader). > > > > The QEMU command I used was: > > > > $ qemu-system-x86_64 \ > > -kernel arch/x86_64/boot/bzImage \ > > -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \ > > -drive > > if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on > > -drive > > if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \ > > Oh no... Without jumping into it, at first glance, I have absolutely > no idea. I suppose I could start debugging it and probably come up > with a solution, but... > > @mst - I'm beginning to think that this whole setup_data route is > cursed. This is accumulating hacks within hacks within hacks. What > would you think if I just send a patch *removing* all use of > setup_data (the rng seed and the dtb thing), and then we can gradually > add that back with an actual overarching design. For example, it'd > probably make sense to have a separate fwcfg file for setup_data > rather than trying to mangle and existing one, etc. This way, we > unbreak the tree, and let the new approach be reviewed more > reasonably.
Sent as https://lore.kernel.org/qemu-devel/20230208180835.234638-1-ja...@zx2c4.com/