Hi Justin, On 3/20/23 16:47, Justin Churchey wrote: > Hello Laszlo, > > Thank you for the rundown. I enabled the > additional LIBGUESTFS_BACKEND_SETTINGS, and I have attached a follow up > to the libguestfs-test-tool output.
Your computer has faulty RAM. Your libguestfs-test-tool log file contains the following line (read it very carefully): LIBGUESTFS_BANKEND_SETTINGS=force_tcg I was staring my eyes out at your log, not understanding why the "force_tcg" setting wouldn't take effect -- because it didn't, the log file confirms the repeated test run still used KVM. That was when I copied and pasted the above line (before the equal sign) into a git-grep, and then a "git log -S". It turns out that "whatever" variable name was captured in the libguestfs-test-tool log, libguestfs never checks that variable, worse, libguestfs has *never* checked it over its entire history. So then I thought, "aha, Justin must have typed the variable name from memory, instead of using the clipboard". But that's not possible: even if you mistyped the variable name when setting the environment, libguestfs-test-tool would not look for that (misnamed) variable, and log it! So the only explanation is that your RAM is faulty; a single character in the variable name got corrupted in this instance (C -> N): LIBGUESTFS_BACKEND_SETTINGS LIBGUESTFS_BANKEND_SETTINGS ^ With faulty RAM, there's nothing more to investigate here; the guest kernel crash (page fault) can be trivially explained by a pointer getting corrupted and pointing into outer space. I suggest running MemTest86 or MemTest86+. (NB, faulty RAM is not as infrequent as one would think. In my life, if I count right, this is actually the third occasion that I've determined faulty RAM for a user -- not necessarily via the same program / misbehavior, of course. Also I think a faulty disk is much less likely: non-ECC RAM exists, but disks without redundancy checks don't / shouldn't exist, as far as I know.) Laszlo > > I also checked out my CPU settings (cat /proc/cpuinfo output attached), > and the host does appear to support PCLMULQDQ (AMD Ryzen 7 5700X). I > also checked the cpuinfo in one of the guests I have created (Ubuntu > 18.04, unstable due to intermittent kernel panics), and the cpuinfo > indicates that this feature seems to be passed down to my guest as well. > > I noticed that the libguestfs-test-tool didn't seem to like the qemu > settings it tried to boot with. So, I went back to basics and built a > disk using qemu-img (qcow2) and utilized qemu-system-x86_64 to do the > base install (Ubuntu 18.04). The resulting image boots and I import the > resulting image with virt-install. However, the GUI/console seems to > want to lock up shortly after boot if I am using virt-tools. The guest > seems more stable when I boot it directly with `qemu-system,` and this > may be my workaround for now. > > In virt-tools, I can consistently get a panic on the guest by trying to > enable the qemu-guest-agent: `systemctl enable qemu-guest-agent.` > Unfortunately, I cannot get the full output from that panic (attached). > It would seem that this problem is more than just libguestfs-tools. Is > there a KVM listserv that this might be more appropriate for? > > Sincerely, > > On Mon, Mar 20, 2023 at 1:31 AM Laszlo Ersek <ler...@redhat.com > <mailto:ler...@redhat.com>> wrote: > > On 3/17/23 16:10, Justin Churchey wrote: > > Hello Everyone, > > > > I was having some difficulties converting OVA images yesterday. At > > first, I thought it may have been a compatibility issue with > > VirtualBox 7.0. However, when I went to run libguestfs-test-tool, it > > began failing with the exact same error as the conversions, which > > leads me to believe the issue may lie with libguestfs and not the > > images themselves. > > > > To test further, I created a fresh install of Ubuntu 22.04, and the > > libguestfs-test-tool seems to fail with the same error, even on a > > fresh install. I am attaching the libguestfs-test-tool output for > > reference. > > > > Ubuntu 22.04 is running libguestfs-tools 1.46.2-10ubuntu3 > > > > If anybody has any insight into the issue, or if you feel a bug report > > needs to be filed, please let me know. > > Your appliance kernel crashes. > > Here's my theory on why this might happen, based on your log. > > The guestfish appliance runs with KVM acceleration. > > The crash happens after/while inserting the modules crc32-pclmul.ko and > crct10dif-pclmul.ko. > > The "pclmul" in the names of those modules indicates that these modules > calculate various (crc32) checksums with the PCLMULQDQ instruction. I > believe that PCLMULQDQ is an advanced / accelerated instruction and not > all CPUs may support it. > > Your appliance guest is started with "-cpu max" on the QEMU command line > (from libguestfs commit 30f74f38bd6e, "appliance: Use -cpu max.", > 2021-01-28). This is probably why the appliance kernel thinks PCLMULQDQ > is available. > > I think the PCLMULQDQ instruction may cause an issue here. I don't know > why it misbehaves under KVM, but that's my suspicion anyway. > > Note that the kernel crash log provides the following instruction > (assembly binary) dump: > > 46 70 48 8b 56 68 48 03 97 90 01 00 00 48 c1 e0 06 48 03 46 20 48 89 97 > 08 02 00 00 48 be ab aa aa aa aa aa aa aa 48 8b 48 10 <48> 89 0a 48 8b > 50 20 48 8b 8f 08 02 00 00 48 89 d0 48 f7 e6 48 c1 > > with the instruction starting at <48> causing the page fault, as the > direct symptom. Now, we can disassemble this: > > printf \ > '%b' \ > > > '\x46\x70\x48\x8b\x56\x68\x48\x03\x97\x90\x01\x00\x00\x48\xc1\xe0\x06\x48\x03\x46\x20\x48\x89\x97\x08\x02\x00\x00\x48\xbe\xab\xaa\xaa\xaa\xaa\xaa\xaa\xaa\x48\x8b\x48\x10\x48\x89\x0a\x48\x8b\x50\x20\x48\x8b\x8f\x08\x02\x00\x00\x48\x89\xd0\x48\xf7\xe6\x48\xc1' > \ > > bin > > $ ndisasm -b64 bin > > 00000000 467048 jo 0x4b > 00000003 8B5668 mov edx,[rsi+0x68] > 00000006 48039790010000 add rdx,[rdi+0x190] > 0000000D 48C1E006 shl rax,byte 0x6 > 00000011 48034620 add rax,[rsi+0x20] > 00000015 48899708020000 mov [rdi+0x208],rdx > 0000001C 48BEABAAAAAAAAAA mov rsi,0xaaaaaaaaaaaaaaab > -AAAA > 00000026 488B4810 mov rcx,[rax+0x10] > 0000002A 48890A mov [rdx],rcx <----------- crash > 0000002D 488B5020 mov rdx,[rax+0x20] > 00000031 488B8F08020000 mov rcx,[rdi+0x208] > 00000038 4889D0 mov rax,rdx > 0000003B 48F7E6 mul rsi > 0000003E 48 rex.w > 0000003F C1 db 0xc1 > > Note the constant 0xaaaaaaaaaaaaaaab; that seems very special. We can > search the kernel tree for it (I'm not bothering about checking out the > particular ubuntu kernel version for now): > > $ git grep -i aaaaaaaaaaaaaaab > arch/x86/math-emu/poly_atan.c:/* 0xaaaaaaaaaaaaaaabLL, transferred > to fixedpterm[] */ > arch/x86/math-emu/poly_sin.c: 0xaaaaaaaaaaaaaaabLL, > arch/x86/math-emu/poly_tan.c:static const unsigned long long > twothirds = 0xaaaaaaaaaaaaaaabLL; > > In particular, in the last file (poly_tan.c) contains a snippet like > > mul64_Xsig(&accum, &twothirds); > > which seems vagely related to > > 0000001C 48BEABAAAAAAAAAA mov rsi,0xaaaaaaaaaaaaaaab > -AAAA > ... > 0000003B 48F7E6 mul rsi > > Now this does not seem connected to PCLMULQDQ, but it does somehow look > connected to multiplication. > > I don't really know where to go with this, except for asking KVM > experts. > > For now, can you try: > > export LIBGUESTFS_BACKEND_SETTINGS=force_tcg > > from <https://libguestfs.org/guestfs.3.html#backend-settings > <https://libguestfs.org/guestfs.3.html#backend-settings>>, and see > if that makes a difference? > > Laszlo > _______________________________________________ Libguestfs mailing list Libguestfs@redhat.com https://listman.redhat.com/mailman/listinfo/libguestfs