On 03/01/2025 12:42 am, Marek Marczykowski-Górecki wrote: > On Fri, Jan 03, 2025 at 01:18:31AM +0100, Marek Marczykowski-Górecki wrote: >> On Thu, Jan 02, 2025 at 08:39:16PM +0100, Marek Marczykowski-Górecki wrote: >>> On Thu, Jan 02, 2025 at 08:17:00PM +0100, Jürgen Groß wrote: >>>> On 02.01.25 19:54, Marek Marczykowski-Górecki wrote: >>>>> On Thu, Jan 02, 2025 at 01:24:21PM +0100, Marek Marczykowski-Górecki >>>>> wrote: >>>>>> On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote: >>>>>>> On 02.01.25 11:20, Jürgen Groß wrote: >>>>>>>> On 19.12.24 17:14, Marek Marczykowski-Górecki wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> It crashes on boot like below, most of the times. But sometimes >>>>>>>>> (rarely) >>>>>>>>> it manages to stay alive. Below I'm pasting few of the crashes that >>>>>>>>> look >>>>>>>>> distinctly different, if you follow the links, you can find more of >>>>>>>>> them. IMHO it looks like some memory corruption bug somewhere. I >>>>>>>>> tested >>>>>>>>> also Linux 6.13-rc2 before, and it had very similar issue. >>>>>>>> ... >>>>>>>> >>>>>>>>> Full log: >>>>>>>>> https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt >>>>>>>> I can reproduce a crash with 6.13-rc5 PV dom0. >>>>>>>> >>>>>>>> What is really interesting in the logs: most crashes seem to happen >>>>>>>> right >>>>>>>> after a module being loaded (in my reproducer it was right after >>>>>>>> loading >>>>>>>> the first module). >>>>>>>> >>>>>>>> I need to go through the 6.13 commits, but I think I remember having >>>>>>>> seen >>>>>>>> a patch optimizing module loading by using large pages for addressing >>>>>>>> the >>>>>>>> loaded modules. Maybe the case of no large pages being available isn't >>>>>>>> handled properly. >>>>>>> Seems I was right. >>>>>>> >>>>>>> For me the following diff fixes the issue. Marek, can you please confirm >>>>>>> it fixes your crashes, too? >>>>>> Thanks for looking into it! >>>>>> Will do, I've pushed it to >>>>>> https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build it >>>>>> and then I'll post it to openQA. >>>>> It is much better! >>>>> >>>>> Tests are still running, but I already see that many are green. >>>> So are you fine with me adding your "Tested-by:"? >>> Yes. >>> >>>>> There is >>>>> one issue (likely unrelated to this change) - sys-usb (HVM domU with USB >>>>> controllers passed through) crashes on a system with Raptor Lake CPU >>>>> (only, others, including ADL and MTL look fine): >> Correction, it does happen on some others too, just got the crash on the ADL >> system, although looks a bit different ("Corrupted page table at ..."): > I've collected some more of them at > https://github.com/QubesOS/qubes-issues/issues/9681 > > Should I start new thread for this? On one hand, it's a different domain > type (HVM), but on the other hand, many of the crashes are around > loading modules too.
https://lore.kernel.org/lkml/20241227072825.1288491-1-r...@kernel.org/T/#t looks relevant. Probably worth following up. ~Andrew