I am unable to collect logs because when this occurs, the system does not boot.
** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2096860 Title: lvl 5 pagetable system hang Status in linux package in Ubuntu: Confirmed Status in linux-hwe-6.8 package in Ubuntu: New Bug description: A hang occurs with a possible kernel BUG at arch/x86/mm/init_64.c:154 during the memmap_init_zone_device initialization call in the AMDGPU init sequence. When the kernel BUG error occurs, this is the expected good result after the [drm] JPEG decode line. memmap_init_zone_device should execute, then amdgpum HMM, and this is where the kernel BUG happens. ========================= Aug 09 00:07:09.659512 host-ruby-942e kernel: [drm] JPEG decode initialized successfully. Aug 09 00:07:09.659521 host-ruby-942e kernel: memmap_init_zone_device initialised 16777216 pages in 136ms Aug 09 00:07:09.659531 host-ruby-942e kernel: amdgpum HMM registered 65520MB device memory Aug 09 00:07:09.659694 host-ruby-942e kernel: kfd kfd: amdgpu: Allocated 3989536 bytes on gart Aug 09 00:07:09.659838 host-ruby-942e kernel: kfd kfd: amdgpu: Total number of KFD nodes to be created: 1 Aug 09 00:07:09.659849 host-ruby-942e kernel: amdgpu: Virtual CRAT table created for GPU Aug 09 00:07:09.659858 host-ruby-942e kernel: amdgpu: Topology: Add dGPU node [0x740f:0x1002] Aug 09 00:07:09.659985 host-ruby-942e kernel: kfd kfd: amdgpu: added device 1002:740f ==================== The issue is a timing-related race condition when setting up the CPU page tables during the AMDGPU driver initialization. The potential issue could fall under Linux memory management for this 5-level page table error The issue occurs during a server reboot stress. Server environment should have at least 1 x AMD MI210 GPU with amd gpu driver installed and enabled. Use ipmitool to drive chassis cold boot in a loop with loop count set to 1000. We are able to reliably reproduce this issue beyond 500 boot cycles. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2096860/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp