I would like to add some dmesg logs that were captured in the past when
we tried reproducing the issue at AMD

** Attachment added: "dropbox_error_instance_dmesglog.txt"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2096860/+attachment/5861120/+files/dropbox_error_instance_dmesglog.txt

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2096860

Title:
  lvl 5 pagetable system hang

Status in linux package in Ubuntu:
  Confirmed
Status in linux-hwe-6.8 package in Ubuntu:
  New

Bug description:
  A hang occurs with a possible kernel BUG at arch/x86/mm/init_64.c:154
  during the memmap_init_zone_device initialization call in the AMDGPU
  init sequence.

  When the kernel BUG error occurs, this is the expected good result
  after the [drm] JPEG decode line. memmap_init_zone_device should
  execute, then amdgpum HMM, and this is where the kernel BUG happens.

  =========================

  Aug 09 00:07:09.659512 host-ruby-942e kernel: [drm] JPEG decode initialized 
successfully.
  Aug 09 00:07:09.659521 host-ruby-942e kernel: memmap_init_zone_device 
initialised 16777216 pages in 136ms
  Aug 09 00:07:09.659531 host-ruby-942e kernel: amdgpum HMM registered 65520MB 
device memory
  Aug 09 00:07:09.659694 host-ruby-942e kernel: kfd kfd: amdgpu: Allocated 
3989536 bytes on gart
  Aug 09 00:07:09.659838 host-ruby-942e kernel: kfd kfd: amdgpu: Total number 
of KFD nodes to be created: 1
  Aug 09 00:07:09.659849 host-ruby-942e kernel: amdgpu: Virtual CRAT table 
created for GPU
  Aug 09 00:07:09.659858 host-ruby-942e kernel: amdgpu: Topology: Add dGPU node 
[0x740f:0x1002]
  Aug 09 00:07:09.659985 host-ruby-942e kernel: kfd kfd: amdgpu: added device 
1002:740f
  ====================

  The issue is a timing-related race condition when setting up the CPU
  page tables during the AMDGPU driver initialization. The potential
  issue could fall under Linux memory management for this 5-level page
  table error


  The issue occurs during a server reboot stress. Server environment
  should have at least 1 x AMD MI210 GPU with amd gpu driver installed
  and enabled. Use ipmitool to drive chassis cold boot in a loop with
  loop count set to 1000. We are able to reliably reproduce this issue
  beyond 500 boot cycles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2096860/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to