[Kernel-packages] [Bug 2039368] Re: UBSAN: array-index-out-of-bounds in /build/linux-D15vQj/linux-6.5.0/drivers/md/bcache/bset.c:1098:3

2023-10-16 Thread KonishchevDmitry
I have similar messages, but with AMD GPU:

```
Oct 16 18:41:04 server kernel: 

Oct 16 18:41:04 server kernel: UBSAN: array-index-out-of-bounds in 
/build/linux-D15vQj/linux-6.5.0/drivers/gpu/drm/amd/amdgpu/../pm/powerplay/hwmgr/processpptables.c:1249:61
Oct 16 18:41:04 server kernel: index 1 is out of range for type 
'ATOM_PPLIB_VCE_Clock_Voltage_Limit_Record [1]'
Oct 16 18:41:04 server kernel: CPU: 3 PID: 128 Comm: (udev-worker) Not tainted 
6.5.0-9-generic #9-Ubuntu
Oct 16 18:41:04 server kernel: Hardware name: HPE ProLiant MicroServer 
Gen10/ProLiant MicroServer Gen10, BIOS 5.12 06/26/2018
Oct 16 18:41:04 server kernel: Call Trace:
Oct 16 18:41:04 server kernel:  
Oct 16 18:41:04 server kernel:  dump_stack_lvl+0x48/0x70
Oct 16 18:41:04 server kernel:  dump_stack+0x10/0x20
Oct 16 18:41:04 server kernel:  __ubsan_handle_out_of_bounds+0xc6/0x110
Oct 16 18:41:04 server kernel:  init_clock_voltage_dependency+0x9bb/0xa60 
[amdgpu]
Oct 16 18:41:04 server kernel:  pp_tables_initialize+0x116/0x440 [amdgpu]
Oct 16 18:41:04 server kernel:  hwmgr_hw_init+0x7b/0x1e0 [amdgpu]
Oct 16 18:41:04 server kernel:  pp_hw_init+0x16/0x50 [amdgpu]
Oct 16 18:41:04 server kernel:  amdgpu_device_ip_init+0x48e/0x900 [amdgpu]
Oct 16 18:41:04 server kernel:  amdgpu_device_init+0x975/0x1160 [amdgpu]
Oct 16 18:41:04 server kernel:  amdgpu_driver_load_kms+0x1a/0x1c0 [amdgpu]
Oct 16 18:41:04 server kernel:  amdgpu_pci_probe+0x175/0x490 [amdgpu]
Oct 16 18:41:04 server kernel:  local_pci_probe+0x47/0xb0
Oct 16 18:41:04 server kernel:  pci_call_probe+0x55/0x190
Oct 16 18:41:04 server kernel:  pci_device_probe+0x84/0x120
Oct 16 18:41:04 server kernel:  really_probe+0x1c7/0x410
Oct 16 18:41:04 server kernel:  __driver_probe_device+0x8c/0x180
Oct 16 18:41:04 server kernel:  driver_probe_device+0x24/0xd0
Oct 16 18:41:04 server kernel:  __driver_attach+0x10b/0x210
Oct 16 18:41:04 server kernel:  ? __pfx___driver_attach+0x10/0x10
Oct 16 18:41:04 server kernel:  bus_for_each_dev+0x8d/0xf0
Oct 16 18:41:04 server kernel:  driver_attach+0x1e/0x30
Oct 16 18:41:04 server kernel:  bus_add_driver+0x127/0x240
Oct 16 18:41:04 server kernel:  driver_register+0x5e/0x130
Oct 16 18:41:04 server kernel:  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
Oct 16 18:41:04 server kernel:  __pci_register_driver+0x62/0x70
Oct 16 18:41:04 server kernel:  amdgpu_init+0x69/0xff0 [amdgpu]
Oct 16 18:41:04 server kernel:  do_one_initcall+0x5e/0x340
Oct 16 18:41:04 server kernel:  do_init_module+0x91/0x290
Oct 16 18:41:04 server kernel:  load_module+0xba1/0xcf0
Oct 16 18:41:04 server kernel:  ? vfree+0xff/0x2d0
Oct 16 18:41:04 server kernel:  init_module_from_file+0x96/0x100
Oct 16 18:41:04 server kernel:  ? init_module_from_file+0x96/0x100
Oct 16 18:41:04 server kernel:  idempotent_init_module+0x11c/0x2b0
Oct 16 18:41:04 server kernel:  __x64_sys_finit_module+0x64/0xd0
Oct 16 18:41:04 server kernel:  do_syscall_64+0x5c/0x90
Oct 16 18:41:04 server kernel:  ? syscall_exit_to_user_mode+0x37/0x60
Oct 16 18:41:04 server kernel:  ? do_syscall_64+0x68/0x90
Oct 16 18:41:04 server kernel:  ? syscall_exit_to_user_mode+0x37/0x60
Oct 16 18:41:04 server kernel:  ? do_syscall_64+0x68/0x90
Oct 16 18:41:04 server kernel:  ? do_syscall_64+0x68/0x90
Oct 16 18:41:04 server kernel:  ? do_syscall_64+0x68/0x90
Oct 16 18:41:04 server kernel:  ? sysvec_call_function+0x4b/0xd0
Oct 16 18:41:04 server kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Oct 16 18:41:04 server kernel: RIP: 0033:0x7f566e6ccc7d
Oct 16 18:41:04 server kernel: Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 
0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6b 81 0d 00 f7 d8 64 89 01 48
Oct 16 18:41:04 server kernel: RSP: 002b:7ffde2a2a668 EFLAGS: 0246 
ORIG_RAX: 0139
Oct 16 18:41:04 server kernel: RAX: ffda RBX: 56478d877030 RCX: 
7f566e6ccc7d
Oct 16 18:41:04 server kernel: RDX: 0004 RSI: 7f566e85c44a RDI: 
0018
Oct 16 18:41:04 server kernel: RBP: 7f566e85c44a R08: 0040 R09: 
fde0
Oct 16 18:41:04 server kernel: R10: fe18 R11: 0246 R12: 
0002
Oct 16 18:41:04 server kernel: R13: 56478d878a90 R14:  R15: 
56478d87bac0
Oct 16 18:41:04 server kernel:  
Oct 16 18:41:04 server kernel: 

Oct 16 18:41:04 server kernel: 

Oct 16 18:41:04 server kernel: UBSAN: array-index-out-of-bounds in 
/build/linux-D15vQj/linux-6.5.0/drivers/gpu/drm/amd/amdgpu/../pm/powerplay/hwmgr/processpptables.c:1249:61
Oct 16 18:41:04 server kernel: index 1 is out of range for type 
'ATOM_PPLIB_VCE_Clock_Voltage_Limit_Record [1]'
Oct 16 18:41:04 server kernel: CPU: 3 PID: 128 Comm: (udev-worker) Not tainted 

[Kernel-packages] [Bug 1956401] Re: amdgpu hangs for 90 seconds at a time in 5.13.0-23, but 5.13.0-22 works

2022-01-08 Thread KonishchevDmitry
5.13.0-24.24 helps to me. With 5.13.0-23.23 my server don't boot at all:
it starts booting and then monitor goes into inactive state, so I don't
even have a way to see an error message.

My configuration:
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 
60h-6fh) Processor Root Complex
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wani 
[Radeon R5/R6/R7 Graphics] (rev 85)
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 
60h-6fh) Host Bridge
00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 
60h-6fh) Processor Root Port
00:02.5 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 
60h-6fh) Processor Root Port
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 
60h-6fh) Host Bridge
00:08.0 Encryption controller: Advanced Micro Devices, Inc. [AMD] Carrizo 
Platform Security Processor
00:09.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Carrizo Audio Dummy 
Host Bridge
00:10.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI 
Controller (rev 20)
00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller 
[AHCI mode] (rev 49)
00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI 
Controller (rev 49)
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 4a)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 11)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 
60h-6fh) Processor Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 
60h-6fh) Processor Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 
60h-6fh) Processor Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 
60h-6fh) Processor Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 
60h-6fh) Processor Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 
60h-6fh) Processor Function 5
01:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe 2.0 x2 
4-port SATA 6 Gb/s RAID Controller (rev 11)
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 
Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 
Gigabit Ethernet PCIe

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1956401

Title:
  amdgpu hangs for 90 seconds at a time in 5.13.0-23, but 5.13.0-22
  works

Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Impish:
  Confirmed

Bug description:
  SRU Justification

  Impact:

  This does not occur with linux-image-5.13.0-22-generic, but does with 
linux-image-5.13.0-23-generic.
  On startup, I get about a 60 second hang, with the following in the kernel 
dmesg:
  Jan  4 15:26:36 inspiron-3505 kernel: [   34.160572] amdgpu :04:00.0: 
amdgp : failed to write reg 28b4 wait reg 28c6
  Jan  4 15:26:56 inspiron-3505 kernel: [   54.189055] amdgpu :04:00.0: 
amdgp : failed to write reg 1a6f4 wait reg 1a706
  Jan  4 15:27:16 inspiron-3505 kernel: [   74.329264] amdgpu :04:00.0: 
amdgp : failed to write reg 28b4 wait reg 28c6
  Jan  4 15:27:36 inspiron-3505 kernel: [   94.337904] amdgpu :04:00.0: 
amdgp : failed to write reg 1a6f4 wait reg 1a706
  I have the following GPU:
  04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] 
Picass
  o (rev c2) (prog-if 00 [VGA controller])
  04:00.0 0300: 1002:15d8 (rev c2)
  (This is a Ryzen 5 3450U CPU with Radeon Vega Mobile.)

  I get a similar hang if I start firefox (when it's probing OpenGL
  contexts), and even with glxgears and glxinfo. Seems like anything
  that'd kick on a OpenGL context does it.  I had a freeze as well when
  I tried running firefox and glxgears both.  Along with odd BUG:
  messages logged (I have some in the attached log.)

  I was running with "iommu=pt", but did try with this removed, still
  got the errors (I think amdgpu driver uses the IOMMU even when it's
  set to IOMMU=pt though.).  See the attached log for some very odd
  "[Hardware Error]" messages that were logged on one test run.  I think
  this was when I tried to run firestorm (second life viewer) -- that
  had a large pause then opened to a black window.

  Per Google, I see there was a bug like this that turned up in kernel
  5.14.15 but fixed in 5.14.17.  See
  https://gitlab.freedesktop.org/drm/amd/-/issues/1770

  Thanks!
  --Henry

  Fix:
  upstream commit afd18180c070 ("drm/amdkfd: fix boot failure when iommu is 
disabled in Picasso.")

  Patch was included in the Impish kernel in -proposed (5.13.0.24.24)
  from an upstream patch set. multiple confirmations the problem is
  resolved with the kernel in -proposed.

To manage notifications about