On Fri, May 14, 2021 at 06:01:50PM -0000, Thiago Jung Bauermann wrote:
> Ah, ok. This morning I went ahead and overwrote the whole of /lib/firmware/
> amdgpu/ with the files from the latest commit of the upstream linux-
> firmware git repo you mention below. It's been only a little over 2 hours, 
> but my impression is that the latest firmware does solve the problem. If no 
> retry page fault happens by the end of the day, then I think it's safe to 
> say that it fixes the issue.

Sounds good.

> So at the end of the day (or earlier if I get a retry page fault) I'll do 
> the procedure you mention to use firmware 1.197 and only overwrite the 
> picasso* files to confirm if those are the ones that need to be changed.
> 
> I don't know much about GPUs, but from looking at Wikipedia¹ I think my 
> model is a "Vega 10". Should I overwrite the vega10* files as well?
> 
> ¹ https://en.wikipedia.org/wiki/Radeon_RX_Vega_series#Picasso_(2019)_2

I'm basing it on the PCI id I see in dmesg, which is 1002:15d8, which
the driver looks to identify as CHIP_RAVEN and flag with
AMD_APU_IS_PICASSO. Based on that it will pick firmware files which
begin with "picasso".  It would be informative though to have the output
from "lspci -vvnn" on your machine.

Let's try the picasso files first. If that doesn't work you can try the
vega10 files (without the new picasso files), then if that doesn't work,
try both. Then we'll have an idea about what the minimal set of files is
to fix the problem.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-firmware in Ubuntu.
https://bugs.launchpad.net/bugs/1928393

Title:
  linux-firmware 1.197 causes kernel to report error "amdgpu: [gfxhub0]
  retry page fault"

Status in linux-firmware package in Ubuntu:
  Incomplete

Bug description:
  After upgrading linux-firmware from 1.190.5 to 1.197 (as part of the
  upgrade from Ubuntu 20.10 to 21.04), I started experiencing frequent
  and severe GPU instability. When this happens, I see this error in
  dmesg:

  [20061.061069] amdgpu 0000:03:00.0: amdgpu: [gfxhub0] retry page fault 
(src_id:0 ring:0 vmid:1 pasid:32769, for process Xorg pid 1141 thread Xorg:cs0 
pid 1236)
  [20061.061103] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 
0x800000401000 from client 27
  [20061.061135] amdgpu 0000:03:00.0: amdgpu: 
VM_L2_PROTECTION_FAULT_STATUS:0x00101031
  [20061.061147] amdgpu 0000:03:00.0: amdgpu:      Faulty UTCL2 client ID: TCP 
(0x8)
  [20061.061157] amdgpu 0000:03:00.0: amdgpu:      MORE_FAULTS: 0x1
  [20061.061167] amdgpu 0000:03:00.0: amdgpu:      WALKER_ERROR: 0x0
  [20061.061174] amdgpu 0000:03:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
  [20061.061183] amdgpu 0000:03:00.0: amdgpu:      MAPPING_ERROR: 0x0
  [20061.061189] amdgpu 0000:03:00.0: amdgpu:      RW: 0x0

  I'll attach a couple of full dmesgs that I collected.

  Many of the times when this happens, the screen and keyboard freeze
  irreversibly (I tried waiting for more than 30 minutes, but it doesn't
  help). I can still log in via ssh though. When there's no freeze, I
  can continue using the computer normally, but the laptop fans keep
  running are always running and the battery depletes fast. There's
  probably something on a permanent loop either in the kernel or in the
  GPU.

  This bug happens several times a day, rendering the machine so
  unstable as to be almost unusable. It is a severe regression and I'm
  aghast that it passed AMD's Quality Assurance.

  After downgrading back to linux-firmware 1.190.5, the machine is back
  to the previous, mostly-reliable state. Which is to say, this bug is
  gone, I'm just left with the other amdgpu suspend bug I've learned to
  live with since I bought this computer.

  Please revert the amdgpu firmware in this package as soon as possible.
  This is unbearable.

  Relevant information:
  Ubuntu version: 21.04
  Linux kernel: 5.11.0-17-generic x86_64
  CPU model: AMD Ryzen 7 3700U with Radeon Vega Mobile Gfx
  GPU: 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. 
[AMD/ATI] Picasso (rev c1)
  Laptop model: Lenovo Ideapad S145

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-firmware/+bug/1928393/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to