On 5/5/25 12:06 PM, John Olender wrote: > On 5/5/25 5:02 AM, Christian König wrote: >>> Simply changing the uvd vcpu bo (and therefore the firmware) to always >>> be allocated in vram does *not* solve #3851. >>> >>> Let me go into a bit of depth about how I arrived at this patch. >>> >>> First, what sort of system configuration changes result in the uvd init >>> failure? It looks like having a display connected and changing the BAR >>> size have an impact. Next, which kernel change reliably triggers the >>> issue? The change is the switch to the buddy allocator. >> >> Well that is not a resizable BAR, but rather the "VRAM" is just stolen >> system memory and we completely bypass the BAR to access it. >> >> But the effect is the same. E.g. you have more memory CPU accessible than >> otherwise. >> >>> >>> Now that the issue can be reliably triggered, where does the error code, >>> -110 / -ETIMEDOUT, come from? It turns out it's in >>> amdgpu_uvd_ring_test_ib(), specifically a timeout while waiting on the >>> ring's fence. >>> >>> With that out of the way, what allocator-related change happens when a >>> display is connected at startup? The 'stolen_vga_memory' and related >>> bos are created. Adding a one page dummy bo to the same place in the >>> driver can allow a headless configuration to now pass the uvd ring ib test. >>> >>> Why does having these extra objects allocated result in a change in >>> behavior? Well, the switch to the buddy allocator drastically changes >>> *where* in vram various objects end up being placed. What about the BAR >>> size change? That ends up influencing where the objects are placed too. >>> >>> Which objects related to uvd end up being moved around? The uvd code >>> has a function to force its objects into a specific segment after all. >>> Well, it turns out the vcpu bo doesn't go through this function and is >>> therefore being moved around. >> >> That function is there because independent buffers (the message and the >> feedback for example) needs to be in the same 256MB segment. >> >>> When the system configuration results in a ring ib timeout, the uvd vcpu >>> bo is pinned *outside* the uvd segment. When uvd init succeeds, the uvd >>> vcpu bo is pinned *inside* the uvd segment. >>> >>> So, it appears there's a relationship between *where* the vcpu bo ends >>> up and the fence timeout. But why does the issue manifest as a ring >>> fence timeout while testing the ib? Unfortunately, I'm unable to find >>> something like a datasheet or developer's guide containing the finer >>> details of uvd. >> >> >> Mhm, there must be something wrong with programming bits 28-31 of the VCPU >> BO base address. >> >> Forcing the VCPU into the first 256 segment just makes those bits zero and >> so makes it work on your system. >> >> The problem is that this is basically just coincident. On other systems the >> base address can be completely different. >> >> See function uvd_v4_2_mc_resume() where the mmUVD_LMI_ADDR_EXT and >> mmUVD_LMI_EXT40_ADDR register is programmed and try to hack those two >> register writes and see if they really end up in the HW.
Okay, I did a read and compare after each write. Both writes seem to go through on both the Kaveri and s9150: Kaveri (512MB UMA Buffer): amdgpu 0000:00:01.0: amdgpu: [drm] uvd_v4_2_mc_resume: mmUVD_LMI_ADDR_EXT: gpu_addr=0xF41FA00000, addr=0x00000001, wrote 0x00001001, read 0x00001001 [same] amdgpu 0000:00:01.0: amdgpu: [drm] uvd_v4_2_mc_resume: mmUVD_LMI_EXT40_ADDR: gpu_addr=0xF41FA00000, addr=0x000000F4, wrote 0x800900F4, read 0x800900F4 [same] s9150: amdgpu 0000:41:00.0: amdgpu: [drm] uvd_v4_2_mc_resume: mmUVD_LMI_ADDR_EXT: gpu_addr=0xF7FFA00000, addr=0x0000000F, wrote 0x0000F00F, read 0x0000F00F [same] amdgpu 0000:41:00.0: amdgpu: [drm] uvd_v4_2_mc_resume: mmUVD_LMI_EXT40_ADDR: gpu_addr=0xF7FFA00000, addr=0x000000F7, wrote 0x800900F7, read 0x800900F7 [same] Thanks, John >> >> I will try to find a Kaveri system which is still working to reproduce the >> issue. >> >> Thanks, >> Christian. >> > > I first saw this issue with a s9150. I had serious reservations about > reporting the issue because, in its default configuration, the s9150 has > no display output. I needed to figure out that yes, this is a real > issue, I didn't just shoot myself in the foot by enabling broken display > hardware. > > The issue affects all s9150s in a system, occurs in different slots and > numa nodes, still occurs when other hardware is added or removed, and > follows the s9150 from x399 to a significantly newer b650 system. > > The Kaveri iGPU, while also impacted, mainly serves to show that yes, > this issue is happening on more than just some dodgy s9150 setup. > > Anyway, hopefully these extra configuration details help narrow down the > problem. > > Thanks, > John > >>> >>> Well, what seems related in the code? Where is the ring fence located? >>> It's placed inside the vcpu bo by amdgpu_fence_driver_start_ring(). >>> >>> So, does this patch provide the correct solution to the problem? Maybe >>> not. But the solution seems plausible enough to at least send in the >>> patch for review. >>> >>> Thanks, >>> John >