On Wed, Nov 23, 2022 at 2:50 AM Khatri, Sunil <sunil.kha...@amd.com> wrote:
>
> [AMD Official Use Only - General]
>
> Hello Alex, Robert
>
> I too have similar issues which I am facing on chrome. Are there any tools in 
> linux environment which can help debug such issues like page faults, kernel 
> panic caused by invalid pointer access.
>
> I have used tools like ramdump parser which can be used to use the ramdump 
> after a crash and check a lot of static data in the memory and even the page 
> table could be checked by walking through them manually. We used to load the 
> kernel symbols along with ramdump to go line by line.
>
> Appreciate if you can point to some document or some tools which is already 
> used by linux graphics teams either UMD or KMD drivers so chrome team can 
> also exploit those to debug issues.
>

UMR has a number of tools for dumping GPU page tables and debugging page faults:
https://gitlab.freedesktop.org/tomstdenis/umr

Alex


> Regards
> Sunil Khatri
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Alex 
> Deucher
> Sent: Tuesday, November 22, 2022 7:42 PM
> To: Robert Beckett <bob.beck...@collabora.com>
> Cc: Dmitrii Osipenko <dmitry.osipe...@collabora.com>; Adrián Martínez Larumbe 
> <adrian.laru...@collabora.com>; Koenig, Christian <christian.koe...@amd.com>; 
> amd-gfx@lists.freedesktop.org; Daniel Stone <dani...@collabora.com>
> Subject: Re: Help debug amdgpu faults
>
> On Tue, Nov 22, 2022 at 6:53 AM Robert Beckett <bob.beck...@collabora.com> 
> wrote:
> >
> > Hi,
> >
> >
> > does anyone know any documentation, or can provide advice on debugging 
> > amdgpu fault reports?
>
> This is a GPU page fault so it refers the the GPU virtual address space of 
> the application .  Each process (well fd really), gets its own GPU virtual 
> address space into which system memory, system mmio space, or vram can be 
> mapped.  The user mode drivers control their GPU virtual address space.
>
> >
> >
> > e.g:
> >
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: [gfxhub]
> > page fault (src_id:0 ring:8 vmid:1 pasid:32769, for process vkcube pid
> > 999 thread vkcube pid 999)
>
> This is the process that caused the fault.
>
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:   in page 
> > starting at address 0x0000800100700000 from client 0x1b (UTCL2)
>
> This is the virtual address that faulted.
>
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: 
> > GCVM_L2_PROTECTION_FAULT_STATUS:0x00101A10
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          Faulty 
> > UTCL2 client ID: SDMA0 (0xd)
>
> The fault came from the SDMA engine.
>
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          
> > MORE_FAULTS: 0x0
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          
> > WALKER_ERROR: 0x0
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          
> > PERMISSION_FAULTS: 0x1
>
> The page was not marked as valid in the GPU page table.
>
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          
> > MAPPING_ERROR: 0x0
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          RW: 0x0
>
> SDMA attempted to read an invalid page.
>
> >
> >
> >
> > see 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F2267&amp;data=05%7C01%7Csunil.khatri%40amd.com%7Cd7778c40bff6443c2af708dacc9394c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638047231486449634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=vep6PKgDjRz02A3xYI8f7600QV2%2B7GYXsx%2FVYPY1M2I%3D&amp;reserved=0
> >  for more context.
> >
> > We have a complicated setup involving rendering then blitting to virtio-gpu 
> > exported dmabufs, with plenty of hacks in the mesa and xwayland stacks, so 
> > we are considering this our issue to debug, and not an issue with the 
> > driver at this point.
> > However, having debugged all the interesting parts leading to these faults, 
> > I am unable to decode the fault report to correlate to a buffer.
> >
> > in the fault report, what address space is the address from?
> > given that the fault handler shifts the reported addres up by 12, I assume 
> > it is a 4K pfn which makes me assume a physical address is this correct?
> > if so, is that a vram pa or a host system memory pa?
> > is there any documentation for the other fields reported like the fault 
> > status etc?
>
> See the comments above.  There is some kernel doc as well:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.kernel.org%2Fgpu%2Famdgpu%2Fdriver-core.html%23amdgpu-virtual-memory&amp;data=05%7C01%7Csunil.khatri%40amd.com%7Cd7778c40bff6443c2af708dacc9394c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638047231486449634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=dd971OoEZSJl%2FJif4%2Bypv9Dp0deeMVsQuCMc2o9BgQk%3D&amp;reserved=0
>
> >
> > I'd appreciate any advice you could give to help us debug further.
>
> Some operation you are doing in the user mode driver is reading an invalid 
> page.  Possibly reading past the end of a buffer or something mis-aligned.  
> Compare the faulting GPU address to the GPU virtual address space in the 
> application and you should be able to track down what is happening.
>
> Alex
>
> >
> > Thanks
> >
> > Bob
> >

Reply via email to