On Wed, Nov 23, 2022 at 2:50 AM Khatri, Sunil <sunil.kha...@amd.com> wrote: > > [AMD Official Use Only - General] > > Hello Alex, Robert > > I too have similar issues which I am facing on chrome. Are there any tools in > linux environment which can help debug such issues like page faults, kernel > panic caused by invalid pointer access. > > I have used tools like ramdump parser which can be used to use the ramdump > after a crash and check a lot of static data in the memory and even the page > table could be checked by walking through them manually. We used to load the > kernel symbols along with ramdump to go line by line. > > Appreciate if you can point to some document or some tools which is already > used by linux graphics teams either UMD or KMD drivers so chrome team can > also exploit those to debug issues. >
UMR has a number of tools for dumping GPU page tables and debugging page faults: https://gitlab.freedesktop.org/tomstdenis/umr Alex > Regards > Sunil Khatri > > -----Original Message----- > From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Alex > Deucher > Sent: Tuesday, November 22, 2022 7:42 PM > To: Robert Beckett <bob.beck...@collabora.com> > Cc: Dmitrii Osipenko <dmitry.osipe...@collabora.com>; Adrián Martínez Larumbe > <adrian.laru...@collabora.com>; Koenig, Christian <christian.koe...@amd.com>; > amd-gfx@lists.freedesktop.org; Daniel Stone <dani...@collabora.com> > Subject: Re: Help debug amdgpu faults > > On Tue, Nov 22, 2022 at 6:53 AM Robert Beckett <bob.beck...@collabora.com> > wrote: > > > > Hi, > > > > > > does anyone know any documentation, or can provide advice on debugging > > amdgpu fault reports? > > This is a GPU page fault so it refers the the GPU virtual address space of > the application . Each process (well fd really), gets its own GPU virtual > address space into which system memory, system mmio space, or vram can be > mapped. The user mode drivers control their GPU virtual address space. > > > > > > > e.g: > > > > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: [gfxhub] > > page fault (src_id:0 ring:8 vmid:1 pasid:32769, for process vkcube pid > > 999 thread vkcube pid 999) > > This is the process that caused the fault. > > > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: in page > > starting at address 0x0000800100700000 from client 0x1b (UTCL2) > > This is the virtual address that faulted. > > > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: > > GCVM_L2_PROTECTION_FAULT_STATUS:0x00101A10 > > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: Faulty > > UTCL2 client ID: SDMA0 (0xd) > > The fault came from the SDMA engine. > > > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: > > MORE_FAULTS: 0x0 > > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: > > WALKER_ERROR: 0x0 > > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: > > PERMISSION_FAULTS: 0x1 > > The page was not marked as valid in the GPU page table. > > > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: > > MAPPING_ERROR: 0x0 > > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: RW: 0x0 > > SDMA attempted to read an invalid page. > > > > > > > > > see > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F2267&data=05%7C01%7Csunil.khatri%40amd.com%7Cd7778c40bff6443c2af708dacc9394c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638047231486449634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vep6PKgDjRz02A3xYI8f7600QV2%2B7GYXsx%2FVYPY1M2I%3D&reserved=0 > > for more context. > > > > We have a complicated setup involving rendering then blitting to virtio-gpu > > exported dmabufs, with plenty of hacks in the mesa and xwayland stacks, so > > we are considering this our issue to debug, and not an issue with the > > driver at this point. > > However, having debugged all the interesting parts leading to these faults, > > I am unable to decode the fault report to correlate to a buffer. > > > > in the fault report, what address space is the address from? > > given that the fault handler shifts the reported addres up by 12, I assume > > it is a 4K pfn which makes me assume a physical address is this correct? > > if so, is that a vram pa or a host system memory pa? > > is there any documentation for the other fields reported like the fault > > status etc? > > See the comments above. There is some kernel doc as well: > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.kernel.org%2Fgpu%2Famdgpu%2Fdriver-core.html%23amdgpu-virtual-memory&data=05%7C01%7Csunil.khatri%40amd.com%7Cd7778c40bff6443c2af708dacc9394c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638047231486449634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dd971OoEZSJl%2FJif4%2Bypv9Dp0deeMVsQuCMc2o9BgQk%3D&reserved=0 > > > > > I'd appreciate any advice you could give to help us debug further. > > Some operation you are doing in the user mode driver is reading an invalid > page. Possibly reading past the end of a buffer or something mis-aligned. > Compare the faulting GPU address to the GPU virtual address space in the > application and you should be able to track down what is happening. > > Alex > > > > > Thanks > > > > Bob > >