Re: amdgpu driver fails to initialize on ppc64le in 7.0-rc1 and newer

Alex Deucher Mon, 16 Mar 2026 06:55:40 -0700

On Mon, Mar 16, 2026 at 5:44 AM Ritesh Harjani <[email protected]> wrote:
>
> Dan Horák <[email protected]> writes:
>
> +cc Gaurav,
>
> > Hi,
> >
> > starting with 7.0-rc1 (meaning 6.19 is OK) the amdgpu driver fails to
> > initialize on my Linux/ppc64le Power9 based system (with Radeon Pro WX4100)
> > with the following in the log
> >
> > ...
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: GART: 256M 
> > 0x000000FF00000000 - 0x000000FF0FFFFFFF
>
>                   ^^^^
> So looks like this is a PowerNV (Power9) machine.
>
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] Detected 
> > VRAM RAM=4096M, BAR=4096M
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] RAM width 
> > 128bits GDDR5
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: iommu: 64-bit 
> > OK but direct DMA is limited by 0
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: 
> > dma_iommu_get_required_mask: returning bypass mask 0xfffffffffffffff
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0:  4096M of VRAM 
> > memory ready
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0:  32570M of GTT 
> > memory ready.
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) failed to 
> > allocate kernel bo
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] Debug 
> > VRAM access will use slowpath MM access
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] GART: num 
> > cpu pages 4096, num gpu pages 65536
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] PCIE GART 
> > of 256M enabled (table at 0x000000F4FFF80000).
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) failed to 
> > allocate kernel bo
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) create WB 
> > bo failed
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: 
> > amdgpu_device_wb_init failed -12
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: 
> > amdgpu_device_ip_init failed
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: Fatal error 
> > during GPU init
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: finishing 
> > device.
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: probe with 
> > driver amdgpu failed with error -12
> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0:  ttm finalized
> > ...
> >
> > After some hints from Alex and bisecting and other investigation I have
> > found that 
> > https://github.com/torvalds/linux/commit/1471c517cf7dae1a6342fb821d8ed501af956dd0
> > is the culprit and reverting it makes amdgpu load (and work) again.
>
> Thanks for confirming this. Yes, this was recently added [1]
>
> [1]: 
> https://lore.kernel.org/linuxppc-dev/[email protected]/
>
>
> @Gaurav,
>
> I am not too familiar with the area, however looking at the logs shared
> by Dan, it looks like we might be always going for dma direct allocation
> path and maybe the device doesn't support this address limit.


The device only supports a 40 bit DMA mask.

Alex

>
>  bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: iommu: 64-bit OK 
> but direct DMA is limited by 0
>  bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: 
> dma_iommu_get_required_mask: returning bypass mask 0xfffffffffffffff
>
> Looking at the code..
>
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index fe7472f13b10..d5743b3c3ab3 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -654,7 +654,7 @@ void *dma_alloc_attrs(struct device *dev, size_t size, 
> dma_addr_t *dma_handle,
>         /* let the implementation decide on the zone to allocate from: */
>         flag &= ~(__GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM);
>
> -       if (dma_alloc_direct(dev, ops)) {
> +       if (dma_alloc_direct(dev, ops) || arch_dma_alloc_direct(dev)) {
>                 cpu_addr = dma_direct_alloc(dev, size, dma_handle, flag, 
> attrs);
>         } else if (use_dma_iommu(dev)) {
>                 cpu_addr = iommu_dma_alloc(dev, size, dma_handle, flag, 
> attrs);
>
> Now, do we need arch_dma_alloc_direct() here? It always returns true if
> dev->dma_ops_bypass is set to true, w/o checking for checks that
> dma_go_direct() has.
>
> whereas...
>
> /*
>  * Check if the devices uses a direct mapping for streaming DMA operations.
>  * This allows IOMMU drivers to set a bypass mode if the DMA mask is large
>  * enough.
>  */
> static inline bool
> dma_alloc_direct(struct device *dev, const struct dma_map_ops *ops)
> ..dma_go_direct(dev, dev->coherent_dma_mask, ops);
> ....  ...
>       #ifdef CONFIG_DMA_OPS_BYPASS
>           if (dev->dma_ops_bypass)
>               return min_not_zero(mask, dev->bus_dma_limit) >=
>                       dma_direct_get_required_mask(dev);
>       #endif
>
> dma_alloc_direct() already checks for dma_ops_bypass and also if
> dev->coherent_dma_mask >= dma_direct_get_required_mask(). So...
>
> .... Do we really need the machinary of arch_dma_{alloc|free}_direct()?
> Isn't dma_alloc_direct() checks sufficient?
>
> Thoughts?
>
> -ritesh
>
>
> >
> > for the record, I have originally opened 
> > https://gitlab.freedesktop.org/drm/amd/-/issues/5039
> >
> >
> >       With regards,
> >
> >               Dan

Re: amdgpu driver fails to initialize on ppc64le in 7.0-rc1 and newer

Reply via email to