Hi Ritesh,

On Tue, 17 Mar 2026 17:13:31 +0530
Ritesh Harjani (IBM) <[email protected]> wrote:

> Dan Horák <[email protected]> writes:
> 
> > Hi Ritesh,
> >
> > On Sun, 15 Mar 2026 09:55:11 +0530
> > Ritesh Harjani (IBM) <[email protected]> wrote:
> >
> >> Dan Horák <[email protected]> writes:
> >> 
> >> +cc Gaurav,
> >> 
> >> > Hi,
> >> >
> >> > starting with 7.0-rc1 (meaning 6.19 is OK) the amdgpu driver fails to
> >> > initialize on my Linux/ppc64le Power9 based system (with Radeon Pro 
> >> > WX4100)
> >> > with the following in the log
> >> >
> >> > ...
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: GART: 256M 
> >> > 0x000000FF00000000 - 0x000000FF0FFFFFFF
> >> 
> >>                   ^^^^
> >> So looks like this is a PowerNV (Power9) machine.
> >
> > correct :-)
> >  
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] 
> >> > Detected VRAM RAM=4096M, BAR=4096M
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] RAM 
> >> > width 128bits GDDR5
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: iommu: 
> >> > 64-bit OK but direct DMA is limited by 0
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: 
> >> > dma_iommu_get_required_mask: returning bypass mask 0xfffffffffffffff
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0:  4096M of 
> >> > VRAM memory ready
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0:  32570M of 
> >> > GTT memory ready.
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) failed 
> >> > to allocate kernel bo
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] Debug 
> >> > VRAM access will use slowpath MM access
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] GART: 
> >> > num cpu pages 4096, num gpu pages 65536
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] PCIE 
> >> > GART of 256M enabled (table at 0x000000F4FFF80000).
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) failed 
> >> > to allocate kernel bo
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) create 
> >> > WB bo failed
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: 
> >> > amdgpu_device_wb_init failed -12
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: 
> >> > amdgpu_device_ip_init failed
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: Fatal error 
> >> > during GPU init
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: finishing 
> >> > device.
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: probe with 
> >> > driver amdgpu failed with error -12
> >> > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0:  ttm 
> >> > finalized
> >> > ...
> >> >
> >> > After some hints from Alex and bisecting and other investigation I have
> >> > found that 
> >> > https://github.com/torvalds/linux/commit/1471c517cf7dae1a6342fb821d8ed501af956dd0
> >> > is the culprit and reverting it makes amdgpu load (and work) again.
> >> 
> >> Thanks for confirming this. Yes, this was recently added [1]
> >> 
> >> [1]: 
> >> https://lore.kernel.org/linuxppc-dev/[email protected]/
> >>  
> >> 
> >> 
> >> @Gaurav,
> >> 
> >> I am not too familiar with the area, however looking at the logs shared
> >> by Dan, it looks like we might be always going for dma direct allocation
> >> path and maybe the device doesn't support this address limit. 
> >> 
> >>  bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: iommu: 64-bit 
> >> OK but direct DMA is limited by 0
> >>  bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: 
> >> dma_iommu_get_required_mask: returning bypass mask 0xfffffffffffffff
> >
> > a complete kernel log is at
> > https://gitlab.freedesktop.org/-/project/4522/uploads/c4935bca6f37bbd06bb4045c07d00b5b/kernel.log
> >
> > Please let me know if you need more info.
> 
> Hi Dan,
> 
> Thanks for sharing the kernel log. Is it also possible to kindly share
> your full kernel config with which you saw this issue.

the log is from an official Fedora kernel, thus the config is
https://src.fedoraproject.org/rpms/kernel/blob/8477f609d4875a2c20717519243fb2e6fb1cdb8f/f/kernel-ppc64le-fedora.config

and yes, Fedora, like RHEL, uses 64k kernel page size for ppc64le and
except years ago I haven't had a 64k related issue with my card. IIRC
there were page size related issues with the newer (Navi?) cards, but
those also had been solved.

 
> I think Gaurav, is still looking into reported issue. However I was
> interested in this kernel log output..
> 
> bře 05 08:35:34 talos.danny.cz kernel: radix-mmu: Mapped 
> 0x00002007fad00000-0x00002007fcd00000 with 64.0 KiB pages
> 
> This shows that the system is using 64K pagesize. So I was interested in
> knowing the kernel configs you have enabled. Donet has recently posted
> 64K pagesize support with amdgpu [1][2] on Power. However, I think, we
> can still use it w/o Donet's changes if we have CONFIG_HSA_AMD_SVM
> disabled.
> 
> So, can you kindly share the kernel configs and the AMD GPU HW details
> attached to your Power9 baremetal system, if it's possible?

output of "lspci -nn -vvv"

0000:01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. 
[AMD/ATI] Baffin [Radeon Pro WX 4100] [1002:67e3] (prog-if 00 [VGA controller])
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:0b0d]
        Device tree node: 
/sys/firmware/devicetree/base/pciex@600c3c0000000/pci@0/vga@0
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 194
        NUMA node: 0
        IOMMU group: 0
        Region 0: Memory at 6000000000000 (64-bit, prefetchable) [size=4G]
        Region 2: Memory at 6000100000000 (64-bit, prefetchable) [size=2M]
        Region 5: Memory at 600c000000000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at 600c000040000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA 
PME(D0-,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Legacy Endpoint, IntMsgNum 0
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 
unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- TEE-IO-
                DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- 
TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency 
L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- 
FltModeDis-
                LnkSta: Speed 8GT/s, Width x8
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis- 
NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ 
EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Not Supported, 
EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 
2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB 
preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ 
EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ 
LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported, FltMode-
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 1000000000000000  Data: 0000
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 
Len=010 <?>
        Capabilities: [150 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- 
AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- 
PCRC_CHECK- TLPXlatBlocked-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- 
AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- 
PCRC_CHECK- TLPXlatBlocked-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF+ MalfTLP+
                        ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- 
AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- 
PCRC_CHECK- TLPXlatBlocked-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
AdvNonFatalErr- CorrIntErr- HeaderOF-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
AdvNonFatalErr+ CorrIntErr- HeaderOF-
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn+ 
ECRCChkCap+ ECRCChkEn+
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [200 v1] Physical Resizable BAR
                BAR 0: current size: 4GB, supported: 256MB 512MB 1GB 2GB 4GB
        Capabilities: [270 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [2b0 v1] Address Translation Service (ATS)
                ATSCap: Invalidate Queue Depth: 00
                ATSCtl: Enable-, Smallest Translation Unit: 00
        Capabilities: [2c0 v1] Page Request Interface (PRI)
                PRICtl: Enable- Reset-
                PRISta: RF- UPRGI- Stopped+ PASID-
                Page Request Capacity: 00000020, Page Request Allocation: 
00000000
        Capabilities: [2d0 v1] Process Address Space ID (PASID)
                PASIDCap: Exec+ Priv+, Max PASID Width: 10
                PASIDCtl: Enable- Exec- Priv-
        Capabilities: [320 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ 
L1_PM_Substates+
                          PortCommonModeRestoreTime=0us PortTPowerOnTime=170us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu


> [1]: 
> https://lore.kernel.org/amd-gfx/[email protected]/#t
>      #merged
> [2]: 
> https://lore.kernel.org/amd-gfx/[email protected]/  
>      #in-review
> 
> -ritesh

if some other is needed, let me know


                Dan

Reply via email to