On 3/24/25 23:14, Christian König wrote:
> Am 24.03.25 um 12:23 schrieb Bert Karwatzki:
>> Am Sonntag, dem 23.03.2025 um 17:51 +1100 schrieb Balbir Singh:
>>> On 3/22/25 23:23, Bert Karwatzki wrote:
>>>> ...
>>>> So why is use_dma32 enabled with nokaslr? Some more printk()s give this 
>>>> result:
>>>>
>>>> The GPUs:
>>>> built-in:
>>>> 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] 
>>>> Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c5)
>>>> discrete:
>>>> 03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 
>>>> [Radeon RX 6600/6600 XT/6600M] (rev c3)
>>>>
>>>> With nokaslr:
>>>> [    1.266517] [    T328] dma_addressing_limited: mask = 0xfffffffffff 
>>>> bus_dma_limit = 0x0 required_mask = 0xfffffffff
>>>> [    1.266519] [    T328] dma_addressing_limited: ops = 0000000000000000 
>>>> use_dma_iommu(dev) = 0
>>>> [    1.266520] [    T328] dma_direct_all_ram_mapped: returning true
>>>> [    1.266521] [    T328] dma_addressing_limited: returning ret = 0
>>>> [    1.266521] [    T328] amdgpu 0000:03:00.0: amdgpu: amdgpu_ttm_init: 
>>>> calling ttm_device_init() with use_dma32 = 0
>>>> [    1.266525] [    T328] entering ttm_device_init, use_dma32 = 0
>>>> [    1.267115] [    T328] entering ttm_pool_init, use_dma32 = 0
>>>>
>>>> [    3.965669] [    T328] dma_addressing_limited: mask = 0xfffffffffff 
>>>> bus_dma_limit = 0x0 required_mask = 0x3fffffffffff
>>>> [    3.965671] [    T328] dma_addressing_limited: returning true
>>>> [    3.965672] [    T328] amdgpu 0000:08:00.0: amdgpu: amdgpu_ttm_init: 
>>>> calling ttm_device_init() with use_dma32 = 1
>>>> [    3.965674] [    T328] entering ttm_device_init, use_dma32 = 1
>>>> [    3.965747] [    T328] entering ttm_pool_init, use_dma32 = 1
>>>>
>>>> Without nokaslr:
>>>> [    1.300907] [    T351] dma_addressing_limited: mask = 0xfffffffffff 
>>>> bus_dma_limit = 0x0 required_mask = 0xfffffffff
>>>> [    1.300909] [    T351] dma_addressing_limited: ops = 0000000000000000 
>>>> use_dma_iommu(dev) = 0
>>>> [    1.300910] [    T351] dma_direct_all_ram_mapped: returning true
>>>> [    1.300910] [    T351] dma_addressing_limited: returning ret = 0
>>>> [    1.300911] [    T351] amdgpu 0000:03:00.0: amdgpu: amdgpu_ttm_init: 
>>>> calling ttm_device_init() with use_dma32 = 0
>>>> [    1.300915] [    T351] entering ttm_device_init, use_dma32 = 0
>>>> [    1.301210] [    T351] entering ttm_pool_init, use_dma32 = 0
>>>>
>>>> [    4.000602] [    T351] dma_addressing_limited: mask = 0xfffffffffff 
>>>> bus_dma_limit = 0x0 required_mask = 0xfffffffffff
>>>> [    4.000603] [    T351] dma_addressing_limited: ops = 0000000000000000 
>>>> use_dma_iommu(dev) = 0
>>>> [    4.000604] [    T351] dma_direct_all_ram_mapped: returning true
>>>> [    4.000605] [    T351] dma_addressing_limited: returning ret = 0
>>>> [    4.000606] [    T351] amdgpu 0000:08:00.0: amdgpu: amdgpu_ttm_init: 
>>>> calling ttm_device_init() with use_dma32 = 0
>>>> [    4.000610] [    T351] entering ttm_device_init, use_dma32 = 0
>>>> [    4.000687] [    T351] entering ttm_pool_init, use_dma32 = 0
>>>>
>>>> So with nokaslr the reuqired mask for the built-in GPU changes from 
>>>> 0xfffffffffff
>>>> to 0x3fffffffffff which causes dma_addressing_limited to return true which 
>>>> causes
>>>> the ttm_device init to be called with use_dma32 = true.
>>> Thanks, this is really the root cause, from what I understand.
> 
> Yeah, completely agree.
> 
>>>
>>>>  It also show that for the discreate GPU nothing changes so the bug does 
>>>> not occur
>>>> there.
>>>>
>>>> I also was able to work around the bug by calling ttm_device_init() with 
>>>> use_dma32=false
>>>> from amdgpu_ttm_init()  (drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c) but I'm 
>>>> not sure if this
>>>> has unwanted side effects.
>>>>
>>>> int amdgpu_ttm_init(struct amdgpu_device *adev)
>>>> {
>>>>    uint64_t gtt_size;
>>>>    int r;
>>>>
>>>>    mutex_init(&adev->mman.gtt_window_lock);
>>>>
>>>>    dma_set_max_seg_size(adev->dev, UINT_MAX);
>>>>    /* No others user of address space so set it to 0 */
>>>>    dev_info(adev->dev, "%s: calling ttm_device_init() with use_dma32 = 0 
>>>> ignoring %d\n", __func__, dma_addressing_limited(adev->dev));
>>>>    r = ttm_device_init(&adev->mman.bdev, &amdgpu_bo_driver, adev->dev,
>>>>                           adev_to_drm(adev)->anon_inode->i_mapping,
>>>>                           adev_to_drm(adev)->vma_offset_manager,
>>>>                           adev->need_swiotlb,
>>>>                           false /* use_dma32 */);
>>>>    if (r) {
>>>>            DRM_ERROR("failed initializing buffer object driver(%d).\n", r);
>>>>            return r;
>>>>    }
>>>>
>>> I think this brings us really close, instead of forcing use_dma32 to false, 
>>> I wonder if we need something like
>>>
>>> uin64_t dma_bits = fls64(dma_get_mask(adev->dev));
>>>
>>> to ttm_device_init, pass the last argument (use_dma32) as dma_bits < 32?
> 
> The handling is completely correct as far as i can see.
> 
>>>
>>>
>>> Thanks,
>>> Balbir Singh
>>>
>> Do these address bits have to shift when using nokaslr or PCI_P2PDMA, I think
>> this shift cause the increase of the required_dma_mask to 0x3fffffffffff?
>>
>> @@ -104,4 +104,4 @@
>>        fe30300000-fe303fffff : 0000:04:00.0
>>      fe30400000-fe30403fff : 0000:04:00.0
>>      fe30404000-fe30404fff : 0000:04:00.0
>> -afe00000000-affffffffff : 0000:03:00.0
>> +3ffe00000000-3fffffffffff : 0000:03:00.0
>>
>> And what memory is this? It's 8G in size so it could be the RAM of the 
>> discrete
>> GPU (which is at PCI 0000:03:00.0), but that is already here (part of
>> /proc/iomem):
>>
>> 1010000000-ffffffffff : PCI Bus 0000:00
>>   fc00000000-fe0fffffff : PCI Bus 0000:01
>>     fc00000000-fe0fffffff : PCI Bus 0000:02
>>       fc00000000-fe0fffffff : PCI Bus 0000:03
>>         fc00000000-fdffffffff : 0000:03:00.0  GPU RAM
>>         fe00000000-fe0fffffff : 0000:03:00.0
>>
>> lspci -v reports 8G of memory at 0xfc00000000 so I assmumed that is the GPU 
>> RAM.
>> 03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23
>> [Radeon RX 6600/6600 XT/6600M] (rev c3)
>>      Subsystem: Micro-Star International Co., Ltd. [MSI] Device 1313
>>      Flags: bus master, fast devsel, latency 0, IRQ 107, IOMMU group 14
>>      Memory at fc00000000 (64-bit, prefetchable) [size=8G]
>>      Memory at fe00000000 (64-bit, prefetchable) [size=256M]
>>      Memory at fca00000 (32-bit, non-prefetchable) [size=1M]
>>      Expansion ROM at fcb00000 [disabled] [size=128K]
> 
> Well when you set nokaslr then that moves the BAR address of the dGPU above 
> the limit the integrated GPU can access on the bus (usually 40 bits).
> 
> Because of this the integrated GPU starts to fallback to system memory below 
> the 4GB limit to make sure that the stuff is always accessible by everyone.

Why does it fallback to GPU_DMA32? Is the rest of system memory not usable 
(upto 40 bits)?
I did not realize that the iGPU is using the BAR memory of the dGPU.

I guess the issue goes away when amdgpu.gttsize is set to 2GB, because 2GB fits 
in the DMA32 window

> 
> Since the memory below 4GB is very very limited we are now starting to 
> constantly swap things in and out of that area. Basically completely killing 
> the performance of your Steam game.
> 
> As far as I can see till that point the handling is completely intentional 
> and working as expected.
> 
> The only thing which eludes me is why setting nokaslr changes the BAR of the 
> dGPU? Can I get the full dmesg with and with nokasl?
> 

IIRC, the iGPU does not work correctly, the dGPU does, so it's an iGPU 
addressing constraint?

Balbir

Reply via email to