On 3/24/25 22:23, Bert Karwatzki wrote: > Am Sonntag, dem 23.03.2025 um 17:51 +1100 schrieb Balbir Singh: >> On 3/22/25 23:23, Bert Karwatzki wrote: >>> The problem occurs in this part of ttm_tt_populate(), in the nokaslr case >>> the loop is entered and repeatedly run because ttm_dma32_pages allocated >>> exceeds >>> the ttm_dma32_pages_limit which leads to lots of calls to >>> ttm_global_swapout(). >>> >>> if (!strcmp(get_current()->comm, "stellaris")) >>> printk(KERN_INFO "%s: ttm_pages_allocated=0x%llx ttm_pages_limit=0x%lx >>> ttm_dma32_pages_allocated=0x%llx ttm_dma32_pages_limit=0x%lx\n", >>> __func__, ttm_pages_allocated.counter, ttm_pages_limit, >>> ttm_dma32_pages_allocated.counter, ttm_dma32_pages_limit); >>> while (atomic_long_read(&ttm_pages_allocated) > ttm_pages_limit || >>> atomic_long_read(&ttm_dma32_pages_allocated) > >>> ttm_dma32_pages_limit) { >>> >>> if (!strcmp(get_current()->comm, "stellaris")) >>> printk(KERN_INFO "%s: count=%d ttm_pages_allocated=0x%llx >>> ttm_pages_limit=0x%lx ttm_dma32_pages_allocated=0x%llx >>> ttm_dma32_pages_limit=0x%lx\n", >>> __func__, count++, ttm_pages_allocated.counter, >>> ttm_pages_limit, ttm_dma32_pages_allocated.counter, ttm_dma32_pages_limit); >>> ret = ttm_global_swapout(ctx, GFP_KERNEL); >>> if (ret == 0) >>> break; >>> if (ret < 0) >>> goto error; >>> } >>> >>> In the case without nokaslr on the number of ttm_dma32_pages_allocated is 0 >>> because >>> use_dma32 == false in this case. >>> >>> So why is use_dma32 enabled with nokaslr? Some more printk()s give this >>> result: >>> >>> The GPUs: >>> built-in: >>> 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] >>> Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c5) >>> discrete: >>> 03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 >>> [Radeon RX 6600/6600 XT/6600M] (rev c3) >>> >>> With nokaslr: >>> [ 1.266517] [ T328] dma_addressing_limited: mask = 0xfffffffffff >>> bus_dma_limit = 0x0 required_mask = 0xfffffffff >>> [ 1.266519] [ T328] dma_addressing_limited: ops = 0000000000000000 >>> use_dma_iommu(dev) = 0 >>> [ 1.266520] [ T328] dma_direct_all_ram_mapped: returning true >>> [ 1.266521] [ T328] dma_addressing_limited: returning ret = 0 >>> [ 1.266521] [ T328] amdgpu 0000:03:00.0: amdgpu: amdgpu_ttm_init: >>> calling ttm_device_init() with use_dma32 = 0 >>> [ 1.266525] [ T328] entering ttm_device_init, use_dma32 = 0 >>> [ 1.267115] [ T328] entering ttm_pool_init, use_dma32 = 0 >>> >>> [ 3.965669] [ T328] dma_addressing_limited: mask = 0xfffffffffff >>> bus_dma_limit = 0x0 required_mask = 0x3fffffffffff >>> [ 3.965671] [ T328] dma_addressing_limited: returning true >>> [ 3.965672] [ T328] amdgpu 0000:08:00.0: amdgpu: amdgpu_ttm_init: >>> calling ttm_device_init() with use_dma32 = 1 >>> [ 3.965674] [ T328] entering ttm_device_init, use_dma32 = 1 >>> [ 3.965747] [ T328] entering ttm_pool_init, use_dma32 = 1 >>> >>> Without nokaslr: >>> [ 1.300907] [ T351] dma_addressing_limited: mask = 0xfffffffffff >>> bus_dma_limit = 0x0 required_mask = 0xfffffffff >>> [ 1.300909] [ T351] dma_addressing_limited: ops = 0000000000000000 >>> use_dma_iommu(dev) = 0 >>> [ 1.300910] [ T351] dma_direct_all_ram_mapped: returning true >>> [ 1.300910] [ T351] dma_addressing_limited: returning ret = 0 >>> [ 1.300911] [ T351] amdgpu 0000:03:00.0: amdgpu: amdgpu_ttm_init: >>> calling ttm_device_init() with use_dma32 = 0 >>> [ 1.300915] [ T351] entering ttm_device_init, use_dma32 = 0 >>> [ 1.301210] [ T351] entering ttm_pool_init, use_dma32 = 0 >>> >>> [ 4.000602] [ T351] dma_addressing_limited: mask = 0xfffffffffff >>> bus_dma_limit = 0x0 required_mask = 0xfffffffffff >>> [ 4.000603] [ T351] dma_addressing_limited: ops = 0000000000000000 >>> use_dma_iommu(dev) = 0 >>> [ 4.000604] [ T351] dma_direct_all_ram_mapped: returning true >>> [ 4.000605] [ T351] dma_addressing_limited: returning ret = 0 >>> [ 4.000606] [ T351] amdgpu 0000:08:00.0: amdgpu: amdgpu_ttm_init: >>> calling ttm_device_init() with use_dma32 = 0 >>> [ 4.000610] [ T351] entering ttm_device_init, use_dma32 = 0 >>> [ 4.000687] [ T351] entering ttm_pool_init, use_dma32 = 0 >>> >>> So with nokaslr the reuqired mask for the built-in GPU changes from >>> 0xfffffffffff >>> to 0x3fffffffffff which causes dma_addressing_limited to return true which >>> causes >>> the ttm_device init to be called with use_dma32 = true. >> >> Thanks, this is really the root cause, from what I understand. >> >>> It also show that for the discreate GPU nothing changes so the bug does >>> not occur >>> there. >>> >>> I also was able to work around the bug by calling ttm_device_init() with >>> use_dma32=false >>> from amdgpu_ttm_init() (drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c) but I'm >>> not sure if this >>> has unwanted side effects. >>> >>> int amdgpu_ttm_init(struct amdgpu_device *adev) >>> { >>> uint64_t gtt_size; >>> int r; >>> >>> mutex_init(&adev->mman.gtt_window_lock); >>> >>> dma_set_max_seg_size(adev->dev, UINT_MAX); >>> /* No others user of address space so set it to 0 */ >>> dev_info(adev->dev, "%s: calling ttm_device_init() with use_dma32 = 0 >>> ignoring %d\n", __func__, dma_addressing_limited(adev->dev)); >>> r = ttm_device_init(&adev->mman.bdev, &amdgpu_bo_driver, adev->dev, >>> adev_to_drm(adev)->anon_inode->i_mapping, >>> adev_to_drm(adev)->vma_offset_manager, >>> adev->need_swiotlb, >>> false /* use_dma32 */); >>> if (r) { >>> DRM_ERROR("failed initializing buffer object driver(%d).\n", r); >>> return r; >>> } >>> >> >> I think this brings us really close, instead of forcing use_dma32 to false, >> I wonder if we need something like >> >> uin64_t dma_bits = fls64(dma_get_mask(adev->dev)); >> >> to ttm_device_init, pass the last argument (use_dma32) as dma_bits < 32? >> >> >> Thanks, >> Balbir Singh >> > > Do these address bits have to shift when using nokaslr or PCI_P2PDMA, I think > this shift cause the increase of the required_dma_mask to 0x3fffffffffff? >
That depends on dma ops, as per dma-api.rst " dma_get_required_mask(struct device *dev) This API returns the mask that the platform requires to operate efficiently. Usually this means the returned mask is the minimum required to cover all of memory." I think the assumption that dma_addressing_limited(), due to dma_mask for the device being smaller/shorter than required_mask implies dma32 = true, is incorrect. > @@ -104,4 +104,4 @@ > fe30300000-fe303fffff : 0000:04:00.0 > fe30400000-fe30403fff : 0000:04:00.0 > fe30404000-fe30404fff : 0000:04:00.0 > -afe00000000-affffffffff : 0000:03:00.0 > +3ffe00000000-3fffffffffff : 0000:03:00.0 > > And what memory is this? It's 8G in size so it could be the RAM of the > discrete > GPU (which is at PCI 0000:03:00.0), but that is already here (part of > /proc/iomem): > > I think the mask is independent of what is mapped there, all it says it it needs to address upto 46 bits in the mask Balbir Singh