On 3/26/25 21:10, Bert Karwatzki wrote: > Am Mittwoch, dem 26.03.2025 um 12:50 +1100 schrieb Balbir Singh: >> On 3/26/25 10:43, Balbir Singh wrote: >>> On 3/26/25 10:21, Bert Karwatzki wrote: >>>> Am Mittwoch, dem 26.03.2025 um 09:45 +1100 schrieb Balbir Singh: >>>>> >>>>> >>>>> The second region seems to be additional, I suspect that is HMM mapping >>>>> from kgd2kfd_init_zone_device() >>>>> >>>>> Balbir Singh >>>>> >>>> Good guess! I inserted a printk into kgd2kfd_init_zone_device(): >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>>> b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>>> index d05d199b5e44..201220e2ac42 100644 >>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>>> @@ -1049,6 +1049,8 @@ int kgd2kfd_init_zone_device(struct amdgpu_device >>>> *adev) >>>> pgmap->range.end = res->end; >>>> pgmap->type = MEMORY_DEVICE_PRIVATE; >>>> } >>>> + dev_info(adev->dev, "%s: range.start = 0x%llx ranges.end = >>>> 0x%llx\n", >>>> + __func__, pgmap->range.start, pgmap->range.end); >>>> >>>> pgmap->nr_range = 1; >>>> pgmap->ops = &svm_migrate_pgmap_ops; >>>> >>>> >>>> and get this in the case without nokaslr: >>>> >>>> [ T367] amdgpu 0000:03:00.0: kfd_migrate: kgd2kfd_init_zone_device: >>>> range.start = 0xafe00000000 ranges.end = 0xaffffffffff >>>> >>>> and this in the case with nokaslr: >>>> >>>> [ T365] amdgpu 0000:03:00.0: kfd_migrate: kgd2kfd_init_zone_device: >>>> range.start = 0x3ffe00000000 ranges.end = 0x3fffffffffff >>>> >>> >>> So we should ignore the second region then for the purposes of this issue. >>> >>> I think this now boils down to >>> >>> Why is the dma_get_required_mask set to all of addressable memory (46 bits) >>> when we have nokaslr >>> >> >> I think I know the root cause of the required_mask going up and hence the >> use of DMA32 >> >> 1. HMM calls add_pages() >> 2. add_pages calls update_end_of_memory_vars() >> 3. This updates max_pfn and that causes required_mask to go up to 46 bits >> >> Do you have CONFIG_HSA_AMD_SVM enabled? Does turning it off, fix the issue? >> >> The actual issue is the update of max_pfn. >> >> Balbir Singh >> > > Yes, turning off CONFIG_HSA_AMD_SVM fixes the issue, the strange memory > resource > afe00000000-affffffffff : 0000:03:00.0 > is gone. > > If one would add a max_pyhs_addr argument to devm_request_free_mem_region() > (which return the resource addr in kgd2kfd_init_zone_device()) one could keep > the memory below the 44bit limit with CONFIG_HSA_AMD_SVM enabled. >
Thanks for reporting the result, does this patch work diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 01ea7c6df303..14f42f8012ab 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -968,8 +968,9 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages, WARN_ON_ONCE(ret); /* update max_pfn, max_low_pfn and high_memory */ - update_end_of_memory_vars(start_pfn << PAGE_SHIFT, - nr_pages << PAGE_SHIFT); + if (!params->pgmap) + update_end_of_memory_vars(start_pfn << PAGE_SHIFT, + nr_pages << PAGE_SHIFT); return ret; } It basically prevents max_pfn from moving when the inserted memory is zone_device. FYI: It's a test patch and will still create issues if the amount of present memory (physically) is very high, because the driver need to enable use_dma32 in that case. If you could try this with everything back to the original config with both kaslr/nokaslr that would be very helpful Thanks, Balbir Singh