On 12/16/25 9:36 PM, Christian König wrote:
On 12/16/25 11:08, Donet Tom wrote:
The GPU page tables are 4k in size no matter what the CPU page size is and 
there is some special handling so that we can allocate them even under memory 
pressure. Background is that you sometimes need to split up higher order pages 
(1G, 2M) into lower order pages (2M, 4k) to be able to swap things to system 
memory for example and for that you need some an extra layer of page tables.

The problem is now that those 4k pages are rounded up to your CPU page size, 
resulting in both wasting quite some memory as well as messing up the special 
handling to not run into OOM situations when swapping things to system 
memory....

What we could potentially do is to switch to 64k pages on the GPU as well (the 
HW is flexible enough to be re-configurable), but that is tons of changes and 
probably not easily testable.

If possible, could you share the steps to change the hardware page size? I can 
try testing it on our system.
Just typing that down from the front of my head, so don't nail me for 100% 
correctness.

Modern HW, e.g. gfx9/Vega and newer including all MI* products, has a maximum 
of 48bits of address space.

Those 48bits are divided on multiple page directories (PD) and a leave page 
table (PT).

IIRC vm_block_size module parameter controls the size of the PDs. If you set 
that to 13 instead of the default 9 you should already get 64k PDs instead of 
4k PDs. But take that with a grain of salt I think we haven't tested that 
parameter in the last 10 years or so.

Then each page directory entry on level 0 (PDE0) has a field called block 
fragment size (see AMDGPU_PDE_BFS for MI products). This controls to how much 
memory each page table entry (PTE) finally points to.

So putting it all together you should be able to have a configuration with two 
levels PDs, each covering 13 bits of address space and 64k in size, plus a PT 
covering 18bits of address space and 2M in size where each PTE points to a 64k 
block.

Here are the relevant bits from function amdgpu_vm_adjust_size():
...
         tmp = roundup_pow_of_two(adev->vm_manager.max_pfn);
         if (amdgpu_vm_block_size != -1)
                 tmp >>= amdgpu_vm_block_size - 9;
         tmp = DIV_ROUND_UP(fls64(tmp) - 1, 9) - 1;
         adev->vm_manager.num_level = min_t(unsigned int, max_level, tmp);
         switch (adev->vm_manager.num_level) {
         case 3:
                 adev->vm_manager.root_level = AMDGPU_VM_PDB2;
                 break;
         case 2:
                 adev->vm_manager.root_level = AMDGPU_VM_PDB1;
                 break;
         case 1:
                 adev->vm_manager.root_level = AMDGPU_VM_PDB0;
                 break;
         default:
                 dev_err(adev->dev, "VMPT only supports 2~4+1 levels\n");
         }
         /* block size depends on vm size and hw setup*/
         if (amdgpu_vm_block_size != -1)
                 adev->vm_manager.block_size =
                         min((unsigned)amdgpu_vm_block_size, max_bits
                             - AMDGPU_GPU_PAGE_SHIFT
                             - 9 * adev->vm_manager.num_level);
         else if (adev->vm_manager.num_level > 1)
                 adev->vm_manager.block_size = 9;
         else
                 adev->vm_manager.block_size = amdgpu_vm_get_block_size(tmp);

         if (amdgpu_vm_fragment_size == -1)
                 adev->vm_manager.fragment_size = fragment_size_default;
         else
                 adev->vm_manager.fragment_size = amdgpu_vm_fragment_size;


Thanks Christian

I will try it.


...

But again, that is probably tons of work since the AMDGPU_PAGE_SIZE macro needs 
to change as well and I'm not sure if the FW doesn't internally assume that we 
have 4k pages somewhere.

Regards,
Christian.

Reply via email to