Hi Christian,

Can you elaborate the mirror on demand/userfaultfd idea?

userfaultfd is a way for user space to take over page fault handling of a user 
registered range. From first look, it seems you want a user space page fault 
handler to mirror a large chunk of memory to GPU. I would imagine this handler 
is in UMD, because the whole purpose of system svm allocator is to allow user 
use cpu address (such as malloc’ed) on gpu program without extra driver api 
call. So the registration and mirroring of this large chunk can’t be in user 
program. With this, I pictured below sequence:

During process initialization time, umd register a large chunk (lets say 1GiB) 
of memory using userfaultfd, this include:

  1.  mem = mmap(NULL, 1GiB, MAP_ANON)
  2.  register range [mem, mem + 1GiB] through userfaultfd
  3.  after that, umd can wait on page fault event. When page fault happens, 
umd call vm_bind to mirror [mem, mem+1GiB] range to GPU

now in a user program:
                ptr = malloc(size);
                submit a GPU program which uses ptr

This is what I can picture. It doesn’t work because ptr can’t belong to [mem, 
mem+1GiB] range. So you can’t vm_bind/mirror ptr on demand to GPU.

Also, the page fault event in 3) above can’t happen at all. A page fault only 
happens when *CPU* access mem but in our case, it could be *only GPU* touch the 
memory.

The point is, with system svm allocator, user can use *any* valid CPU address 
for GPU program. This address can be anything in the range of [0~2^57-1]. This 
design requirement is quite simple and clean. I don’t see how to solve this 
with userfaultfd/on demand mirroring.

Regards,
Oak

From: Christian König <christian.koe...@amd.com>
Sent: Thursday, February 29, 2024 4:41 AM
To: Zeng, Oak <oak.z...@intel.com>; Danilo Krummrich <d...@redhat.com>; Dave 
Airlie <airl...@redhat.com>; Daniel Vetter <dan...@ffwll.ch>; Felix Kuehling 
<felix.kuehl...@amd.com>; jgli...@redhat.com
Cc: Welty, Brian <brian.we...@intel.com>; dri-devel@lists.freedesktop.org; 
intel...@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bo...@intel.com>; 
Ghimiray, Himal Prasad <himal.prasad.ghimi...@intel.com>; 
thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana 
<niranjana.vishwanathap...@intel.com>; Brost, Matthew 
<matthew.br...@intel.com>; Gupta, saurabhg <saurabhg.gu...@intel.com>
Subject: Re: Making drm_gpuvm work across gpu devices

Am 28.02.24 um 20:51 schrieb Zeng, Oak:


The mail wasn’t indent/preface correctly. Manually format it.


From: Christian König 
<christian.koe...@amd.com<mailto:christian.koe...@amd.com>>
Sent: Tuesday, February 27, 2024 1:54 AM
To: Zeng, Oak <oak.z...@intel.com<mailto:oak.z...@intel.com>>; Danilo Krummrich 
<d...@redhat.com<mailto:d...@redhat.com>>; Dave Airlie 
<airl...@redhat.com<mailto:airl...@redhat.com>>; Daniel Vetter 
<dan...@ffwll.ch<mailto:dan...@ffwll.ch>>; Felix Kuehling 
<felix.kuehl...@amd.com<mailto:felix.kuehl...@amd.com>>; 
jgli...@redhat.com<mailto:jgli...@redhat.com>
Cc: Welty, Brian <brian.we...@intel.com<mailto:brian.we...@intel.com>>; 
dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org>; 
intel...@lists.freedesktop.org<mailto:intel...@lists.freedesktop.org>; Bommu, 
Krishnaiah <krishnaiah.bo...@intel.com<mailto:krishnaiah.bo...@intel.com>>; 
Ghimiray, Himal Prasad 
<himal.prasad.ghimi...@intel.com<mailto:himal.prasad.ghimi...@intel.com>>; 
thomas.hellst...@linux.intel.com<mailto:thomas.hellst...@linux.intel.com>; 
Vishwanathapura, Niranjana 
<niranjana.vishwanathap...@intel.com<mailto:niranjana.vishwanathap...@intel.com>>;
 Brost, Matthew <matthew.br...@intel.com<mailto:matthew.br...@intel.com>>; 
Gupta, saurabhg <saurabhg.gu...@intel.com<mailto:saurabhg.gu...@intel.com>>
Subject: Re: Making drm_gpuvm work across gpu devices

Hi Oak,
Am 23.02.24 um 21:12 schrieb Zeng, Oak:
Hi Christian,

I go back this old email to ask a question.

sorry totally missed that one.



Quote from your email:
“Those ranges can then be used to implement the SVM feature required for higher 
level APIs and not something you need at the UAPI or even inside the low level 
kernel memory management.”
“SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have 
any influence on the design of the kernel UAPI.”

There are two category of SVM:

1.       driver svm allocator: this is implemented in user space,  i.g., 
cudaMallocManaged (cuda) or zeMemAllocShared (L0) or clSVMAlloc(openCL). Intel 
already have gem_create/vm_bind in xekmd and our umd implemented clSVMAlloc and 
zeMemAllocShared on top of gem_create/vm_bind. Range A..B of the process 
address space is mapped into a range C..D of the GPU address space, exactly as 
you said.

2.       system svm allocator:  This doesn’t introduce extra driver API for 
memory allocation. Any valid CPU virtual address can be used directly 
transparently in a GPU program without any extra driver API call. Quote from 
kernel Documentation/vm/hmm.hst: “Any application memory region (private 
anonymous, shared memory, or regular file backed memory) can be used by a 
device transparently” and “to share the address space by duplicating the CPU 
page table in the device page table so the same address points to the same 
physical memory for any valid main memory address in the process address 
space”. In system svm allocator, we don’t need that A..B C..D mapping.

It looks like you were talking of 1). Were you?

No, even when you fully mirror the whole address space from a process into the 
GPU you still need to enable this somehow with an IOCTL.

And while enabling this you absolutely should specify to which part of the 
address space this mirroring applies and where it maps to.


[Zeng, Oak]
Lets say we have a hardware platform where both CPU and GPU support 57bit(use 
it for example. The statement apply to any address range) virtual address 
range, how do you decide “which part of the address space this mirroring 
applies”? You have to mirror the whole address space [0~2^57-1], do you? As you 
designed it, the gigantic vm_bind/mirroring happens at the process 
initialization time, and at that time, you don’t know which part of the address 
space will be used for gpu program. Remember for system allocator, *any* valid 
CPU address can be used for GPU program.  If you add an offset to [0~2^57-1], 
you get an address out of 57bit address range. Is this a valid concern?

Well you can perfectly mirror on demand. You just need something similar to 
userfaultfd() for the GPU. This way you don't need to mirror the full address 
space, but can rather work with large chunks created on demand, let's say 1GiB 
or something like that.

The virtual address space is basically just a hardware functionality to route 
memory accesses. While the mirroring approach is a very common use case for 
data-centers and high performance computing there are quite a number of 
different use cases which makes use of virtual address space in a non 
"standard" fashion. The native context approach for VMs is just one example, 
databases and emulators are another one.




I see the system svm allocator as just a special case of the driver allocator 
where not fully backed buffer objects are allocated, but rather sparse one 
which are filled and migrated on demand.


[Zeng, Oak]
Above statement is true to me. We don’t have BO for system svm allocator. It is 
a sparse one as we can sparsely map vma to GPU. Our migration policy decide 
which pages/how much of the vma is migrated/mapped to GPU page table.

The difference b/t your mind and mine is, you want a gigantic vma (created 
during the gigantic vm_bind) to be sparsely populated to gpu. While I thought 
vma (xe_vma in xekmd codes) is a place to save memory attributes (such as 
caching, user preferred placement etc). All those memory attributes are range 
based, i.e., user can specify range1 is cached while range2 is uncached. So I 
don’t see how you can manage it with the gigantic vma. Do you split your 
gigantic vma later to save range based memory attributes?

Yes, exactly that. I mean the splitting and eventually merging of ranges is a 
standard functionality of the GPUVM code.

So when you need to store additional attributes per range then I would strongly 
suggest to make use of this splitting and merging functionality as well.

So basically an IOCTL which says range A..B of the GPU address space is mapped 
to offset X of the CPU address space with parameters Y (caching, migration 
behavior etc..). That is essentially the same we have for mapping GEM objects, 
the provider of the backing store is just something different.

Regards,
Christian.



Regards,
Oak


Regards,
Christian.




Reply via email to