RE: Making drm_gpuvm work across gpu devices

Zeng, Oak Wed, 24 Jan 2024 21:25:44 -0800

Hi Dave,

Let me step back. When I wrote " shared virtual address space b/t cpu and all 
gpu devices is a hard requirement for our system allocator design", I meant 
this is not only Intel's design requirement. Rather this is a common 
requirement for both Intel, AMD and Nvidia. Take a look at cuda driver API 
definition of cuMemAllocManaged (search this API on 
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM),
 it said:


"The pointer is valid on the CPU and on all GPUs in the system that support 
managed memory."

This means the program virtual address space is shared b/t CPU and all GPU 
devices on the system. The system allocator we are discussing is just one step 
advanced than cuMemAllocManaged: it allows malloc'ed memory to be shared b/t 
CPU and all GPU devices.

I hope we all agree with this point.

With that, I agree with Christian that in kmd we should make driver code 
per-device based instead of managing all devices in one driver instance. Our 
system allocator (and generally xekmd)design follows this rule: we make xe_vm 
per device based - one device is *not* aware of other device's address space, 
as I explained in previous email. I started this email seeking a one drm_gpuvm 
instance to cover all GPU devices. I gave up this approach (at least for now) 
per Danilo and Christian's feedback: We will continue to have per device based 
drm_gpuvm. I hope this is aligned with Christian but I will have to wait for 
Christian's reply to my previous email.

I hope this clarify thing a little.

Regards,
Oak 

> -----Original Message-----
> From: dri-devel <dri-devel-boun...@lists.freedesktop.org> On Behalf Of David
> Airlie
> Sent: Wednesday, January 24, 2024 8:25 PM
> To: Zeng, Oak <oak.z...@intel.com>
> Cc: Ghimiray, Himal Prasad <himal.prasad.ghimi...@intel.com>;
> thomas.hellst...@linux.intel.com; Winiarski, Michal
> <michal.winiar...@intel.com>; Felix Kuehling <felix.kuehl...@amd.com>; Welty,
> Brian <brian.we...@intel.com>; Shah, Ankur N <ankur.n.s...@intel.com>; dri-
> de...@lists.freedesktop.org; intel...@lists.freedesktop.org; Gupta, saurabhg
> <saurabhg.gu...@intel.com>; Danilo Krummrich <d...@redhat.com>; Daniel
> Vetter <dan...@ffwll.ch>; Brost, Matthew <matthew.br...@intel.com>; Bommu,
> Krishnaiah <krishnaiah.bo...@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathap...@intel.com>; Christian König
> <christian.koe...@amd.com>
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> >
> >
> > For us, Xekmd doesn't need to know it is running under bare metal or
> virtualized environment. Xekmd is always a guest driver. All the virtual 
> address
> used in xekmd is guest virtual address. For SVM, we require all the VF devices
> share one single shared address space with guest CPU program. So all the 
> design
> works in bare metal environment can automatically work under virtualized
> environment. +@Shah, Ankur N +@Winiarski, Michal to backup me if I am wrong.
> >
> >
> >
> > Again, shared virtual address space b/t cpu and all gpu devices is a hard
> requirement for our system allocator design (which means malloc’ed memory,
> cpu stack variables, globals can be directly used in gpu program. Same
> requirement as kfd SVM design). This was aligned with our user space software
> stack.
> 
> Just to make a very general point here (I'm hoping you listen to
> Christian a bit more and hoping he replies in more detail), but just
> because you have a system allocator design done, it doesn't in any way
> enforce the requirements on the kernel driver to accept that design.
> Bad system design should be pushed back on, not enforced in
> implementation stages. It's a trap Intel falls into regularly since
> they say well we already agreed this design with the userspace team
> and we can't change it now. This isn't acceptable. Design includes
> upstream discussion and feedback, if you say misdesigned the system
> allocator (and I'm not saying you definitely have), and this is
> pushing back on that, then you have to go fix your system
> architecture.
> 
> KFD was an experiment like this, I pushed back on AMD at the start
> saying it was likely a bad plan, we let it go and got a lot of
> experience in why it was a bad design.
> 
> Dave.

RE: Making drm_gpuvm work across gpu devices

Reply via email to