On 2016-11-23 03:51 AM, Christian König wrote: > Am 23.11.2016 um 08:49 schrieb Daniel Vetter: >> On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote: >>> On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <daniel at ffwll.ch> wrote: >>>> On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch >>>> <serguei.sagalovitch at amd.com> wrote: >>>>> On 2016-11-22 03:10 PM, Daniel Vetter wrote: >>>>>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams >>>>>> <dan.j.williams at intel.com> >>>>>> wrote: >>>>>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch >>>>>>> <serguei.sagalovitch at amd.com> wrote: >>>>>>>> I personally like "device-DAX" idea but my concerns are: >>>>>>>> >>>>>>>> - How well it will co-exists with the DRM infrastructure / >>>>>>>> implementations >>>>>>>> in part dealing with CPU pointers? >>>>>>> Inside the kernel a device-DAX range is "just memory" in the sense >>>>>>> that you can perform pfn_to_page() on it and issue I/O, but the >>>>>>> vma is >>>>>>> not migratable. To be honest I do not know how well that co-exists >>>>>>> with drm infrastructure. >>>>>>> >>>>>>>> - How well we will be able to handle case when we need to >>>>>>>> "move"/"evict" >>>>>>>> memory/data to the new location so CPU pointer should >>>>>>>> point to the >>>>>>>> new >>>>>>>> physical location/address >>>>>>>> (and may be not in PCI device memory at all)? >>>>>>> So, device-DAX deliberately avoids support for in-kernel >>>>>>> migration or >>>>>>> overcommit. Those cases are left to the core mm or drm. The >>>>>>> device-dax >>>>>>> interface is for cases where all that is needed is a >>>>>>> direct-mapping to >>>>>>> a statically-allocated physical-address range be it persistent >>>>>>> memory >>>>>>> or some other special reserved memory range. >>>>>> For some of the fancy use-cases (e.g. to be comparable to what >>>>>> HMM can >>>>>> pull off) I think we want all the magic in core mm, i.e. >>>>>> migration and >>>>>> overcommit. At least that seems to be the very strong drive in all >>>>>> general-purpose gpu abstractions and implementations, where >>>>>> memory is >>>>>> allocated with malloc, and then mapped/moved into vram/gpu address >>>>>> space through some magic, >>>>> It is possible that there is other way around: memory is requested >>>>> to be >>>>> allocated and should be kept in vram for performance reason but due >>>>> to possible overcommit case we need at least temporally to "move" >>>>> such >>>>> allocation to system memory. >>>> With migration I meant migrating both ways of course. And with stuff >>>> like numactl we can also influence where exactly the malloc'ed memory >>>> is allocated originally, at least if we'd expose the vram range as a >>>> very special numa node that happens to be far away and not hold any >>>> cpu cores. >>> I don't think we should be using numa distance to reverse engineer a >>> certain allocation behavior. The latency data should be truthful, but >>> you're right we'll need a mechanism to keep general purpose >>> allocations out of that range by default. Btw, strict isolation is >>> another design point of device-dax, but I think in this case we're >>> describing something between the two extremes of full isolation and >>> full compatibility with existing numactl apis. >> Yes, agreed. My idea with exposing vram sections using numa nodes wasn't >> to reuse all the existing allocation policies directly, those won't >> work. >> So at boot-up your default numa policy would exclude any vram nodes. >> >> But I think (as an -mm layman) that numa gives us a lot of the tools and >> policy interface that we need to implement what we want for gpus. > > Agree completely. From a ten mile high view our GPUs are just command > processors with local memory as well . > > Basically this is also the whole idea of what AMD is pushing with HSA > for a while. > > It's just that a lot of problems start to pop up when you look at all > the nasty details. For example only part of the GPU memory is usually > accessible by the CPU. > > So even when numa nodes expose a good foundation for this I think > there is still a lot of code to write. > > BTW: I should probably start to read into the numa code of the kernel. > Any good pointers for that? I would assume that "page" allocation logic itself should be inside of graphics driver due to possible different requirements especially from graphics: alignment, etc.
> > Regards, > Christian. > >> Wrt isolation: There's a sliding scale of what different users expect, >> from full auto everything, including migrating pages around if needed to >> full isolation all seems to be on the table. As long as we keep vram >> nodes >> out of any default allocation numasets, full isolation should be >> possible. >> -Daniel > > Sincerely yours, Serguei Sagalovitch