On 11/30/2012 10:07 PM, Jerome Glisse wrote: > On Fri, Nov 30, 2012 at 09:35:49PM +0100, Thomas Hellstrom wrote: >> On 11/30/2012 08:25 PM, Jerome Glisse wrote: >>> On Fri, Nov 30, 2012 at 07:31:04PM +0100, Thomas Hellstrom wrote: >>>> On 11/30/2012 07:07 PM, Jerome Glisse wrote: >>>>> On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote: >>>>>> On 11/30/2012 06:18 PM, Jerome Glisse wrote: >>>>>>> On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas at >>>>>>> shipmail.org> wrote: >>>>>>>> On 11/30/2012 05:30 PM, Jerome Glisse wrote: >>>>>>>>> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas at >>>>>>>>> shipmail.org> >>>>>>>>> wrote: >>>>>>>>>> On 11/29/2012 10:58 PM, Marek Ol??k wrote: >>>>>>>>>>> What I tried to point out was that the synchronization shouldn't be >>>>>>>>>>> needed, because the CPU shouldn't do anything with the contents of >>>>>>>>>>> evicted buffers. The GPU moves the buffers, not the CPU. What does >>>>>>>>>>> the >>>>>>>>>>> CPU do besides updating some kernel structures? >>>>>>>>>>> >>>>>>>>>>> Also, buffer deletion is something where you don't need to wait for >>>>>>>>>>> the buffer to become idle if you know the memory area won't be >>>>>>>>>>> mapped by the CPU, ever. The memory can be reclaimed right away. It >>>>>>>>>>> would be the GPU to move new data in and once that happens, the old >>>>>>>>>>> buffer will be trivially idle, because single-ring GPUs execute >>>>>>>>>>> commands in order. >>>>>>>>>>> >>>>>>>>>>> Marek >>>>>>>>>> Actually asynchronous eviction / deletion is something I have been >>>>>>>>>> prototyping for a while but never gotten around to implement in TTM: >>>>>>>>>> >>>>>>>>>> There are a few minor caveats: >>>>>>>>>> >>>>>>>>>> With buffer deletion, what you say is true for fixed memory, but not >>>>>>>>>> for >>>>>>>>>> TT >>>>>>>>>> memory where pages are reclaimed by the system after buffer >>>>>>>>>> destruction. >>>>>>>>>> That means that we don't have to wait for idle to free GPU space, >>>>>>>>>> but we >>>>>>>>>> need to wait before pages are handed back to the system. >>>>>>>>>> >>>>>>>>>> Swapout needs to access the contents of evicted buffers, but >>>>>>>>>> synchronizing >>>>>>>>>> doesn't need to happen until just before swapout. >>>>>>>>>> >>>>>>>>>> Multi-ring - CPU support: If another ring / engine or the CPU is >>>>>>>>>> about to >>>>>>>>>> move in buffer contents to VRAM or a GPU aperture that was previously >>>>>>>>>> evicted by another ring, it needs to sync with that eviction, but >>>>>>>>>> doesn't >>>>>>>>>> know what buffer or even which buffers occupied the space previously. >>>>>>>>>> Trivially one can attach a sync object to the memory type manager >>>>>>>>>> that >>>>>>>>>> represents the last eviction from that memory type, and *any* engine >>>>>>>>>> (CPU >>>>>>>>>> or >>>>>>>>>> GPU) that moves buffer contents in needs to order that movement with >>>>>>>>>> respect >>>>>>>>>> to that fence. As you say, with a single ring and no CPU fallbacks, >>>>>>>>>> that >>>>>>>>>> ordering is a no-op, but any common (non-driver based) implementation >>>>>>>>>> needs >>>>>>>>>> to support this. >>>>>>>>>> >>>>>>>>>> A single fence attached to the memory type manager is the simplest >>>>>>>>>> solution, >>>>>>>>>> but a solution with a fence for each free region in the free list is >>>>>>>>>> also >>>>>>>>>> possible. Then TTM needs a driver callback to be able order fences w >>>>>>>>>> r t >>>>>>>>>> echother. >>>>>>>>>> >>>>>>>>>> /Thomas >>>>>>>>>> >>>>>>>>> Radeon already handle multi-ring and ttm interaction with what we call >>>>>>>>> semaphore. Semaphore are created to synchronize with fence accross >>>>>>>>> different ring. I think the easiest solution is to just remove the bo >>>>>>>>> wait in ttm and let driver handle this. >>>>>>>> The wait can be removed, but only conditioned on a driver flag that >>>>>>>> says it >>>>>>>> supports unsynchronous buffer moves. >>>>>>>> >>>>>>>> The multi-ring case I'm talking about is: >>>>>>>> >>>>>>>> Ring 1 evicts buffer A, emits fence 0 >>>>>>>> Ring 2 evicts buffer B, emits fence 1 >>>>>>>> ..Other eviction takes place by various rings, perhaps including ring >>>>>>>> 1 and >>>>>>>> ring 2. >>>>>>>> Ring 3 moves buffer C into the space which happens bo be the union of >>>>>>>> the >>>>>>>> space prevously occupied buffer A and buffer B. >>>>>>>> >>>>>>>> Question is: which fence do you want to order this move with? >>>>>>>> The answer is whichever of fence 0 and 1 signals last. >>>>>>>> >>>>>>>> I think it's a reasonable thing for TTM to keep track of this, but in >>>>>>>> order >>>>>>>> to do so it needs a driver callback that >>>>>>>> can order two fences, and can order a job in the current ring w r t a >>>>>>>> fence. >>>>>>>> In radeon's case that driver callback >>>>>>>> would probably insert a barrier / semaphore. In the case of simpler >>>>>>>> hardware >>>>>>>> it would wait on one of the fences. >>>>>>>> >>>>>>>> /Thomas >>>>>>>> >>>>>>> I don't think we can order fence easily with a clean api, i would >>>>>>> rather see ttm provide a list of fence to driver and tell to the >>>>>>> driver before moving this object all the fence on this list need to be >>>>>>> completed. I think it's as easy as associating fence with drm_mm (well >>>>>>> nouveau as its own mm stuff) but idea would basicly be that fence are >>>>>>> both associated with bo and with mm object so you know when a segment >>>>>>> of memory is idle/available for use. >>>>>>> >>>>>>> Cheers, >>>>>>> Jerome >>>>>> Hmm. Agreed that would save a lot of barriers. >>>>>> >>>>>> Even if TTM tracks fences by free mm regions or a single fence for >>>>>> the whole memory type, it's a simple fact that fences from the same >>>>>> ring are trivially ordered, which means such a list should contain at >>>>>> most as many fences as there are rings. >>>>> Yes, one function callback is needed to know which fence is necessary, >>>>> also ttm needs to know the number of rings (note that i think newer >>>>> hw will have somethings like 1024 rings or even more, even today hw >>>>> might have as many as i think nvidia channel is pretty much what i >>>>> define to be a ring). >>>>> >>>>> But i think most case will be few fence accross few rings. Like 1 >>>>> ring is the dma ring and then you have a ring for one of the GL >>>>> context that using the memory and another ring for the new context >>>>> that want to use the memory. >>>>> >>>>>> So, whatever approach is chosen, TTM needs to be able to determine >>>>>> that trivial ordering, and I think the upcoming cross-device fencing >>>>>> work will face the exact same problem. >>>>>> >>>>>> My proposed ordering API would look something like >>>>>> >>>>>> struct fence *order_fences(struct fence *fence_a, struct fence >>>>>> *fence_b, bool trivial_order, bool interruptible, bool no_wait_gpu) >>>>>> >>>>>> Returns which of the fences @fence_a and @fence_b that when >>>>>> signaled, guarantees that also the other >>>>>> fence has signaled. If @quick_order is true, and the driver cannot >>>>>> trivially order the fences, it may return ERR_PTR(-EAGAIN), >>>>>> if @interruptible is true, any wait should be performed >>>>>> interruptibly and if no_wait_gpu is true, the function is not >>>>>> allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs >>>>>> to do so to order fences. >>>>>> >>>>>> (Hardware without semaphores can't order fences without waiting on them). >>>>>> >>>>>> The list approach you suggest would use @trivial_order = true, >>>>>> Single fence approach would use @trivial_order = false. >>>>>> >>>>>> And a first simple implementation in TTM would perhaps use your list >>>>>> approach with a single list for the whole memory type. >>>>>> >>>>>> /Thomas >>>>> I would rather add a callback like : >>>>> >>>>> ttm_reduce_fences(unsigned *nfences, fence **fencearray) >>>> I don't agree here. I think the fence order function is more >>>> versatile and a good abstraction that can be >>>> applied in a number of cases to this problem. Anyway we should sync >>>> this with Maarten and his >>>> fence work. The same problem applies to attaching shared fences to a bo. >>> Radeon already handle the bo case on multi-ring and i think it should >>> be left to the driver to do what its necessary there. >>> >>> I don't think here more versatility is bad, in fact i am pretty sure >>> that no hw can give fence ordering and i also think that if driver have >>> to track multi-ring fence ordering it will just waste resource for no >>> good reasons. For tracking that in radeon i will need to keep a list >>> of semaphore and knows which semaphores insure synchronization btw >>> which ring and for each which fence are concerned just thinking to it >>> it would be a messy graph with tons of nodes. >>> >>> Of course here i am thinking in term of newer GPU with tons of rings, >>> GPU with one ring, which are a vanishing category as newer opencl >>> require more ring, are easy to handle but i really don't think we >>> should design for those. >> The biggest problem I see with ttm_reduce_fences() is that it seems >> to have high complexity, since it >> doesn't know which fence is the new one. And if it did, it would >> only be a multi-fence version of >> order_fences(trivial=true). > I am sure all driver with multi-ring will store ring id in there fence > structure, that's from my pov a requirement. So when you get a list of > fence you first go over fence on the same ring and so far all fence > implementation use increasing sequence number for fence. So reducing > becomes as easy as only leaving the most recent fence for each of the > ring with an active fence. At the same time it could check (assuming > it's a quick operation) if fence is already signaled or not. > > For radeon this function would be very small, a simple imbricated loop > with couple test in the loop.
What you describe sounds like an O(n?) complexity algorithm, whereas an algorithm based on order_fences is O(n) complexity. Thanks, Thomas