On 11/30/2012 07:07 PM, Jerome Glisse wrote: > On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote: >> On 11/30/2012 06:18 PM, Jerome Glisse wrote: >>> On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas at shipmail.org> >>> wrote: >>>> On 11/30/2012 05:30 PM, Jerome Glisse wrote: >>>>> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas at shipmail.org> >>>>> wrote: >>>>>> On 11/29/2012 10:58 PM, Marek Ol??k wrote: >>>>>>> What I tried to point out was that the synchronization shouldn't be >>>>>>> needed, because the CPU shouldn't do anything with the contents of >>>>>>> evicted buffers. The GPU moves the buffers, not the CPU. What does the >>>>>>> CPU do besides updating some kernel structures? >>>>>>> >>>>>>> Also, buffer deletion is something where you don't need to wait for >>>>>>> the buffer to become idle if you know the memory area won't be >>>>>>> mapped by the CPU, ever. The memory can be reclaimed right away. It >>>>>>> would be the GPU to move new data in and once that happens, the old >>>>>>> buffer will be trivially idle, because single-ring GPUs execute >>>>>>> commands in order. >>>>>>> >>>>>>> Marek >>>>>> Actually asynchronous eviction / deletion is something I have been >>>>>> prototyping for a while but never gotten around to implement in TTM: >>>>>> >>>>>> There are a few minor caveats: >>>>>> >>>>>> With buffer deletion, what you say is true for fixed memory, but not for >>>>>> TT >>>>>> memory where pages are reclaimed by the system after buffer destruction. >>>>>> That means that we don't have to wait for idle to free GPU space, but we >>>>>> need to wait before pages are handed back to the system. >>>>>> >>>>>> Swapout needs to access the contents of evicted buffers, but >>>>>> synchronizing >>>>>> doesn't need to happen until just before swapout. >>>>>> >>>>>> Multi-ring - CPU support: If another ring / engine or the CPU is about to >>>>>> move in buffer contents to VRAM or a GPU aperture that was previously >>>>>> evicted by another ring, it needs to sync with that eviction, but doesn't >>>>>> know what buffer or even which buffers occupied the space previously. >>>>>> Trivially one can attach a sync object to the memory type manager that >>>>>> represents the last eviction from that memory type, and *any* engine (CPU >>>>>> or >>>>>> GPU) that moves buffer contents in needs to order that movement with >>>>>> respect >>>>>> to that fence. As you say, with a single ring and no CPU fallbacks, that >>>>>> ordering is a no-op, but any common (non-driver based) implementation >>>>>> needs >>>>>> to support this. >>>>>> >>>>>> A single fence attached to the memory type manager is the simplest >>>>>> solution, >>>>>> but a solution with a fence for each free region in the free list is also >>>>>> possible. Then TTM needs a driver callback to be able order fences w r t >>>>>> echother. >>>>>> >>>>>> /Thomas >>>>>> >>>>> Radeon already handle multi-ring and ttm interaction with what we call >>>>> semaphore. Semaphore are created to synchronize with fence accross >>>>> different ring. I think the easiest solution is to just remove the bo >>>>> wait in ttm and let driver handle this. >>>> The wait can be removed, but only conditioned on a driver flag that says it >>>> supports unsynchronous buffer moves. >>>> >>>> The multi-ring case I'm talking about is: >>>> >>>> Ring 1 evicts buffer A, emits fence 0 >>>> Ring 2 evicts buffer B, emits fence 1 >>>> ..Other eviction takes place by various rings, perhaps including ring 1 and >>>> ring 2. >>>> Ring 3 moves buffer C into the space which happens bo be the union of the >>>> space prevously occupied buffer A and buffer B. >>>> >>>> Question is: which fence do you want to order this move with? >>>> The answer is whichever of fence 0 and 1 signals last. >>>> >>>> I think it's a reasonable thing for TTM to keep track of this, but in order >>>> to do so it needs a driver callback that >>>> can order two fences, and can order a job in the current ring w r t a >>>> fence. >>>> In radeon's case that driver callback >>>> would probably insert a barrier / semaphore. In the case of simpler >>>> hardware >>>> it would wait on one of the fences. >>>> >>>> /Thomas >>>> >>> I don't think we can order fence easily with a clean api, i would >>> rather see ttm provide a list of fence to driver and tell to the >>> driver before moving this object all the fence on this list need to be >>> completed. I think it's as easy as associating fence with drm_mm (well >>> nouveau as its own mm stuff) but idea would basicly be that fence are >>> both associated with bo and with mm object so you know when a segment >>> of memory is idle/available for use. >>> >>> Cheers, >>> Jerome >> >> Hmm. Agreed that would save a lot of barriers. >> >> Even if TTM tracks fences by free mm regions or a single fence for >> the whole memory type, it's a simple fact that fences from the same >> ring are trivially ordered, which means such a list should contain at >> most as many fences as there are rings. > Yes, one function callback is needed to know which fence is necessary, > also ttm needs to know the number of rings (note that i think newer > hw will have somethings like 1024 rings or even more, even today hw > might have as many as i think nvidia channel is pretty much what i > define to be a ring). > > But i think most case will be few fence accross few rings. Like 1 > ring is the dma ring and then you have a ring for one of the GL > context that using the memory and another ring for the new context > that want to use the memory. > >> So, whatever approach is chosen, TTM needs to be able to determine >> that trivial ordering, and I think the upcoming cross-device fencing >> work will face the exact same problem. >> >> My proposed ordering API would look something like >> >> struct fence *order_fences(struct fence *fence_a, struct fence >> *fence_b, bool trivial_order, bool interruptible, bool no_wait_gpu) >> >> Returns which of the fences @fence_a and @fence_b that when >> signaled, guarantees that also the other >> fence has signaled. If @quick_order is true, and the driver cannot >> trivially order the fences, it may return ERR_PTR(-EAGAIN), >> if @interruptible is true, any wait should be performed >> interruptibly and if no_wait_gpu is true, the function is not >> allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs >> to do so to order fences. >> >> (Hardware without semaphores can't order fences without waiting on them). >> >> The list approach you suggest would use @trivial_order = true, >> Single fence approach would use @trivial_order = false. >> >> And a first simple implementation in TTM would perhaps use your list >> approach with a single list for the whole memory type. >> >> /Thomas > I would rather add a callback like : > > ttm_reduce_fences(unsigned *nfences, fence **fencearray)
I don't agree here. I think the fence order function is more versatile and a good abstraction that can be applied in a number of cases to this problem. Anyway we should sync this with Maarten and his fence work. The same problem applies to attaching shared fences to a bo. > > In the ttm bo move callback you provide the list list of mm block (each > having its array of fence) and the move callback is responsible to > use what ever mecanism it wants to properly schedule and synchronize the > move. Agreed. > > One thing i am not sure is should we merge free mm block and merge/reduce > their fence array or should we provide a list of mm block to the move > callback. I think here there is tradeoff you probably want to merge small > mm block up to a certain point but you don't want to merge so much that > any allocation will have to wait on a zillions fences. As mentioned previously we can choose the complexity here, but the simplest approach would be to have a single list for the whole manager. I think if we don't merge mm blocks immediately when freed, we're going to use up a lot of resources. /Thomas