On 11/30/2012 06:18 PM, Jerome Glisse wrote: > On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas at shipmail.org> > wrote: >> On 11/30/2012 05:30 PM, Jerome Glisse wrote: >>> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas at shipmail.org> >>> wrote: >>>> On 11/29/2012 10:58 PM, Marek Ol??k wrote: >>>>> >>>>> What I tried to point out was that the synchronization shouldn't be >>>>> needed, because the CPU shouldn't do anything with the contents of >>>>> evicted buffers. The GPU moves the buffers, not the CPU. What does the >>>>> CPU do besides updating some kernel structures? >>>>> >>>>> Also, buffer deletion is something where you don't need to wait for >>>>> the buffer to become idle if you know the memory area won't be >>>>> mapped by the CPU, ever. The memory can be reclaimed right away. It >>>>> would be the GPU to move new data in and once that happens, the old >>>>> buffer will be trivially idle, because single-ring GPUs execute >>>>> commands in order. >>>>> >>>>> Marek >>>> >>>> Actually asynchronous eviction / deletion is something I have been >>>> prototyping for a while but never gotten around to implement in TTM: >>>> >>>> There are a few minor caveats: >>>> >>>> With buffer deletion, what you say is true for fixed memory, but not for >>>> TT >>>> memory where pages are reclaimed by the system after buffer destruction. >>>> That means that we don't have to wait for idle to free GPU space, but we >>>> need to wait before pages are handed back to the system. >>>> >>>> Swapout needs to access the contents of evicted buffers, but >>>> synchronizing >>>> doesn't need to happen until just before swapout. >>>> >>>> Multi-ring - CPU support: If another ring / engine or the CPU is about to >>>> move in buffer contents to VRAM or a GPU aperture that was previously >>>> evicted by another ring, it needs to sync with that eviction, but doesn't >>>> know what buffer or even which buffers occupied the space previously. >>>> Trivially one can attach a sync object to the memory type manager that >>>> represents the last eviction from that memory type, and *any* engine (CPU >>>> or >>>> GPU) that moves buffer contents in needs to order that movement with >>>> respect >>>> to that fence. As you say, with a single ring and no CPU fallbacks, that >>>> ordering is a no-op, but any common (non-driver based) implementation >>>> needs >>>> to support this. >>>> >>>> A single fence attached to the memory type manager is the simplest >>>> solution, >>>> but a solution with a fence for each free region in the free list is also >>>> possible. Then TTM needs a driver callback to be able order fences w r t >>>> echother. >>>> >>>> /Thomas >>>> >>> Radeon already handle multi-ring and ttm interaction with what we call >>> semaphore. Semaphore are created to synchronize with fence accross >>> different ring. I think the easiest solution is to just remove the bo >>> wait in ttm and let driver handle this. >> >> The wait can be removed, but only conditioned on a driver flag that says it >> supports unsynchronous buffer moves. >> >> The multi-ring case I'm talking about is: >> >> Ring 1 evicts buffer A, emits fence 0 >> Ring 2 evicts buffer B, emits fence 1 >> ..Other eviction takes place by various rings, perhaps including ring 1 and >> ring 2. >> Ring 3 moves buffer C into the space which happens bo be the union of the >> space prevously occupied buffer A and buffer B. >> >> Question is: which fence do you want to order this move with? >> The answer is whichever of fence 0 and 1 signals last. >> >> I think it's a reasonable thing for TTM to keep track of this, but in order >> to do so it needs a driver callback that >> can order two fences, and can order a job in the current ring w r t a fence. >> In radeon's case that driver callback >> would probably insert a barrier / semaphore. In the case of simpler hardware >> it would wait on one of the fences. >> >> /Thomas >> > I don't think we can order fence easily with a clean api, i would > rather see ttm provide a list of fence to driver and tell to the > driver before moving this object all the fence on this list need to be > completed. I think it's as easy as associating fence with drm_mm (well > nouveau as its own mm stuff) but idea would basicly be that fence are > both associated with bo and with mm object so you know when a segment > of memory is idle/available for use. > > Cheers, > Jerome
Hmm. Agreed that would save a lot of barriers. Even if TTM tracks fences by free mm regions or a single fence for the whole memory type, it's a simple fact that fences from the same ring are trivially ordered, which means such a list should contain at most as many fences as there are rings. So, whatever approach is chosen, TTM needs to be able to determine that trivial ordering, and I think the upcoming cross-device fencing work will face the exact same problem. My proposed ordering API would look something like struct fence *order_fences(struct fence *fence_a, struct fence *fence_b, bool trivial_order, bool interruptible, bool no_wait_gpu) Returns which of the fences @fence_a and @fence_b that when signaled, guarantees that also the other fence has signaled. If @quick_order is true, and the driver cannot trivially order the fences, it may return ERR_PTR(-EAGAIN), if @interruptible is true, any wait should be performed interruptibly and if no_wait_gpu is true, the function is not allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs to do so to order fences. (Hardware without semaphores can't order fences without waiting on them). The list approach you suggest would use @trivial_order = true, Single fence approach would use @trivial_order = false. And a first simple implementation in TTM would perhaps use your list approach with a single list for the whole memory type. /Thomas