On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote: > On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote: > > On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote: > > > Hi, > > > > > > So i want over the whole fence and sync point stuff as it's becoming a > > > pressing > > > issue. I think we first need to agree on what is the problem we want to > > > solve > > > and what would be the requirements to solve it. > > > > > > Problem : > > > Explicit synchronization btw different hardware block over a buffer > > > object. > > > > > > Requirements : > > > Share common infrastructure. > > > Allow optimal hardware command stream scheduling accross hardware block. > > > Allow android sync point to be implemented on top of it. > > > Handle/acknowledge exception (like good old gpu lockup). > > > Minimize driver changes. > > > > > > Glossary : > > > hardware timeline: timeline bound to a specific hardware block. > > > pipeline timeline: timeline bound to a userspace rendering pipeline, > > > each > > > point on that timeline can be a composite of several > > > different hardware pipeline point. > > > pipeline: abstract object representing userspace application graphic > > > pipeline > > > of each of the application graphic operations. > > > fence: specific point in a timeline where synchronization needs to > > > happen. > > > > > > > > > So now, current include/linux/fence.h implementation is i believe missing > > > the > > > objective by confusing hardware and pipeline timeline and by bolting > > > fence to > > > buffer object while what is really needed is true and proper timeline for > > > both > > > hardware and pipeline. But before going further down that road let me > > > look at > > > things and explain how i see them. > > > > fences can be used free-standing and no one forces you to integrate them > > with buffers. We actually plan to go this way with the intel svm stuff. > > Ofc for dma-buf the plan is to synchronize using such fences, but that's > > somewhat orthogonal I think. At least you only talk about fences and > > timeline and not dma-buf here. > > > > > Current ttm fence have one and a sole purpose, allow synchronization for > > > buffer > > > object move even thought some driver like radeon slightly abuse it and > > > use them > > > for things like lockup detection. > > > > > > The new fence want to expose an api that would allow some implementation > > > of a > > > timeline. For that it introduces callback and some hard requirement on > > > what the > > > driver have to expose : > > > enable_signaling > > > [signaled] > > > wait > > > > > > Each of those have to do work inside the driver to which the fence > > > belongs and > > > each of those can be call more or less from unexpected (with restriction > > > like > > > outside irq) context. So we end up with thing like : > > > > > > Process 1 Process 2 Process 3 > > > I_A_schedule(fence0) > > > CI_A_F_B_signaled(fence0) > > > I_A_signal(fence0) > > > > > > CI_B_F_A_callback(fence0) > > > CI_A_F_B_wait(fence0) > > > Lexique: > > > I_x in driver x (I_A == in driver A) > > > CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from > > > driver B) > > > > > > So this is an happy mess everyone call everyone and this bound to get > > > messy. > > > Yes i know there is all kind of requirement on what happen once a fence is > > > signaled. But those requirement only looks like they are trying to atone > > > any > > > mess that can happen from the whole callback dance. > > > > > > While i was too seduced by the whole callback idea long time ago, i think > > > it is > > > a highly dangerous path to take where the combinatorial of what could > > > happen > > > are bound to explode with the increase in the number of players. > > > > > > > > > So now back to how to solve the problem we are trying to address. First i > > > want > > > to make an observation, almost all GPU that exist today have a command > > > ring > > > on to which userspace command buffer are executed and inside the command > > > ring > > > you can do something like : > > > > > > if (condition) execute_command_buffer else skip_command_buffer > > > > > > where condition is a simple expression (memory_address cop value)) with > > > cop one > > > of the generic comparison (==, <, >, <=, >=). I think it is a safe > > > assumption > > > that any gpu that slightly matter can do that. Those who can not should > > > fix > > > there command ring processor. > > > > > > > > > With that in mind, i think proper solution is implementing timeline and > > > having > > > fence be a timeline object with a way simpler api. For each hardware > > > timeline > > > driver provide a system memory address at which the lastest signaled fence > > > sequence number can be read. Each fence object is uniquely associated with > > > both a hardware and a pipeline timeline. Each pipeline timeline have a > > > wait > > > queue. > > > > > > When scheduling something that require synchronization on a hardware > > > timeline > > > a fence is created and associated with the pipeline timeline and hardware > > > timeline. Other hardware block that need to wait on a fence can use there > > > command ring conditional execution to directly check the fence sequence > > > from > > > the other hw block so you do optimistic scheduling. If optimistic > > > scheduling > > > fails (which would be reported by hw block specific solution and hidden) > > > then > > > things can fallback to software cpu wait inside what could be considered > > > the > > > kernel thread of the pipeline timeline. > > > > > > > > > From api point of view there is no inter-driver call. All the driver > > > needs to > > > do is wakeup the pipeline timeline wait_queue when things are signaled or > > > when things go sideway (gpu lockup). > > > > > > > > > So how to implement that with current driver ? Well easy. Currently we > > > assume > > > implicit synchronization so all we need is an implicit pipeline timeline > > > per > > > userspace process (note this do not prevent inter process > > > synchronization). > > > Everytime a command buffer is submitted it is added to the implicit > > > timeline > > > with the simple fence object : > > > > > > struct fence { > > > struct list_head list_hwtimeline; > > > struct list_head list_pipetimeline; > > > struct hw_timeline *hw_timeline; > > > uint64_t seq_num; > > > work_t timedout_work; > > > void *csdata; > > > }; > > > > > > So with set of helper function call by each of the driver command > > > execution > > > ioctl you have the implicit timeline that is properly populated and each > > > dirver command execution get the dependency from the implicit timeline. > > > > > > > > > Of course to take full advantages of all flexibilities this could offer we > > > would need to allow userspace to create pipeline timeline and to schedule > > > against the pipeline timeline of there choice. We could create file for > > > each of the pipeline timeline and have file operation to wait/query > > > progress. > > > > > > Note that the gpu lockup are considered exceptional event, the implicit > > > timeline will probably want to continue on other job on other hardware > > > block but the explicit one probably will want to decide wether to continue > > > or abort or retry without the fault hw block. > > > > > > > > > I realize i am late to the party and that i should have taken a serious > > > look at all this long time ago. I apologize for that and if you consider > > > this is to late then just ignore me modulo the big warning the crazyness > > > that callback will introduce an how bad things bound to happen. I am not > > > saying that bad things can not happen with what i propose just that > > > because everything happen inside the process context that is the one > > > asking/requiring synchronization there will be not interprocess kernel > > > callback (a callback that was registered by one process and that is call > > > inside another process time slice because fence signaling is happening > > > inside this other process time slice). > > > > So I read through it all and presuming I understand it correctly your > > proposal and what we currently have is about the same. The big difference > > is that you make a timeline a first-class object and move the callback > > queue from the fence to the timeline, which requires callers to check the > > fence/seqno/whatever themselves instead of pushing that responsibility to > > callers. > > No, big difference is that there is no callback thus when waiting for a > fence you are either inside the process context that need to wait for it > or your inside a kernel thread process context. Which means in both case > you can do whatever you want. What i hate about the fence code as it is, > is the callback stuff, because you never know into which context fence > are signaled then you never know into which context callback are executed.
Look at waitqueues a bit closer. They're implemented with callbacks ;-) The only difference is that you're allowed to have spurious wakeups and need to handle that somehow, so need a separate check function. > > If you actually mandate that the fence is just a seqno or similar which > > can be read lockless then I could register my own special callback into > > that waitqueue (other stuff than waking up threads is allowed) and from > > hard-irq context check the seqno and readd my own callback if that's not > > yet happened (that needs to go through some other context for hilarity). > > Yes mandating a simple number that can be read from anywhere without lock, > i am pretty sure all hw can write to system page and can write a value > alongside their command buffer. So either you hw support reading and testing > value either you can do it in atomic context right before scheduling. Imo that's a step in the wrong direction since reading a bit of system memory or checking a bit of irq-safe spinlock protected data or a register shouldn't matter. You just arbitrarily disallow that. And allowing random other kernel subsystems to read mmio or page mappings not under their control is an idea that freaks /me/ out. > > So from that pov (presuming I didn't miss anything) your proposal is > > identical to what we have, minor some different color choices (like where > > to place the callback queue). > > No callback is the mantra here, and instead of bolting free living fence > to buffer object, they are associated with timeline which means you do not > need to go over all buffer object to know what you need to wait for. Ok, then I guess I didn't understand that part of your the proposal. Can you please elaborate a bit more how you want to synchronize mulitple drivers accessing a dma-buf object and what piece of state we need to associate to the dma-buf to make this happen? Thanks, Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch