On Fri, 6 Sep 2019 07:40:17 -0400 Alyssa Rosenzweig <aly...@rosenzweig.io> wrote:
> I think we can simplify `panfrost_flush_draw_deps`. We need to flush > any BOs that write where we read/write and any BOs that read where we > write. Since we collect this information via add_bo, we can > implement this logic generically, without requiring a special case > for every kind of BO we might need to flush, which is verbose and easy > to forget when adding new BOs later. You might need some extra tables in > panfrost_batch. With the current design where deps are flushed before issuing draw/clear job, the existing add_bo() calls happen too late. This being said, we could add BOs earlier and store the type of access in batch->bos (turn it into a hash table where the key is the BO and the data contains the flags). With that in place, we'd be able to automatically add BOs to the ctx->{write,read}_bos hash tables. Now, if we go for the dep graph solution, that's probably a non issue, since deps can be added at any point as long as they are described before the flush happens. > > ---- > > On design more generally: > > I don't think we want to trigger any flushes at draw time. Rather, we > want to trigger at flush time. Conceptually, right before we send a > batch to the GPU, we ensure all of the other batches it needs have been > sent first and there is a barrier between them (via wait_bo). I agree, and actually had this rework on my TODO list. > > The first consequence of delaying is that CPU-side logic can proceed > without being stalled on results. > > The second is that different batches can be _totally_ independent. > Consider an app that does the following passes: > > [FBO 1: Render a depth map of an object ] > [FBO 2: Render a normal map of that object ] > [Scanout: Render using the depth/normal maps as textures ] > > In this case, the app should generate CPU-side batches for all three > render targets at once. Then, when flush() is called, fbo #1 and fbo #2 > should be submitted and waited upon so they execute concurrently, then > scanout is submitted and waited. Yes, also thought about that. We'd need to move the out_sync object to the batch to make that possible, but that's definitely an improvement I had in mind. > This should be a little faster, > especially paired with _NEXT changes in the kernel. CC'ing Steven to > ensure the principle is sound. Haven't looked at that patch yet. > > We can model this with a dependency graph, where batches are nodes and > the dependency of a batch X on a batch Y is represented as an edge from > Y to X. So this is a directed arrow graph. For well-behaved apps, the > graph must be acyclic (why?). > > This touches on the idea of topological sorting: a topological sort of > the dependency graph is a valid order to submit the batches in. So > hypothetically, you could delay everything to flush time, construct the > dependency graph, do a topological sort, and then submit/wait in the > order of the sort. > > But more interesting will be to extend to the concurrent FBO case, an > algorithm for which follows simply from topological sorting: > > --- > > 0. Create the dependency graph. Cull nodes that are not connected to the > node we're trying to flush (the scanout batch). In other words, reduce > the graph to its component containing the flushed node. See also > https://en.wikipedia.org/wiki/Connected_component_(graph_theory)#Algorithms > > 1. For each node with no incoming edges (=batch with no dependencies), > submit this batch. Remove it from the dependency graph, removing all > outgoing edges. Add it to a set of submitted batches. > > 3. For each submitted batch, wait on that batch. > > 4. Jump back to step #1 until there are no more nodes with no incoming > edges. > > --- > > Intuitively, the idea is "submit as much as we can all at once, then > wait for it. Keep doing that until we submitted everything we need." > > A bit more formally, nodes with no edges have no unsatisfied > dependencies by definition, so we can submit them in any order. We > choose to submit these first. We are allowed to submit a wait at any > time. Once we wait on a batch, it is complete, so any batches that > depend on it have that dependency satisfied, represented by removing the > edge from the dependency graph. > > Do note that the subtlety of the termination condition: no more nodes > with no incoming edges. This makes proving that the algorithm halts > easy, since every iteration either removes a node or halts, and there > are a finite integral non-negative number of nodes. > > * Whether this is a useful optimization is greatly dependent on the > hardware. The Arm guys can chime in here, but I do know the GPU has > some parallel execution capabilities so this shouldn't be a total > waste. Thanks for the detailed explanation. I'll look into that. This being said, I was wondering if we shouldn't merge this patch (after I addressed your first comment maybe) before getting involved in a more advanced solution (which I agree is what we should aim for). _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev