I think we can simplify `panfrost_flush_draw_deps`. We need to flush any BOs that write where we read/write and any BOs that read where we write. Since we collect this information via add_bo, we can implement this logic generically, without requiring a special case for every kind of BO we might need to flush, which is verbose and easy to forget when adding new BOs later. You might need some extra tables in panfrost_batch.
---- On design more generally: I don't think we want to trigger any flushes at draw time. Rather, we want to trigger at flush time. Conceptually, right before we send a batch to the GPU, we ensure all of the other batches it needs have been sent first and there is a barrier between them (via wait_bo). The first consequence of delaying is that CPU-side logic can proceed without being stalled on results. The second is that different batches can be _totally_ independent. Consider an app that does the following passes: [FBO 1: Render a depth map of an object ] [FBO 2: Render a normal map of that object ] [Scanout: Render using the depth/normal maps as textures ] In this case, the app should generate CPU-side batches for all three render targets at once. Then, when flush() is called, fbo #1 and fbo #2 should be submitted and waited upon so they execute concurrently, then scanout is submitted and waited. This should be a little faster, especially paired with _NEXT changes in the kernel. CC'ing Steven to ensure the principle is sound. We can model this with a dependency graph, where batches are nodes and the dependency of a batch X on a batch Y is represented as an edge from Y to X. So this is a directed arrow graph. For well-behaved apps, the graph must be acyclic (why?). This touches on the idea of topological sorting: a topological sort of the dependency graph is a valid order to submit the batches in. So hypothetically, you could delay everything to flush time, construct the dependency graph, do a topological sort, and then submit/wait in the order of the sort. But more interesting will be to extend to the concurrent FBO case, an algorithm for which follows simply from topological sorting: --- 0. Create the dependency graph. Cull nodes that are not connected to the node we're trying to flush (the scanout batch). In other words, reduce the graph to its component containing the flushed node. See also https://en.wikipedia.org/wiki/Connected_component_(graph_theory)#Algorithms 1. For each node with no incoming edges (=batch with no dependencies), submit this batch. Remove it from the dependency graph, removing all outgoing edges. Add it to a set of submitted batches. 3. For each submitted batch, wait on that batch. 4. Jump back to step #1 until there are no more nodes with no incoming edges. --- Intuitively, the idea is "submit as much as we can all at once, then wait for it. Keep doing that until we submitted everything we need." A bit more formally, nodes with no edges have no unsatisfied dependencies by definition, so we can submit them in any order. We choose to submit these first. We are allowed to submit a wait at any time. Once we wait on a batch, it is complete, so any batches that depend on it have that dependency satisfied, represented by removing the edge from the dependency graph. Do note that the subtlety of the termination condition: no more nodes with no incoming edges. This makes proving that the algorithm halts easy, since every iteration either removes a node or halts, and there are a finite integral non-negative number of nodes. * Whether this is a useful optimization is greatly dependent on the hardware. The Arm guys can chime in here, but I do know the GPU has some parallel execution capabilities so this shouldn't be a total waste. _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev