So, games/apps that are aware of how a tiler gpu works will make an effort to avoid mid-batch (tile pass) updates to textures, UBOs, etc, since this will force a flush, and extra resolve (tile->mem) and restore (mem->tile) in the next batch. They also avoid unnecessary framebuffer switches, for the same reason.
But turns out that many games, benchmarks, etc, aren't very good at this. But what if we could re-order the batches (and potentially shadow texture/UBO/etc resources) to minimize the tile passes and unnecessary resolve/restore? This is based on a rough idea that Eric suggested a while back, and a few other experiments that I have been trying recently. It boils down to three parts: 1) Add an fd_batch object, which tracks cmdstream being built for that particular tile pass. State that is global to the tile pass is move from fd_context to fd_batch. (Mostly the framebuffer state, but also so internal tracking that is done to decide whether to use GMEM or sysmem/bypass mode, etc.) Tracking of resources written/read in the batch is also moved from ctx to batch. 2) Add a batch-cache. Previously, whenever new framebuffer state is set, it forced a flush. Now (if reordering is enabled), we use the framebuffer state as key into a hashtable to map it to an existing batch (if there is one, otherwise construct a new batch and add it to the table). When a resource is marked as read/written by a batch, which is already pending access by another batch, a dependency between the two batches is added. TODO there is probably a bit more room for improvement here. See below analysis of supertuxkart. 3) Shadow resources. Mid-batch UBO updates or uploading new contents to an in-use texture is sadly too common. Traditional (non-tiler) gpu's could solve this with a staging buffer, and blitting from the staging to real buffer at the appropriate spot in the cmdstream. But this doesn't work for a tiling gpu, since we'll need the old contents again when we move on to the next tile. To solve this, allocate a new buffer and back-blit the previous contents to the new buffer. The existing buffer becomes a shadow and is unref'd (the backing GEM object is kept alive since it is referenced by the cmdstream). For example, a texture upload + mipmap gen turns into transfer_map for level zero (glTexSubImage*, etc), followed by blits to the remaining mipmap levels (glGenerateMipmap()). So in transfer_map() if writing new contents into the buffer would trigger a flush or stall, we shadow the existing buffer, and blit the remaining levels from old to new. Each blit turns into a batch (different frame- buffer state), and is not immediately flushed, but just hangs out in the batch cache. When the next blit (from glGenerateMipmap() overwrites the contents from the back-blit, we realize this and drop the previous rendering to the batch, so in many cases the back-blit ends up discarded. Results: supertuxkart was a big winner, with an overall ~30% boost, making the new render engine finally playable on most levels. Fps varies a lot by level, but on average going from 14-19fps to 20-25fps. (Sadly, the old render engine, which was much faster on lower end hw, seems to be in disrepair.) I did also add some instrumentation to collect some stats on # of different sorts of batches. Since supertuxkart --profile-laps is not repeatable, I could not directly compare results there, but I could compare an apitrace replay of stk level: normal: batch_sysmem=10398, batch_gmem=6958, batch_restore=3864 reorder: batch_sysmem=16825, batch_gmem=6956, batch_restore=3863 (for 792 frames) I was expecting a drop in gmem batches, and restores, because stk does two problematic things: (1) render target switches, ie. clear, switch fb, clear, switch fb, draw, etc., and (2) mid-batch UBO update. I've looked a bit into the render target switches, but it seems like it is mixing/matching zsbuf and cbuf's in a way that makes them map to different batches. Ie: set fb: zsbuf=A, cbuf[0]=B clear color0 clear stencil set fb: zsbuf=A, cbuf[0]=C draw Not entirely sure what to do about that. I suppose I could track the cmdstream for the clears individually, and juggle them between batches somehow to avoid the flush? The mid-batch UBO update seems to actually happen between two fb states with the same color0 and zs, but first treats color0 as R8G8B8A8_SRGB and the next R8G8B8A8_UNORM. Probably we need a flush here anyways, but use of glDiscardFramebuffer() in the app (and wiring up the driver bits) could avoid a lot of restores. Most of the gain seems to come from simply not stalling on the UBO update. xonitic also seems to be a winner, although I haven't analyzed it as closely: med: 48fps -> 52fps high: 25fps -> 31fps ultra: 15fps -> 19fps and the batch stats show more of an improvement: med: normal: batch_sysmem=0, batch_gmem=18055, batch_restore=3748 reorder: batch_sysmem=2220, batch_gmem=14483, batch_restore=174 (10510 frames) high: normal: batch_sysmem=63072, batch_gmem=62692, batch_restore=48384 reorder: batch_sysmem=65429, batch_gmem=58284, batch_restore=43971 (10510 frames) ultra: normal: batch_sysmem=63072, batch_gmem=81318, batch_restore=66863 reorder: batch_sysmem=65869, batch_gmem=71360, batch_restore=56939 (10510 frames) So in all cases a nice drop in tile passes (batch_gmem) and reduction in number of times we need to move back from system memory to tile buffer (batch_restore). High/ultra still has a lot of restore's per frame, so maybe there is still some room for improvement. Not sure yet if it is the same sort of thing going on as supertuxkart. I would expect to see some gains in manhattan and possibly trex, but unfortunately it is mostly using compressed textures that util_blitter cannot blit, so the resource shadowing back-blit ends up on the CPU (which ends up flushing previous mipmap generation and stalling, which kind of defeats the purpose). I'm not entirely sure what to do here. Since we don't need scaling/filtering/etc we could map things to a different format which can be rendered to, but I think we end up needing to also lie about the width/height. Which works ok for fb state (we take w/h from the pipe_surface, not the pipe_resource). But not on the src (tex state) side. Possibly we could add w/h to pipe_sampler_view to solve this? Solving this should at least bring about +15% in manhattan, and maybe a bit in trex. At any rate, the freedreno bits end up depending on some libdrm patches[1] which in turn depend on some kernel stuff I have queued up for 4.8. So it will be some time before it lands. But I'd like to get the first three patches reviewed and pushed. And suggestions about the remaining issues welcome, since there is still some room for further gains. [1] https://github.com/freedreno/libdrm/commits/fd-next Rob Clark (12): gallium/util: make util_copy_framebuffer_state(src=NULL) work gallium: un-inline pipe_surface_desc list: fix list_replace() for empty lists freedreno: introduce fd_batch freedreno: push resource tracking down into batch freedreno: dynamically sized/growable cmd buffers freedreno: move more batch related tracking to fd_batch freedreno: add batch-cache freedreno: batch re-ordering support freedreno: spiff up some debug traces freedreno: shadow textures if possible to avoid stall/flush freedreno: support discarding previous rendering in special cases src/gallium/auxiliary/util/u_framebuffer.c | 37 ++- src/gallium/drivers/freedreno/Makefile.sources | 4 + src/gallium/drivers/freedreno/a2xx/fd2_draw.c | 12 +- src/gallium/drivers/freedreno/a2xx/fd2_emit.c | 15 +- src/gallium/drivers/freedreno/a2xx/fd2_gmem.c | 63 ++--- src/gallium/drivers/freedreno/a3xx/fd3_context.c | 4 - src/gallium/drivers/freedreno/a3xx/fd3_context.h | 5 - src/gallium/drivers/freedreno/a3xx/fd3_draw.c | 23 +- src/gallium/drivers/freedreno/a3xx/fd3_emit.c | 23 +- src/gallium/drivers/freedreno/a3xx/fd3_emit.h | 2 +- src/gallium/drivers/freedreno/a3xx/fd3_gmem.c | 146 +++++------ src/gallium/drivers/freedreno/a4xx/fd4_draw.c | 41 +-- src/gallium/drivers/freedreno/a4xx/fd4_draw.h | 13 +- src/gallium/drivers/freedreno/a4xx/fd4_emit.c | 24 +- src/gallium/drivers/freedreno/a4xx/fd4_emit.h | 2 +- src/gallium/drivers/freedreno/a4xx/fd4_gmem.c | 122 ++++----- src/gallium/drivers/freedreno/freedreno_batch.c | 280 +++++++++++++++++++++ src/gallium/drivers/freedreno/freedreno_batch.h | 152 +++++++++++ .../drivers/freedreno/freedreno_batch_cache.c | 246 ++++++++++++++++++ .../drivers/freedreno/freedreno_batch_cache.h | 51 ++++ src/gallium/drivers/freedreno/freedreno_context.c | 131 ++-------- src/gallium/drivers/freedreno/freedreno_context.h | 123 ++------- src/gallium/drivers/freedreno/freedreno_draw.c | 132 +++++----- src/gallium/drivers/freedreno/freedreno_draw.h | 15 +- src/gallium/drivers/freedreno/freedreno_gmem.c | 110 ++++---- src/gallium/drivers/freedreno/freedreno_gmem.h | 6 +- src/gallium/drivers/freedreno/freedreno_query_hw.c | 8 +- src/gallium/drivers/freedreno/freedreno_resource.c | 242 ++++++++++++++++-- src/gallium/drivers/freedreno/freedreno_resource.h | 10 +- src/gallium/drivers/freedreno/freedreno_screen.c | 9 + src/gallium/drivers/freedreno/freedreno_screen.h | 2 + src/gallium/drivers/freedreno/freedreno_state.c | 19 +- src/gallium/drivers/freedreno/freedreno_util.h | 43 ++-- src/gallium/include/pipe/p_state.h | 23 +- src/util/list.h | 14 +- 35 files changed, 1486 insertions(+), 666 deletions(-) create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.c create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.h create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.c create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.h -- 2.7.4 _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev