On Sat, Jul 2, 2016 at 12:52 PM, Rob Clark <robdcl...@gmail.com> wrote: > So, games/apps that are aware of how a tiler gpu works will make an > effort to avoid mid-batch (tile pass) updates to textures, UBOs, etc, > since this will force a flush, and extra resolve (tile->mem) and > restore (mem->tile) in the next batch. They also avoid unnecessary > framebuffer switches, for the same reason. > > But turns out that many games, benchmarks, etc, aren't very good at > this. But what if we could re-order the batches (and potentially > shadow texture/UBO/etc resources) to minimize the tile passes and > unnecessary resolve/restore? > > This is based on a rough idea that Eric suggested a while back, and > a few other experiments that I have been trying recently. It boils > down to three parts: > > 1) Add an fd_batch object, which tracks cmdstream being built for that > particular tile pass. State that is global to the tile pass is > move from fd_context to fd_batch. (Mostly the framebuffer state, > but also so internal tracking that is done to decide whether to > use GMEM or sysmem/bypass mode, etc.) > > Tracking of resources written/read in the batch is also moved from > ctx to batch. > > 2) Add a batch-cache. Previously, whenever new framebuffer state is > set, it forced a flush. Now (if reordering is enabled), we use > the framebuffer state as key into a hashtable to map it to an > existing batch (if there is one, otherwise construct a new batch > and add it to the table). > > When a resource is marked as read/written by a batch, which is > already pending access by another batch, a dependency between the > two batches is added.
so I noticed one slightly annoying thing.. u_blitter doesn't clear the vtx sampler state, and only updates the first sampler for frag state.. resulting in us adding a bunch of unneeded dependencies during a blitter draw.. not sure if it is better to fix u_blitter or hack around it in the driver.. BR, -R > TODO there is probably a bit more room for improvement here. See > below analysis of supertuxkart. > > 3) Shadow resources. Mid-batch UBO updates or uploading new contents > to an in-use texture is sadly too common. Traditional (non-tiler) > gpu's could solve this with a staging buffer, and blitting from the > staging to real buffer at the appropriate spot in the cmdstream. > But this doesn't work for a tiling gpu, since we'll need the old > contents again when we move on to the next tile. To solve this, > allocate a new buffer and back-blit the previous contents to the > new buffer. The existing buffer becomes a shadow and is unref'd > (the backing GEM object is kept alive since it is referenced by > the cmdstream). > > For example, a texture upload + mipmap gen turns into transfer_map > for level zero (glTexSubImage*, etc), followed by blits to the > remaining mipmap levels (glGenerateMipmap()). So in transfer_map() > if writing new contents into the buffer would trigger a flush or > stall, we shadow the existing buffer, and blit the remaining levels > from old to new. Each blit turns into a batch (different frame- > buffer state), and is not immediately flushed, but just hangs out > in the batch cache. When the next blit (from glGenerateMipmap() > overwrites the contents from the back-blit, we realize this and > drop the previous rendering to the batch, so in many cases the > back-blit ends up discarded. > > > > Results: > > supertuxkart was a big winner, with an overall ~30% boost, making the > new render engine finally playable on most levels. Fps varies a lot > by level, but on average going from 14-19fps to 20-25fps. > > (Sadly, the old render engine, which was much faster on lower end hw, > seems to be in disrepair.) > > I did also add some instrumentation to collect some stats on # of > different sorts of batches. Since supertuxkart --profile-laps is > not repeatable, I could not directly compare results there, but I > could compare an apitrace replay of stk level: > > normal: batch_sysmem=10398, batch_gmem=6958, batch_restore=3864 > reorder: batch_sysmem=16825, batch_gmem=6956, batch_restore=3863 > (for 792 frames) > > I was expecting a drop in gmem batches, and restores, because stk > does two problematic things: (1) render target switches, ie. clear, > switch fb, clear, switch fb, draw, etc., and (2) mid-batch UBO > update. > > I've looked a bit into the render target switches, but it seems like > it is mixing/matching zsbuf and cbuf's in a way that makes them map > to different batches. Ie: > > set fb: zsbuf=A, cbuf[0]=B > clear color0 > clear stencil > set fb: zsbuf=A, cbuf[0]=C > draw > > Not entirely sure what to do about that. I suppose I could track the > cmdstream for the clears individually, and juggle them between batches > somehow to avoid the flush? > > The mid-batch UBO update seems to actually happen between two fb states > with the same color0 and zs, but first treats color0 as R8G8B8A8_SRGB > and the next R8G8B8A8_UNORM. Probably we need a flush here anyways, > but use of glDiscardFramebuffer() in the app (and wiring up the driver > bits) could avoid a lot of restores. > > Most of the gain seems to come from simply not stalling on the UBO > update. > > > xonitic also seems to be a winner, although I haven't analyzed it as > closely: > > med: 48fps -> 52fps > high: 25fps -> 31fps > ultra: 15fps -> 19fps > > and the batch stats show more of an improvement: > > med: > normal: batch_sysmem=0, batch_gmem=18055, batch_restore=3748 > reorder: batch_sysmem=2220, batch_gmem=14483, batch_restore=174 > (10510 frames) > > high: > normal: batch_sysmem=63072, batch_gmem=62692, batch_restore=48384 > reorder: batch_sysmem=65429, batch_gmem=58284, batch_restore=43971 > (10510 frames) > > ultra: > normal: batch_sysmem=63072, batch_gmem=81318, batch_restore=66863 > reorder: batch_sysmem=65869, batch_gmem=71360, batch_restore=56939 > (10510 frames) > > So in all cases a nice drop in tile passes (batch_gmem) and reduction > in number of times we need to move back from system memory to tile > buffer (batch_restore). High/ultra still has a lot of restore's per > frame, so maybe there is still some room for improvement. Not sure > yet if it is the same sort of thing going on as supertuxkart. > > I would expect to see some gains in manhattan and possibly trex, but > unfortunately it is mostly using compressed textures that util_blitter > cannot blit, so the resource shadowing back-blit ends up on the CPU > (which ends up flushing previous mipmap generation and stalling, which > kind of defeats the purpose). I'm not entirely sure what to do here. > Since we don't need scaling/filtering/etc we could map things to a > different format which can be rendered to, but I think we end up > needing to also lie about the width/height. Which works ok for fb > state (we take w/h from the pipe_surface, not the pipe_resource). But > not on the src (tex state) side. Possibly we could add w/h to > pipe_sampler_view to solve this? Solving this should at least bring > about +15% in manhattan, and maybe a bit in trex. > > > At any rate, the freedreno bits end up depending on some libdrm > patches[1] which in turn depend on some kernel stuff I have queued up > for 4.8. So it will be some time before it lands. But I'd like to > get the first three patches reviewed and pushed. And suggestions > about the remaining issues welcome, since there is still some room > for further gains. > > [1] https://github.com/freedreno/libdrm/commits/fd-next > > Rob Clark (12): > gallium/util: make util_copy_framebuffer_state(src=NULL) work > gallium: un-inline pipe_surface_desc > list: fix list_replace() for empty lists > freedreno: introduce fd_batch > freedreno: push resource tracking down into batch > freedreno: dynamically sized/growable cmd buffers > freedreno: move more batch related tracking to fd_batch > freedreno: add batch-cache > freedreno: batch re-ordering support > freedreno: spiff up some debug traces > freedreno: shadow textures if possible to avoid stall/flush > freedreno: support discarding previous rendering in special cases > > src/gallium/auxiliary/util/u_framebuffer.c | 37 ++- > src/gallium/drivers/freedreno/Makefile.sources | 4 + > src/gallium/drivers/freedreno/a2xx/fd2_draw.c | 12 +- > src/gallium/drivers/freedreno/a2xx/fd2_emit.c | 15 +- > src/gallium/drivers/freedreno/a2xx/fd2_gmem.c | 63 ++--- > src/gallium/drivers/freedreno/a3xx/fd3_context.c | 4 - > src/gallium/drivers/freedreno/a3xx/fd3_context.h | 5 - > src/gallium/drivers/freedreno/a3xx/fd3_draw.c | 23 +- > src/gallium/drivers/freedreno/a3xx/fd3_emit.c | 23 +- > src/gallium/drivers/freedreno/a3xx/fd3_emit.h | 2 +- > src/gallium/drivers/freedreno/a3xx/fd3_gmem.c | 146 +++++------ > src/gallium/drivers/freedreno/a4xx/fd4_draw.c | 41 +-- > src/gallium/drivers/freedreno/a4xx/fd4_draw.h | 13 +- > src/gallium/drivers/freedreno/a4xx/fd4_emit.c | 24 +- > src/gallium/drivers/freedreno/a4xx/fd4_emit.h | 2 +- > src/gallium/drivers/freedreno/a4xx/fd4_gmem.c | 122 ++++----- > src/gallium/drivers/freedreno/freedreno_batch.c | 280 > +++++++++++++++++++++ > src/gallium/drivers/freedreno/freedreno_batch.h | 152 +++++++++++ > .../drivers/freedreno/freedreno_batch_cache.c | 246 ++++++++++++++++++ > .../drivers/freedreno/freedreno_batch_cache.h | 51 ++++ > src/gallium/drivers/freedreno/freedreno_context.c | 131 ++-------- > src/gallium/drivers/freedreno/freedreno_context.h | 123 ++------- > src/gallium/drivers/freedreno/freedreno_draw.c | 132 +++++----- > src/gallium/drivers/freedreno/freedreno_draw.h | 15 +- > src/gallium/drivers/freedreno/freedreno_gmem.c | 110 ++++---- > src/gallium/drivers/freedreno/freedreno_gmem.h | 6 +- > src/gallium/drivers/freedreno/freedreno_query_hw.c | 8 +- > src/gallium/drivers/freedreno/freedreno_resource.c | 242 ++++++++++++++++-- > src/gallium/drivers/freedreno/freedreno_resource.h | 10 +- > src/gallium/drivers/freedreno/freedreno_screen.c | 9 + > src/gallium/drivers/freedreno/freedreno_screen.h | 2 + > src/gallium/drivers/freedreno/freedreno_state.c | 19 +- > src/gallium/drivers/freedreno/freedreno_util.h | 43 ++-- > src/gallium/include/pipe/p_state.h | 23 +- > src/util/list.h | 14 +- > 35 files changed, 1486 insertions(+), 666 deletions(-) > create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.c > create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.h > create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.c > create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.h > > -- > 2.7.4 > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev