On Fri, Jun 24, 2016 at 1:09 PM, Nicolai Hähnle <nhaeh...@gmail.com> wrote: > On 22.06.2016 20:29, Marek Olšák wrote: >> >> From: Marek Olšák <marek.ol...@amd.com> >> >> DCC for displayable surfaces is allocated in a separate buffer and is >> enabled or disabled based on PS invocations from 2 frames ago (to let >> queries go idle) and the number of slow clears from the current frame. >> >> At least an equivalent of 5 fullscreen draws or slow clears must be done >> to enable DCC. (PS invocations / (width * height) + num_slow_clears >= 5) >> >> Pipeline statistic queries are always active if a color buffer that can >> have separate DCC is bound, even if separate DCC is disabled. That means >> the window color buffer is always monitored and DCC is enabled only when >> the situation is right. >> >> The tracking of per-texture queries in r600_common_context is quite ugly, >> but I don't see a better way. >> >> The first fast clear always enables DCC. DCC decompression can disable it. >> A later fast clear can enable it again. Enable/disable typically happens >> only once per frame. >> >> The impact is expected to be negligible because games usually don't have >> a high level of overdraw. DCC usually activates when too much blending >> is happening (smoke rendering) or when testing glClear performance and >> CMASK isn't supported (Stoney). > > > Nice stuff. One corner case to think of: what happens when DCC is enabled > for a texture that is currently bound? Needs the same treatment as when DCC > is disabled, right?
Not sure what you mean, can you be more specific? Note that it can't be bound as a sampler by apps because it's a window framebuffer. > > More comments below... > > >> --- >> src/gallium/drivers/radeon/r600_pipe_common.c | 15 ++ >> src/gallium/drivers/radeon/r600_pipe_common.h | 40 +++++ >> src/gallium/drivers/radeon/r600_texture.c | 239 >> ++++++++++++++++++++++++++ >> src/gallium/drivers/radeonsi/si_blit.c | 14 +- >> src/gallium/drivers/radeonsi/si_state.c | 15 ++ >> src/gallium/drivers/radeonsi/si_state_draw.c | 5 +- >> 6 files changed, 326 insertions(+), 2 deletions(-) >> >> diff --git a/src/gallium/drivers/radeon/r600_pipe_common.c >> b/src/gallium/drivers/radeon/r600_pipe_common.c >> index 5d4a679..66afcfa 100644 >> --- a/src/gallium/drivers/radeon/r600_pipe_common.c >> +++ b/src/gallium/drivers/radeon/r600_pipe_common.c >> @@ -397,6 +397,21 @@ bool r600_common_context_init(struct >> r600_common_context *rctx, >> >> void r600_common_context_cleanup(struct r600_common_context *rctx) >> { >> + unsigned i,j; >> + >> + /* Release DCC stats. */ >> + for (i = 0; i < ARRAY_SIZE(rctx->dcc_stats); i++) { >> + assert(!rctx->dcc_stats[i].query_active); >> + >> + for (j = 0; j < ARRAY_SIZE(rctx->dcc_stats[i].ps_stats); >> j++) >> + if (rctx->dcc_stats[i].ps_stats[j]) >> + rctx->b.destroy_query(&rctx->b, >> + >> rctx->dcc_stats[i].ps_stats[j]); >> + >> + pipe_resource_reference((struct pipe_resource**) >> + &rctx->dcc_stats[i].tex, NULL); >> + } >> + >> if (rctx->gfx.cs) >> rctx->ws->cs_destroy(rctx->gfx.cs); >> if (rctx->dma.cs) >> diff --git a/src/gallium/drivers/radeon/r600_pipe_common.h >> b/src/gallium/drivers/radeon/r600_pipe_common.h >> index 92cba13..cdec907 100644 >> --- a/src/gallium/drivers/radeon/r600_pipe_common.h >> +++ b/src/gallium/drivers/radeon/r600_pipe_common.h >> @@ -272,6 +272,25 @@ struct r600_texture { >> * dcc_offset contains the absolute GPUVM address, not the >> relative one. >> */ >> struct r600_resource *dcc_separate_buffer; >> + /* When DCC is temporarily disabled, the separate buffer is here. >> */ >> + struct r600_resource *last_dcc_separate_buffer; >> + /* We need to track DCC dirtiness, because st/dri usually calls >> + * flush_resource twice per frame (not a bug) and we don't wanna >> + * decompress DCC twice. Also, the dirty tracking must be done >> even >> + * if DCC isn't used, because it's required by the DCC usage >> analysis >> + * for a possible future enablement. >> + */ >> + bool separate_dcc_dirty; >> + /* Statistics gathering for the DCC enablement heuristic. */ >> + bool dcc_gather_statistics; >> + /* Estimate of how much this color buffer is written to in units >> of >> + * full-screen draws: ps_invocations / (width * height) >> + * Shader kills, late Z, and blending with trivial discards make >> it >> + * inaccurate (we need to count CB updates, not PS invocations). >> + */ >> + unsigned ps_draw_ratio; >> + /* The number of clears since the last DCC usage analysis. */ >> + unsigned num_slow_clears; >> >> /* Counter that should be non-zero if the texture is bound to a >> * framebuffer. Implemented in radeonsi only. >> @@ -536,6 +555,21 @@ struct r600_common_context { >> float sample_locations_8x[8][2]; >> float sample_locations_16x[16][2]; >> >> + /* Statistics gathering for the DCC enablement heuristic. It can't >> be >> + * in r600_texture because r600_texture can be shared by multiple >> + * contexts. This is for back buffers only. We shouldn't get too >> many >> + * of those. >> + */ >> + struct { >> + struct r600_texture *tex; >> + /* Query queue: 0 = usually active, 1 = waiting, 2 = >> readback. */ >> + struct pipe_query *ps_stats[3]; >> + /* If all slots are used and another slot is needed, >> + * the least recently used slot is evicted based on this. >> */ >> + int64_t last_use_timestamp; >> + bool query_active; >> + } dcc_stats[2]; >> + >> /* The list of all texture buffer objects in this context. >> * This list is walked when a buffer is invalidated/reallocated >> and >> * the GPU addresses are updated. */ >> @@ -688,6 +722,12 @@ struct pipe_surface >> *r600_create_surface_custom(struct pipe_context *pipe, >> const struct pipe_surface >> *templ, >> unsigned width, unsigned >> height); >> unsigned r600_translate_colorswap(enum pipe_format format, bool >> do_endian_swap); >> +void vi_separate_dcc_start_query(struct pipe_context *ctx, >> + struct r600_texture *tex); >> +void vi_separate_dcc_stop_query(struct pipe_context *ctx, >> + struct r600_texture *tex); >> +void vi_separate_dcc_analyze_stats(struct pipe_context *ctx, >> + struct r600_texture *tex); >> void vi_dcc_clear_level(struct r600_common_context *rctx, >> struct r600_texture *rtex, >> unsigned level, unsigned clear_value); >> diff --git a/src/gallium/drivers/radeon/r600_texture.c >> b/src/gallium/drivers/radeon/r600_texture.c >> index 23be5ed..7295ab6 100644 >> --- a/src/gallium/drivers/radeon/r600_texture.c >> +++ b/src/gallium/drivers/radeon/r600_texture.c >> @@ -26,9 +26,11 @@ >> */ >> #include "r600_pipe_common.h" >> #include "r600_cs.h" >> +#include "r600_query.h" >> #include "util/u_format.h" >> #include "util/u_memory.h" >> #include "util/u_pack_color.h" >> +#include "os/os_time.h" >> #include <errno.h> >> #include <inttypes.h> >> >> @@ -567,6 +569,7 @@ static void r600_texture_destroy(struct pipe_screen >> *screen, >> } >> pb_reference(&resource->buf, NULL); >> r600_resource_reference(&rtex->dcc_separate_buffer, NULL); >> + r600_resource_reference(&rtex->last_dcc_separate_buffer, NULL); >> FREE(rtex); >> } >> >> @@ -1017,6 +1020,7 @@ r600_texture_create_object(struct pipe_screen >> *screen, >> rtex->non_disp_tiling = rtex->is_depth && >> rtex->surface.level[0].mode >= RADEON_SURF_MODE_1D; >> /* Applies to GCN. */ >> rtex->last_msaa_resolve_target_micro_mode = >> rtex->surface.micro_tile_mode; >> + rtex->ps_draw_ratio = 100; /* start with a sufficiently high >> number */ >> >> if (rtex->is_depth) { >> if (!(base->flags & (R600_RESOURCE_FLAG_TRANSFER | >> @@ -1705,6 +1709,224 @@ unsigned r600_translate_colorswap(enum pipe_format >> format, bool do_endian_swap) >> return ~0U; >> } >> >> +/* PIPELINE_STAT-BASED DCC ENABLEMENT FOR DISPLAYABLE SURFACES */ >> + >> +/** >> + * Return the per-context slot where DCC statistics queries for the >> texture live. >> + */ >> +static unsigned vi_get_context_dcc_stats_index(struct r600_common_context >> *rctx, >> + struct r600_texture *tex) >> +{ >> + int i, empty_slot = -1; >> + >> + for (i = 0; i < ARRAY_SIZE(rctx->dcc_stats); i++) { >> + /* Return if found. */ >> + if (rctx->dcc_stats[i].tex == tex) { >> + rctx->dcc_stats[i].last_use_timestamp = >> os_time_get(); >> + return i; >> + } >> + >> + /* Record the first seen empty slot. */ >> + if (empty_slot == -1 && !rctx->dcc_stats[i].tex) >> + empty_slot = i; >> + } >> + >> + /* Not found. Remove the oldest member to make space in the array. >> */ >> + if (empty_slot == -1) { >> + int oldest_slot = 0; >> + >> + /* Find the oldest slot. */ >> + for (i = 1; i < ARRAY_SIZE(rctx->dcc_stats); i++) >> + if >> (rctx->dcc_stats[oldest_slot].last_use_timestamp > >> + rctx->dcc_stats[i].last_use_timestamp) >> + oldest_slot = i; >> + >> + /* Clean up the oldest slot. */ >> + if (rctx->dcc_stats[oldest_slot].query_active) >> + vi_separate_dcc_stop_query(&rctx->b, >> + >> rctx->dcc_stats[oldest_slot].tex); >> + >> + for (i = 0; i < >> ARRAY_SIZE(rctx->dcc_stats[oldest_slot].ps_stats); i++) >> + if (rctx->dcc_stats[oldest_slot].ps_stats[i]) { >> + rctx->b.destroy_query(&rctx->b, >> + >> rctx->dcc_stats[oldest_slot].ps_stats[i]); >> + rctx->dcc_stats[oldest_slot].ps_stats[i] = >> NULL; >> + } >> + >> + pipe_resource_reference((struct pipe_resource**) >> + &rctx->dcc_stats[oldest_slot].tex, >> NULL); >> + empty_slot = oldest_slot; >> + } >> + >> + /* Add the texture to the new slot. */ >> + pipe_resource_reference((struct >> pipe_resource**)&rctx->dcc_stats[empty_slot].tex, >> + &tex->resource.b.b); >> + rctx->dcc_stats[empty_slot].last_use_timestamp = os_time_get(); >> + return empty_slot; >> +} >> + >> +static struct pipe_query * >> +vi_create_resuming_pipestats_query(struct pipe_context *ctx) >> +{ >> + struct r600_query_hw *query = (struct r600_query_hw*) >> + ctx->create_query(ctx, PIPE_QUERY_PIPELINE_STATISTICS, 0); >> + >> + query->flags |= R600_QUERY_HW_FLAG_BEGIN_RESUMES; >> + return (struct pipe_query*)query; >> +} >> + >> +/** >> + * Called when binding a color buffer. >> + */ >> +void vi_separate_dcc_start_query(struct pipe_context *ctx, >> + struct r600_texture *tex) >> +{ >> + struct r600_common_context *rctx = (struct >> r600_common_context*)ctx; >> + unsigned i = vi_get_context_dcc_stats_index(rctx, tex); >> + >> + assert(!rctx->dcc_stats[i].query_active); >> + >> + if (!rctx->dcc_stats[i].ps_stats[0]) >> + rctx->dcc_stats[i].ps_stats[0] = >> vi_create_resuming_pipestats_query(ctx); >> + >> + /* begin or resume the query */ >> + ctx->begin_query(ctx, rctx->dcc_stats[i].ps_stats[0]); >> + rctx->dcc_stats[i].query_active = true; >> +} >> + >> +/** >> + * Called when unbinding a color buffer. >> + */ >> +void vi_separate_dcc_stop_query(struct pipe_context *ctx, >> + struct r600_texture *tex) >> +{ >> + struct r600_common_context *rctx = (struct >> r600_common_context*)ctx; >> + unsigned i = vi_get_context_dcc_stats_index(rctx, tex); >> + >> + assert(rctx->dcc_stats[i].query_active); >> + assert(rctx->dcc_stats[i].ps_stats[0]); >> + >> + /* pause or end the query */ >> + ctx->end_query(ctx, rctx->dcc_stats[i].ps_stats[0]); >> + rctx->dcc_stats[i].query_active = false; >> +} >> + >> +static bool vi_can_enable_separate_dcc(struct r600_texture *tex) >> +{ >> + /* The minimum number of fullscreen draws per frame that is >> required >> + * to enable DCC. */ >> + return tex->ps_draw_ratio + tex->num_slow_clears >= 5; >> +} > > > Rename this function to vi_should_enable_separate_dcc or similar. OK. > > >> + >> +/* Called by fast clear. */ >> +static void vi_separate_dcc_try_enable(struct r600_common_context *rctx, >> + struct r600_texture *tex) >> +{ >> + /* The intent is to use this with shared displayable back buffers, >> + * but it's not strictly limited only to them. >> + */ >> + if (!tex->resource.is_shared || >> + !(tex->resource.external_usage & >> PIPE_HANDLE_USAGE_EXPLICIT_FLUSH) || >> + tex->resource.b.b.target != PIPE_TEXTURE_2D || >> + tex->surface.last_level > 0 || >> + !tex->surface.dcc_size) >> + return; >> + >> + if (tex->dcc_offset) >> + return; /* already enabled */ >> + >> + if (!vi_can_enable_separate_dcc(tex)) >> + return; /* stats show that DCC decompression is too >> expensive */ >> + >> + assert(tex->surface.level[0].dcc_enabled); >> + assert(!tex->dcc_separate_buffer); >> + >> + r600_texture_discard_cmask(rctx->screen, tex); >> + >> + /* Get a DCC buffer. */ >> + if (tex->last_dcc_separate_buffer) { >> + tex->dcc_separate_buffer = tex->last_dcc_separate_buffer; >> + tex->last_dcc_separate_buffer = NULL; >> + } else { >> + tex->dcc_separate_buffer = (struct r600_resource*) >> + r600_aligned_buffer_create(rctx->b.screen, 0, >> + PIPE_USAGE_DEFAULT, >> + tex->surface.dcc_size, >> + >> tex->surface.dcc_alignment); >> + if (!tex->dcc_separate_buffer) >> + return; >> + >> + /* Enabling for the first time, so start the query. */ >> + tex->dcc_gather_statistics = true; >> + vi_separate_dcc_start_query(&rctx->b, tex); > > > I think it would be cleaner to put the statistics enablement above the > vi_can_enable_separate_dcc/vi_should_enable_separate_dcc check, since then > the code doesn't rely on a magic initialization of ps_ratio. Generally it > would separate the statistics gathering more cleanly from the > enable/disable. The magic initilization of ps_ratio is required for tests that don't call SwapBuffers or call it too late. I understand the idea about separating the statistics gathering, but do we need it separate? There is no reason to gather statistics before the first clear. > >> + } >> + >> + /* dcc_offset is the absolute GPUVM address. */ >> + tex->dcc_offset = tex->dcc_separate_buffer->gpu_address; >> + >> + /* no need to flag anything since this is called by fast clear >> that >> + * flags framebuffer state >> + */ >> +} >> + >> +/** >> + * Called by pipe_context::flush_resource, the place where DCC >> decompression >> + * takes place. >> + */ >> +void vi_separate_dcc_analyze_stats(struct pipe_context *ctx, >> + struct r600_texture *tex) > > > Bike-shedding: I'd prefer a name like _gather_stats or _collect_stats. > >> +{ >> + struct r600_common_context *rctx = (struct >> r600_common_context*)ctx; >> + unsigned i = vi_get_context_dcc_stats_index(rctx, tex); >> + bool query_active = rctx->dcc_stats[i].query_active; >> + bool disable = false; >> + >> + if (rctx->dcc_stats[i].ps_stats[2]) { >> + union pipe_query_result result; >> + >> + /* Read the results. */ >> + ctx->get_query_result(ctx, rctx->dcc_stats[i].ps_stats[2], >> + true, &result); > > > What if this stalls? If flush_resource / DCC decompression is done only once per frame, there is a latency of two frames. That should be more than enough time even if it stalls. Do you have any concerns about that? > > >> + ctx->destroy_query(ctx, rctx->dcc_stats[i].ps_stats[2]); >> + rctx->dcc_stats[i].ps_stats[2] = NULL; >> + >> + /* Compute the approximate number of fullscreen draws. */ >> + tex->ps_draw_ratio = >> + result.pipeline_statistics.ps_invocations / >> + (tex->resource.b.b.width0 * >> tex->resource.b.b.height0); >> + >> + disable = tex->dcc_separate_buffer && >> + !vi_can_enable_separate_dcc(tex); >> + } >> + >> + tex->num_slow_clears = 0; >> + >> + /* stop the statistics query for ps_stats[0] */ >> + if (query_active) >> + vi_separate_dcc_stop_query(ctx, tex); >> + >> + /* Move the queries in the queue by one. */ >> + rctx->dcc_stats[i].ps_stats[2] = rctx->dcc_stats[i].ps_stats[1]; >> + rctx->dcc_stats[i].ps_stats[1] = rctx->dcc_stats[i].ps_stats[0]; >> + rctx->dcc_stats[i].ps_stats[0] = NULL; >> + >> + /* create and start a new query as ps_stats[0] */ >> + if (query_active) >> + vi_separate_dcc_start_query(ctx, tex); >> + >> + if (disable) { >> + assert(!tex->last_dcc_separate_buffer); >> + tex->last_dcc_separate_buffer = tex->dcc_separate_buffer; >> + tex->dcc_separate_buffer = NULL; >> + tex->dcc_offset = 0; >> + /* no need to flag anything since this is called after >> + * decompression that re-sets framebuffer state >> + */ >> + } > > > DCC disabling logic shouldn't be in a function called *_analyze_stats. Also, > I find it clearer to reset tex->num_slow_clears outside, in flush_resource. What about *_process_and_reset_stats? Marek _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev