Kristian Høgsberg <k...@bitplanet.net> writes: > On Thu, Nov 19, 2015 at 4:24 AM, Francisco Jerez <curroje...@riseup.net> > wrote: >> Kristian Høgsberg <k...@bitplanet.net> writes: >> >>> On Tue, Nov 17, 2015 at 9:54 PM, Jordan Justen >>> <jordan.l.jus...@intel.com> wrote: >>>> From: Francisco Jerez <curroje...@riseup.net> >>>> >>>> This calculates a rather conservative partitioning of the L3 cache >>>> based on the shaders currently bound to the pipeline and whether they >>>> use SLM, atomics, images or scratch space. The result is intended to >>>> be fine-tuned later on based on other pipeline state. >>>> --- >>>> src/mesa/drivers/dri/i965/brw_compiler.h | 1 + >>>> src/mesa/drivers/dri/i965/gen7_l3_state.c | 53 >>>> +++++++++++++++++++++++++++++++ >>>> 2 files changed, 54 insertions(+) >>>> >>>> diff --git a/src/mesa/drivers/dri/i965/brw_compiler.h >>>> b/src/mesa/drivers/dri/i965/brw_compiler.h >>>> index 8f147d3..ef8bddb 100644 >>>> --- a/src/mesa/drivers/dri/i965/brw_compiler.h >>>> +++ b/src/mesa/drivers/dri/i965/brw_compiler.h >>>> @@ -300,6 +300,7 @@ struct brw_stage_prog_data { >>>> >>>> unsigned curb_read_length; >>>> unsigned total_scratch; >>>> + unsigned total_shared; >>>> >>>> /** >>>> * Register where the thread expects to find input data from the URB >>>> diff --git a/src/mesa/drivers/dri/i965/gen7_l3_state.c >>>> b/src/mesa/drivers/dri/i965/gen7_l3_state.c >>>> index 4d0cfcd..1a88261 100644 >>>> --- a/src/mesa/drivers/dri/i965/gen7_l3_state.c >>>> +++ b/src/mesa/drivers/dri/i965/gen7_l3_state.c >>>> @@ -258,6 +258,59 @@ get_l3_config(const struct brw_device_info *devinfo, >>>> struct brw_l3_weights w0) >>>> } >>>> >>>> /** >>>> + * Return a reasonable default L3 configuration for the specified device >>>> based >>>> + * on whether SLM and DC are required. In the non-SLM non-DC case the >>>> result >>>> + * is intended to approximately resemble the hardware defaults. >>>> + */ >>>> +static struct brw_l3_weights >>>> +get_default_l3_weights(const struct brw_device_info *devinfo, >>>> + bool needs_dc, bool needs_slm) >>>> +{ >>>> + struct brw_l3_weights w = {{ 0 }}; >>>> + >>>> + w.w[L3P_SLM] = needs_slm; >>>> + w.w[L3P_URB] = 1.0; >>>> + >>>> + if (devinfo->gen >= 8) { >>>> + w.w[L3P_ALL] = 1.0; >>>> + } else { >>>> + w.w[L3P_DC] = needs_dc ? 0.1 : 0; >>>> + w.w[L3P_RO] = devinfo->is_baytrail ? 0.5 : 1.0; >>>> + } >>>> + >>>> + return norm_l3_weights(w); >>>> +} >>>> + >>>> +/** >>>> + * Calculate the desired L3 partitioning based on the current state of the >>>> + * pipeline. For now this simply returns the conservative defaults >>>> calculated >>>> + * by get_default_l3_weights(), but we could probably do better by >>>> gathering >>>> + * more statistics from the pipeline state (e.g. guess of expected URB >>>> usage >>>> + * and bound surfaces), or by using feed-back from performance counters. >>>> + */ >>>> +static struct brw_l3_weights >>>> +get_pipeline_state_l3_weights(const struct brw_context *brw) >>>> +{ >>>> + const struct brw_stage_state *stage_states[] = { >>>> + &brw->vs.base, &brw->gs.base, &brw->wm.base, &brw->cs.base >>>> + }; >>>> + bool needs_dc = false, needs_slm = false; >>> >>> This doesn't seem optimal - we should evaluate the 3D pipe and the >>> compute pipe separately depending on which one is active. For >>> example, if we have a current compute program that uses SLM, but are >>> using the 3D pipeline, we'll get a partition that includes SLM even >>> for the 3D pipe. >>> >> The intention of this patch is not to provide an optimal heuristic, but >> to implement a simple heuristic that calculates conservative defaults in >> order to guarantee functional correctness. It would be possible to base >> the result on the currently active pipeline with minimal changes to this >> function (and making sure that the L3 config atom is invalidated while >> switching pipelines), but I don't think we want to switch back and forth >> between SLM and non-SLM configurations if the application interleaves >> draw and compute operations, because the L3 partitioning is global >> rather than per-pipeline and transitions are expensive -- Switching >> between L3 configuration requires a full pipeline stall and flushing and >> invalidation of all L3-backed caches, doing what you suggest would >> likely lead to cache-thrashing and might prevent pipelining of compute >> and render workloads [At least on BDW+ -- On Gen7 it might be impossible >> to achieve pipelining of render and compute workloads due to hardware >> issues]. If the result of the heuristic is based on the currently >> active pipeline some sort of hysteresis will likely be desirable, which >> in this patch I get for free by basing the calculation on the context >> state rather than on the currently active pipeline. I agree with you >> that we'll probably need a more sophisticated heuristic in the future, >> but that falls outside the scope of this series -- If anything we need >> to be able to run benchmarks exercising CS and SLM before we can find >> out which kind of heuristic we want and what degree of hysteresis is >> desirable. > > Switching between compute and 3d pipeline (PIPELINE_SELECT) also > requires flush+stall followed by invalidate PIPE_CONTROLs. It's > similar to the BRW_NEW_BATCH condition that's already in there in that > it makes reprogramming the L3 partition free. >
The stall is only necessary on Gen7, since Gen8 it's not strictly required AFAIK and even if you do stall there's no need to drain any L3-backed caches. > This series is already way past what I would call a simple and correct > correct solution. If you want 'simple and correct' we can scale this > back to something that just enables DC and SLM when needed and do all > the weighting and heuristics when we have more data. On the other > hand, if you think the weighting machinery here is required, I don't > think you can argue that evaluating 3D and compute L3 requirements > separately is too complicated. > I didn't mean that making the heuristic depend on the currently active pipeline the naive way would be too complicated, as I said the changes to this patch would be minimal. I'm just saying that I have no evidence that it would be an optimization rather than a pessimization -- Otherwise I'd be glad to implement it. > Kristian > >>> Kristian >>> >>>> + for (unsigned i = 0; i < ARRAY_SIZE(stage_states); i++) { >>>> + const struct gl_shader_program *prog = >>>> + brw->ctx._Shader->CurrentProgram[stage_states[i]->stage]; >>>> + const struct brw_stage_prog_data *prog_data = >>>> stage_states[i]->prog_data; >>>> + >>>> + needs_dc |= (prog && prog->NumAtomicBuffers) || >>>> + (prog_data && (prog_data->total_scratch || >>>> prog_data->nr_image_params)); >>>> + needs_slm |= prog_data && prog_data->total_shared; >>>> + } >>>> + >>>> + return get_default_l3_weights(brw->intelScreen->devinfo, >>>> + needs_dc, needs_slm); >>>> +} >>>> + >>>> +/** >>>> * Program the hardware to use the specified L3 configuration. >>>> */ >>>> static void >>>> -- >>>> 2.6.2 >>>>
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev