fs: Add reuse_16bit_conversions_register optimization

Francisco Jerez Fri, 25 Aug 2017 11:48:54 -0700

Alejandro Piñeiro <apinhe...@igalia.com> writes:

> On 24/08/17 21:07, Connor Abbott wrote:
>>
>> Hi Alejandro,
>
> Hi Connor,
>
>>
>> This seems really suspicious. If the live ranges are really
>> independent, then the register allocator should be able to assign the
>> two virtual registers to the same physical register if it needs to.
>
> Yes, it is true, the register allocator should be able to assign two
> virtual registers to the same physical register. But that is done at the
> end (or really near the end), so late for the problem this optimization
> is trying to fix. We are also reducing the amount of instructions used.
>
> Probably not really clear on the commit message. When I say "reduce the
> pressure of the register allocator" I mean having a code that the
> register allocator would be able to handle without using too much time.
> The problem this optimization tries to solve is that for some 16 bit CTS
> tests (some with matrices and geometry shaders), the amount of virtual
> registers used and instructions was really big. For the record,
> initially, some tests needed 24 min just to compile. Right now, thanks
> to other optimizations, the slower test without this optimization needs
> 1min 30 seconds. Adding some hacky timestamps, the time used  at
> fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
>
> * While trying to schedule using the three available pre mode
> heuristics: 7 seconds
> * Allocation with spilling: 63 seconds
> * Final schedule using SCHEDULE_POST: 19 seconds
>
> With this optimization, the total time goes down to 14 seconds (10 + 0 +
> 3 on the previous bullet point list).
>
> One could argue that 1min 30 seconds is okish. But taking into account
> that it goes down to 14 seconds, even with some caveats (see below), I
> still think that it is worth to use the optimization.
>
> And a final comment. For that same test, this is the final stats (using
> INTEL_DEBUG):
>
>  * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
> 130320 cycles. 15:9 spills:fills.
>  * Without the optimization: SIMD8 shader: 12312 instructions. 0 loops.
> 174816 cycles. 751:1851 spills:fills.
>
>> This change forces the two to be the same, which constrains the
>> register allocator unecessarily and should make it worse, so I'm
>> confused as to why this would help at all.
>
> I didn't check that issue specifically, but I recently found that this
> optimization affects copy propagation/dead code eliminate. So there are
> still some room for improvement. But in any case, take into account that
> this custom optimization is only used if there is a 32 to 16 bit
> conversion, so only affects shaders with this specific feature.
>
>>
>> IIRC there were some issues where we unnecessarily made the sources
>> and destination of an instruction interefere with each other, but if
>> that's what's causing this, then we should fix that underlying issue.
>>
>> (From what I remember, a lot of SIMD16 were expanded to SIMD8 in the
>> generator, in which case the second half of the source is read after
>> the first half of the destination is written, and we falsely thought
>> that the HW did that too, so we had some code to add a fake
>> interference between them, but a while ago Curro moved the expansion
>> to happen before register allocation. I don't have the code in front
>> of me, but I think we still have this useless code lying around, and I
>> would guess this is the source of the problem.)
>
> Taking into account what I explained before, I don't think that the
> problem is the interference or this code you mention (although perhaps
> Im wrong).
>


I agree with Connor's feed-back on this change, this really smells like
a hack working around register allocator brokenness.  If the register
allocator is failing to assign two variables with disjoint live ranges
to the same register it has a bug.  If you forcefully merge the live
ranges of source and destination it might turn out that that wasn't the
optimal decision to take after all register pressure-wise (because it's
frequently harder to find room in the GRF for a variable with 2x the
live range than for two independent variables), so you will be
pessimizing register usage in some cases -- The register allocator
should know better than you.

>> hhhh
>> Connor
>>
>> On Aug 24, 2017 2:59 PM, "Alejandro Piñeiro" <apinhe...@igalia.com
>> <mailto:apinhe...@igalia.com>> wrote:
>>
>>     When dealing with HF/U/UW, it is usual having a register with a
>>     F/D/UD, and then convert it to HF/U/UW, and not use again the F/D/UD
>>     value. In those cases it would be possible to reuse the register where
>>     the F value is initially stored instead of having two. Take also into
>>     account that when operating with HF/U/UW, you would need to use the
>>     full register (so stride 2). Packs/unpacks would be only useful when
>>     loading/storing several HF/W/UW.
>>
>>     Note that no instruction is removed. The main benefict is reducing the
>>     amoung of registers used, so the pressure on the register allocator is
>>     decreased with big shaders.
>>
>>     Possibly this could be integrated into an existing optimization, at it
>>     is even already done by the register allocator, but was far easier to
>>     write and cleaner to read as a separate optimization.
>>
>>     We found this issue when dealing with some Vulkan CTS tests that
>>     needed several minutes to compile. Most of the time was spent on the
>>     register allocator.
>>
>>     Right now the optimization only handles 32 to 16 bit conversion. It
>>     could be possible to do the equivalent for 16 to 32 bit too, but in
>>     practice, we didn't need it.
>>     ---
>>      src/intel/compiler/brw_fs.cpp | 77
>>     +++++++++++++++++++++++++++++++++++++++++++
>>      src/intel/compiler/brw_fs.h   |  1 +
>>      2 files changed, 78 insertions(+)
>>
>>     diff --git a/src/intel/compiler/brw_fs.cpp
>>     b/src/intel/compiler/brw_fs.cpp
>>     index b6013a5ce85..1342150b44e 100644
>>     --- a/src/intel/compiler/brw_fs.cpp
>>     +++ b/src/intel/compiler/brw_fs.cpp
>>     @@ -39,6 +39,7 @@
>>      #include "compiler/glsl_types.h"
>>      #include "compiler/nir/nir_builder.h"
>>      #include "program/prog_parameter.h"
>>     +#include "brw_fs_live_variables.h"
>>
>>      using namespace brw;
>>
>>     @@ -3133,6 +3134,81 @@ fs_visitor::remove_extra_rounding_modes()
>>         return progress;
>>      }
>>
>>     +/**
>>     + * When dealing with HF/W/UW, it is usual having a register with
>>     a F/D/UD, and
>>     + * then convert it to HF/W/UW, and not use again the F/D/UD
>>     value. In those
>>     + * cases it would be possible to reuse the register where the F
>>     value is
>>     + * initially stored instead of having two. Take also into account
>>     that when
>>     + * operating with HF/W/UW, you would need to use the full
>>     register (so stride
>>     + * 2). Packs/unpacks would be only useful when loading/storing
>>     several
>>     + * HF/W/UWs.
>>     + *
>>     + * So something like this:
>>     + *  mov(8) vgrf14<2>:HF, vgrf39:F
>>     + *
>>     + * Became:
>>     + *  mov(8) vgrf39<2>:HF, vgrf39:F
>>     + *
>>     + * Note that no instruction is removed. The main benefict is
>>     reducing the
>>     + * amoung of registers used, so the pressure on the register
>>     allocator is
>>     + * decreased with big shaders.
>>     + */
>>     +bool
>>     +fs_visitor::reuse_16bit_conversions_vgrf()
>>     +{
>>     +   bool progress = false;
>>     +   int ip = -1;
>>     +
>>     +   calculate_live_intervals();
>>     +
>>     +   foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
>>     +      ip++;
>>     +
>>     +      if (inst->dst.file != VGRF || inst->src[0].file != VGRF)
>>     +         continue;
>>     +
>>     +      if (inst->opcode != BRW_OPCODE_MOV)
>>     +         continue;
>>     +
>>     +      if (type_sz(inst->dst.type) != 2 || inst->dst.stride != 2 ||
>>     +          type_sz(inst->src[0].type) != 4 || inst->src[0].stride
>>     != 1) {
>>     +         continue;
>>     +      }
>>     +
>>     +      int src_reg = inst->src[0].nr;
>>     +      int src_offset = inst->src[0].offset;
>>     +      unsigned src_var = live_intervals->var_from_vgrf[src_reg];
>>     +      int src_end = live_intervals->end[src_var];
>>     +      int dst_reg = inst->dst.nr <http://dst.nr>;
>>     +
>>     +      if (src_end > ip)
>>     +         continue;
>>     +
>>     +      foreach_block_and_inst(block, fs_inst, scan_inst, cfg) {
>>     +         if (scan_inst->dst.file == VGRF &&
>>     +             scan_inst->dst.nr <http://dst.nr> == dst_reg) {
>>     +            scan_inst->dst.nr <http://dst.nr> = src_reg;
>>     +            scan_inst->dst.offset = src_offset;
>>     +            progress = true;
>>     +         }
>>     +
>>     +         for (int i = 0; i < scan_inst->sources; i++) {
>>     +            if (scan_inst->src[i].file == VGRF &&
>>     +                scan_inst->src[i].nr == dst_reg) {
>>     +               scan_inst->src[i].nr = src_reg;
>>     +               scan_inst->src[i].offset = src_offset;
>>     +               progress = true;
>>     +            }
>>     +         }
>>     +      }
>>     +   }
>>     +
>>     +   if (progress)
>>     +      invalidate_live_intervals();
>>     +
>>     +   return progress;
>>     +}
>>     +
>>
>>      static void
>>      clear_deps_for_inst_src(fs_inst *inst, bool *deps, int first_grf,
>>     int grf_len)
>>     @@ -5829,6 +5905,7 @@ fs_visitor::optimize()
>>
>>         OPT(opt_drop_redundant_mov_to_flags);
>>         OPT(remove_extra_rounding_modes);
>>     +   OPT(reuse_16bit_conversions_vgrf);
>>
>>         do {
>>            progress = false;
>>     diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
>>     index b9476e69edb..2685f748b87 100644
>>     --- a/src/intel/compiler/brw_fs.h
>>     +++ b/src/intel/compiler/brw_fs.h
>>     @@ -151,6 +151,7 @@ public:
>>         bool dead_code_eliminate();
>>         bool remove_duplicate_mrf_writes();
>>         bool remove_extra_rounding_modes();
>>     +   bool reuse_16bit_conversions_vgrf();
>>
>>         bool opt_sampler_eot();
>>         bool virtual_grf_interferes(int a, int b);
>>     --
>>     2.11.0
>>
>>     _______________________________________________
>>     mesa-dev mailing list
>>     mesa-dev@lists.freedesktop.org <mailto:mesa-dev@lists.freedesktop.org>
>>     https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>     <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
>>
>>
>>
>
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

signature.asc
Description: PGP signature

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

Reply via email to