vec4: Don't unspill the same register in consecutive instructions

Francisco Jerez Fri, 07 Aug 2015 04:15:26 -0700

Iago Toral <ito...@igalia.com> writes:

> On Thu, 2015-08-06 at 18:27 +0300, Francisco Jerez wrote:
>> Iago Toral Quiroga <ito...@igalia.com> writes:
>> 
>> > If we have spilled/unspilled a register in the current instruction, avoid
>> > emitting unspills for the same register in the same instruction or 
>> > consecutive
>> > instructions following the current one as long as they keep reading the 
>> > spilled
>> > register. This should allow us to avoid emitting costy unspills that come 
>> > with
>> > little benefit to register allocation.
>> >
>> > Also, update evaluate_spill_costs so that we account for the saved 
>> > unspills.
>> > ---
>> >  .../drivers/dri/i965/brw_vec4_reg_allocate.cpp     | 129 
>> > +++++++++++++++++++--
>> >  1 file changed, 121 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp 
>> > b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>> > index 617c988..fed5f4d 100644
>> > --- a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>> > +++ b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>> > @@ -264,6 +264,95 @@ vec4_visitor::reg_allocate()
>> >     return true;
>> >  }
>> >  
>> > +/**
>> > + * When we decide to spill a register, instead of blindly spilling every 
>> > use,
>> > + * save unspills when the spill register is used (read) in consecutive
>> > + * instructions. This can potentially save a bunch of unspills that would
>> > + * have very little impact in register allocation anyway.
>> > + *
>> > + * Notice that we need to account for this behavior when spilling a 
>> > register
>> > + * and when evaluating spilling costs. This function is designed so it can
>> > + * be called from both places and avoid repeating the logic.
>> > + *
>> > + *  - When we call this function from spill_reg, we pass in scratch_reg 
>> > the
>> > + *    actual unspill/spill register that we want to reuse in the current
>> > + *    instruction.
>> > + *
>> > + *  - When we call this from evaluate_spill_costs, we pass the register 
>> > for
>> > + *    which we are evaluating spilling costs.
>> > + *
>> > + * In either case, we check if the previous instructions read scratch_reg 
>> > until
>> > + * we find an instruction that writes to it (in which case we can reuse
>> > + * scratch_reg as long as the writemask is compatible with the channels 
>> > we need
>> > + * to read in the current instruction) or we hit an instruction that does 
>> > not
>> > + * read scratch_reg at all. The latter can only happen when we call this 
>> > from
>> > + * evaluate_spill_costs,
>> 
>> Strictly speaking it can also happen when called from spill_reg() for
>> the first time in a given sequence of consecutive instructions (in which
>> case you correctly return false).
>
> True, I'll fix the comment.
>
>> >  and means that this is the point at which we first
>> > + * need the unspill this register for our current instruction. Since all 
>> > our
>> > + * unspills read a full vec4, we know that in this case we will have all
>> > + * the channels available in scratch_reg and we can reuse it.
>> > + *
>> > + * In any other case, we can't reuse scratch_reg in the current 
>> > instruction,
>> > + * meaning that we will need to unspill it.
>> > + */
>> > +static bool
>> > +can_use_scratch_for_source(const vec4_instruction *inst, unsigned i,
>> > +                           unsigned scratch_reg)
>> > +{
>> > +   assert(inst->src[i].file == GRF);
>> > +
>> > +   /* If the current instruction is already using scratch_reg in src[n] 
>> > with
>> > +    * n < i, then we know we can reuse it for src[i] too.
>> > +    */
>> > +   for (unsigned n = 0; n < i; n++) {
>> > +      if (inst->src[n].file == GRF && inst->src[n].reg == scratch_reg)
>> > +         return true;
>> > +   }
>> 
>> I don't think this is correct in cases where the previous source reused
>> the temporary of a previously spilled register with incompatible
>> writemask.  You probably want to handle the current instruction
>> consistently with the previous ones, i.e. as part of the loop below.
>
> Yes, you're right.
>
>> I suggest you define a variable (e.g. n as you've called it) initially
>> equal to i that would determine the number of sources to check for the
>> next instruction.  At the end of the loop body it would be re-set to 3,
>> what would also cause the destination registers to be checked in
>> subsequent iterations.
>
> Sounds good to me.
>
>> > +
>> > +   bool prev_inst_read_scratch_reg = false;
>> > +   vec4_instruction *prev_inst = (vec4_instruction *) inst->prev;
>> 
>> You can move this declaration into the init statement of the for loop to
>> limit its scope.
>
> Ok.
>
>> > +   for (; !prev_inst->is_head_sentinel();
>> > +        prev_inst = (vec4_instruction *) prev_inst->prev) {
>> > +      /* If any previous instruction does not read from or write to 
>> > scratch_reg
>> > +       * inconditonally we cannot reuse scratch_reg
>> > +       */
>> > +      if (prev_inst->predicate && prev_inst->opcode != BRW_OPCODE_SEL)
>> > +         return false;
>> 
>> I think this is somewhat pessimistic, register fills for a predicated
>> instruction won't be predicated AFAIK, so it should be possible to reuse
>> them, only the destination of a predicated write cannot be reused.
>
> Yeah, makes sense.
>
>> > +
>> > +      /* If the previous instruction writes to scratch_reg then we can 
>> > reuse
>> > +       * it if the channels we wrote are compatible with our read mask.
>> > +       */
>> > +      if (prev_inst->dst.file == GRF && prev_inst->dst.reg == 
>> > scratch_reg) {
>> > +         return (brw_mask_for_swizzle(inst->src[i].swizzle) &
>> > +                 ~prev_inst->dst.writemask) == 0;
>> > +     }
>> > +
>> > +      if (prev_inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_WRITE ||
>> > +          prev_inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_READ)
>> > +         continue;
>> > +
>> I'm not sure I see why you need to treat scratch reads and writes
>> specially here.  AFAICT if you come across one for the same scratch_reg
>> it won't make a difference for the code below, and if you come across
>> one for a different register the condition you wanted to check (the same
>> register is reused for a number of consecutive instructions) may no
>> longer hold, so you may as well return.
>
> I had my doubts about that, but the reason was the following. Say we
> have this code:
>
> r1 = r2 + r3
> r4 = r2 * r3
>
> In theory, if r2 and r3 were to be spilled we would want to reuse the
> scratch registers in the second instruction, but the truth is that if we
> spill them we end up with code like this:
>
> r2 = scratch_load(offset_r2)
> r3 = scratch_load(offset_r3)
> r1 = r2 + r3 
> r4 = r2 + r3
>
> When we call this function for the instruction that writes r1, it will
> see that the previous instruction uses r3, but _not_ r2, and decide that
> it can't use the scratch result for r2 triggering a new load.


Hm, I'm not following.  Say we first decide to spill r2, the result will
be:

| r2 = scratch_load(offset_r2)
| r1 = r2 + r3 
| r4 = r2 + r3

If we then decide to spill r3 your code would notice that it's read from
in two consecutive instructions and then go on and reuse it regardless
of whether scratch reads and writes are treated specially or not, like:

| r2 = scratch_load(offset_r2)
| r3 = scratch_load(offset_r3)
| r1 = r2 + r3 
| r4 = r2 + r3

> This happens because each scratch load only pulls one register, so as
> soon as we inject these in the code we are effectively limiting the
> optimization to just one register.
>
> It can get worse. If we spill r1 before these, we would have:
>
> r2 = scratch_load(offset_r2)
> r3 = scratch_load(offset_r3)
> r1 = r2 + r3 
> write(r1, offset_r1)
> r4 = r2 + r3
>

Ah, this seems like a better example.  In this case r2 and/or r3 will be
reused or not depending on the order in which the spills are done, what
indeed is slightly disconcerting -- but note that in all such cases the
original assumption you started from will be violated (the temporary is
re-used if it was already read or written in a sequence of consecutive
instructions).  That said I don't really have any objection against
special-casing scratch reads and writes, I just wanted to understand why
you had decided to break your own rule. ;)

> Now the instruction that writes to r4 can't reuse neither r2 nor r3
> because the scratch write in between won't read any of them.
>
>> > +      /* Check that the previous instruction reads scratch_reg, if so, 
>> > continue
>> > +       * looping. Otherwise it means that we got here from
>> > +       * evaluate_spill_costs and this is the point at which we will emit 
>> > an
>> > +       * unspill for the register passed in scratch_reg, in which case we 
>> > can
>> > +       * only reuse it if all other instructions in between have read
>> > +       * scratch_reg.
>> > +       */
>> > +      int n;
>> > +      for (n = 0; n < 3; n++) {
>> > +         if (prev_inst->src[n].file == GRF &&
>> > +             prev_inst->src[n].reg == scratch_reg) {
>> > +            prev_inst_read_scratch_reg = true;
>> > +            break;
>> > +         }
>> > +      }
>> > +      if (n == 3) {
>> > +         return prev_inst_read_scratch_reg;
>> > +      }
>> > +   }
>> > +
>> > +   return false;
>> 
>> Shouldn't this return prev_inst_read_scratch_reg?
>
> Yes.
>
>> > +}
>> > +
>> >  void
>> >  vec4_visitor::evaluate_spill_costs(float *spill_costs, bool *no_spill)
>> >  {
>> > @@ -281,9 +370,15 @@ vec4_visitor::evaluate_spill_costs(float 
>> > *spill_costs, bool *no_spill)
>> >     foreach_block_and_inst(block, vec4_instruction, inst, cfg) {
>> >        for (unsigned int i = 0; i < 3; i++) {
>> >           if (inst->src[i].file == GRF) {
>> > -            spill_costs[inst->src[i].reg] += loop_scale;
>> > -            if (inst->src[i].reladdr)
>> > -               no_spill[inst->src[i].reg] = true;
>> > +            /* We will only unspill src[i] it it wasn't unspilled for the
>> > +             * previous instruction, in which case we'll just reuse the 
>> > scratch
>> > +             * reg for this instruction.
>> > +             */
>> > +            if (!can_use_scratch_for_source(inst, i, inst->src[i].reg)) {
>> > +               spill_costs[inst->src[i].reg] += loop_scale;
>> > +               if (inst->src[i].reladdr)
>> > +                  no_spill[inst->src[i].reg] = true;
>> > +            }
>> >           }
>> >        }
>> >  
>> > @@ -342,19 +437,37 @@ vec4_visitor::spill_reg(int spill_reg_nr)
>> >     unsigned int spill_offset = last_scratch++;
>> >  
>> >     /* Generate spill/unspill instructions for the objects being spilled. 
>> > */
>> > +   int scratch_reg = -1;
>> >     foreach_block_and_inst(block, vec4_instruction, inst, cfg) {
>> > +      /* We don't spill registers used for scratch */
>> > +      if (inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_READ ||
>> > +          inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_WRITE)
>> > +         continue;
>> > +
>> For that reason you'll never see a match in the loop below for scratch
>> reads and writes, and it should be a harmless no-op.
>
> I figured we would rather skip loop for the 3 srcs and the check for the
> dst in this case since we already know that we won't see a match. I can
> remove the check if you think it is not worth it though.
>
>
>> >        for (unsigned int i = 0; i < 3; i++) {
>> >           if (inst->src[i].file == GRF && inst->src[i].reg == 
>> > spill_reg_nr) {
>> > -            src_reg spill_reg = inst->src[i];
>> > -            inst->src[i].reg = alloc.allocate(1);
>> > -            dst_reg temp = dst_reg(inst->src[i]);
>> > -
>> > -            emit_scratch_read(block, inst, temp, spill_reg, spill_offset);
>> > +            if (scratch_reg == -1 ||
>> > +                !can_use_scratch_for_source(inst, i, scratch_reg)) {
>> > +               /* We need to unspill anyway so make sure we read the full 
>> > vec4
>> > +                * in any case. This way, the cached register can be reused
>> > +                * for consecutive instructions that read different 
>> > channels of
>> > +                * the same vec4.
>> > +                */
>> > +               scratch_reg = alloc.allocate(1);
>> > +               src_reg temp = inst->src[i];
>> > +               temp.reg = scratch_reg;
>> > +               temp.swizzle = BRW_SWIZZLE_XYZW;
>> > +               emit_scratch_read(block, inst,
>> > +                                 dst_reg(temp), inst->src[i], 
>> > spill_offset);
>> > +            }
>> > +            assert(scratch_reg != -1);
>> > +            inst->src[i].reg = scratch_reg;
>> >           }
>> >        }
>> >  
>> >        if (inst->dst.file == GRF && inst->dst.reg == spill_reg_nr) {
>> >           emit_scratch_write(block, inst, spill_offset);
>> > +         scratch_reg = inst->dst.reg;
>> >        }
>> >     }
>> >  
>> > -- 
>> > 1.9.1

signature.asc
Description: PGP signature

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH v3 4/4] i965/vec4: Don't unspill the same register in consecutive instructions

Reply via email to