On 12.06.2017 14:34, Gert Wollny wrote:
Am Montag, den 12.06.2017, 12:28 +0200 schrieb Nicolai Hähnle:
On 10.06.2017 01:15, Gert Wollny wrote:

Plenty of style issues aside, can you explain where and why you get
tighter lifetimes?

In the original code if a temporary is used within a loop it gets the
whole life time of the loop assigned.

With this patch I track in more detail where a temporary is accesses
and base the lifetime on this: For instance, if a variable is first
unconditionally written and later read for the last time in the same
scope (loop, if, or switch branch), then the lifetime can be restricted
to that first-written - last-read range.

The code gets more complex because it tries to resolve this also for
nested scopes, and one also has to take care about whether a variable
is written only conditionally within a loop, or conditionally read
before it is written (in the source code sense, but not in the program
flow sense).

Shaders that profit from this better lifetime estimation are the ones
that have many short living values within long loops, think

for () {
   float4 t[1] = f(in2);
   float4 t[2] = g(in1);
   float4 t[3] = op(t[1], t[2]);
    ...
   sum += t[200];
}
Here the old code would keep all of the t[i] alive for the whole loop,
and in fact with the GpuTest Piano benchmark I have seen a shader with
2000 temporaries where many were used in a loop but only required for
two or three lines so that my code could merge them to less then 100
temporaries, and this made it possible for the tgsi to bytecode layer
in r600g to actually translate the shader.

Okay. I think you should seriously re-think your algorithm in a way that makes it a more natural evolution from the algorithm that's already there.

Basically, the current algorithm tracks (first_write, last_read), so think about what you need to track in order to obtain a single-pass algorithm that computes lifetime (first, last) for every temporary. I think the following should do it for a first cut:

   struct st_calculate_lifetime_state {
      /* First and last instruction that has this temporary */
      unsigned first;
      unsigned last;

      /* First instruction of the outer-most in-active loop that
       * contains this temporary. (A loop is active if we're
       * currently processing instructions in it.)
      unsigned loop_first;

      /* Position of a read without preceding dominating write. */
      unsigned undef_read[4];

      /* First write in the program that is dominating our
       * current position, per channel.
       */
      unsigned first_dominating_write[4];
   };

In addition, you need to keep a stack of active scopes (loops and ifs), but you really only need to remember the start of the scope (and for loops, probably the position of the first BREAK).

Here's a sketch of the "state machine" that you need to run while traversing the program, assuming no BRK and CONT:

Init: last = 0, everything else = ~0

These are updates on individual variables on use:

On any use (source or dest):
- if first > cur_pos, set first = loop_first = cur_pos
- if loop_first < first, set first = loop_first
- update last

On use as source:
- if first_dominating_write > cur_pos and undef_read > cur_pos, set undef_read = cur_pos

On use as dest:
- if first_dominating_write > cur_pos, set first_dominating_write = cur_pos

These are updates of all temporaries on scope change:

On ENDLOOP, for all temps:
- if loop_first > start of loop, set loop_first = start of loop
- first < start of loop, update last to end of loop
- if undef_read between start and end of loop: set first = MIN(first, start of loop) and last = end of loop
- if first_dominating_write < end of loop, set undef_read = ~0

On ELSE and ENDIF, for all temps:
- if first_dominating_write > start of scope, set first_dominating_write = ~0

I'm not sure right now whether BREAK / CONT need special treatment at all. I think what you need is:

On ENDLOOP:
- if first_dominating_write is between the first BREAK in the loop and the end of the loop, set first_dominating_write = ~0

And for CONT, you probably don't really need anything, because CONTs cannot make you skip code forever.

What this state machine doesn't yet cover is

IF ..
  MOV TEMP[0], ...
ELSE
  MOV TEMP[0], ...
ENDIF

Still, I'd start with it and see whether you need to cover that case.

And even that case can probably be dealt with in a fairly efficient and pragmatic way. The idea is to keep track of the nesting level of IFs, plus an

   uint32 dominating_write_in_true_block[4];

per temp. Then:

On ELSE:
- if first_dominating_write < cur_pos, set the bit corresponding to the current nesting level in dominating_write_in_true_block

On ENDIF:
- don't reset first_dominating_write if the bit corresponding to the current nesting level in dominating_write_in_true_block is set
- unconditionally clear that bit

It won't cover cases with nesting level > 32, but do we really care?

I hope I didn't miss anything, because after all this is admittedly subtle stuff. Still, I think this kind of state-machine approach should work and allow you to avoid *lots* of allocations and pointer-chasing.

Cheers,
Nicolai



Best,
Gert

PS: Regarding style, I am fully aware that I have to iterate over this
code a few more times. I tried to adhere to the way the existing code
represents itself to me, but I'm happy to listen to any advise I can
get.

In any case, I though it might be best to send this patch out now for
discussion. Now, with the unit tests in place I will rework it and
focus also more on style questions. One thing that comes up immediately
up is that I will try to reduce the use of dynamically allocated
memory, since 60% of the run time of my code is with memory allocation
and de-allocation.




--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to