vec4: adding vec4_cmod_propagation optimization

Alejandro Piñeiro Wed, 14 Oct 2015 04:57:30 -0700

On 14/10/15 10:15, Francisco Jerez wrote:
> Alejandro Piñeiro <apinhe...@igalia.com> writes:
>
>> On 13/10/15 23:36, Matt Turner wrote:
>>> On Tue, Oct 13, 2015 at 1:49 AM, Alejandro Piñeiro <apinhe...@igalia.com> 
>>> wrote:
>>>> On 13/10/15 03:10, Matt Turner wrote:
>>>>> Looks like this is causing an intermittent failure on HSW in our
>>>>> Jenkins system (but I'm not able to reproduce locally) and a
>>>>> consistent failure on G45. I'll investigate that.
>>>> Ok, will hold on, and meanwhile I will try to reproduce the problem on
>>>> HSW. Unfortunately I don't have any G45 available, so I will not be able
>>>> to help on that. Thanks for taking a look there.
>>> Okay, I think I see what's going on:
>>>
>>> --- good        2015-10-13 13:34:40.753357253 -0700
>>> +++ bad 2015-10-13 13:36:06.114290094 -0700
>>> -and(8)          g6<1>.xD        g6<4,4,1>.xD    1D              { align16 
>>> };
>>> -cmp.nz.f0(8)    null            -g6<4,4,1>.xD   0D              { align16 
>>> };
>>> +and.nz.f0(8)    g6<1>.xD        g6<4,4,1>.xD    1D              { align16 
>>> };
>>>
>>> We're propagating a .nz conditional mod from a CMP with a null dest
>>> (that has a writemask of .xyzw) to an AND that has a writemask of only
>>> .x, so only the .x channel of the flag is now being updated.
>> That is really common. And that is the idea behind the first patch of
>> the series that changes how if's are emitted, and allowing to optimize
>> if scan_inst->writemask is WRITEMASK_X. I will use the piglit test
>> glsl-1.30/execution/switch/vs-pervertex.shader_test to explain this.
>>
>> If we use current master, we get the following assembly:
>> 1: cmp.z.f0(8)     g20<1>.xD       g12<4,4,1>.xD   g19<4,4,1>.xD   {
>> align16 1Q compacted };
>> 2: mov.nz.f0(8)    null            g20<4,4,1>.xD                   {
>> align16 1Q };
>> 3: (+f0) if(8)     JIP: 6          UIP: 6                          {
>> align16 1Q };
>>
>> If we just simplify out the mov.nz at #2, the test would fail, because
>> as in the example you mentioned, null has a XYZW writemask, but g19 a
>> writemask X. So this optimized version, f0 only have x updated, and #3
>> is doing a normal if, that uses all the channels.
>>
>> But right now the if just needs to check against one component, the x.
>> So that is the reason (among others) that my patch 1 changes how if's
>> are emitted. So, using both my patches 1 and 2, the assembly outcome
>> would be the following:
>> 1: cmp.z.f0(8)     g20<1>.xD       g12<4,4,1>.xD   g19<4,4,1>.xD   {
>> align16 1Q compacted };
>> 2: (+f0.x) if(8)   JIP: 6          UIP: 6                          {
>> align16 1Q };
>>
>> It is true that now #1 only updates the x channel. But the if at #2 only
>> use that channel now, so the test still passes.
>>
>>> I think for now the thing to do is add
>>>
>>>    (inst->dst.writemask & ~scan_inst->dst.writemask) != 0)
>> This would bail out basically all if's.
>>
> I agree with Matt that this check is necessary, otherwise
> cmod_propagation will replace the pair of instruction with another
> instruction which doesn't have equivalent semantics.


Ok.

> If you still want
> cases like the one you pasted below involving an if conditional that
> only reads the x components to be simplified, I suggest you have a look
> at dead code elimination and fix it so that writemask bits of an
> instruction that writes a flag register (e.g. the "mov.nz.f0") are
> turned off for components which are determined to be dead, kind of like
> is done already for the vec4 components of GRFs.
>
> With that change the assembly from your example would look like this
> after DCE:
>
> | 1: cmp.z.f0(8)     g20<1>.xD       g12<4,4,1>.xD   g19<4,4,1>.xD     { 
> align16 1Q compacted };
> | 2: mov.nz.f0(8)    null.x          g20<4,4,1>.xD                     { 
> align16 1Q };
> | 3: (+f0.x) if(8)   JIP: 6          UIP: 6                            { 
> align16 1Q };
>
> Which cmod propagation would be allowed to simplify further.

Ok, thanks, that sounds like a good idea. Will work on that, and send a
new patch to this series.

>
>> So guessing about what is happening here, as I mentioned, we are
>> allowing to pass that condition if scan_inst->dst.writemask equal to
>> WRITEMASK_X assuming that they come from an if, so we don't need to
>> update all f0 channels. But as Jason mentioned in our off-list chat,
>> that would not be the case if we are dealing with a sel. So something
>> like this:
>> 1: cmp.z.f0(8)     g20<1>.xD       g12<4,4,1>.xD   g19<4,4,1>.xD   {
>> align16 1Q compacted };
>> 2: mov.nz.f0(8)    null            g20<4,4,1>.xD                   {
>> align16 1Q };
>> 3: (+f0.x) if(8)     JIP: 6          UIP: 6                          {
>> align16 1Q };
>>
>> #2 can be optimized out. But if we replace that if for an sel at #3, we
>> can't.
>>
>> If that is the case, in addition to check scan_inst and inst, we would
>> need to check in which kind of instruction the flag register is used later.
>>
>> But as I said, Im just guessing until I see the full failing test.
>>
>>> to the list of conditions under which we bail out. That is, if the
>>> instruction we want to propagate the cmod onto writes fewer channels,
>>> we can't do the optimization.
>>>
>>> With that, the block should look like:
>>>
>>>      if ((scan_inst->predicate && scan_inst->opcode != BRW_OPCODE_SEL) ||
>>>          scan_inst->dst.reg_offset != inst->src[0].reg_offset ||
>>>          (scan_inst->dst.writemask != WRITEMASK_X &&
>>>           scan_inst->dst.writemask != WRITEMASK_XYZW) ||
>>>          (scan_inst->dst.writemask == WRITEMASK_XYZW &&
>>>           inst->src[0].swizzle != BRW_SWIZZLE_XYZW) ||
>>>          (inst->dst.writemask & ~scan_inst->dst.writemask) != 0)
>>>         break;
>>>
>>> The good news is that, unless I've done something wrong, this doesn't
>>> affect any shaders in shader-db on ILK or HSW (I only tested those
>>> two, but I expect the results are the same everywhere). Apparently
>>> this is a pretty rare case.
>> Are you sure? I have made a run adding your condition, and now comparing
>> master vs having the optimization I get this:
>> total instructions in shared programs: 6240631 -> 6240471 (-0.00%)
>> instructions in affected programs:     18965 -> 18805 (-0.84%)
>> helped:                                160
>> HURT:                                  0
>>
>> That is a really small gain. Or put in other way, if we compare the
>> conditions I have on the original patches vs adding the condition you
>> are proposing, I get this:
>>
>> total instructions in shared programs: 6223900 -> 6240471 (0.27%)
>> instructions in affected programs:     477537 -> 494108 (3.47%)
>> helped:                                0
>> HURT:                                  3047
>>
>>> With that change, my R-b still stands, though we should have a unit
>>> test for this case as well in 3/5.
Taking into account Francisco's email, I will add that condition to the
cmod propagation optimization, adding your review, and I will update the
unit tests handling this cases.

BR


-- 
Alejandro Piñeiro (apinhe...@igalia.com)

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 2/5] i965/vec4: adding vec4_cmod_propagation optimization

Reply via email to