[PING**4] [PATCH, ARM] correctly encode the CC reg data flow

Bernd Edlinger Thu, 01 Jun 2017 09:01:53 -0700

Ping...


On 05/12/17 18:49, Bernd Edlinger wrote:
> Ping...
> 
> On 04/29/17 19:21, Bernd Edlinger wrote:
>> Ping...
>>
>> On 04/20/17 20:11, Bernd Edlinger wrote:
>>> Ping...
>>>
>>> for this patch:
>>> https://gcc.gnu.org/ml/gcc-patches/2017-01/msg01351.html
>>>
>>> On 01/18/17 16:36, Bernd Edlinger wrote:
>>>> On 01/13/17 19:28, Bernd Edlinger wrote:
>>>>> On 01/13/17 17:10, Bernd Edlinger wrote:
>>>>>> On 01/13/17 14:50, Richard Earnshaw (lists) wrote:
>>>>>>> On 18/12/16 12:58, Bernd Edlinger wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> this is related to PR77308, the follow-up patch will depend on this
>>>>>>>> one.
>>>>>>>>
>>>>>>>> When trying the split the *arm_cmpdi_insn and *arm_cmpdi_unsigned
>>>>>>>> before reload, a mis-compilation in libgcc function
>>>>>>>> __gnu_satfractdasq
>>>>>>>> was discovered, see [1] for more details.
>>>>>>>>
>>>>>>>> The reason seems to be that when the *arm_cmpdi_insn is directly
>>>>>>>> followed by a *arm_cmpdi_unsigned instruction, both are split
>>>>>>>> up into this:
>>>>>>>>
>>>>>>>>    [(set (reg:CC CC_REGNUM)
>>>>>>>>          (compare:CC (match_dup 0) (match_dup 1)))
>>>>>>>>     (parallel [(set (reg:CC CC_REGNUM)
>>>>>>>>                     (compare:CC (match_dup 3) (match_dup 4)))
>>>>>>>>                (set (match_dup 2)
>>>>>>>>                     (minus:SI (match_dup 5)
>>>>>>>>                              (ltu:SI (reg:CC_C CC_REGNUM) 
>>>>>>>> (const_int
>>>>>>>> 0))))])]
>>>>>>>>
>>>>>>>>    [(set (reg:CC CC_REGNUM)
>>>>>>>>          (compare:CC (match_dup 2) (match_dup 3)))
>>>>>>>>     (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0))
>>>>>>>>                (set (reg:CC CC_REGNUM)
>>>>>>>>                     (compare:CC (match_dup 0) (match_dup 1))))]
>>>>>>>>
>>>>>>>> The problem is that the reg:CC from the *subsi3_carryin_compare
>>>>>>>> is not mentioning that the reg:CC is also dependent on the reg:CC
>>>>>>>> from before.  Therefore the *arm_cmpsi_insn appears to be
>>>>>>>> redundant and thus got removed, because the data values are
>>>>>>>> identical.
>>>>>>>>
>>>>>>>> I think that applies to a number of similar pattern where data
>>>>>>>> flow is happening through the CC reg.
>>>>>>>>
>>>>>>>> So this is a kind of correctness issue, and should be fixed
>>>>>>>> independently from the optimization issue PR77308.
>>>>>>>>
>>>>>>>> Therefore I think the patterns need to specify the true
>>>>>>>> value that will be in the CC reg, in order for cse to
>>>>>>>> know what the instructions are really doing.
>>>>>>>>
>>>>>>>>
>>>>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.
>>>>>>>> Is it OK for trunk?
>>>>>>>>
>>>>>>>
>>>>>>> I agree you've found a valid problem here, but I have some issues
>>>>>>> with
>>>>>>> the patch itself.
>>>>>>>
>>>>>>>
>>>>>>> (define_insn_and_split "subdi3_compare1"
>>>>>>>   [(set (reg:CC_NCV CC_REGNUM)
>>>>>>>     (compare:CC_NCV
>>>>>>>       (match_operand:DI 1 "register_operand" "r")
>>>>>>>       (match_operand:DI 2 "register_operand" "r")))
>>>>>>>    (set (match_operand:DI 0 "register_operand" "=&r")
>>>>>>>     (minus:DI (match_dup 1) (match_dup 2)))]
>>>>>>>   "TARGET_32BIT"
>>>>>>>   "#"
>>>>>>>   "&& reload_completed"
>>>>>>>   [(parallel [(set (reg:CC CC_REGNUM)
>>>>>>>            (compare:CC (match_dup 1) (match_dup 2)))
>>>>>>>           (set (match_dup 0) (minus:SI (match_dup 1) (match_dup
>>>>>>> 2)))])
>>>>>>>    (parallel [(set (reg:CC_C CC_REGNUM)
>>>>>>>            (compare:CC_C
>>>>>>>              (zero_extend:DI (match_dup 4))
>>>>>>>              (plus:DI (zero_extend:DI (match_dup 5))
>>>>>>>                   (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))
>>>>>>>           (set (match_dup 3)
>>>>>>>            (minus:SI (minus:SI (match_dup 4) (match_dup 5))
>>>>>>>                  (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])]
>>>>>>>
>>>>>>>
>>>>>>> This pattern is now no-longer self consistent in that before the
>>>>>>> split
>>>>>>> the overall result for the condition register is in mode CC_NCV, but
>>>>>>> afterwards it is just CC_C.
>>>>>>>
>>>>>>> I think CC_NCV is correct mode (the N, C and V bits all correctly
>>>>>>> reflect the result of the 64-bit comparison), but that then implies
>>>>>>> that
>>>>>>> the cc mode of subsi3_carryin_compare is incorrect as well and
>>>>>>> should in
>>>>>>> fact also be CC_NCV.  Thinking about this pattern, I'm inclined to
>>>>>>> agree
>>>>>>> that CC_NCV is the correct mode for this operation
>>>>>>>
>>>>>>> I'm not sure if there are other consequences that will fall out from
>>>>>>> fixing this (it's possible that we might need a change to
>>>>>>> select_cc_mode
>>>>>>> as well).
>>>>>>>
>>>>>>
>>>>>> Yes, this is still a bit awkward...
>>>>>>
>>>>>> The N and V bit will be the correct result for the subdi3_compare1
>>>>>> a 64-bit comparison, but zero_extend:DI (match_dup 4) (plus:DI ...)
>>>>>> only gets the C bit correct, the expression for N and V is a 
>>>>>> different
>>>>>> one.
>>>>>>
>>>>>> It probably works, because the subsi3_carryin_compare instruction 
>>>>>> sets
>>>>>> more CC bits than the pattern does explicitly specify the value.
>>>>>> We know the subsi3_carryin_compare also computes the NV bits, but
>>>>>> it is
>>>>>> hard to write down the correct rtl expression for it.
>>>>>>
>>>>>> In theory the pattern should describe everything correctly,
>>>>>> maybe, like:
>>>>>>
>>>>>> set (reg:CC_C CC_REGNUM)
>>>>>>     (compare:CC_C
>>>>>>       (zero_extend:DI (match_dup 4))
>>>>>>       (plus:DI (zero_extend:DI (match_dup 5))
>>>>>>                (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))
>>>>>> set (reg:CC_NV CC_REGNUM)
>>>>>>     (compare:CC_NV
>>>>>>      (match_dup 4))
>>>>>>      (plus:SI (match_dup 5) (ltu:SI (reg:CC_C CC_REGNUM) (const_int
>>>>>> 0)))
>>>>>> set (match_dup 3)
>>>>>>     (minus:SI (minus:SI (match_dup 4) (match_dup 5))
>>>>>>               (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0)))))
>>>>>>
>>>>>>
>>>>>> But I doubt that will work to set CC_REGNUM with two different modes
>>>>>> in parallel?
>>>>>>
>>>>>> Another idea would be to invent a CC_CNV_NOOV mode, that implicitly
>>>>>> defines C from the DImode result, and NV from the SImode result,
>>>>>> similar to the CC_NOOVmode, that also leaves something open what
>>>>>> bits it really defines?
>>>>>>
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Bernd.
>>>>>
>>>>> I think maybe the right solution is to invent a new CCmode
>>>>> that defines C as if the comparison is done in DImode
>>>>> but N and V as if the comparison is done in SImode.
>>>>>
>>>>> I thought maybe I would call it CC_NCV_CIC (CIC = Carry-In-Compare),
>>>>> furthermore I think the CC_NOOV should be renamed to CC_NZ (because
>>>>> only N and Z are set correctly), but in a different patch of course.
>>>>>
>>>>> Attached is a new version that implements the new CCmode.
>>>>>
>>>>> How do you like this new version?
>>>>>
>>>>> It seems to be able to build a cross-compiler at least.
>>>>>
>>>>> I will start a new bootstrap with this new patch, but that can take
>>>>> some
>>>>> time until I have definitive results.
>>>>>
>>>>> Is there still a chance that it can go into gcc-7 or should it wait
>>>>> for the next stage1?
>>>>>
>>>>> Thanks
>>>>> Bernd.
>>>>
>>>>
>>>> I thought I should also look at where the subdi_compare1 amd the
>>>> negdi2_compare patterns are used, and look if the caller is fine with
>>>> not having all CC bits available.
>>>>
>>>> And indeed usubv<mode>4 turns out to be questionabe, because it
>>>> emits gen_sub<mode>3_compare1 and uses arm_gen_unlikely_cbranch (LTU,
>>>> CCmode) which is inconsistent when subdi3_compare1 no longer uses
>>>> CCmode.
>>>>
>>>> To correct this, the branch should use CC_Cmode which is always 
>>>> defined.
>>>>
>>>> So I tried to test this pattern, with the following test programs,
>>>> and found that the code actually improves when the branch uses CC_Cmode
>>>> instead of CCmode, both for SImode and DImode data, which was a bit
>>>> surprising.
>>>>
>>>> I used this test program to see how the usubv<mode>4 pattern works:
>>>>
>>>> cat test.c (DImode)
>>>> unsigned long long x, y, z;
>>>> int b;
>>>> void test()
>>>> {
>>>>   b = __builtin_sub_overflow (y,z, &x);
>>>> }
>>>>
>>>>
>>>> unpatched code used 8 byte more stack than patched,
>>>> because the DImode subtraction is effectively done twice.
>>>>
>>>> cat test1.c (SImode)
>>>> unsigned long x, y, z;
>>>> int b;
>>>> void test()
>>>> {
>>>>   b = __builtin_sub_overflow (y,z, &x);
>>>> }
>>>>
>>>> which generates (unpatched):
>>>>         cmp     r3, r0
>>>>         sub     ip, r3, r0
>>>>
>>>> instead of expected (patched):
>>>>     subs    r3, r3, r2
>>>>
>>>>
>>>> The condition is extracted by ifconversion and/or combine
>>>> and complicates the resulting code instead of simplifying.
>>>>
>>>> I think this happens only when the branch and the subsi/di3_compare1
>>>> is using the same CC mode.
>>>>
>>>> That does not happen when the CC modes disagree, as with the
>>>> proposed patch.  All other uses of the pattern are already using
>>>> CC_Cmode or CC_Vmode in the branch, and these do not change.
>>>>
>>>> Attached is an updated version of the patch, that happens to
>>>> improve the code generation of the usubsi4 and usubdi4 pattern,
>>>> as a side effect.
>>>>
>>>>
>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.
>>>> Is it OK for trunk?
>>>>
>>>>
>>>> Thanks
>>>> Bernd.

[PING**4] [PATCH, ARM] correctly encode the CC reg data flow

Reply via email to