Richard Earnshaw <richard.earns...@foss.arm.com> writes:
> On 28/11/2023 12:52, Surya Kumari Jangala wrote:
>> Hi Richard,
>> Thanks a lot for your response!
>> 
>> Another failure reported by the Linaro CI is as follows :
>> (Note: I am planning to send a separate mail for each failure, as this will 
>> make
>> the discussion easy to track)
>> 
>> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve 
>> -moverride=tune=none  check-function-bodies dup_x0_m
>> 
>> Expected code:
>> 
>>        ...
>>        add     (x[0-9]+), x0, #?1
>>        mov     (p[0-7])\.b, p15\.b
>>        mov     z0\.d, \2/m, \1
>>        ...
>>        ret
>> 
>> 
>> Code obtained w/o patch:
>>          addvl   sp, sp, #-1
>>          str     p15, [sp]
>>          add     x0, x0, 1
>>          mov     p3.b, p15.b
>>          mov     z0.d, p3/m, x0
>>          ldr     p15, [sp]
>>          addvl   sp, sp, #1
>>          ret
>> 
>> Code obtained w/ patch:
>>      addvl   sp, sp, #-1
>>          str     p15, [sp]
>>          mov     p3.b, p15.b
>>          add     x0, x0, 1
>>          mov     z0.d, p3/m, x0
>>          ldr     p15, [sp]
>>          addvl   sp, sp, #1
>>          ret
>> 
>> As we can see, with the patch, the following two instructions are 
>> interchanged:
>>          add     x0, x0, 1
>>          mov     p3.b, p15.b
>
> Indeed, both look acceptable results to me, especially given that we 
> don't schedule results at -O1.
>
> There's two ways of fixing this:
> 1) Simply swap the order to what the compiler currently generates (which 
> is a little fragile, since it might flip back someday).
> 2) Write the test as
>
>
> ** (
> **       add     (x[0-9]+), x0, #?1
> **       mov     (p[0-7])\.b, p15\.b
> **       mov     z0\.d, \2/m, \1
> ** |
> **       mov     (p[0-7])\.b, p15\.b
> **       add     (x[0-9]+), x0, #?1
> **       mov     z0\.d, \1/m, \2
> ** )
>
> Note, we need to swap the match names in the third insn to account for 
> the different order of the earlier instructions.
>
> Neither is ideal, but the second is perhaps a little more bomb proof.
>
> I don't really have a strong feeling either way, but perhaps the second 
> is slightly preferable.
>
> Richard S: thoughts?

Yeah, I agree the second is probably better.  The | doesn't reset the
capture numbers, so I think the final instruction needs to be:

**       mov     z0\.d, \3/m, \4

Thanks,
Richard

>
> R.
>
>> I believe that this is fine and the test can be modified to allow it to pass 
>> on
>> aarch64. Please let me know what you think.
>> 
>> Regards,
>> Surya
>> 
>> 
>> On 24/11/23 4:18 pm, Richard Earnshaw wrote:
>>>
>>>
>>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
>>>> Hi Richard,
>>>> Ping. Please let me know if the test failure that I mentioned in the mail 
>>>> below can be handled by changing the expected generated code. I am not 
>>>> conversant with arm, and hence would appreciate your help.
>>>>
>>>> Regards,
>>>> Surya
>>>>
>>>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>>>>> Hi Richard,
>>>>> I had submitted a patch for review 
>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>>>>> regarding scaling save/restore costs of callee save registers with block
>>>>> frequency in the IRA pass (PR111673).
>>>>>
>>>>> This patch has been approved by VMakarov
>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>>>>
>>>>> With this patch, we are seeing performance improvements with spec on x86
>>>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>>>>
>>>>> I received a mail from Linaro about some failures seen in the CI pipeline 
>>>>> with
>>>>> this patch. I have analyzed the failures and I wish to discuss the 
>>>>> analysis with you.
>>>>>
>>>>> One failure reported by the Linaro CI is:
>>>>>
>>>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, 
>>>>> r[0-9]+, \\[r[0-9]+\\] 2
>>>>>
>>>>> The diff in the assembly between trunk and patch is:
>>>>>
>>>>> 93c93
>>>>> <       push    {r4, r5}
>>>>> ---
>>>>>>         push    {fp}
>>>>> 95c95
>>>>> <       ldrexd  r4, r5, [r0]
>>>>> ---
>>>>>>         ldrexd  fp, ip, [r0]
>>>>> 99c99
>>>>> <       pop     {r4, r5}
>>>>> ---
>>>>>>         ldr     fp, [sp], #4
>>>>>
>>>>>
>>>>> The test fails with patch because the ldrexd insn uses fp & ip registers 
>>>>> instead
>>>>> of r[0-9]+
>>>>>
>>>>> But the code produced by patch is better because it is pushing and 
>>>>> restoring only
>>>>> one register (fp) instead of two registers (r4, r5). Hence, this test can 
>>>>> be
>>>>> modified to allow it to pass on arm. Please let me know what you think.
>>>>>
>>>>> If you need more information, please let me know. I will be sending 
>>>>> separate mails
>>>>> for the other test failures.
>>>>>
>>>
>>> Thanks for looking at this.
>>>
>>>
>>> The key part of this test is that the compiler generates LDREXD.  The 
>>> registers used for that are pretty much irrelevant as we don't match them 
>>> to any other operations within the test.  So I'd recommend just testing for 
>>> the mnemonic and not for any of the operands (ie just match "ldrexd\t").
>>>
>>> R.
>>>
>>>>> Regards,
>>>>> Surya
>>>>>
>>>>>
>>>>>

Reply via email to