Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)

Christophe Lyon Tue, 05 Sep 2017 01:47:56 -0700

Hi Bernd,


On 4 September 2017 at 16:52, Kyrill  Tkachov
<kyrylo.tkac...@foss.arm.com> wrote:
>
> On 29/04/17 18:45, Bernd Edlinger wrote:
>>
>> Ping...
>>
>> I attached a rebased version since there was a merge conflict in
>> the xordi3 pattern, otherwise the patch is still identical.
>> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2
>> early when the target has no neon or iwmmxt.
>>
>>
>> Thanks
>> Bernd.
>>
>>
>>
>> On 11/28/16 20:42, Bernd Edlinger wrote:
>>>
>>> On 11/25/16 12:30, Ramana Radhakrishnan wrote:
>>>>
>>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger
>>>> <bernd.edlin...@hotmail.de> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> This improves the stack usage on the sha512 test case for the case
>>>>> without hardware fpu and without iwmmxt by splitting all di-mode
>>>>> patterns right while expanding which is similar to what the
>>>>> shift-pattern
>>>>> does.  It does nothing in the case iwmmxt and fpu=neon or vfp as well
>>>>> as
>>>>> thumb1.
>>>>>
>>>> I would go further and do this in the absence of Neon, the VFP unit
>>>> being there doesn't help with DImode operations i.e. we do not have 64
>>>> bit integer arithmetic instructions without Neon. The main reason why
>>>> we have the DImode patterns split so late is to give a chance for
>>>> folks who want to do 64 bit arithmetic in Neon a chance to make this
>>>> work as well as support some of the 64 bit Neon intrinsics which IIRC
>>>> map down to these instructions. Doing this just for soft-float doesn't
>>>> improve the default case only. I don't usually test iwmmxt and I'm not
>>>> sure who has the ability to do so, thus keeping this restriction for
>>>> iwMMX is fine.
>>>>
>>>>
>>> Yes I understand, thanks for pointing that out.
>>>
>>> I was not aware what iwmmxt exists at all, but I noticed that most
>>> 64bit expansions work completely different, and would break if we split
>>> the pattern early.
>>>
>>> I can however only look at the assembler outout for iwmmxt, and make
>>> sure that the stack usage does not get worse.
>>>
>>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as
>>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack
>>> for the test cases, and vfp and soft-float at around 270 bytes stack
>>> usage.
>>>
>>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!).
>>>>>
>>>>> Note this also splits many ldrd/strd instructions and therefore I will
>>>>> post a followup-patch that mitigates this effect by enabling the
>>>>> ldrd/strd
>>>>> peephole optimization after the necessary reg-testing.
>>>>>
>>>>>
>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.
>>>>
>>>> What do you mean by arm-linux-gnueabihf - when folks say that I
>>>> interpret it as --with-arch=armv7-a --with-float=hard
>>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon).
>>>>
>>>> If you've really bootstrapped and regtested it on armhf, doesn't this
>>>> patch as it stand have no effect there i.e. no change ?
>>>> arm-linux-gnueabihf usually means to me someone has configured with
>>>> --with-float=hard, so there are no regressions in the hard float ABI
>>>> case,
>>>>
>>> I know it proves little.  When I say arm-linux-gnueabihf
>>> I do in fact mean --enable-languages=all,ada,go,obj-c++
>>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16
>>> --with-float=hard.
>>>
>>> My main interest in the stack usage is of course not because of linux,
>>> but because of eCos where we have very small task stacks and in fact
>>> no fpu support by the O/S at all, so that patch is exactly what we need.
>>>
>>>
>>> Bootstrapped and reg-tested on arm-linux-gnueabihf
>>> Is it OK for trunk?
>
>
> The code is ok.
> AFAICS testing this with --with-fpu=vfpv3-d16 does exercise the new code as
> the splits
> will happen for !TARGET_NEON (it is of course !TARGET_IWMMXT and
> TARGET_IWMMXT2
> is irrelevant here).
>
> So this is ok for trunk.
> Thanks, and sorry again for the delay.
> Kyrill
>

This patch (r251663) causes a regression on armeb-none-linux-gnueabihf
--with-mode arm
--with-cpu cortex-a9
--with-fpu vfpv3-d16-fp16
FAIL:    gcc.dg/vect/vect-singleton_1.c (internal compiler error)
FAIL:    gcc.dg/vect/vect-singleton_1.c -flto -ffat-lto-objects
(internal compiler error)

the test passes if gcc is configured --with-fpu neon-fp16


Christophe

>>>
>>> Thanks
>>> Bernd.
>
>

Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)

Reply via email to