[PING**3] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)

Bernd Edlinger Fri, 12 May 2017 09:50:21 -0700

Ping...


On 04/29/17 19:45, Bernd Edlinger wrote:
> Ping...
>
> I attached a rebased version since there was a merge conflict in
> the xordi3 pattern, otherwise the patch is still identical.
> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2
> early when the target has no neon or iwmmxt.
>
>
> Thanks
> Bernd.
>
>
>
> On 11/28/16 20:42, Bernd Edlinger wrote:
>> On 11/25/16 12:30, Ramana Radhakrishnan wrote:
>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger
>>> <bernd.edlin...@hotmail.de> wrote:
>>>> Hi!
>>>>
>>>> This improves the stack usage on the sha512 test case for the case
>>>> without hardware fpu and without iwmmxt by splitting all di-mode
>>>> patterns right while expanding which is similar to what the
>>>> shift-pattern
>>>> does.  It does nothing in the case iwmmxt and fpu=neon or vfp as
>>>> well as
>>>> thumb1.
>>>>
>>>
>>> I would go further and do this in the absence of Neon, the VFP unit
>>> being there doesn't help with DImode operations i.e. we do not have 64
>>> bit integer arithmetic instructions without Neon. The main reason why
>>> we have the DImode patterns split so late is to give a chance for
>>> folks who want to do 64 bit arithmetic in Neon a chance to make this
>>> work as well as support some of the 64 bit Neon intrinsics which IIRC
>>> map down to these instructions. Doing this just for soft-float doesn't
>>> improve the default case only. I don't usually test iwmmxt and I'm not
>>> sure who has the ability to do so, thus keeping this restriction for
>>> iwMMX is fine.
>>>
>>>
>>
>> Yes I understand, thanks for pointing that out.
>>
>> I was not aware what iwmmxt exists at all, but I noticed that most
>> 64bit expansions work completely different, and would break if we split
>> the pattern early.
>>
>> I can however only look at the assembler outout for iwmmxt, and make
>> sure that the stack usage does not get worse.
>>
>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as
>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack
>> for the test cases, and vfp and soft-float at around 270 bytes stack
>> usage.
>>
>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!).
>>>>
>>>> Note this also splits many ldrd/strd instructions and therefore I will
>>>> post a followup-patch that mitigates this effect by enabling the
>>>> ldrd/strd
>>>> peephole optimization after the necessary reg-testing.
>>>>
>>>>
>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.
>>>
>>> What do you mean by arm-linux-gnueabihf - when folks say that I
>>> interpret it as --with-arch=armv7-a --with-float=hard
>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon).
>>>
>>> If you've really bootstrapped and regtested it on armhf, doesn't this
>>> patch as it stand have no effect there i.e. no change ?
>>> arm-linux-gnueabihf usually means to me someone has configured with
>>> --with-float=hard, so there are no regressions in the hard float ABI
>>> case,
>>>
>>
>> I know it proves little.  When I say arm-linux-gnueabihf
>> I do in fact mean --enable-languages=all,ada,go,obj-c++
>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16
>> --with-float=hard.
>>
>> My main interest in the stack usage is of course not because of linux,
>> but because of eCos where we have very small task stacks and in fact
>> no fpu support by the O/S at all, so that patch is exactly what we need.
>>
>>
>> Bootstrapped and reg-tested on arm-linux-gnueabihf
>> Is it OK for trunk?
>>
>>
>> Thanks
>> Bernd.

[PING**3] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)

Reply via email to