[Bug target/77308] surprisingly large stack usage for sha512 on arm

bernd.edlinger at hotmail dot de Tue, 25 Oct 2016 13:49:11 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308


--- Comment #15 from Bernd Edlinger <bernd.edlinger at hotmail dot de> ---
(In reply to Wilco from comment #14)
> (In reply to Bernd Edlinger from comment #13)
> > I am still trying to understand why thumb1 seems to outperform thumb2.
> > 
> > Obviously thumb1 does not have the shiftdi3 pattern,
> > but even if I remove these from thumb2, the result is still
> > not par with thumb2.  Apparently other patterns still produce di
> > values that are not enabled with thumb1, they are 
> > xordi3 and anddi3, these are often used.  Then there is
> > adddi3 that is enabled in thumb1 and thumb2, I also disabled
> > this one, and now the sha512 gets down to inclredible 1152
> > bytes frame (-Os -march=armv7 -mthumb -float-abi=soft):
> > 
> > I know this is a hack, but 1K stack is what we should expect...
> > 
> > --- arm.md      2016-10-25 19:54:16.425736721 +0200
> > +++ arm.md.orig 2016-10-17 19:46:59.000000000 +0200
> > @@ -448,7 +448,7 @@
> >           (plus:DI (match_operand:DI 1 "s_register_operand" "")
> >                    (match_operand:DI 2 "arm_adddi_operand"  "")))
> >      (clobber (reg:CC CC_REGNUM))])]
> > -  "TARGET_EITHER && !TARGET_THUMB2"
> > +  "TARGET_EITHER"
> 
> So you're actually turning the these instructions off for Thumb-2? What does
> it do instead then? Do the number of instructions go down?
> 
> I noticed that with or without -mfpu=neon, using -marm is significantly
> smaller than -mthumb. Most of the extra instructions appear to be moves,
> which means something is wrong (I would expect Thumb-2 to do better as it
> supports LDRD with larger offsets than ARM).

The LDRD may be another detail, that contributes to this mess.
Maybe, just a guess, the LDRD does simply not work with DI registers, but
only with two SI, at least the pattern looks like targeting two SI moves?

I would expect the n DI mode registers to fall apart into 2n SI mode registers,
that should happen when the expansion finds no DI pattern, it falls back
to use SI pattern, and each SI mode register can be spilled independently and
can be dead independently of the other half word.

And frankly I am still puzzled, what my brutal patch did to the stack size,
and it reduced the code size:

frame       2328 ->   1152
code size 0x4188 -> 0x3ab8


I have not tested if the code works, but I assume that it should,
or fail in an ICE, which is apparently not the case.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

Reply via email to