[Bug target/77308] surprisingly large stack usage for sha512 on arm

bernd.edlinger at hotmail dot de Tue, 01 Nov 2016 07:31:57 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308


Bernd Edlinger <bernd.edlinger at hotmail dot de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #39898|0                           |1
        is obsolete|                            |

--- Comment #38 from Bernd Edlinger <bernd.edlinger at hotmail dot de> ---
Created attachment 39939
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39939&action=edit
proposed patch, v2

Hi,

this is a new version that tries to fix the fall out of
the previous attempt.

I will attempt a bootstrap and reg-test later this week.

It splits the logical di3 pattern right at the expansion.
When !TARGET_HARD_FLOAT or !TARGET_IWMMXT, in order to not
break the neon/iwmmxt patterns that seem to depend on it.

Simply disabling the logical di3 pattern made it impossible
to merge the ldrd/strd later because the ldr/str got expanded
too far away from each other.

It splits the adddi3/subdi3 in the split1 pass but only when
!TARGET_HARD_FLOAT, because other hard float pattern seem
to depend on it.

Note that the setting of the out register in the shift
expansion is only necessary in the case -mfpu=vfp -mhard-float
in all other configurations this is now unnecessary.

So far I have only benchmarked with the sha512 test case
and a modified sha512 with the Sigma blocks decorated with bit-not (~).

Checked that the pr53447-*.c test cases work again.

Checked that this test case emits all ldrd/strd where expected:

cat test.c
void foo(long long* p)
{
  p[1] |= 0x100000001;
  p[2] &= 0x100000001;
  p[3] ^= 0x100000001;
  p[4] += 0x100000001;
  p[5] -= 0x100000001;
  p[6] = ~p[6];
  p[7] <<= 5;
  p[8] >>= 5;
  p[9] -= p[10];
}

At -Os -mthumb -march=armv7-a -msoft-float / -mhard-float
improves number of ldrd/strd with this patch to 100%.

I wonder if it is OK to emit ldrd at all when optimizing
for speed, given they are considered slower than ldm / 2x ldr ?

With -Os -mfpu=neon / -mfpu=vfp / -march=iwmmxt: checked that the stack usage
is still the same, around 2328 bytes.

With -Os -marm / thumb2: made sure that the stack usage is still 272 bytes.

Unlike the previous patch, thumb1 stack usage stays at 1588 bytes,
because thumb1 cannot split the adddi3 pattern, once it is emitted.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

Reply via email to