https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308
Bernd Edlinger <bernd.edlinger at hotmail dot de> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #39898|0 |1 is obsolete| | --- Comment #38 from Bernd Edlinger <bernd.edlinger at hotmail dot de> --- Created attachment 39939 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39939&action=edit proposed patch, v2 Hi, this is a new version that tries to fix the fall out of the previous attempt. I will attempt a bootstrap and reg-test later this week. It splits the logical di3 pattern right at the expansion. When !TARGET_HARD_FLOAT or !TARGET_IWMMXT, in order to not break the neon/iwmmxt patterns that seem to depend on it. Simply disabling the logical di3 pattern made it impossible to merge the ldrd/strd later because the ldr/str got expanded too far away from each other. It splits the adddi3/subdi3 in the split1 pass but only when !TARGET_HARD_FLOAT, because other hard float pattern seem to depend on it. Note that the setting of the out register in the shift expansion is only necessary in the case -mfpu=vfp -mhard-float in all other configurations this is now unnecessary. So far I have only benchmarked with the sha512 test case and a modified sha512 with the Sigma blocks decorated with bit-not (~). Checked that the pr53447-*.c test cases work again. Checked that this test case emits all ldrd/strd where expected: cat test.c void foo(long long* p) { p[1] |= 0x100000001; p[2] &= 0x100000001; p[3] ^= 0x100000001; p[4] += 0x100000001; p[5] -= 0x100000001; p[6] = ~p[6]; p[7] <<= 5; p[8] >>= 5; p[9] -= p[10]; } At -Os -mthumb -march=armv7-a -msoft-float / -mhard-float improves number of ldrd/strd with this patch to 100%. I wonder if it is OK to emit ldrd at all when optimizing for speed, given they are considered slower than ldm / 2x ldr ? With -Os -mfpu=neon / -mfpu=vfp / -march=iwmmxt: checked that the stack usage is still the same, around 2328 bytes. With -Os -marm / thumb2: made sure that the stack usage is still 272 bytes. Unlike the previous patch, thumb1 stack usage stays at 1588 bytes, because thumb1 cannot split the adddi3 pattern, once it is emitted.