https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308
--- Comment #12 from Wilco <wdijkstr at arm dot com> --- It looks like we need a different approach, I've seen the extra SETs use up more registers in some cases, and in other cases being optimized away early on... Doing shift expansion at the same time as all other DI mode operations should result in the same stack size as -fpu=neon. However that's still well behind Thumb-1, and I would expect ARM/Thumb-2 to beat Thumb-1 easily with 6 extra registers. The spill code for Thumb-2 seems incorrect: (insn 11576 8090 9941 5 (set (reg:SI 3 r3 [11890]) (plus:SI (reg/f:SI 13 sp) (const_int 480 [0x1e0]))) sha512.c:147 4 {*arm_addsi3} (nil)) (insn 9941 11576 2978 5 (set (reg:DI 2 r2 [4210]) (mem/c:DI (reg:SI 3 r3 [11890]) [5 %sfpD.4158+-3112 S8 A64])) sha512.c:147 170 {*arm_movdi} (nil)) LDRD has a range of 1020 on Thumb-2 so I would expect this to be a single instruction.