https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308
ktkachov at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2016-08-23 CC| |ktkachov at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #9 from ktkachov at gcc dot gnu.org --- Note that the fpu option plays a role here as well When I compile with -O3 -S -mfloat-abi=hard -march=armv7-a -mthumb -mtune=cortex-a8 -mfpu=neon I get: sha512_block_data_order: @ args = 0, pretend = 0, frame = 2384 @ frame_needed = 0, uses_anonymous_args = 0 push {r4, r5, r6, r7, r8, r9, r10, fp, lr} subw sp, sp, #2388 subs r4, r2, #1 whereas if you leave out the -mfpu you get the default which is probably 'vfp' if you didn't configure gcc with an explicit --with-fpu. This is usually not a good fit for recent targets. With -mfpu=vfp I get the terrible: sha512_block_data_order: @ args = 0, pretend = 0, frame = 3568 @ frame_needed = 0, uses_anonymous_args = 0 push {r4, r5, r6, r7, r8, r9, r10, fp, lr} subw sp, sp, #3572 subs r4, r2, #1 That said, I bet there's still room for improvement