https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53513
--- Comment #16 from Oleg Endo <olegendo at gcc dot gnu.org> --- I've tried a modified example from PR 5360, using floats instead of doubles: void loop_p (int np, int non0, float coeff[][2048], float tmp1) { int j, k; for (j = non0; j < np; j++) for (k = 0; k < j; k++) coeff[j][j] -= tmp1 * coeff[j][k]; } with -O2 -m4a (double mode default) and the patch from comment #15 applied: (loop setup code omitted) ... .L6: cmp/pl r5 ! outer loop, set to single bf/s .L7 sts fpscr,r7 mov.l .L16,r4 mov r0,r2 fmov.s @r3,fr1 mov r5,r1 and r4,r7 lds r7,fpscr .align 2 .L5: fmov.s @r2+,fr0 ! inner loop, no switch dt r1 fneg fr0 fmac fr0,fr5,fr1 bf/s .L5 fmov.s fr1,@r3 .L7: dt r6 add #1,r5 add r9,r0 bf/s .L6 add r8,r3 sts fpscr,r1 ! function return, set to double mov.l .L17,r2 mov.l @r15+,r9 or r2,r1 mov.l @r15+,r8 rts lds r1,fpscr Obviously, if the inner loop count is small the mode set in the outer loop will dominate. Something seems to be missing in the mode-switch optimization. The mode switch should be just hoisted above all loops, which then can use the fpchg insn on SH4A.