While compiling compression code for LZMA for use with an embedded ARM target I have discovered a regression from previous editions of GCC.
I have pared this down to a trivial example (attached) which boils down to a application specific modulus operation (please note this is the *minimal* test case and obviously is a bit more complex buried in the middle of the compression system. The behavior exhibited remains the same in both the large and small systems. The simple test case is compiled with arm-unknown-linux-gnu-gcc -Os -o foo test.c and the resulting objdump is: 000083fc <foo>: 83fc: e92d4010 push {r4, lr} 8400: e5d11000 ldrb r1, [r1] 8404: e1a04000 mov r4, r0 8408: e1a02001 mov r2, r1 840c: ea000002 b 841c <foo+0x20> 8410: e5943004 ldr r3, [r4, #4] 8414: e2833001 add r3, r3, #1 ; 0x1 8418: e5843004 str r3, [r4, #4] 841c: e242302d sub r3, r2, #45 ; 0x2d 8420: e352002c cmp r2, #44 ; 0x2c 8424: e20320ff and r2, r3, #255 ; 0xff 8428: 8afffff8 bhi 8410 <foo+0x14> 842c: e1a00001 mov r0, r1 8430: e3a0102d mov r1, #45 ; 0x2d 8434: eb000003 bl 8448 <__umodsi3> 8438: e20000ff and r0, r0, #255 ; 0xff 843c: e5840000 str r0, [r4] 8440: e8bd8010 pop {r4, pc} if a differing optimisation is used: arm-unknown-linux-gnu-gcc -O2 -o foo test.c 000083fc <foo>: 83fc: e92d4070 push {r4, r5, r6, lr} 8400: e5d14000 ldrb r4, [r1] 8404: e354002c cmp r4, #44 ; 0x2c 8408: e1a06000 mov r6, r0 840c: 9a00000e bls 844c <foo+0x50> 8410: e244402d sub r4, r4, #45 ; 0x2d 8414: e20440ff and r4, r4, #255 ; 0xff 8418: e5905004 ldr r5, [r0, #4] 841c: e3a0102d mov r1, #45 ; 0x2d 8420: e1a00004 mov r0, r4 8424: eb00004f bl 8568 <__umodsi3> 8428: e3a0102d mov r1, #45 ; 0x2d 842c: e1a03000 mov r3, r0 8430: e1a00004 mov r0, r4 8434: e20340ff and r4, r3, #255 ; 0xff 8438: eb000006 bl 8458 <__aeabi_uidiv> 843c: e2855001 add r5, r5, #1 ; 0x1 8440: e20000ff and r0, r0, #255 ; 0xff 8444: e0855000 add r5, r5, r0 8448: e5865004 str r5, [r6, #4] 844c: e5864000 str r4, [r6] 8450: e8bd8070 pop {r4, r5, r6, pc} Actually several optimization levels were tried and all produced similar output GCC 4.2.2 and 4.2.4 (which are our current compliers) arm-unknown-linux-gnueabi-gcc -Os -o foo test.c produce: 00008328 <foo>: 8328: e5d12000 ldrb r2, [r1] 832c: ea000003 b 8340 <foo+0x18> 8330: e5903004 ldr r3, [r0, #4] 8334: e20120ff and r2, r1, #255 ; 0xff 8338: e2833001 add r3, r3, #1 ; 0x1 833c: e5803004 str r3, [r0, #4] 8340: e352002c cmp r2, #44 ; 0x2c 8344: e242102d sub r1, r2, #45 ; 0x2d 8348: 8afffff8 bhi 8330 <foo+0x8> 834c: e5802000 str r2, [r0] 8350: e12fff1e bx lr As can be seen the trivial loop is performed and the divisor and remainder found but then the __umodsi3 builtin is called to do the operation *again* and that used to assign the result which is already available from the loop! This odd behavior is seen in cross built (and native) GCC 4.3.2 but not in 4.2.4 it seems to be present in current development builds however I have issues building those reliably so cannot give definite results. The behavior is especially obvious with large performance and code size degradation in compression code on small embedded system. Also the additional need to link in the __umodsi3 implementation causes more space to be lost. This has also been observed in some circumstances within ARM kernels when using modulous on powers of two! the obvious optimisation using shifts is performed and then the value recomputed using __modsi3 Just for completeness here is the GCC 4.3.2 compiler used for the tests (the 4.3.4 produces identical compiled output but has other undesirable behaviors not relevant to this report) arm-unknown-linux-gnu-gcc -v Using built-in specs. Target: arm-unknown-linux-gnu Configured with: /opt/simtec/crosstool-ng/targets/src/gcc-4.3.2/configure --build=x86_64-build_unknown-linux-gnu --host=x86_64-build_unknown-linux-gnu --target=arm-unknown-linux-gnu --prefix=/opt/simtec/arm-unknown-linux-gnu --with-sysroot=/opt/simtec/arm-unknown-linux-gnu/arm-unknown-linux-gnu/sys-root --enable-languages=c,c++,fortran,java --disable-multilib --with-float=soft --with-gmp=/opt/simtec/arm-unknown-linux-gnu --with-mpfr=/opt/simtec/arm-unknown-linux-gnu --with-pkgversion=crosstool-NG-1.3.0 --enable-__cxa_atexit --with-local-prefix=/opt/simtec/arm-unknown-linux-gnu/arm-unknown-linux-gnu/sys-root --disable-nls --enable-threads=posix --enable-symvers=gnu --enable-c99 --enable-long-long --enable-target-optspace Thread model: posix gcc version 4.3.2 (crosstool-NG-1.3.0) -- Summary: Output code optimisation excessive use of builtins Product: gcc Version: 4.3.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vince at simtec dot co dot uk GCC build triplet: x86_64-build_unknown-linux-gnu GCC host triplet: x86_64-build_unknown-linux-gnu GCC target triplet: arm-unknown-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38453