https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101225
Bug ID: 101225 Summary: Example where y % 16 == 0 seems more expensive than y % 400 == 0. Product: gcc Version: 11.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: cassio.neri at gmail dot com Target Milestone: --- Consider this implementation of is_leap_year: bool is_leap_year_1(short year) { return year % 100 == 0 ? year % 400 == 0 : year % 4 == 0; } If a number is multiple of 100, then it's divisible by 400 if and only if it's divisible by 16. Since checking divisibility by 16 is cheap, one would expect the following version to be more efficient (at least, not worse): bool is_leap_year_2(short year) { return year % 100 == 0 ? year % 16 == 0 : year % 4 == 0; } According to [1] the latter is 1.4x slower than the former. The emitted instructions with -O3 [2] don't seem bad and, except for a leal x addw, the difference is a localized strength-reduction from "y % 400 == 0" to "y % 16 == 0": is_leap_year_1(short): imulw $23593, %di, %ax leal 1308(%rax), %edx rorw $2, %dx cmpw $654, %dx ja .L2 addw $1296, %ax # Begin: year % 400 == 0 rorw $4, %ax # cmpw $162, %ax # setbe %al # End : year % 400 == 0 ret .L2: andl $3, %edi sete %al ret is_leap_year_2(short): imulw $23593, %di, %ax addw $1308, %ax rorw $2, %ax cmpw $654, %ax ja .L6 andl $15, %edi # Begin: y % 16 == 0 sete %al # End : y % 16 == 0 ret .L6: andl $3, %edi sete %al ret FWIW: My educated **guess** is that the issue is the choice of registers: for version 1 just after leal, the register rax/ax/al is free and regardless of the branch taken, the CPU can continue the calculation of "y % 100 == 0" in parallel with the other divisibility check, up to "sete %al". For version 2, rax/ax/al is busy during the whole execution of "y % 100" and "sete %al" can't be preemptively executed. As a test for my theory I reimplemented half of is_leap_year_2 in inline asm (see in [1] and [2]) using similar choices of registers as in is_leap_year_1 and I got the performance boost that I was expecting. [1] https://quick-bench.com/q/3U8t4qzXxtSpsehbWNOh3SWxBGQ [2] https://godbolt.org/z/jfK3j5777 Note: [1] runs GCC 10.2 but the same happens on GCC 11.0.0.