https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118072
Bug ID: 118072 Summary: n % 7 on ARM is bad and unstable. Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: cassio.neri at gmail dot com Target Milestone: --- Consider: unsigned mod3(unsigned n) { return n % 3; } The emitted code for ARM64 with -O3 is: mod3: mov w1, 43691 movk w1, 0xaaaa, lsl 16 umull x1, w0, w1 lsr x1, x1, 33 add w1, w1, w1, lsl 1 sub w0, w0, w1 ret Basically, it returns n - 3 * (n / 3) and to get n / 3 it applies some variant of Granlund & Montgomery (G&M) to avoid udiv. The cases of n % 5 and n % 9 are similar. Now, for n % 7 the emitted code is: mod7: mov w1, 7 udiv w1, w0, w1 lsl w2, w1, 3 sub w1, w2, w1 sub w0, w0, w1 ret Similarly to the previous cases, it returns n - 7 * (n / 7) but, in contrast, to get n / 7 the compiler falls back to udiv. Surprisingly, for this function: unsigned div7(unsigned n) { return n / 7; }; The compiler does use G&M: div7: mov w1, 18725 movk w1, 0x2492, lsl 16 umull x1, w0, w1 lsr x1, x1, 32 sub w0, w0, w1 add w0, w1, w0, lsr 1 lsr w0, w0, 2 ret Even more surprising is that if div7 appears in the source code just above mod7, then the compiler uses G&M for mod7 (even if div7 is unused!): mod7: mov w2, 18725 movk w2, 0x2492, lsl 16 umull x2, w0, w2 lsr x2, x2, 32 sub w1, w0, w2 add w1, w2, w1, lsr 1 lsr w1, w1, 2 lsl w2, w1, 3 sub w1, w2, w1 sub w0, w0, w1 ret Therefore, there's something special about the divisor 7. The same happens for divisors that are multiple of 7. Same for 19 and its multiples. And certainly others. Here is my educated guess (I don't know GCC's code but I know a thing or two about G&M). Roughly speaking, G&M replaces n / D (where D is a constant) with n * M >> S, where M and S depend on D. It turns out that for some divisors, the constant M doesn't fit in 32-bits (*) and a slightly more complicated version of G&M is used. Now, 7 and 19 are two such divisors. See: https://godbolt.org/z/5Esr3ozf4 (*) I don't know ARM64 but on x64, it's possible to lift the 32-bits limitation. However, IIUC, most larger-than-32-bits constants can't be used as immediate values except in mov. This might be the reason for holding the limitation on x64 or, perhaps, this is just a historical thing from x86 times and a technical revisitation of G&M for x64 (and ARM64?) could be valuable. Sorry, I'm digressing.