https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118072
            Bug ID: 118072
           Summary: n % 7 on ARM is bad and unstable.
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: cassio.neri at gmail dot com
  Target Milestone: ---

Consider:

unsigned mod3(unsigned n) {
  return n % 3;
}

The emitted code for ARM64 with -O3 is:

mod3:
        mov     w1, 43691
        movk    w1, 0xaaaa, lsl 16
        umull   x1, w0, w1
        lsr     x1, x1, 33
        add     w1, w1, w1, lsl 1
        sub     w0, w0, w1
        ret

Basically, it returns n - 3 * (n / 3) and to get n / 3 it applies some variant
of Granlund & Montgomery (G&M) to avoid udiv. The cases of n % 5 and n % 9 are
similar. Now, for n % 7 the emitted code is:

mod7:
        mov     w1, 7
        udiv    w1, w0, w1
        lsl     w2, w1, 3
        sub     w1, w2, w1
        sub     w0, w0, w1
        ret

Similarly to the previous cases, it returns n - 7 * (n / 7) but, in contrast,
to get n / 7 the compiler falls back to udiv. Surprisingly, for this function:

unsigned div7(unsigned n) {
  return n / 7;
};

The compiler does use G&M:

div7:
        mov     w1, 18725
        movk    w1, 0x2492, lsl 16
        umull   x1, w0, w1
        lsr     x1, x1, 32
        sub     w0, w0, w1
        add     w0, w1, w0, lsr 1
        lsr     w0, w0, 2
        ret

Even more surprising is that if div7 appears in the source code just above
mod7, then the compiler uses G&M for mod7 (even if div7 is unused!):

mod7:
        mov     w2, 18725
        movk    w2, 0x2492, lsl 16
        umull   x2, w0, w2
        lsr     x2, x2, 32
        sub     w1, w0, w2
        add     w1, w2, w1, lsr 1
        lsr     w1, w1, 2
        lsl     w2, w1, 3
        sub     w1, w2, w1
        sub     w0, w0, w1
        ret

Therefore, there's something special about the divisor 7. The same happens for
divisors that are multiple of 7. Same for 19 and its multiples. And certainly
others.

Here is my educated guess (I don't know GCC's code but I know a thing or two
about G&M). 
Roughly speaking, G&M replaces n / D (where D is a constant) with n * M >> S,
where M and S depend on D. It turns out that for some divisors, the constant M
doesn't fit in 32-bits  (*) and a slightly more complicated version of G&M is
used. Now, 7 and 19 are two such divisors. 

See: https://godbolt.org/z/5Esr3ozf4 

(*) I don't know ARM64 but on x64, it's possible to lift the 32-bits
limitation. However, IIUC, most larger-than-32-bits constants can't be used as
immediate values except in mov. This might be the reason for holding the
limitation on x64 or, perhaps, this is just a  historical thing from x86 times
and a technical revisitation of G&M for x64 (and ARM64?) could be valuable.
Sorry, I'm digressing.

Reply via email to