https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89895
Bug ID: 89895 Summary: Unable to sink high half of widening multiply out of loop Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: lkml at sdf dot org Target Milestone: --- This is part of gcc's general problem with double-word values, but I was encouraged to submit a PR, since it's a particularly simple but real-world-applicable test case. Lemire's algorithm for uniform random number generation in a range (https://arxiv.org/abs/1805.10941) has the following core: static uint64_t __attribute__((noinline)) get_random_u64(void); u64 get_random_range(uint64_t range, uint64_t lim) { unsigned __int128 prod; do { prod = (unsigned __int128)range * get_random_u64(); } while ((uint64_t)prod < lim); return prod >> 64; } (In practice, get_random_u64() would be inlined, but I've left it out of line for exposition.) GCC's isn't sinking generation of the high half of the product out of the loop. This particularly applies on platforms with a separate multiply-high instruction like alpha: $L9: bsr $26,get_random_u64 !samegp mulq $0,$9,$1 umulh $0,$9,$0 cmpule $10,$1,$1 beq $1,$L9 and PowerPC: .L12: bl get_random_u64 mulld 9,3,31 mulhdu 3,3,31 cmpld 7,30,9 bgt+ 7,.L12 But is also applies to MIPS, where the mfhi could be sunk out of the loop: .L10: jal get_random_u64 nop dmultu $2,$17 mflo $2 sltu $6,$2,$16 bne $6,$0,.L10 mfhi $3 In this case, there's nothing *better* to do in the delay slot than mfhi, but that's kind of an accident. The code I'd hope to see is Alpha: $L9: bsr $26,get_random_u64 mulq $0,$9,$1 cmpule $10,$1,$1 beq $1,$L9 umulh $0,$9,$0 PowerPC: .L12: bl get_random_u64 mulld 9,3,31 cmpld 7,30,9 bgt+ 7,.L12 mulhdu 3,3,31 and (when the mulditi3 expander is added) MIPS r6: .L10: balc get_random_u64 dmulu $3, $2, $17 sltu $3, $3, $16 bnezc $3, .L10 dmuhu $2, $2, $17 In these cases, since the low-half multiply is the last multiply in the loop, the high half will still catch the hardware-optimized case for both halves of a multiply.