https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839
Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |pinskia at gcc dot gnu.org --- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> --- (In reply to Dmitrij Pochepko from comment #2) > aarch64 won't be necessarily faster with such fix. > 531.deepsjeng_r on ThunderX2 shows about 0.5% slower numbers with 31-clz(a). This sounds like we only pass 0 or 1 to this function in deepsjeng_r? Have you figured out the values that deepsjeng_r uses for these loops? Because 31-clz would be: clz w0, w0 mov w1, 31 sub w0, w1, w0 --- CUT --- While the loop version would be: asr w1, w0, 1 mov w0, 0 cbz w1, .L3 .p2align 2 .L5: add w0, w0, 1 asr w1, w1, 1 cbnz w1, .L5 .L3: If the first branch was predicted as being taken (and it was actually taken; that is skip the loop), it would be a few cycles faster than the non-loop based one. This would also mean the value of w0 is either 0 or 1. Did you anlaysis why it was worse for TX2?