https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153
--- Comment #13 from ncm at cantrip dot org ---
This is essentially the entire difference between the versions of
puzzlegen-int.cc without, and with, the added "++count;" line
referenced above (modulo register assignments and branch labels)
that sidesteps the +50% pessimization:
(Asm is from "g++ -fverbose-asm -std=c++14 -O3 -Wall -S $SRC.cc" using
g++ (Debian 5.2.1-15) 5.2.1 20150808, with no instruction-set extensions
specified. Output with "-mbmi -mbmi2" has different instructions, but
they do not noticeably affect run time on Haswell i7-4770.)
@@ -793,25 +793,26 @@
.L141:
movl (%rdi), %esi # MEM[base: _244, offset: 0], word
testl %r11d, %esi # D.66634, word
jne .L138 #,
xorl %eax, %eax # tmp419
cmpl %esi, %r12d # word, seven
leaq 208(%rsp), %rcx #, tmp574
sete %al #, tmp419
movl %r12d, %edx # seven, seven
leal 1(%rax,%rax), %r8d #, D.66619
.p2align 4,,10
.p2align 3
.L140:
movl %edx, %eax # seven, D.66634
negl %eax # D.66634
andl %edx, %eax # seven, D.66622
testl %eax, %esi # D.66622, word
je .L139 #,
addl %r8d, 24(%rcx) # D.66619, MEM[base: _207, offset: 24B]
+ addl $1, %ebx #, count
.L139:
notl %eax # D.66622
subq $4, %rcx #, ivtmp.424
andl %eax, %edx # D.66622, seven
jne .L140 #,
addq $4, %rdi #, ivtmp.428
cmpq %rdi, %r10 # ivtmp.428, D.66637
jne .L141 #,
I tried a version of the program with a fixed-length loop (over
'place' in [6..0]) so that branches do not depend on results of
"rest &= ~-rest". The compiler unrolled the loop, but the program
ran at pessimized speed with or without the "++count" line.
I am very curious whether this has been reproduced on others' Haswells,
and on Ivybridge and Skylake.