Hi, For x86, shift insn will automatically mask the count to 5 bits in 32 bit mode and to 6 bits in 64 bit mode, so for the testcase below, the buf_ << (-end & 63) could be optimized to buf_ << -end. But for trunk compiler, some place in the testcase is not optimized.
typedef unsigned long long uint64; typedef unsigned int uint32; class Decoder { public: Decoder() : k_minus_1_(0), buf_(0), bits_left_(0) {} ~Decoder() {} uint32 ExtractBits(uint64 end, uint64 start); inline uint32 GetBits(int bits) { uint32 val = ExtractBits(bits, 0); buf_ >>= bits; bits_left_ -= bits; return val; } uint32 Get(uint32 bits); uint32 k_minus_1_; uint64 buf_; unsigned long bits_left_; }; uint32 Decoder::ExtractBits(uint64 end, uint64 start) { return (buf_ << (-end & 63)) >> ((start - end) & 63); } uint32 Decoder::Get(uint32 bits) { bits += k_minus_1_; uint32 msbit = (bits > (k_minus_1_ + 1)); return GetBits(bits - msbit) | (msbit << (bits - 1)); } The assembly generated by "g++ -O2 -S 1.C" .file "1.c" .text .align 2 .p2align 4,,15 .globl _ZN7Decoder11ExtractBitsEyy .type _ZN7Decoder11ExtractBitsEyy, @function _ZN7Decoder11ExtractBitsEyy: .LFB7: .cfi_startproc movq 8(%rdi), %rax movl %esi, %ecx negl %ecx salq %cl, %rax ===> Here (-end & 63) is optimized away. movl %edx, %ecx subl %esi, %ecx shrq %cl, %rax ret .cfi_endproc .LFE7: .size _ZN7Decoder11ExtractBitsEyy, .-_ZN7Decoder11ExtractBitsEyy .align 2 .p2align 4,,15 .globl _ZN7Decoder3GetEj .type _ZN7Decoder3GetEj, @function _ZN7Decoder3GetEj: .LFB8: .cfi_startproc movl (%rdi), %eax movq 8(%rdi), %r8 addl %eax, %esi addl $1, %eax movq %r8, %r10 cmpl %esi, %eax movl %esi, %ecx setb %al movzbl %al, %eax subl %eax, %ecx movl %ecx, %edx shrq %cl, %r10 movslq %ecx, %r9 negl %edx movq %r10, 8(%rdi) subq %r9, 16(%rdi) andl $63, %edx ==> Inst A: the (-end & 63) is not optimized away. movq %r8, %rdi movl %edx, %ecx salq %cl, %rdi ==> Inst B: use the result of (-end & 63) here shrq %cl, %rdi ==> Inst C: use the result of (-end & 63) here leal -1(%rsi), %ecx sall %cl, %eax orl %edi, %eax ret .cfi_endproc .LFE8: In Decoder::Get(), (-end & 63) in Inst A is not optimized away because the two (-end & 63) exprs are csed after ExtractBits() is inlined to GetBits() then to Get(), then it is a single def feeding to multiple down uses, which cannot be optimized by combine phase. This is an old problem in combine. It is also the reason why (-end & 63) is optimized away in _ZN7Decoder11ExtractBitsEyy. To overcome the weakness of combine phase for this testcase, I am wondering whether we can extend fwprop to solve the problem because fwprop handle those "single def to multiple down uses" cases. Existing fwprop is restrictive since it only tries to propagate when the src of def insn is a reg, subreg or const, and it only tries to propagate when simplification after propagate can collapse the expr to a const. Can we extend it to try propagate more complex exprs from def to use (here propagate the andl operation from Inst A to the shift operands in Inst B and Inst C), and let try_fwprop_subst in fwprop.c to decide whether to keep the change by comparing the costs? Or better way to solve the problem? Appreciated a lot! Thanks, Wei.