https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86352

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
As for the memset issue with vectors, with -march=skylake on the trunk we get:

BucketMap::acquireBucket():
        movq    %rdi, %rax
        movq    %rsi, %rcx
.L2:
        movq    (%rsi), %rdx
        andl    $1, %edx
        lock btsq       %rdx, (%rcx)
        jc      .L2
        vpxor   %xmm15, %xmm15, %xmm15
        vmovdqu %ymm15, (%rax)
        vmovdqu %ymm15, 32(%rax)
        vmovdqu %ymm15, 64(%rax)
        vmovdqu %ymm15, 96(%rax)
        ret

Which I think is close to the best code, there are two extra moves which is due
to the way atomics are represented but otherwise still decent code.

Which is much better than LLVM can do:
BucketMap::acquireBucket():         # @BucketMap::acquireBucket()
        movq    %rdi, %r8
        movl    $1, %ecx
.LBB1_1:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB1_2 Depth 2
        movq    (%rsi), %rax
        andb    $1, %al
        shlxq   %rax, %rcx, %rdx
        movq    (%rsi), %rax
        .p2align        4, 0x90
.LBB1_2:                                #   Parent Loop BB1_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        movq    %rax, %rdi
        orq     %rdx, %rdi
        lock            cmpxchgq        %rdi, (%rsi)
        jne     .LBB1_2
# %bb.3:                                #   in Loop: Header=BB1_1 Depth=1
        testl   %eax, %edx
        jne     .LBB1_1
# %bb.4:
        vxorps  %xmm0, %xmm0, %xmm0
        vmovups %ymm0, 96(%r8)
        vmovups %ymm0, 64(%r8)
        vmovups %ymm0, 32(%r8)
        vmovups %ymm0, (%r8)
        movq    %r8, %rax
        vzeroupper
        ret

So many things badly wrong with the above LLVM code, compare xchange loop vs
btsq, extra vzeroupper which is not needed as the upper parts of the ymm
registers are already zero.

Reply via email to