https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86352
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> --- As for the memset issue with vectors, with -march=skylake on the trunk we get: BucketMap::acquireBucket(): movq %rdi, %rax movq %rsi, %rcx .L2: movq (%rsi), %rdx andl $1, %edx lock btsq %rdx, (%rcx) jc .L2 vpxor %xmm15, %xmm15, %xmm15 vmovdqu %ymm15, (%rax) vmovdqu %ymm15, 32(%rax) vmovdqu %ymm15, 64(%rax) vmovdqu %ymm15, 96(%rax) ret Which I think is close to the best code, there are two extra moves which is due to the way atomics are represented but otherwise still decent code. Which is much better than LLVM can do: BucketMap::acquireBucket(): # @BucketMap::acquireBucket() movq %rdi, %r8 movl $1, %ecx .LBB1_1: # =>This Loop Header: Depth=1 # Child Loop BB1_2 Depth 2 movq (%rsi), %rax andb $1, %al shlxq %rax, %rcx, %rdx movq (%rsi), %rax .p2align 4, 0x90 .LBB1_2: # Parent Loop BB1_1 Depth=1 # => This Inner Loop Header: Depth=2 movq %rax, %rdi orq %rdx, %rdi lock cmpxchgq %rdi, (%rsi) jne .LBB1_2 # %bb.3: # in Loop: Header=BB1_1 Depth=1 testl %eax, %edx jne .LBB1_1 # %bb.4: vxorps %xmm0, %xmm0, %xmm0 vmovups %ymm0, 96(%r8) vmovups %ymm0, 64(%r8) vmovups %ymm0, 32(%r8) vmovups %ymm0, (%r8) movq %r8, %rax vzeroupper ret So many things badly wrong with the above LLVM code, compare xchange loop vs btsq, extra vzeroupper which is not needed as the upper parts of the ymm registers are already zero.