Robert Haas <robertmh...@gmail.com> writes: > It's not a question of whether the return value is used, but of > whether the updated value of *old is used.
Right, but if we re-read "old" in the loop, and if the primitive doesn't return "old" (or does, but call site ignores it) then in principle the CAS might be strength-reduced a bit. Whether that can win enough to be better than removing the unlocked read, I dunno. In the case at hand, looking at a loop like while (count-- > 0) { (void) pg_atomic_fetch_or_u32(myptr, 1); } with our HEAD code and RHEL6's gcc I get this for the inner loop: .L9: movl (%rdx), %eax movl %eax, %ecx orl $1, %ecx lock cmpxchgl %ecx,(%rdx) setz %cl testb %cl, %cl je .L9 subq $1, %rbx testq %rbx, %rbx jg .L9 Applying the proposed generic.h patch, it becomes .L10: movl (%rsi), %eax .L9: movl %eax, %ecx orl $1, %ecx lock cmpxchgl %ecx,(%rdx) setz %cl testb %cl, %cl je .L9 subq $1, %rbx testq %rbx, %rbx jg .L10 Note that in both cases the cmpxchgl is coming out of the asm construct in pg_atomic_compare_exchange_u32_impl from atomics/arch-x86.h, so that even if a simpler assembly instruction were possible given that we don't need %eax to be updated, there's no chance of that actually happening. This gets back to the point I made in the other thread that maybe the arch-x86.h asm sequences are not an improvement over the gcc intrinsics. The code is the same if the loop is modified to use the result of pg_atomic_fetch_or_u32, so I won't bother showing that. Adding the proposed generic-gcc.h patch, the loop reduces to .L11: lock orl $1, (%rax) subq $1, %rbx testq %rbx, %rbx jg .L11 or if we make the loop do junk += pg_atomic_fetch_or_u32(myptr, 1); then we get .L13: movl (%rsi), %eax .L10: movl %eax, %edi movl %eax, %ecx orl $1, %ecx lock cmpxchgl %ecx, (%rdx) jne .L10 addl %edi, %r8d subq $1, %rbx testq %rbx, %rbx jg .L13 So that last is slightly better than the generic.h + asm CAS version, perhaps not meaningfully so --- but it's definitely better when the compiler can see the result isn't used. In short then, given the facts on the ground here, in particular the asm-based CAS primitive at the heart of these loops, it's clear that taking the read out of the loop can't hurt. But that should be read as a rather narrow conclusion. With a different compiler and/or a different version of pg_atomic_compare_exchange_u32_impl, maybe the answer would be different. And of course it's moot once the generic-gcc.h patch is applied. Anyway, I don't have a big objection to applying this. My concern is more that we need to be taking a harder look at other parts of the atomics infrastructure, because tweaks there are likely to buy much more. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers