On Fri, Feb 17, 2017 at 5:30 PM, Jakub Jelinek <ja...@redhat.com> wrote: > On Sun, May 29, 2016 at 11:10:15PM +0200, Uros Bizjak wrote: >> As explained in PR71245, comment #3 [1], it is better to use offset -4 >> to a %esp to implement a non-SSE memory fence instruction: >> >> -q- >> >> I guess it costs a code byte for a disp8 in the addressing mode, but >> it avoids adding a lot of latency to a critical path involving a >> spill/reload to (%esp), in functions where there is something at >> (%esp). >> >> If it's an object larger than 4B, the lock orl could even cause a >> store-forwarding stall when the object is reloaded. (e.g. a double or >> a vector). >> >> Ideally we could do the lock orl on some padding between two locals, >> or on something in memory that wasn't going to be loaded soon, to >> avoid touching more stack memory (which might be in the next page >> down). But we still want to do it on a cache line that's hot, so >> going way up above our own stack frame isn't good either. > > Unfortunately this makes valgrind unhappy about that: > https://bugzilla.redhat.com/show_bug.cgi?id=1423434 > I assume it will complain now on anything pre-SSE2 that contains the memory > barrier in 32-bit code. > Perhaps we should decrement and increment %esp around it or something > similar (or push/pop)? Of course, that would mean we need to take care > of async unwind info.
Or, we can simply revert the patch? Not that the barrier performance of non-SSE 32bit targets matter... Uros.