> On Sat, Dec 27, 2014 at 11:57 AM, Andrew Haley <[email protected]> wrote: > Is it faster? Have you measured it? Is it so much faster that it's critical > for your > application?
Well, I couldn't really leave this be: I did a little bit of benchmarking using my company's proprietary benchmarking library, which I'll try and get open sourced. It follows Intel's recommendations for using RDTSCP/CPUID etc, and I've also spent some time looking at Agner Fog 's techniques. I believe it to be pretty accurate, to within a clock cycle or two. On my laptop (Core i5 M520) the volatile and non-volatile increments are so fast as to be within the noise - 1-2 clock cycles. So that certainly lends support to your theory Andrew that it's probably not worth the effort (other than offending my aesthetic sensibilities!). Obviously this doesn't really take into account the extra i-cache pressure. As a comparison, the "lock xaddl" versions come out at 18 cycles. Obviously this is also pretty much "free" by any reasonable metric, but it's hard to measure the impact of the bus lock on other processors' memory accesses in a highly multi-threaded environment. For completeness I also tried it on a few other machines: X5670 : 0-2 for normal, 28 clocks for lock xadd E5-2667 v2: as above, 27 clocks for lock xadd E5-2667 v3: as above, 15 clocks for lock xadd On Sat, Dec 27, 2014 at 11:57 AM, Andrew Haley <[email protected]> wrote: > Well, in this case you now know: it's a bug! But one that it's >fairly hard to care deeply about, although it might get fixed now. Understood completely! Thanks again, Matt
