On Mon, Nov 25, 2013 at 2:11 AM, Richard Biener <richard.guent...@gmail.com> wrote: > On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <x...@google.com> wrote: >> On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener >> <richard.guent...@gmail.com> wrote: >>> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <x...@google.com> wrote: >>>> Hi, >>>> >>>> This patch injects a condition into the instrumented code for edge >>>> counter update. The counter value will not be updated after reaching >>>> value 1. >>>> >>>> The feature is under a new parameter --param=coverage-exec_once. >>>> Default is disabled and setting to 1 to enable. >>>> >>>> This extra check usually slows the program down. For SPEC 2006 >>>> benchmarks (all single thread programs), we usually see around 20%-35% >>>> slow down in -O2 coverage build. This feature, however, is expected to >>>> improve the coverage run speed for multi-threaded programs, because >>>> there virtually no data race and false sharing in updating counters. >>>> The improvement can be significant for highly threaded programs -- we >>>> are seeing 7x speedup in coverage test run for some non-trivial google >>>> applications. >>>> >>>> Tested with bootstrap. >>> >>> Err - why not simply emit >>> >>> counter = 1 >>> >>> for the counter update itself with that --param (I don't like a --param >>> for this either). >>> >>> I assume that CPUs can avoid data-races and false sharing for >>> non-changing accesses? >>> >> >> I'm not aware of any CPU having this feature. I think a write to the >> shared cache line to invalidate all the shared copies. I cannot find >> any reference on checking the value of the write. Do you have any >> pointer to the feature? > > I don't have any pointer - but I remember seeing this in the context > of atomics thus it may be only in the context of using a xchg > or cmpxchg instruction. Which would make it non-portable to > some extent (if you don't want to use atomic builtins here). >
cmpxchg should work here -- it's a conditional write so the data race /false sharing can be avoided. I'm comparing the performance b/w explicit branch vs cmpxchg and will report back. -Rong > Richard. > >> I just tested this implementation vs. simply setting to 1, using >> google search as the benchmark. >> This one is 4.5x faster. The test was done on Intel Westmere systems. >> >> I can change the parameter to an option. >> >> -Rong >> >>> Richard. >>> >>>> Thanks, >>>> >>>> -Rong