On Fri, Dec 20, 2013 at 11:45 PM, Rong Xu <x...@google.com> wrote: > Here are the results using our internal benchmarks which are a mixed a > multi-threaded and single-threaded programs. > This was collected about a month ago but I did not got time to send > due to an unexpected trip. > > cmpxchg gives the worst performance due to the memory barriers it incurs. > I'll send a patch that supports conditional_1 and unconditional_1.
Bah, too bad. Ok, so another idea is to use non-tempoal unconditional store of 1. According to docs this is weakly ordered and side-steps the cache (movnti/movntq on x86). Btw, +@item coverage-exec_once +Set to 1 to update each arc counter only once. This avoids false sharing +and speeds up profile-generate run for multi-threaded programs. + for -fprofile-generate this certainly is a bad choice but I can see that it is useful for -coverage. Also avoid mixing - and _ here. Using a --param here is not the proper thing for a non-developer feature, thus I'd suggest to add -fcoverage-exec-once instead. It is supposed to help for -coverage, right? Not for -fprofile-generate? Instead of using a pointer-set to record stmts to "instrument" just set one of the pass-local flags on the stmts (gimple_set_plf/gimple_plf, you have to clear flags before using them). Thanks, Richard. > --------------------------------- result ----------------------------- > > base: original_coverage > (1): using_conditional_1 -- using branch (my original implementation) > (2): using_unconfitional_1 -- write straight 1 > (3): using_cmpxchg -- using compxchg write 1 > > Values are performance ratios where 100.0 equals the performance of > O2. Larger numbers are faster. > "--" means the test failed due to running too slowly. > > arch : westmere > Benchmark Base (1) (2) (3) > --------------------------------------------------------- > benchmark_1 26.4 +176.62% +17.20% -- > benchmark_2 -- [78.4] [12.3] -- > benchmark_3 86.3 +6.15% +10.52% -61.28% > benchmark_4 88.4 +6.59% +14.26% -68.76% > benchmark_5 89.6 +6.26% +13.00% -68.74% > benchmark_6 76.7 +22.28% +29.15% -75.31% > benchmark_7 89.0 -0.62% +3.36% -71.37% > benchmark_8 84.5 -1.45% +5.27% -74.04% > benchmark_9 81.3 +10.64% +13.32% -72.82% > benchmark_10 59.1 +44.71% +14.77% -73.24% > benchmark_11 90.3 -1.74% +4.22% -61.95% > benchmark_12 98.9 +0.07% +0.48% -6.37% > benchmark_13 74.0 -4.69% +4.35% -77.02% > benchmark_14 21.4 +309.92% +63.41% -35.82% > benchmark_15 21.4 +282.33% +58.15% -57.98% > benchmark_16 85.1 -7.71% +1.65% -60.72% > benchmark_17 81.7 +2.47% +8.20% -72.08% > benchmark_18 83.7 +1.59% +3.83% -69.33% > geometric mean +30.30% +14.41% -65.66% (incomplete) > > arch : sandybridge > Benchmark Base (1) (2) (3) > --------------------------------------------------------- > benchmark_1 -- [70.1] [26.1] -- > benchmark_2 -- [79.1] -- -- > benchmark_3 84.3 +10.82% +15.84% -68.98% > benchmark_4 88.5 +10.28% +11.35% -75.10% > benchmark_5 89.4 +10.46% +11.40% -74.41% > benchmark_6 65.5 +38.52% +44.46% -77.97% > benchmark_7 87.7 -0.16% +1.74% -76.19% > benchmark_8 89.6 -4.52% +6.29% -78.10% > benchmark_9 79.9 +13.43% +19.44% -75.99% > benchmark_10 52.6 +61.53% +8.23% -78.41% > benchmark_11 89.9 -1.40% +3.37% -68.16% > benchmark_12 99.0 +1.51% +0.63% -10.37% > benchmark_13 74.3 -6.75% +3.89% -81.84% > benchmark_14 21.8 +295.76% +19.48% -51.58% > benchmark_15 23.5 +257.20% +29.33% -83.53% > benchmark_16 84.4 -10.04% +2.39% -68.25% > benchmark_17 81.6 +0.60% +8.82% -78.02% > benchmark_18 87.4 -1.14% +9.69% -75.88% > geometric mean +25.64% +11.76% -72.96% (incomplete) > > arch : clovertown > Benchmark Base (1) (2) (3) > -------------------------------------------------------------- > benchmark_1 -- [83.4] -- -- > benchmark_2 -- [82.3] -- -- > benchmark_3 86.2 +7.58% +13.10% -81.74% > benchmark_4 89.4 +5.69% +11.70% -82.97% > benchmark_5 92.8 +4.67% +7.48% -80.02% > benchmark_6 78.1 +13.28% +22.21% -86.92% > benchmark_7 96.8 +0.25% +5.44% -84.94% > benchmark_8 89.1 +0.66% +3.60% -85.89% > benchmark_9 86.4 +8.42% +9.95% -82.30% > benchmark_10 59.7 +44.95% +21.79% -- > benchmark_11 91.2 -0.29% +4.35% -76.05% > benchmark_12 99.0 +0.31% -0.05% -25.19% > benchmark_14 8.2 +1011.27% +104.15% +5.56% > benchmark_15 11.7 +669.25% +108.54% -29.83% > benchmark_16 85.7 -7.51% +4.43% -- > benchmark_17 87.7 +2.84% +7.45% -- > benchmark_18 87.9 +1.59% +3.82% -81.11% > geometric mean +37.89% +17.54% -74.47% (incomplete) > > arch : istanbul > > Benchmark Base (1) (2) (3) > ---------------------------------------------------------- > benchmark_1 -- [73.2] -- -- > benchmark_2 -- [82.9] -- -- > benchmark_3 86.1 +4.56% +11.68% -61.04% > benchmark_4 92.0 +3.47% +4.63% -64.84% > benchmark_5 91.9 +4.18% +4.90% -64.77% > benchmark_6 73.6 +23.36% +27.13% -72.64% > benchmark_7 93.6 -3.57% +4.76% -68.54% > benchmark_8 88.9 -3.01% +2.87% -75.50% > benchmark_9 81.6 +9.91% +14.10% -69.81% > benchmark_10 69.8 +23.60% -1.53% -- > benchmark_11 89.9 +0.34% +4.32% -59.62% > benchmark_12 98.7 +0.96% +0.88% -10.52% > benchmark_13 80.5 -8.31% -0.55% -75.90% > benchmark_14 17.1 +429.47% -5.21% -4.87% > benchmark_15 22.0 +295.70% -1.29% -46.36% > benchmark_16 80.0 -6.31% +0.54% -62.61% > benchmark_17 83.5 +4.84% +10.71% -70.48% > benchmark_18 90.0 -1.27% +3.18% -72.30% > geometric mean +24.51% +4.81% -62.44% (incomplete) > > benchmark_13 77.1 -3.34% +5.75% -- > > On Mon, Nov 25, 2013 at 11:19 AM, Rong Xu <x...@google.com> wrote: >> On Mon, Nov 25, 2013 at 2:11 AM, Richard Biener >> <richard.guent...@gmail.com> wrote: >>> On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <x...@google.com> wrote: >>>> On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener >>>> <richard.guent...@gmail.com> wrote: >>>>> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <x...@google.com> wrote: >>>>>> Hi, >>>>>> >>>>>> This patch injects a condition into the instrumented code for edge >>>>>> counter update. The counter value will not be updated after reaching >>>>>> value 1. >>>>>> >>>>>> The feature is under a new parameter --param=coverage-exec_once. >>>>>> Default is disabled and setting to 1 to enable. >>>>>> >>>>>> This extra check usually slows the program down. For SPEC 2006 >>>>>> benchmarks (all single thread programs), we usually see around 20%-35% >>>>>> slow down in -O2 coverage build. This feature, however, is expected to >>>>>> improve the coverage run speed for multi-threaded programs, because >>>>>> there virtually no data race and false sharing in updating counters. >>>>>> The improvement can be significant for highly threaded programs -- we >>>>>> are seeing 7x speedup in coverage test run for some non-trivial google >>>>>> applications. >>>>>> >>>>>> Tested with bootstrap. >>>>> >>>>> Err - why not simply emit >>>>> >>>>> counter = 1 >>>>> >>>>> for the counter update itself with that --param (I don't like a --param >>>>> for this either). >>>>> >>>>> I assume that CPUs can avoid data-races and false sharing for >>>>> non-changing accesses? >>>>> >>>> >>>> I'm not aware of any CPU having this feature. I think a write to the >>>> shared cache line to invalidate all the shared copies. I cannot find >>>> any reference on checking the value of the write. Do you have any >>>> pointer to the feature? >>> >>> I don't have any pointer - but I remember seeing this in the context >>> of atomics thus it may be only in the context of using a xchg >>> or cmpxchg instruction. Which would make it non-portable to >>> some extent (if you don't want to use atomic builtins here). >>> >> >> cmpxchg should work here -- it's a conditional write so the data race >> /false sharing can be avoided. >> I'm comparing the performance b/w explicit branch vs cmpxchg and will >> report back. >> >> -Rong >> >> >>> Richard. >>> >>>> I just tested this implementation vs. simply setting to 1, using >>>> google search as the benchmark. >>>> This one is 4.5x faster. The test was done on Intel Westmere systems. >>>> >>>> I can change the parameter to an option. >>>> >>>> -Rong >>>> >>>>> Richard. >>>>> >>>>>> Thanks, >>>>>> >>>>>> -Rong