On Fri, Dec 20, 2013 at 11:45 PM, Rong Xu <x...@google.com> wrote:
> Here are the results using our internal benchmarks which are a mixed a
> multi-threaded and single-threaded programs.
> This was collected about a month ago but I did not got time to send
> due to an unexpected trip.
>
> cmpxchg gives the worst performance due to the memory barriers it incurs.
> I'll send a patch that supports conditional_1 and unconditional_1.

Bah, too bad.  Ok, so another idea is to use non-tempoal unconditional
store of 1.  According to docs this is weakly ordered and side-steps the
cache (movnti/movntq on x86).

Btw,

+@item coverage-exec_once
+Set to 1 to update each arc counter only once. This avoids false sharing
+and speeds up profile-generate run for multi-threaded programs.
+

for -fprofile-generate this certainly is a bad choice but I can see that
it is useful for -coverage.  Also avoid mixing - and _ here.  Using a
--param here is not the proper thing for a non-developer feature,
thus I'd suggest to add -fcoverage-exec-once instead.  It is supposed
to help for -coverage, right?  Not for -fprofile-generate?

Instead of using a pointer-set to record stmts to "instrument" just
set one of the pass-local flags on the stmts (gimple_set_plf/gimple_plf,
you have to clear flags before using them).

Thanks,
Richard.

> --------------------------------- result -----------------------------
>
> base: original_coverage
> (1): using_conditional_1 -- using branch (my original implementation)
> (2): using_unconfitional_1 -- write straight 1
> (3): using_cmpxchg -- using compxchg write 1
>
> Values are performance ratios where 100.0 equals the performance of
> O2. Larger numbers are faster.
> "--" means the test failed due to running too slowly.
>
> arch        : westmere
>   Benchmark           Base      (1)    (2)      (3)
> ---------------------------------------------------------
> benchmark_1            26.4  +176.62%  +17.20%        --
> benchmark_2              --    [78.4]   [12.3]        --
> benchmark_3            86.3    +6.15%  +10.52%   -61.28%
> benchmark_4            88.4    +6.59%  +14.26%   -68.76%
> benchmark_5            89.6    +6.26%  +13.00%   -68.74%
> benchmark_6            76.7   +22.28%  +29.15%   -75.31%
> benchmark_7            89.0    -0.62%   +3.36%   -71.37%
> benchmark_8            84.5    -1.45%   +5.27%   -74.04%
> benchmark_9            81.3   +10.64%  +13.32%   -72.82%
> benchmark_10           59.1   +44.71%  +14.77%   -73.24%
> benchmark_11           90.3    -1.74%   +4.22%   -61.95%
> benchmark_12           98.9    +0.07%   +0.48%    -6.37%
> benchmark_13           74.0    -4.69%   +4.35%   -77.02%
> benchmark_14           21.4  +309.92%  +63.41%   -35.82%
> benchmark_15           21.4  +282.33%  +58.15%   -57.98%
> benchmark_16           85.1    -7.71%   +1.65%   -60.72%
> benchmark_17           81.7    +2.47%   +8.20%   -72.08%
> benchmark_18           83.7    +1.59%   +3.83%   -69.33%
> geometric mean                 +30.30%  +14.41%  -65.66% (incomplete)
>
> arch        : sandybridge
>   Benchmark           Base    (1)       (2)      (3)
> ---------------------------------------------------------
> benchmark_1             --    [70.1]   [26.1]       --
> benchmark_2             --    [79.1]       --       --
> benchmark_3           84.3   +10.82%  +15.84%  -68.98%
> benchmark_4           88.5   +10.28%  +11.35%  -75.10%
> benchmark_5           89.4   +10.46%  +11.40%  -74.41%
> benchmark_6           65.5   +38.52%  +44.46%  -77.97%
> benchmark_7           87.7    -0.16%   +1.74%  -76.19%
> benchmark_8           89.6    -4.52%   +6.29%  -78.10%
> benchmark_9           79.9   +13.43%  +19.44%  -75.99%
> benchmark_10          52.6   +61.53%   +8.23%  -78.41%
> benchmark_11          89.9    -1.40%   +3.37%  -68.16%
> benchmark_12          99.0    +1.51%   +0.63%  -10.37%
> benchmark_13          74.3    -6.75%   +3.89%  -81.84%
> benchmark_14          21.8  +295.76%  +19.48%  -51.58%
> benchmark_15          23.5  +257.20%  +29.33%  -83.53%
> benchmark_16          84.4   -10.04%   +2.39%  -68.25%
> benchmark_17          81.6    +0.60%   +8.82%  -78.02%
> benchmark_18          87.4    -1.14%   +9.69%  -75.88%
> geometric mean               +25.64%  +11.76%  -72.96% (incomplete)
>
> arch        : clovertown
>   Benchmark         Base       (1)       (2)        (3)
> --------------------------------------------------------------
> benchmark_1             --     [83.4]        --         --
> benchmark_2             --     [82.3]        --         --
> benchmark_3           86.2     +7.58%   +13.10%    -81.74%
> benchmark_4           89.4     +5.69%   +11.70%    -82.97%
> benchmark_5           92.8     +4.67%    +7.48%    -80.02%
> benchmark_6           78.1    +13.28%   +22.21%    -86.92%
> benchmark_7           96.8     +0.25%    +5.44%    -84.94%
> benchmark_8           89.1     +0.66%    +3.60%    -85.89%
> benchmark_9           86.4     +8.42%    +9.95%    -82.30%
> benchmark_10          59.7    +44.95%   +21.79%         --
> benchmark_11          91.2     -0.29%    +4.35%    -76.05%
> benchmark_12          99.0     +0.31%    -0.05%    -25.19%
> benchmark_14           8.2  +1011.27%  +104.15%     +5.56%
> benchmark_15          11.7   +669.25%  +108.54%    -29.83%
> benchmark_16          85.7     -7.51%    +4.43%         --
> benchmark_17          87.7     +2.84%    +7.45%         --
> benchmark_18          87.9     +1.59%    +3.82%    -81.11%
> geometric mean                 +37.89%   +17.54%  -74.47% (incomplete)
>
> arch        : istanbul
>
>  Benchmark           Base    (1)       (2)       (3)
> ----------------------------------------------------------
> benchmark_1             --    [73.2]       --         --
> benchmark_2             --    [82.9]       --         --
> benchmark_3           86.1    +4.56%  +11.68%    -61.04%
> benchmark_4           92.0    +3.47%   +4.63%    -64.84%
> benchmark_5           91.9    +4.18%   +4.90%    -64.77%
> benchmark_6           73.6   +23.36%  +27.13%    -72.64%
> benchmark_7           93.6    -3.57%   +4.76%    -68.54%
> benchmark_8           88.9    -3.01%   +2.87%    -75.50%
> benchmark_9           81.6    +9.91%  +14.10%    -69.81%
> benchmark_10          69.8   +23.60%   -1.53%         --
> benchmark_11          89.9    +0.34%   +4.32%    -59.62%
> benchmark_12          98.7    +0.96%   +0.88%    -10.52%
> benchmark_13          80.5    -8.31%   -0.55%    -75.90%
> benchmark_14          17.1  +429.47%   -5.21%     -4.87%
> benchmark_15          22.0  +295.70%   -1.29%    -46.36%
> benchmark_16          80.0    -6.31%   +0.54%    -62.61%
> benchmark_17          83.5    +4.84%  +10.71%    -70.48%
> benchmark_18          90.0    -1.27%   +3.18%    -72.30%
> geometric mean               +24.51%   +4.81%  -62.44% (incomplete)
>
> benchmark_13          77.1     -3.34%    +5.75%         --
>
> On Mon, Nov 25, 2013 at 11:19 AM, Rong Xu <x...@google.com> wrote:
>> On Mon, Nov 25, 2013 at 2:11 AM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <x...@google.com> wrote:
>>>> On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener
>>>> <richard.guent...@gmail.com> wrote:
>>>>> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <x...@google.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> This patch injects a condition into the instrumented code for edge
>>>>>> counter update. The counter value will not be updated after reaching
>>>>>> value 1.
>>>>>>
>>>>>> The feature is under a new parameter --param=coverage-exec_once.
>>>>>> Default is disabled and setting to 1 to enable.
>>>>>>
>>>>>> This extra check usually slows the program down. For SPEC 2006
>>>>>> benchmarks (all single thread programs), we usually see around 20%-35%
>>>>>> slow down in -O2 coverage build. This feature, however, is expected to
>>>>>> improve the coverage run speed for multi-threaded programs, because
>>>>>> there virtually no data race and false sharing in updating counters.
>>>>>> The improvement can be significant for highly threaded programs -- we
>>>>>> are seeing 7x speedup in coverage test run for some non-trivial google
>>>>>> applications.
>>>>>>
>>>>>> Tested with bootstrap.
>>>>>
>>>>> Err - why not simply emit
>>>>>
>>>>>   counter = 1
>>>>>
>>>>> for the counter update itself with that --param (I don't like a --param
>>>>> for this either).
>>>>>
>>>>> I assume that CPUs can avoid data-races and false sharing for
>>>>> non-changing accesses?
>>>>>
>>>>
>>>> I'm not aware of any CPU having this feature. I think a write to the
>>>> shared cache line to invalidate all the shared copies. I cannot find
>>>> any reference on checking the value of the write. Do you have any
>>>> pointer to the feature?
>>>
>>> I don't have any pointer - but I remember seeing this in the context
>>> of atomics thus it may be only in the context of using a xchg
>>> or cmpxchg instruction.  Which would make it non-portable to
>>> some extent (if you don't want to use atomic builtins here).
>>>
>>
>> cmpxchg should work here -- it's a conditional write so the data race
>> /false sharing can be avoided.
>> I'm comparing the performance b/w explicit branch vs cmpxchg and will
>> report back.
>>
>> -Rong
>>
>>
>>> Richard.
>>>
>>>> I just tested this implementation vs. simply setting to 1, using
>>>> google search as the benchmark.
>>>> This one is 4.5x faster. The test was done on Intel Westmere systems.
>>>>
>>>> I can change the parameter to an option.
>>>>
>>>> -Rong
>>>>
>>>>> Richard.
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Rong

Reply via email to