https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119482

--- Comment #8 from ak at gcc dot gnu.org ---
The workload does a lot of bitmap manipulations:

#
     5.62%  cc1plus  cc1plus               [.] bitmap_and_into(bitmap_head*,
bitmap_head const*)
     5.30%  cc1plus  cc1plus               [.]
bitmap_element_allocate(bitmap_head*)
     3.93%  cc1plus  cc1plus               [.] bitmap_ior_into(bitmap_head*,
bitmap_head const*)
     3.85%  cc1plus  cc1plus               [.]
bitmap_list_find_element(bitmap_head*, unsigned int)
     2.09%  cc1plus  cc1plus               [.] bitmap_and(bitmap_head*,
bitmap_head const*, bitmap_head const*)
     1.77%  cc1plus  cc1plus               [.] bitmap_elt_ior(bitmap_head*,
bitmap_element*, bitmap_element*, bitmap_element const*, bitmap_element const*>
     1.44%  cc1plus  cc1plus               [.] bitmap_set_bit(bitmap_head*,
int)
     1.43%  cc1plus  cc1plus               [.] bitmap_copy(bitmap_head*,
bitmap_head const*)
     1.41%  cc1plus  cc1plus               [.]
bitmap_ior_and_compl(bitmap_head*, bitmap_head const*, bitmap_head const*,
bitmap_head const*)
     1.33%  cc1plus  cc1plus               [.] bitmap_elt_copy(bitmap_head*,
bitmap_element*, bitmap_element*, bitmap_element const*, bool)


Looking at the samples most of it seem to be cache misses of some sort (working
set too big), but bitmap_set_bit stands out by having a misprediction

This simple patch improves runtime by 15%. Which is more than I expected given
it only has ~1.44% of the cycles, but I guess the mispredicts caused some down
stream effects.

  ../obj-fast/gcc/cc1plus-bitmap -std=gnu++20 -O2 pr119482.cc  -quiet ran
    1.15 ± 0.01 times faster than ../obj-fast/gcc/cc1plus -std=gnu++20 -O2
pr119482.cc  -quiet

Most callers of bitmap_set_bit don't need the return value, but with the
conditional store the CPU still has to predict it correctly since gcc doesn't
know how to do that without APX (even though CMOV could do it with a dummy
target). If we make the write unconditional the memory bandwidth increases, but
it is made up by less mispredictions.

>From the performance counter results it doesn't do much to the bandwidth,
but reduces the number of branches drastically. Even though the misprediction
rate goes up it is a lot less cycles wasted because of less branches.

$ perf stat -e
branches,branch-misses,uncore_imc/cas_count_read/,uncore_imc/cas_count_write/
-a ../obj-fast/gcc/cc1plus -std=gnu++20 -O2 pr119482.cc  -quiet -w

 Performance counter stats for 'system wide':

    41,932,957,091      branches
       686,117,623      branch-misses                    #    1.64% of all
branches
         43,690.47 MiB  uncore_imc/cas_count_read/
         12,362.56 MiB  uncore_imc/cas_count_write/

      49.328633365 seconds time elapsed

$ perf stat -e
branches,branch-misses,uncore_imc/cas_count_read/,uncore_imc/cas_count_write/
-a ../obj-fast/gcc/cc1plus-bitmap -std=gnu++20 -O2 pr119482.cc  -quiet -w

 Performance counter stats for 'system wide':

    37,092,113,179      branches
       663,641,708      branch-misses                    #    1.79% of all
branches
         43,196.52 MiB  uncore_imc/cas_count_read/
         12,369.33 MiB  uncore_imc/cas_count_write/

      42.632458350 seconds time elapsed

Patch:

diff --git a/gcc/bitmap.cc b/gcc/bitmap.cc
index f5a64b495ab3..7744f8f8c2e4 100644
--- a/gcc/bitmap.cc
+++ b/gcc/bitmap.cc
@@ -969,8 +969,7 @@ bitmap_set_bit (bitmap head, int bit)
   if (ptr != 0)
     {
       bool res = (ptr->bits[word_num] & bit_val) == 0;
-      if (res)
-       ptr->bits[word_num] |= bit_val;
+      ptr->bits[word_num] |= bit_val;
       return res;
     }

Reply via email to