https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119482
--- Comment #8 from ak at gcc dot gnu.org --- The workload does a lot of bitmap manipulations: # 5.62% cc1plus cc1plus [.] bitmap_and_into(bitmap_head*, bitmap_head const*) 5.30% cc1plus cc1plus [.] bitmap_element_allocate(bitmap_head*) 3.93% cc1plus cc1plus [.] bitmap_ior_into(bitmap_head*, bitmap_head const*) 3.85% cc1plus cc1plus [.] bitmap_list_find_element(bitmap_head*, unsigned int) 2.09% cc1plus cc1plus [.] bitmap_and(bitmap_head*, bitmap_head const*, bitmap_head const*) 1.77% cc1plus cc1plus [.] bitmap_elt_ior(bitmap_head*, bitmap_element*, bitmap_element*, bitmap_element const*, bitmap_element const*> 1.44% cc1plus cc1plus [.] bitmap_set_bit(bitmap_head*, int) 1.43% cc1plus cc1plus [.] bitmap_copy(bitmap_head*, bitmap_head const*) 1.41% cc1plus cc1plus [.] bitmap_ior_and_compl(bitmap_head*, bitmap_head const*, bitmap_head const*, bitmap_head const*) 1.33% cc1plus cc1plus [.] bitmap_elt_copy(bitmap_head*, bitmap_element*, bitmap_element*, bitmap_element const*, bool) Looking at the samples most of it seem to be cache misses of some sort (working set too big), but bitmap_set_bit stands out by having a misprediction This simple patch improves runtime by 15%. Which is more than I expected given it only has ~1.44% of the cycles, but I guess the mispredicts caused some down stream effects. ../obj-fast/gcc/cc1plus-bitmap -std=gnu++20 -O2 pr119482.cc -quiet ran 1.15 ± 0.01 times faster than ../obj-fast/gcc/cc1plus -std=gnu++20 -O2 pr119482.cc -quiet Most callers of bitmap_set_bit don't need the return value, but with the conditional store the CPU still has to predict it correctly since gcc doesn't know how to do that without APX (even though CMOV could do it with a dummy target). If we make the write unconditional the memory bandwidth increases, but it is made up by less mispredictions. >From the performance counter results it doesn't do much to the bandwidth, but reduces the number of branches drastically. Even though the misprediction rate goes up it is a lot less cycles wasted because of less branches. $ perf stat -e branches,branch-misses,uncore_imc/cas_count_read/,uncore_imc/cas_count_write/ -a ../obj-fast/gcc/cc1plus -std=gnu++20 -O2 pr119482.cc -quiet -w Performance counter stats for 'system wide': 41,932,957,091 branches 686,117,623 branch-misses # 1.64% of all branches 43,690.47 MiB uncore_imc/cas_count_read/ 12,362.56 MiB uncore_imc/cas_count_write/ 49.328633365 seconds time elapsed $ perf stat -e branches,branch-misses,uncore_imc/cas_count_read/,uncore_imc/cas_count_write/ -a ../obj-fast/gcc/cc1plus-bitmap -std=gnu++20 -O2 pr119482.cc -quiet -w Performance counter stats for 'system wide': 37,092,113,179 branches 663,641,708 branch-misses # 1.79% of all branches 43,196.52 MiB uncore_imc/cas_count_read/ 12,369.33 MiB uncore_imc/cas_count_write/ 42.632458350 seconds time elapsed Patch: diff --git a/gcc/bitmap.cc b/gcc/bitmap.cc index f5a64b495ab3..7744f8f8c2e4 100644 --- a/gcc/bitmap.cc +++ b/gcc/bitmap.cc @@ -969,8 +969,7 @@ bitmap_set_bit (bitmap head, int bit) if (ptr != 0) { bool res = (ptr->bits[word_num] & bit_val) == 0; - if (res) - ptr->bits[word_num] |= bit_val; + ptr->bits[word_num] |= bit_val; return res; }