On Mon, Jul 31, 2023 at 2:57 PM Changbin Du via Gcc <gcc@gcc.gnu.org> wrote: > > Hello, folks. > This is to discuss Gcc's heuristic strategy about Predicated Instructions and > Branches. And probably something needs to be improved. > > [The story] > Weeks ago, I built a huffman encoding program with O2, O3, and PGO > respectively. > This program is nothing special, just a random code I found on the internet. > You > can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c. > > Build it with O2/O3/PGO (My GCC is 13.1): > $ gcc -O2 -march=native -g -o huffman huffman.c > $ gcc -O3 -march=native -g -o huffman.O3 huffman.c > > $ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented > huffman.c > $ ./huffman.instrumented test.data > $ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o > huffman.pgo huffman.c > > Run them on my 12900H laptop: > $ head -c 50M /dev/urandom > test.data > $ perf stat -r3 --table -- taskset -c 0 ./huffman test.data > $ perf stat -r3 --table -- taskset -c 0 ./huffman.O3 test.data > $ perf stat -r3 --table -- taskset -c 0 ./huffman.pgo test.data > > The result (p-core, no ht, no turbo, performance mode): > > O2 O3 PGO > cycles 2,581,832,749 8,638,401,568 9,394,200,585 > (1.07s) (3.49s) (3.80s) > instructions 12,609,600,094 11,827,675,782 12,036,010,638 > branches 2,303,416,221 2,671,184,833 2,723,414,574 > branch-misses 0.00% 7.94% 8.84% > cache-misses 3,012,613 3,055,722 3,076,316 > L1-icache-load-misses 11,416,391 12,112,703 11,896,077 > icache_tag.stalls 1,553,521 1,364,092 1,896,066 > itlb_misses.stlb_hit 6,856 21,756 22,600 > itlb_misses.walk_completed 14,430 4,454 15,084 > baclears.any 131,573 140,355 131,644 > int_misc.clear_resteer_cycles 2,545,915 586,578,125 679,021,993 > machine_clears.count 22,235 39,671 37,307 > dsb2mite_switches.penalty_cycles 6,985,838 12,929,675 8,405,493 > frontend_retired.any_dsb_miss 28,785,677 28,161,724 28,093,319 > idq.dsb_cycles_any 1,986,038,896 5,683,820,258 5,971,969,906 > idq.dsb_uops 11,149,445,952 26,438,051,062 28,622,657,650 > idq.mite_uops 207,881,687 216,734,007 212,003,064 > > > Above data shows: > o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively. > o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to > aggressive inline. > o O3/PGO introduced very bad branch prediction. I will explain it later. > o Code built with O3 has high iTLB miss but much lower sTLB miss. This is > beyond > my expectation. > o O3/PGO introduced 78% and 68% more machine clears. This is interesting and > I don't know why. (subcategory MC is not measured yet) > o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO. > o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x. > DSB hit well. So frontend fetching and decoding is not a problem for > O3/PGO. > o Other events are mainly affected by bad branch misprediction. > > Additionally, here is the TMA level 2 analysis: The main changes in the > pipeline > slots are of Bad Speculation and Frontend Bound categories. I doubt the > accuracy > of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and > idq.mite_uops data. > > $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman > test.data > test.data.huf is 1.00% of test.data > > Performance counter stats for 'taskset -c 0 ./huffman test.data': > > % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % > tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % > tma_fetch_latency % tma_machine_clears > 0.0 0.8 11.4 > 76.8 2.0 8.3 > 0.8 0.0 > > 1.073381357 seconds time elapsed > > 0.945233000 seconds user > 0.095719000 seconds sys > > > $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 > ./huffman.O3 test.data > test.data.huf is 1.00% of test.data > > Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data': > > % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % > tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % > tma_fetch_latency % tma_machine_clears > 38.2 6.6 3.5 > 21.7 0.9 20.9 > 7.5 0.8 > > 3.501875873 seconds time elapsed > > 3.378572000 seconds user > 0.084163000 seconds sys > > > $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 > ./huffman.pgo test.data > test.data.huf is 1.00% of test.data > > Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data': > > % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % > tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % > tma_fetch_latency % tma_machine_clears > 40.3 6.3 3.6 > 19.4 1.2 17.8 > 10.7 0.8 > > 3.803413059 seconds time elapsed > > 3.686474000 seconds user > 0.079707000 seconds sys > > > I also tried the same program with O2/O3 on arm64. And O3 lead to *30%* > performance > drop than O2. > > > [Predicated Ins vs Branches] > Then I analyzed the Bad Speculation problem. 99% of the miss-prediction in > O3/PGO > is caused by the below branch. > > @@ -264,7 +264,7 @@ void bitout (FILE *f, char b) { > > /* put a one on the end of this byte if b is '1' */ > > - if (b == '1') current_byte |= 1; > + //if (b == '1') current_byte |= 1; > > /* one more bit */ > > If I comment it out as above patch, then O3/PGO can get 16% and 12% > performance > improvement compared to O2 on x86. > > O2 O3 PGO > cycles 2,497,674,824 2,104,993,224 2,199,753,593 > instructions 10,457,508,646 9,723,056,131 10,457,216,225 > branches 2,303,029,380 2,250,522,323 2,302,994,942 > branch-misses 0.00% 0.01% 0.01% > > The main difference in the compilation output about code around the > miss-prediction > branch is: > o In O2: predicated instruction (cmov here) is selected to eliminate above > branch. cmov is true better than branch here. > o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction > is selected. But this branch is obviously *unpredictable* and the compiler > doesn't know it. This why O3/PGO are are so bad for this program. > > Gcc doesn't support __builtin_unpredictable() which has been introduced by > llvm. > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve > the > same purpose. The result is negative.
But does it appear to be predictable with your profiling data? > I think we could come to a conclusion that there must be something can > improve in > Gcc's heuristic strategy about Predicated Instructions and branches, at least > for O3 and PGO. > > And can we add __builtin_unpredictable() support for Gcc? As usually it's hard > for the compiler to detect unpredictable branches. > > -- > Cheers, > Changbin Du