On Wed, 27 Mar 2024, Michael Matz wrote: > Hey, > > On Wed, 27 Mar 2024, Jakub Jelinek wrote: > > > > @@ -1712,12 +1711,9 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs) > > > gcc_checking_assert (bb_index > > > < (unsigned) last_basic_block_for_fn (cfun)); > > > > > > - EXECUTE_IF_AND_COMPL_IN_BITMAP (&dfs[bb_index], > > > phi_insertion_points, > > > - 0, i, bi) > > > - { > > > + EXECUTE_IF_SET_IN_BITMAP (&dfs[bb_index], 0, i, bi) > > > + if (bitmap_set_bit (phi_insertion_points, i)) > > > bitmap_set_bit (work_set, i); > > > - bitmap_set_bit (phi_insertion_points, i); > > > - } > > > } > > > > I don't understand why the above is better. > > Wouldn't it be best to do > > bitmap_ior_and_compl_into (work_set, &dfs[bb_index], > > phi_insertion_points); > > bitmap_ior_into (phi_insertion_points, &dfs[bb_index]); > > ? > > I had the same hunch, but: > > 1) One would have to make work_set be non-tree-view again (which with the > current structure is a wash anyway, and that makes sense as accesses to > work_set aren't heavily random here).
The tree-view is a wash indeed (I tried many things). > 2) But doing that and using bitmap_ior.._into is still measurably slower: > on a reduced testcase with -O0 -fno-checking, proposed structure > (tree-view or not-tree-view workset doesn't matter): > > tree SSA rewrite : 14.93 ( 12%) 0.01 ( 2%) 14.95 ( > 12%) 27M ( 8%) > > with non-tree-view, and your suggestion: > > tree SSA rewrite : 20.68 ( 12%) 0.02 ( 4%) 20.75 ( > 12%) 27M ( 8%) > > I can only speculate that the usually extreme sparsity of the bitmaps in > question make the setup costs of the two bitmap_ior calls actually more > expensive than the often skipped second call to bitmap_set_bit in Richis > proposed structure. (That or cache effects) So slightly "better" than Jakubs variant would be if (bitmap_ior_and_compl_into (work_set, &dfs[bb_index], phi_insertion_points)) bitmap_ior_into (phi_insertion_points, &dfs[bb_index]); since phi_insertion_points grows that IOR becomes more expensive over time. The above for me (today Zen2, yesterday Zen4) is tree SSA rewrite : 181.02 ( 37%) with unconditiona ior_into: tree SSA rewrite : 180.93 ( 36%) while my patch is tree SSA rewrite : 22.04 ( 6%) not sure what uarch Micha tested on. I think the testcase has simply many variables we write into SSA (man compute_idf calls), many BBs but very low popcount DFS[] so iterating over DFS[] only is very beneficial here as opposed to also walk phi_insertion_points and work_set. I think low popcount DFS[] is quite typical for a CFG - but for sure popcount of DFS[] is going to be lower than popcount of the IDF (phi_insertion_points). Btw, with my patch compute_idf is completely off the profile so it's hard to improve further (we do process blocks possibly twice for example, but that doesn't make a difference here) Indeed doing statistics shows the maximum popcount of a dominance frontier is 8 but 99% have just a single block. But the popcount of the final IDF is more than 10000 half of the time and more than 1000 90% of the time. I have pushed the patch now. Richard.