On Thu, 5 Dec 2024, Filip Kastl wrote: > Hi Andi and Richi, > > sorry for the late reply. While looking for a testcase where the DP algorithm > performs better than Andi's greedy one I found out some new things about bit > test switch lowering and I had to think them through. I write about what I > found bellow. > > On Thu 2024-11-21 10:01:38, Andi Kleen wrote: > > On Fri, Nov 15, 2024 at 10:43:57AM +0100, Filip Kastl wrote: > > > Hi, > > > > > > Andi's greedy bit test finding algorithm was reverted. I found a fix for > > > the > > > problem that caused the revert. I made this patch to reintroduce the > > > greedy > > > alg into GCC. However I think we should keep the old slow but more > > > powerful > > > algorithm so I added a limit on the number of cases of the switch > > > statement > > > that decides which of the two algorithms to use. > > > > Do we actually have a case where the DP algorithm is better? > > > > In the bootstrap comparison greedy does produce less or the same clusters > > > > It seems reasonable to me to want a bit test cluster finding algorithm that > produces the minimal number of clusters (I count both the bit test clusters > and > simple clusters which correspond to single cases of the original switch). I > actually found out that, with respect to this metric, neither the DP algorithm > nor the greedy algorithm are optimal. > > I came up with this testcase > > int f0(); > int f1(); > int f2(); > int f3(); > int f4(); > > int main(int a) > { > switch (a) > { > case 0: > case 2: > case 4: > case 6: > return f0(); > case 8: > return f1(); > case 10: > case 14: > case 16: > case 18: > return f2(); > case 12: > return f3(); > case 20: > return f4(); > } > return -1; > } > > where the DP algorithm manages to cover all cases with two bit test clusters > but the greedy algorithm only finds one bit test cluster (and 4 simple > clusters > for a total of 5 clusters). > > But if you reverse the cases (meaning you map case i to case 20 - i), you get > this testcase > > int f0(); > int f1(); > int f2(); > int f3(); > int f4(); > > int main(int a) > { > switch (a) > { > case 20: > case 18: > case 16: > case 14: > return f0(); > case 12: > return f1(); > case 10: > case 6: > case 4: > case 2: > return f2(); > case 8: > return f3(); > case 0: > return f4(); > } > return -1; > } > > where the situation is reversed: Greedy manages to cover all cases with two > bit test clusters but DP only manages to find one bit test cluster. > > This surprised me. I thought that the DP algorithm is optimal and that the > greedy algorithm being better on some cases could be explained by the > differences in how the two algorithms used the 'is_beneficial' heuristic. But > that's not the case. > > I looked more thoroughly into the code of the DP algorithm. I'm now pretty > convinced the DP algorithm could give optimal solution but simply isn't > configured to do so. It first looks for minimal number of clusters and only > after it finds the clusters it goes through them and checks that they are > 'is_beneficial'. It breaks down the non-'is_beneficial' ones into simple > clusters. Instead of this the algorithm could just look for the minimal > number > of clusters such that each of them is 'is_beneficial' in the first place. So > we could have an optimal bit test clustering algorithm. We just don't at the > moment. > > Since we are currently in stage 3, I don't expect we want to go make the DP > algorithm optimal. We can do that in the next stage 1. However, I think we > still want to fix PR117091. I think we need to keep the limit I introduced at > least for the jump table clustering because that is still at least O(n^2). > For > the bit test clustering we can either use both the DP algorithm and the greedy > algorithm as I do in my patch or I guess we could just use the greedy > algorithm. Since none of the two is optimal, it isn't really clear which one > is better. On one hand the greedy algorithm is faster and Andi claims that he > saw it performing better. On the other hand, I would like to make the DP > algorithm optimal and use it in GCC 16 and it seems to me simpler to keep the > DP algorithm than to remove it and add it again later. > > What do you think, Andi and Richi? I myself slightly prefer keeping the DP > but > I would be fine with either option.
I think we can keep both, though I have no strong opinion. Richard. > > > > > + k = 0; > > > + while (i + k < l) > > > + { > > > + cluster *end_cluster = clusters[i + k]; > > > + > > > + /* Does value range fit into the BITS_PER_WORD window? */ > > > + HOST_WIDE_INT w = cluster::get_range (start_cluster->get_low (), > > > + end_cluster->get_high ()); > > > + if (w == 0 || w > BITS_PER_WORD) > > > + break; > > > + > > > + /* Check for max # of targets. */ > > > + if (targets.elements () == m_max_case_bit_tests > > > + && !targets.contains (end_cluster->m_case_bb)) > > > + break; > > > + > > > + targets.add (end_cluster->m_case_bb); > > > + k++; > > > + } > > > > > > + if (is_beneficial (k, targets.elements ())) > > > > Equivalent to the old test in DP would be k + 1 > > I'm not sure it makes a lot of difference though. > > > > I've double checked and I'm pretty sure that the k is used here the same way > it > is used in DP. In both cases, is_beneficial (unsigned, unsigned) ends up > being > called with the number of cases to be covered by a bit test cluster as the > first argument. In the DP code this is harder to see because it first calls > is_beneficial (vec<cluster> &, unsigned, unsigned) which then calls > is_beneficial (unsigned, unsigned) and also there's some +/-1 weirdness with > the end of the sequence of cases. > > > > > -Andi > > Cheers, > Filip Kastl > > P.S.: Inspired by Andi's algorithm I think I found a way to speed up the DP > from O(n^2) to O(n * BITS_PER_WORD) so effectively O(n)! So in GCC 16 we > could > have an optimal *and* fast bit test clustering algorithm. > > Btw, sorry for being sceptical about the existence of an optimal O(n) > algorithm > in the PR117091 bugzilla report. Though for jump table clustering we still > have only O(n^2) (or O(n^3)? not sure) and I don't think we can use the > "clusters are at most BITS_PER_WORD long" trick there. > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)