Re: [PATCH] gimple: Add limit after which slower switchlower algs are used [PR117091] [PR117352]

Richard Biener Thu, 05 Dec 2024 03:59:43 -0800

On Thu, 5 Dec 2024, Filip Kastl wrote:

> Hi Andi and Richi,
> 
> sorry for the late reply.  While looking for a testcase where the DP algorithm
> performs better than Andi's greedy one I found out some new things about bit
> test switch lowering and I had to think them through.  I write about what I
> found bellow.
> 
> On Thu 2024-11-21 10:01:38, Andi Kleen wrote:
> > On Fri, Nov 15, 2024 at 10:43:57AM +0100, Filip Kastl wrote:
> > > Hi,
> > > 
> > > Andi's greedy bit test finding algorithm was reverted.  I found a fix for 
> > > the
> > > problem that caused the revert.  I made this patch to reintroduce the 
> > > greedy
> > > alg into GCC.  However I think we should keep the old slow but more 
> > > powerful
> > > algorithm so I added a limit on the number of cases of the switch 
> > > statement
> > > that decides which of the two algorithms to use.
> > 
> > Do we actually have a case where the DP algorithm is better?
> > 
> > In the bootstrap comparison greedy does produce less or the same clusters
> > 
> 
> It seems reasonable to me to want a bit test cluster finding algorithm that
> produces the minimal number of clusters (I count both the bit test clusters 
> and
> simple clusters which correspond to single cases of the original switch).  I
> actually found out that, with respect to this metric, neither the DP algorithm
> nor the greedy algorithm are optimal.
> 
> I came up with this testcase
> 
> int f0();    
> int f1();    
> int f2();    
> int f3();    
> int f4();    
>      
> int main(int a)    
> {    
>     switch (a)    
>     {    
>         case 0:    
>         case 2:    
>         case 4:    
>         case 6:    
>             return f0();    
>         case 8:    
>             return f1();    
>         case 10:    
>         case 14:    
>         case 16:    
>         case 18:    
>             return f2();    
>         case 12:    
>             return f3();    
>         case 20:    
>             return f4();    
>     }    
>     return -1;    
> }
> 
> where the DP algorithm manages to cover all cases with two bit test clusters
> but the greedy algorithm only finds one bit test cluster (and 4 simple 
> clusters
> for a total of 5 clusters).
> 
> But if you reverse the cases (meaning you map case i to case 20 - i), you get
> this testcase
> 
> int f0();    
> int f1();    
> int f2();    
> int f3();    
> int f4();    
>     
> int main(int a)    
> {    
>     switch (a)    
>     {    
>         case 20:    
>         case 18:    
>         case 16:    
>         case 14:    
>             return f0();    
>         case 12:    
>             return f1();    
>         case 10:    
>         case 6:    
>         case 4:    
>         case 2:    
>             return f2();    
>         case 8:    
>             return f3();    
>         case 0:    
>             return f4();    
>     }    
>     return -1;    
> }
> 
> where the situation is reversed:  Greedy manages to cover all cases with two
> bit test clusters but DP only manages to find one bit test cluster.
> 
> This surprised me.  I thought that the DP algorithm is optimal and that the
> greedy algorithm being better on some cases could be explained by the
> differences in how the two algorithms used the 'is_beneficial' heuristic.  But
> that's not the case.
> 
> I looked more thoroughly into the code of the DP algorithm.  I'm now pretty
> convinced the DP algorithm could give optimal solution but simply isn't
> configured to do so.  It first looks for minimal number of clusters and only
> after it finds the clusters it goes through them and checks that they are
> 'is_beneficial'.  It breaks down the non-'is_beneficial' ones into simple
> clusters.  Instead of this the algorithm could just look for the minimal 
> number
> of clusters such that each of them is 'is_beneficial' in the first place.  So
> we could have an optimal bit test clustering algorithm.  We just don't at the
> moment.
> 
> Since we are currently in stage 3, I don't expect we want to go make the DP
> algorithm optimal.  We can do that in the next stage 1.  However, I think we
> still want to fix PR117091.  I think we need to keep the limit I introduced at
> least for the jump table clustering because that is still at least O(n^2).  
> For
> the bit test clustering we can either use both the DP algorithm and the greedy
> algorithm as I do in my patch or I guess we could just use the greedy
> algorithm.  Since none of the two is optimal, it isn't really clear which one
> is better.  On one hand the greedy algorithm is faster and Andi claims that he
> saw it performing better.  On the other hand, I would like to make the DP
> algorithm optimal and use it in GCC 16 and it seems to me simpler to keep the
> DP algorithm than to remove it and add it again later.
> 
> What do you think, Andi and Richi?  I myself slightly prefer keeping the DP 
> but
> I would be fine with either option.


I think we can keep both, though I have no strong opinion.

Richard.

> > 
> > > +      k = 0;
> > > +      while (i + k < l)
> > > + {
> > > +   cluster *end_cluster = clusters[i + k];
> > > +
> > > +   /* Does value range fit into the BITS_PER_WORD window?  */
> > > +   HOST_WIDE_INT w = cluster::get_range (start_cluster->get_low (),
> > > +                                         end_cluster->get_high ());
> > > +   if (w == 0 || w > BITS_PER_WORD)
> > > +     break;
> > > +
> > > +   /* Check for max # of targets.  */
> > > +   if (targets.elements () == m_max_case_bit_tests
> > > +       && !targets.contains (end_cluster->m_case_bb))
> > > +     break;
> > > +
> > > +   targets.add (end_cluster->m_case_bb);
> > > +   k++;
> > > + }
> > >  
> > > +      if (is_beneficial (k, targets.elements ()))
> > 
> > Equivalent to the old test in DP would be k + 1
> > I'm not sure it makes a lot of difference though.
> > 
> 
> I've double checked and I'm pretty sure that the k is used here the same way 
> it
> is used in DP.  In both cases, is_beneficial (unsigned, unsigned) ends up 
> being
> called with the number of cases to be covered by a bit test cluster as the
> first argument.  In the DP code this is harder to see because it first calls
> is_beneficial (vec<cluster> &, unsigned, unsigned) which then calls
> is_beneficial (unsigned, unsigned) and also there's some +/-1 weirdness with
> the end of the sequence of cases.
> 
> > 
> > -Andi
> 
> Cheers,
> Filip Kastl
> 
> P.S.: Inspired by Andi's algorithm I think I found a way to speed up the DP
> from O(n^2) to O(n * BITS_PER_WORD) so effectively O(n)!  So in GCC 16 we 
> could
> have an optimal *and* fast bit test clustering algorithm.
> 
> Btw, sorry for being sceptical about the existence of an optimal O(n) 
> algorithm
> in the PR117091 bugzilla report.  Though for jump table clustering we still
> have only O(n^2) (or O(n^3)? not sure) and I don't think we can use the
> "clusters are at most BITS_PER_WORD long" trick there.
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH] gimple: Add limit after which slower switchlower algs are used [PR117091] [PR117352]

Reply via email to