RE: [RFC] Implementing detection of saturation and rounding arithmetic

Richard Biener Wed, 12 May 2021 03:28:57 -0700

On Wed, 12 May 2021, Tamar Christina wrote:

> 
> 
> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Tuesday, May 11, 2021 12:45 PM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: gcc@gcc.gnu.org; Richard Sandiford <richard.sandif...@arm.com>;
> > Jakub Jelinek <ja...@redhat.com>
> > Subject: Re: [RFC] Implementing detection of saturation and rounding
> > arithmetic
> > 
> > On Tue, 11 May 2021, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > We are looking to implement saturation support in the compiler.  The
> > > aim is to recognize both Scalar and Vector variant of typical saturating
> > expressions.
> > >
> > > As an example:
> > >
> > > 1. Saturating addition:
> > >    char sat (char a, char b)
> > >    {
> > >       int tmp = a + b;
> > >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> > >    }
> > >
> > > 2. Saturating abs:
> > >    char sat (char a)
> > >    {
> > >       int tmp = abs (a);
> > >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> > >    }
> > >
> > > 3. Rounding shifts
> > >    char rndshift (char dc)
> > >    {
> > >       int round_const = 1 << (shift - 1);
> > >       return (dc + round_const) >> shift;
> > >    }
> > >
> > > etc.
> > >
> > > Of course the first issue is that C does not really have a single
> > > idiom for expressing this.
> > >
> > > At the RTL level we have ss_truncate and us_truncate and
> > > float_truncate for truncation.
> > >
> > > At the Tree level we have nothing for truncation (I believe) for
> > > scalars. For Vector code there already seems to be VEC_PACK_SAT_EXPR
> > > but it looks like nothing actually generates this at the moment. it's 
> > > just an
> > unused tree code.
> > >
> > > For rounding there doesn't seem to be any existing infrastructure.
> > >
> > > The proposal to handle these are as follow, keep in mind that all of
> > > these also exist in their scalar form, as such detecting them in the
> > > vectorizer would be the wrong place.
> > >
> > > 1. Rounding:
> > >    a) Use match.pd to rewrite various rounding idioms to shifts.
> > >    b) Use backwards or forward prop to rewrite these to internal functions
> > >       where even if the target does not support these rounding 
> > > instructions
> > they
> > >       have a chance to provide a more efficient implementation than what
> > would
> > >       be generated normally.
> > >
> > > 2. Saturation:
> > >    a) Use match.pd to rewrite the various saturation expressions into
> > min/max
> > >       operations which opens up the expressions to further optimizations.
> > >    b) Use backwards or forward prop to convert to internal functions if 
> > > the
> > >       resulting min/max expression still meet the criteria for being a
> > >       saturating expression.  This follows the algorithm as outlined in 
> > > "The
> > >       Software Vectorization handbook" by Aart J.C. Bik.
> > >
> > >       We could get the right instructions by using combine if we don't 
> > > rewrite
> > >       the instructions to an internal function, however then during
> > Vectorization
> > >       we would overestimate the cost of performing the saturation.  The
> > constants
> > >       will the also be loaded into registers and so becomes a lot more 
> > > difficult
> > >       to cleanup solely in the backend.
> > >
> > > The one thing I am wondering about is whether we would need an
> > > internal function for all operations supported, or if it should be
> > > modelled as an internal FN which just "marks" the operation as
> > > rounding/saturating. After all, the only difference between a normal
> > > and saturating expression in RTL is the xx_truncate RTL surrounding
> > > the expression.  Doing so would also mean that all targets whom have
> > saturating instructions would automatically benefit from this.
> > >
> > > But it does mean a small adjustment to the costing, which would need
> > > to cost the internal function call and the argument together as a whole.
> > >
> > > Any feedback is appreciated to minimize the number of changes required
> > > to the final patch.  Any objections to the outlined approach?
> > 
> > I think it makes sense to pattern-match the operations on GIMPLE and follow
> > the approach take by __builtin_add_overflow & friends.
> > Maybe quickly check whether clang provides some builtins already which we
> > could implement.
> > 
> 
> Hmm so taking a look at __builtin_add_overflow and friends it looks like they 
> use
> Internal functions like ADD_OVERFLOW. Biggest difference being that it sets a 
> condition
> flag.  This does me wonder if something like
> 
> int f (int a, int b)
> {
>     int res;
>     if (__builtin_add_overflow (a, b, &res))
>       {
>           if (res >= 0)
>             return INT_MAX;
>           else
>             return INT_MIN;
>       }
>     return res;
> }
> 
> Should be recognized as saturating as well.  But yeah, following the same 
> approach we
> would rewrite the sequence to something like res = .ADD_SAT (a, b);
> 
> I know that in the past there was concerns about doing such pattern matching 
> early which
> is why I thought about following the same approach we do with the table based 
> ctz recognition
> and do it in backprop.


I think the danger with matching the saturation part late is that jump
optimizations break it apart.  That might be also good of course(?)
in case for example we can prove the value never is < -128 but only
possibly > 127 so we can use a single MAX_EXPR.  Which makes me wonder
whether you see us recognizing MIN/MAX and you're just pattern matching
that?

> For completeness, the OVERFLOW version of ADD in AArch64 maps to ADDS, but 
> the saturating to SQADD.
> 
> > There's some appeal to mimicing what RTL does - thus have the saturation be
> > represented as saturating truncation.
> > Maybe that's what users expect of builtins as well.
> 
> That would also certainly cut down on the number of internal FNs we would 
> introduce. But, the
> one technical issue I see is that I don't know how to ask the target if it 
> supports it or not.

We don't ask the target for overflow support either but have fallback
RTL expansion.  What I'd suggest is to match the saturation operation
separately and independent of target support and do a late combining
with the feeding arithmetic operations using direct internal functions
to the optabs when the target supports this.

So the saturation op would match (target-type)MIN<MAX<op, tt-min>, tt-max>
thus three GIMPLE stmts.  Thus we'd have a SAT_CONVERT_EXPR eventually
with restriction on sign-changes?

> Normally the internal-fn are ti
> The internal-fn and the next instruction into an optab so the target can 
> expand to it.  It's because of this
> that I wonder whether this added complexity is worth it.  Unless there's 
> already such a mechanism In place?

I think direct internal fns are prefered in case we look to restrict
things to what target support.

Richard.

> > 
> > I'm not sure what the rounding shift would do - 'shift' isn't an argument to
> > rndshift here.  It feels like it's a rounding division but only by powers 
> > of two.
> > Does ROUND_DIV_EXPR already provide the desired semantics?
> 
> Oops, yes shift should have been an argument to the function.
> For the given example indeed I think this is ROUND_DIV_EXPR but we have other 
> ones such as
> Rounding half-ing addition, Will need to check if that's just a 
> ROUND_DIV_EXPR over an ADD.
> 
> For completeness, the AArch64 instruction for this example is SRSHL.
> 
> Thanks!
> 
> Regards,
> Tamar
> 
> > 
> > Richard.
>

RE: [RFC] Implementing detection of saturation and rounding arithmetic

Reply via email to