> -----Original Message-----
> From: Tamar Christina
> Sent: Tuesday, June 14, 2022 4:58 PM
> To: Richard Sandiford <richard.sandif...@arm.com>; Richard Biener
> <rguent...@suse.de>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> Subject: RE: [PATCH 1/2]middle-end Support optimized division by pow2
> bitmask
> 
> 
> 
> > -----Original Message-----
> > From: Richard Sandiford <richard.sandif...@arm.com>
> > Sent: Tuesday, June 14, 2022 2:43 PM
> > To: Richard Biener <rguent...@suse.de>
> > Cc: Tamar Christina <tamar.christ...@arm.com>;
> > gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> > Subject: Re: [PATCH 1/2]middle-end Support optimized division by pow2
> > bitmask
> >
> > Richard Biener <rguent...@suse.de> writes:
> > > On Mon, 13 Jun 2022, Tamar Christina wrote:
> > >
> > >> > -----Original Message-----
> > >> > From: Richard Biener <rguent...@suse.de>
> > >> > Sent: Monday, June 13, 2022 12:48 PM
> > >> > To: Tamar Christina <tamar.christ...@arm.com>
> > >> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Sandiford
> > >> > <richard.sandif...@arm.com>
> > >> > Subject: RE: [PATCH 1/2]middle-end Support optimized division by
> > >> > pow2 bitmask
> > >> >
> > >> > On Mon, 13 Jun 2022, Tamar Christina wrote:
> > >> >
> > >> > > > -----Original Message-----
> > >> > > > From: Richard Biener <rguent...@suse.de>
> > >> > > > Sent: Monday, June 13, 2022 10:39 AM
> > >> > > > To: Tamar Christina <tamar.christ...@arm.com>
> > >> > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard
> > >> > > > Sandiford <richard.sandif...@arm.com>
> > >> > > > Subject: Re: [PATCH 1/2]middle-end Support optimized division
> > >> > > > by
> > >> > > > pow2 bitmask
> > >> > > >
> > >> > > > On Mon, 13 Jun 2022, Richard Biener wrote:
> > >> > > >
> > >> > > > > On Thu, 9 Jun 2022, Tamar Christina wrote:
> > >> > > > >
> > >> > > > > > Hi All,
> > >> > > > > >
> > >> > > > > > In plenty of image and video processing code it's common
> > >> > > > > > to modify pixel values by a widening operation and then
> > >> > > > > > scale them back into range
> > >> > > > by dividing by 255.
> > >> > > > > >
> > >> > > > > > This patch adds an optab to allow us to emit an optimized
> > >> > > > > > sequence when doing an unsigned division that is equivalent
> to:
> > >> > > > > >
> > >> > > > > >    x = y / (2 ^ (bitsize (y)/2)-1
> > >> > > > > >
> > >> > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > >> > > > > > x86_64-pc-linux-gnu and no issues.
> > >> > > > > >
> > >> > > > > > Ok for master?
> > >> > > > >
> > >> > > > > Looking at 2/2 it seems that this is the wrong way to
> > >> > > > > attack the problem.  The ISA doesn't have such instruction
> > >> > > > > so adding an optab looks premature.  I suppose that there's
> > >> > > > > no unsigned vector integer division and thus we open-code
> > >> > > > > that in a different
> > way?
> > >> > > > > Isn't the correct thing then to fixup that open-coding if
> > >> > > > > it is more
> > >> > efficient?
> > >> > > >
> > >> > >
> > >> > > The problem is that even if you fixup the open-coding it would
> > >> > > need to be something target specific? The sequence of
> > >> > > instructions we generate don't have a GIMPLE representation.
> > >> > > So whatever is generated I'd have to fixup in RTL then.
> > >> >
> > >> > What's the operation that doesn't have a GIMPLE representation?
> > >>
> > >> For NEON use two operations:
> > >> 1. Add High narrowing lowpart, essentially doing (a +w b) >>.n
> bitsize(a)/2
> > >>     Where the + widens and the >> narrows.  So you give it two
> > >> shorts, get a byte 2. Add widening add of lowpart so basically
> > >> lowpart (a +w b)
> > >>
> > >> For SVE2 we use a different sequence, we use two back-to-back
> > sequences of:
> > >> 1. Add narrow high part (bottom).  In SVE the Top and Bottom
> > >> instructions
> > select
> > >>    Even and odd elements of the vector rather than "top half" and
> > >> "bottom
> > half".
> > >>
> > >>    So this instruction does : Add each vector element of the first
> > >> source
> > vector to the
> > >>    corresponding vector element of the second source vector, and
> > >> place
> > the most
> > >>     significant half of the result in the even-numbered half-width
> > destination elements,
> > >>     while setting the odd-numbered elements to zero.
> > >>
> > >> So there's an explicit permute in there. The instructions are
> > >> sufficiently different that there wouldn't be a single GIMPLE
> > representation.
> > >
> > > I see.  Are these also useful to express scalar integer division?
> > >
> > > I'll defer to others to ack the special udiv_pow2_bitmask optab or
> > > suggest some piecemail things other targets might be able to do as
> > > well.  It does look very special.  I'd also bikeshed it to
> > > udiv_pow2m1 since 'bitmask' is less obvious than 2^n-1 (assuming I
> > > interpreted 'bitmask' correctly ;)).  It seems to be even less
> > > general since it is an unary op and the actual divisor is
> > > constrained by the mode itself?
> >
> > Yeah, those were my concerns as well.  For n-bit numbers, the same
> > kind of arithmetic transformation can be used for any 2^m-1 for m in
> > [n/2, n), so from a target-independent point of view, m==n/2 isn't
> particularly special.
> > Hard-coding one value of m would make sense if there was an underlying
> > instruction that did exactly this, but like you say, there isn't.
> >
> > Would a compromise be to define an optab for ADDHN and then add a
> > vector pattern for this division that (at least initially) prefers
> > ADDHN over the current approach whenever ADDHN is available?  We
> could
> > then adapt the conditions on the pattern if other targets also provide
> > ADDHN but don't want this transform.  (I think the other instructions
> > in the pattern already have
> > optabs.)
> >
> > That still leaves open the question about what to do about SVE2, but
> > the underlying problem there is that the vectoriser doesn't know about
> > the B/T layout.
> 
> Wouldn't it be better to just generalize the optab and to pass on the mask?
> I'd prefer to do that than teach the vectorizer about ADDHN (which can't be
> easily done now) let alone teaching it about B/T.   It also seems somewhat
> unnecessary to diverge the implementation here in the mid-end. After all,
> you can generate better SSE code here as well, so focusing on generating ISA
> specific code from here for each ISA seems like the wrong approach to me.

Ping, is there any consensus here? 

Thanks,
Tamar

> 
> Thanks,
> Tamar
> 
> >
> > Thanks,
> > Richard

Reply via email to