> -----Original Message----- > From: Tamar Christina > Sent: Tuesday, June 14, 2022 4:58 PM > To: Richard Sandiford <richard.sandif...@arm.com>; Richard Biener > <rguent...@suse.de> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > Subject: RE: [PATCH 1/2]middle-end Support optimized division by pow2 > bitmask > > > > > -----Original Message----- > > From: Richard Sandiford <richard.sandif...@arm.com> > > Sent: Tuesday, June 14, 2022 2:43 PM > > To: Richard Biener <rguent...@suse.de> > > Cc: Tamar Christina <tamar.christ...@arm.com>; > > gcc-patches@gcc.gnu.org; nd <n...@arm.com> > > Subject: Re: [PATCH 1/2]middle-end Support optimized division by pow2 > > bitmask > > > > Richard Biener <rguent...@suse.de> writes: > > > On Mon, 13 Jun 2022, Tamar Christina wrote: > > > > > >> > -----Original Message----- > > >> > From: Richard Biener <rguent...@suse.de> > > >> > Sent: Monday, June 13, 2022 12:48 PM > > >> > To: Tamar Christina <tamar.christ...@arm.com> > > >> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Sandiford > > >> > <richard.sandif...@arm.com> > > >> > Subject: RE: [PATCH 1/2]middle-end Support optimized division by > > >> > pow2 bitmask > > >> > > > >> > On Mon, 13 Jun 2022, Tamar Christina wrote: > > >> > > > >> > > > -----Original Message----- > > >> > > > From: Richard Biener <rguent...@suse.de> > > >> > > > Sent: Monday, June 13, 2022 10:39 AM > > >> > > > To: Tamar Christina <tamar.christ...@arm.com> > > >> > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard > > >> > > > Sandiford <richard.sandif...@arm.com> > > >> > > > Subject: Re: [PATCH 1/2]middle-end Support optimized division > > >> > > > by > > >> > > > pow2 bitmask > > >> > > > > > >> > > > On Mon, 13 Jun 2022, Richard Biener wrote: > > >> > > > > > >> > > > > On Thu, 9 Jun 2022, Tamar Christina wrote: > > >> > > > > > > >> > > > > > Hi All, > > >> > > > > > > > >> > > > > > In plenty of image and video processing code it's common > > >> > > > > > to modify pixel values by a widening operation and then > > >> > > > > > scale them back into range > > >> > > > by dividing by 255. > > >> > > > > > > > >> > > > > > This patch adds an optab to allow us to emit an optimized > > >> > > > > > sequence when doing an unsigned division that is equivalent > to: > > >> > > > > > > > >> > > > > > x = y / (2 ^ (bitsize (y)/2)-1 > > >> > > > > > > > >> > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > >> > > > > > x86_64-pc-linux-gnu and no issues. > > >> > > > > > > > >> > > > > > Ok for master? > > >> > > > > > > >> > > > > Looking at 2/2 it seems that this is the wrong way to > > >> > > > > attack the problem. The ISA doesn't have such instruction > > >> > > > > so adding an optab looks premature. I suppose that there's > > >> > > > > no unsigned vector integer division and thus we open-code > > >> > > > > that in a different > > way? > > >> > > > > Isn't the correct thing then to fixup that open-coding if > > >> > > > > it is more > > >> > efficient? > > >> > > > > > >> > > > > >> > > The problem is that even if you fixup the open-coding it would > > >> > > need to be something target specific? The sequence of > > >> > > instructions we generate don't have a GIMPLE representation. > > >> > > So whatever is generated I'd have to fixup in RTL then. > > >> > > > >> > What's the operation that doesn't have a GIMPLE representation? > > >> > > >> For NEON use two operations: > > >> 1. Add High narrowing lowpart, essentially doing (a +w b) >>.n > bitsize(a)/2 > > >> Where the + widens and the >> narrows. So you give it two > > >> shorts, get a byte 2. Add widening add of lowpart so basically > > >> lowpart (a +w b) > > >> > > >> For SVE2 we use a different sequence, we use two back-to-back > > sequences of: > > >> 1. Add narrow high part (bottom). In SVE the Top and Bottom > > >> instructions > > select > > >> Even and odd elements of the vector rather than "top half" and > > >> "bottom > > half". > > >> > > >> So this instruction does : Add each vector element of the first > > >> source > > vector to the > > >> corresponding vector element of the second source vector, and > > >> place > > the most > > >> significant half of the result in the even-numbered half-width > > destination elements, > > >> while setting the odd-numbered elements to zero. > > >> > > >> So there's an explicit permute in there. The instructions are > > >> sufficiently different that there wouldn't be a single GIMPLE > > representation. > > > > > > I see. Are these also useful to express scalar integer division? > > > > > > I'll defer to others to ack the special udiv_pow2_bitmask optab or > > > suggest some piecemail things other targets might be able to do as > > > well. It does look very special. I'd also bikeshed it to > > > udiv_pow2m1 since 'bitmask' is less obvious than 2^n-1 (assuming I > > > interpreted 'bitmask' correctly ;)). It seems to be even less > > > general since it is an unary op and the actual divisor is > > > constrained by the mode itself? > > > > Yeah, those were my concerns as well. For n-bit numbers, the same > > kind of arithmetic transformation can be used for any 2^m-1 for m in > > [n/2, n), so from a target-independent point of view, m==n/2 isn't > particularly special. > > Hard-coding one value of m would make sense if there was an underlying > > instruction that did exactly this, but like you say, there isn't. > > > > Would a compromise be to define an optab for ADDHN and then add a > > vector pattern for this division that (at least initially) prefers > > ADDHN over the current approach whenever ADDHN is available? We > could > > then adapt the conditions on the pattern if other targets also provide > > ADDHN but don't want this transform. (I think the other instructions > > in the pattern already have > > optabs.) > > > > That still leaves open the question about what to do about SVE2, but > > the underlying problem there is that the vectoriser doesn't know about > > the B/T layout. > > Wouldn't it be better to just generalize the optab and to pass on the mask? > I'd prefer to do that than teach the vectorizer about ADDHN (which can't be > easily done now) let alone teaching it about B/T. It also seems somewhat > unnecessary to diverge the implementation here in the mid-end. After all, > you can generate better SSE code here as well, so focusing on generating ISA > specific code from here for each ISA seems like the wrong approach to me.
Ping, is there any consensus here? Thanks, Tamar > > Thanks, > Tamar > > > > > Thanks, > > Richard