On 6/21/2022 6:34 PM, Tamar Christina via Gcc-patches wrote:
-----Original Message-----
From: Tamar Christina
Sent: Tuesday, June 14, 2022 4:58 PM
To: Richard Sandiford <richard.sandif...@arm.com>; Richard Biener
<rguent...@suse.de>
Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
Subject: RE: [PATCH 1/2]middle-end Support optimized division by pow2
bitmask



-----Original Message-----
From: Richard Sandiford <richard.sandif...@arm.com>
Sent: Tuesday, June 14, 2022 2:43 PM
To: Richard Biener <rguent...@suse.de>
Cc: Tamar Christina <tamar.christ...@arm.com>;
gcc-patches@gcc.gnu.org; nd <n...@arm.com>
Subject: Re: [PATCH 1/2]middle-end Support optimized division by pow2
bitmask

Richard Biener <rguent...@suse.de> writes:
On Mon, 13 Jun 2022, Tamar Christina wrote:

-----Original Message-----
From: Richard Biener <rguent...@suse.de>
Sent: Monday, June 13, 2022 12:48 PM
To: Tamar Christina <tamar.christ...@arm.com>
Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Sandiford
<richard.sandif...@arm.com>
Subject: RE: [PATCH 1/2]middle-end Support optimized division by
pow2 bitmask

On Mon, 13 Jun 2022, Tamar Christina wrote:

-----Original Message-----
From: Richard Biener <rguent...@suse.de>
Sent: Monday, June 13, 2022 10:39 AM
To: Tamar Christina <tamar.christ...@arm.com>
Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard
Sandiford <richard.sandif...@arm.com>
Subject: Re: [PATCH 1/2]middle-end Support optimized division
by
pow2 bitmask

On Mon, 13 Jun 2022, Richard Biener wrote:

On Thu, 9 Jun 2022, Tamar Christina wrote:

Hi All,

In plenty of image and video processing code it's common
to modify pixel values by a widening operation and then
scale them back into range
by dividing by 255.
This patch adds an optab to allow us to emit an optimized
sequence when doing an unsigned division that is equivalent
to:
    x = y / (2 ^ (bitsize (y)/2)-1

Bootstrapped Regtested on aarch64-none-linux-gnu,
x86_64-pc-linux-gnu and no issues.

Ok for master?
Looking at 2/2 it seems that this is the wrong way to
attack the problem.  The ISA doesn't have such instruction
so adding an optab looks premature.  I suppose that there's
no unsigned vector integer division and thus we open-code
that in a different
way?
Isn't the correct thing then to fixup that open-coding if
it is more
efficient?
The problem is that even if you fixup the open-coding it would
need to be something target specific? The sequence of
instructions we generate don't have a GIMPLE representation.
So whatever is generated I'd have to fixup in RTL then.
What's the operation that doesn't have a GIMPLE representation?
For NEON use two operations:
1. Add High narrowing lowpart, essentially doing (a +w b) >>.n
bitsize(a)/2
     Where the + widens and the >> narrows.  So you give it two
shorts, get a byte 2. Add widening add of lowpart so basically
lowpart (a +w b)

For SVE2 we use a different sequence, we use two back-to-back
sequences of:
1. Add narrow high part (bottom).  In SVE the Top and Bottom
instructions
select
    Even and odd elements of the vector rather than "top half" and
"bottom
half".
    So this instruction does : Add each vector element of the first
source
vector to the
    corresponding vector element of the second source vector, and
place
the most
     significant half of the result in the even-numbered half-width
destination elements,
     while setting the odd-numbered elements to zero.

So there's an explicit permute in there. The instructions are
sufficiently different that there wouldn't be a single GIMPLE
representation.
I see.  Are these also useful to express scalar integer division?

I'll defer to others to ack the special udiv_pow2_bitmask optab or
suggest some piecemail things other targets might be able to do as
well.  It does look very special.  I'd also bikeshed it to
udiv_pow2m1 since 'bitmask' is less obvious than 2^n-1 (assuming I
interpreted 'bitmask' correctly ;)).  It seems to be even less
general since it is an unary op and the actual divisor is
constrained by the mode itself?
Yeah, those were my concerns as well.  For n-bit numbers, the same
kind of arithmetic transformation can be used for any 2^m-1 for m in
[n/2, n), so from a target-independent point of view, m==n/2 isn't
particularly special.
Hard-coding one value of m would make sense if there was an underlying
instruction that did exactly this, but like you say, there isn't.

Would a compromise be to define an optab for ADDHN and then add a
vector pattern for this division that (at least initially) prefers
ADDHN over the current approach whenever ADDHN is available?  We
could
then adapt the conditions on the pattern if other targets also provide
ADDHN but don't want this transform.  (I think the other instructions
in the pattern already have
optabs.)

That still leaves open the question about what to do about SVE2, but
the underlying problem there is that the vectoriser doesn't know about
the B/T layout.
Wouldn't it be better to just generalize the optab and to pass on the mask?
I'd prefer to do that than teach the vectorizer about ADDHN (which can't be
easily done now) let alone teaching it about B/T.   It also seems somewhat
unnecessary to diverge the implementation here in the mid-end. After all,
you can generate better SSE code here as well, so focusing on generating ISA
specific code from here for each ISA seems like the wrong approach to me.
Ping, is there any consensus here?
Not that I've seen.  The ongoing discussion has clarified a few things in my mind, but I'm still wrapping my brain around what you're doing here.

jeff

Reply via email to