RE: [PATCH 1/2]middle-end Support optimized division by pow2 bitmask

Richard Biener via Gcc-patches Tue, 14 Jun 2022 06:18:54 -0700

On Mon, 13 Jun 2022, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Monday, June 13, 2022 12:48 PM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Sandiford
> > <richard.sandif...@arm.com>
> > Subject: RE: [PATCH 1/2]middle-end Support optimized division by pow2
> > bitmask
> > 
> > On Mon, 13 Jun 2022, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <rguent...@suse.de>
> > > > Sent: Monday, June 13, 2022 10:39 AM
> > > > To: Tamar Christina <tamar.christ...@arm.com>
> > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Sandiford
> > > > <richard.sandif...@arm.com>
> > > > Subject: Re: [PATCH 1/2]middle-end Support optimized division by
> > > > pow2 bitmask
> > > >
> > > > On Mon, 13 Jun 2022, Richard Biener wrote:
> > > >
> > > > > On Thu, 9 Jun 2022, Tamar Christina wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > In plenty of image and video processing code it's common to
> > > > > > modify pixel values by a widening operation and then scale them
> > > > > > back into range
> > > > by dividing by 255.
> > > > > >
> > > > > > This patch adds an optab to allow us to emit an optimized
> > > > > > sequence when doing an unsigned division that is equivalent to:
> > > > > >
> > > > > >    x = y / (2 ^ (bitsize (y)/2)-1
> > > > > >
> > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > > > x86_64-pc-linux-gnu and no issues.
> > > > > >
> > > > > > Ok for master?
> > > > >
> > > > > Looking at 2/2 it seems that this is the wrong way to attack the
> > > > > problem.  The ISA doesn't have such instruction so adding an optab
> > > > > looks premature.  I suppose that there's no unsigned vector
> > > > > integer division and thus we open-code that in a different way?
> > > > > Isn't the correct thing then to fixup that open-coding if it is more
> > efficient?
> > > >
> > >
> > > The problem is that even if you fixup the open-coding it would need to
> > > be something target specific? The sequence of instructions we generate
> > > don't have a GIMPLE representation.  So whatever is generated I'd have
> > > to fixup in RTL then.
> > 
> > What's the operation that doesn't have a GIMPLE representation?
> 
> For NEON use two operations:
> 1. Add High narrowing lowpart, essentially doing (a +w b) >>.n bitsize(a)/2
>     Where the + widens and the >> narrows.  So you give it two shorts, get a 
> byte
> 2. Add widening add of lowpart so basically lowpart (a +w b)
> 
> For SVE2 we use a different sequence, we use two back-to-back sequences of:
> 1. Add narrow high part (bottom).  In SVE the Top and Bottom instructions 
> select
>    Even and odd elements of the vector rather than "top half" and "bottom 
> half".
> 
>    So this instruction does : Add each vector element of the first source 
> vector to the
>    corresponding vector element of the second source vector, and place the 
> most
>     significant half of the result in the even-numbered half-width 
> destination elements,
>     while setting the odd-numbered elements to zero.
> 
> So there's an explicit permute in there. The instructions are sufficiently 
> different that there
> wouldn't be a single GIMPLE representation.


I see.  Are these also useful to express scalar integer division?

I'll defer to others to ack the special udiv_pow2_bitmask optab
or suggest some piecemail things other targets might be able to do as 
well.  It does look very special.  I'd also bikeshed it to
udiv_pow2m1 since 'bitmask' is less obvious than 2^n-1 (assuming
I interpreted 'bitmask' correctly ;)).  It seems to be even less
general since it is an unary op and the actual divisor is constrained
by the mode itself?

Richard.

> > 
> > I think for costing you could resort to the *_cost functions as used by
> > synth_mult and friends.
> > 
> > > The problem with this is that it seemed fragile. We generate from the
> > > Vectorizer:
> > >
> > >   vect__3.8_35 = MEM <vector(16) unsigned char> [(uint8_t *)_21];
> > >   vect_patt_28.9_37 = WIDEN_MULT_LO_EXPR <vect__3.8_35,
> > vect_cst__36>;
> > >   vect_patt_28.9_38 = WIDEN_MULT_HI_EXPR <vect__3.8_35,
> > vect_cst__36>;
> > >   vect_patt_19.10_40 = vect_patt_28.9_37 h* { 32897, 32897, 32897, 32897,
> > 32897, 32897, 32897, 32897 };
> > >   vect_patt_19.10_41 = vect_patt_28.9_38 h* { 32897, 32897, 32897, 32897,
> > 32897, 32897, 32897, 32897 };
> > >   vect_patt_25.11_42 = vect_patt_19.10_40 >> 7;
> > >   vect_patt_25.11_43 = vect_patt_19.10_41 >> 7;
> > >   vect_patt_11.12_44 = VEC_PACK_TRUNC_EXPR <vect_patt_25.11_42,
> > > vect_patt_25.11_43>;
> > >
> > > and if the magic constants change then we miss the optimization. I
> > > could rewrite the open coding to use shifts alone, but that might be a
> > regression for some uarches I would imagine.
> > 
> > OK, so you do have a highpart multiply.  I suppose the pattern is too deep 
> > to
> > be recognized by combine?  What's the RTL good vs. bad before combine of
> > one of the expressions?
> 
> Yeah combine only tried 2-3 instructions, but to use these sequences we have 
> to
> match the entire chain as the instructions do the narrowing themselves.  So 
> the RTL
> for the bad case before combine is
> 
> (insn 39 37 42 4 (set (reg:V4SI 119)
>         (mult:V4SI (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 116 [ 
> vect_patt_28.9D.3754 ])
>                     (parallel:V8HI [
>                             (const_int 4 [0x4])
>                             (const_int 5 [0x5])
>                             (const_int 6 [0x6])
>                             (const_int 7 [0x7])
>                         ])))
>             (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 118)
>                     (parallel:V8HI [
>                             (const_int 4 [0x4])
>                             (const_int 5 [0x5])
>                             (const_int 6 [0x6])
>                             (const_int 7 [0x7])
>                         ]))))) "/app/example.c":6:14 2114 
> {aarch64_simd_vec_umult_hi_v8hi}
>      (expr_list:REG_DEAD (reg:V8HI 116 [ vect_patt_28.9D.3754 ])
>         (expr_list:REG_EQUAL (mult:V4SI (zero_extend:V4SI (vec_select:V4HI 
> (reg:V8HI 116 [ vect_patt_28.9D.3754 ])
>                         (parallel:V8HI [
>                                 (const_int 4 [0x4])
>                                 (const_int 5 [0x5])
>                                 (const_int 6 [0x6])
>                                 (const_int 7 [0x7])
>                             ])))
>                 (const_vector:V4SI [
>                         (const_int 32897 [0x8081]) repeated x4
>                     ]))
>             (nil))))
> (insn 42 39 43 4 (set (reg:V8HI 121 [ vect_patt_19.10D.3755 ])
>         (unspec:V8HI [
>                 (subreg:V8HI (reg:V4SI 117) 0)
>                 (subreg:V8HI (reg:V4SI 119) 0)
>             ] UNSPEC_UZP2)) "/app/example.c":6:14 4096 {aarch64_uzp2v8hi}
>      (expr_list:REG_DEAD (reg:V4SI 119)
>         (expr_list:REG_DEAD (reg:V4SI 117)
>             (nil))))
> (insn 43 42 44 4 (set (reg:V8HI 124 [ vect_patt_25.11D.3756 ])
>         (lshiftrt:V8HI (reg:V8HI 121 [ vect_patt_19.10D.3755 ])
>             (const_vector:V8HI [
>                     (const_int 7 [0x7]) repeated x8
>                 ]))) "/app/example.c":6:14 1803 {aarch64_simd_lshrv8hi}
>      (expr_list:REG_DEAD (reg:V8HI 121 [ vect_patt_19.10D.3755 ])
>         (nil)))
> (insn 44 43 46 4 (set (reg:V8HI 125 [ vect_patt_28.9D.3754 ])
>         (mult:V8HI (zero_extend:V8HI (vec_select:V8QI (reg:V16QI 115 [ MEM 
> <vector(16) unsigned charD.21> [(uint8_tD.3704 *)_21 clique 1 base 1] ])
>                     (parallel:V16QI [
>                             (const_int 8 [0x8])
>                             (const_int 9 [0x9])
>                             (const_int 10 [0xa])
>                             (const_int 11 [0xb])
>                             (const_int 12 [0xc])
>                             (const_int 13 [0xd])
>                             (const_int 14 [0xe])
>                             (const_int 15 [0xf])
>                         ])))
>             (zero_extend:V8HI (vec_select:V8QI (reg:V16QI 100 [ vect_cst__36 
> ])
>                     (parallel:V16QI [
>                             (const_int 8 [0x8])
>                             (const_int 9 [0x9])
>                             (const_int 10 [0xa])
>                             (const_int 11 [0xb])
>                             (const_int 12 [0xc])
>                             (const_int 13 [0xd])
>                             (const_int 14 [0xe])
>                             (const_int 15 [0xf])
>                         ]))))) "/app/example.c":6:14 2112 
> {aarch64_simd_vec_umult_hi_v16qi}
>      (expr_list:REG_DEAD (reg:V16QI 115 [ MEM <vector(16) unsigned charD.21> 
> [(uint8_tD.3704 *)_21 clique 1 base 1] ])
>         (nil)))
> (insn 46 44 48 4 (set (reg:V4SI 126)
>         (mult:V4SI (zero_extend:V4SI (subreg:V4HI (reg:V8HI 125 [ 
> vect_patt_28.9D.3754 ]) 0))
>             (zero_extend:V4SI (subreg:V4HI (reg:V8HI 118) 0)))) 
> "/app/example.c":6:14 2108 {aarch64_intrinsic_vec_umult_lo_v4hi}
>      (expr_list:REG_EQUAL (mult:V4SI (zero_extend:V4SI (subreg:V4HI (reg:V8HI 
> 125 [ vect_patt_28.9D.3754 ]) 0))
>             (const_vector:V4SI [
>                     (const_int 32897 [0x8081]) repeated x4
>                 ]))
>         (nil)))
> (insn 48 46 51 4 (set (reg:V4SI 128)
>         (mult:V4SI (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 125 [ 
> vect_patt_28.9D.3754 ])
>                     (parallel:V8HI [
>                             (const_int 4 [0x4])
>                             (const_int 5 [0x5])
>                             (const_int 6 [0x6])
>                             (const_int 7 [0x7])
>                         ])))
>             (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 118)
>                     (parallel:V8HI [
>                             (const_int 4 [0x4])
>                             (const_int 5 [0x5])
>                             (const_int 6 [0x6])
>                             (const_int 7 [0x7])
>                         ]))))) "/app/example.c":6:14 2114 
> {aarch64_simd_vec_umult_hi_v8hi}
>      (expr_list:REG_DEAD (reg:V8HI 125 [ vect_patt_28.9D.3754 ])
>         (expr_list:REG_EQUAL (mult:V4SI (zero_extend:V4SI (vec_select:V4HI 
> (reg:V8HI 125 [ vect_patt_28.9D.3754 ])
>                         (parallel:V8HI [
>                                 (const_int 4 [0x4])
>                                 (const_int 5 [0x5])
>                                 (const_int 6 [0x6])
>                                 (const_int 7 [0x7])
>                             ])))
>                 (const_vector:V4SI [
>                         (const_int 32897 [0x8081]) repeated x4
>                     ]))
>             (nil))))
> (insn 51 48 52 4 (set (reg:V8HI 130 [ vect_patt_19.10D.3755 ])
>         (unspec:V8HI [
>                 (subreg:V8HI (reg:V4SI 126) 0)
>                 (subreg:V8HI (reg:V4SI 128) 0)
>             ] UNSPEC_UZP2)) "/app/example.c":6:14 4096 {aarch64_uzp2v8hi}
>      (expr_list:REG_DEAD (reg:V4SI 128)
>         (expr_list:REG_DEAD (reg:V4SI 126)
>             (nil))))
> (insn 52 51 53 4 (set (reg:V8HI 133 [ vect_patt_25.11D.3756 ])
>         (lshiftrt:V8HI (reg:V8HI 130 [ vect_patt_19.10D.3755 ])
>             (const_vector:V8HI [
>                     (const_int 7 [0x7]) repeated x8
>                 ]))) "/app/example.c":6:14 1803 {aarch64_simd_lshrv8hi}
>      (expr_list:REG_DEAD (reg:V8HI 130 [ vect_patt_19.10D.3755 ])
>         (nil)))
> 
> And for good:
> 
> (insn 32 30 34 4 (set (reg:V16QI 118)
>         (vec_concat:V16QI (unspec:V8QI [
>                     (reg:V8HI 114 [ vect_patt_28.9 ])
>                     (reg:V8HI 115)
>                 ] UNSPEC_ADDHN)
>             (const_vector:V8QI [
>                     (const_int 0 [0]) repeated x8
>                 ]))) "draw.c":6:35 2688 {aarch64_addhnv8hi_insn_le}
>      (expr_list:REG_EQUAL (vec_concat:V16QI (unspec:V8QI [
>                     (reg:V8HI 114 [ vect_patt_28.9 ])
>                     (const_vector:V8HI [
>                             (const_int 257 [0x101]) repeated x8
>                         ])
>                 ] UNSPEC_ADDHN)
>             (const_vector:V8QI [
>                     (const_int 0 [0]) repeated x8
>                 ]))
>         (nil)))
> (insn 34 32 35 4 (set (reg:V8HI 117)
>         (plus:V8HI (zero_extend:V8HI (subreg:V8QI (reg:V16QI 118) 0))
>             (reg:V8HI 114 [ vect_patt_28.9 ]))) "draw.c":6:35 2635 
> {aarch64_uaddwv8qi}
>      (expr_list:REG_DEAD (reg:V16QI 118)
>         (expr_list:REG_DEAD (reg:V8HI 114 [ vect_patt_28.9 ])
>             (nil))))
> (insn 35 34 37 4 (set (reg:V8HI 103 [ vect_patt_25.10 ])
>         (lshiftrt:V8HI (reg:V8HI 117)
>             (const_vector:V8HI [
>                     (const_int 8 [0x8]) repeated x8
>                 ]))) "draw.c":6:35 1741 {aarch64_simd_lshrv8hi}
>      (expr_list:REG_DEAD (reg:V8HI 117)
>         (nil)))
> (insn 37 35 39 4 (set (reg:V16QI 122)
>         (vec_concat:V16QI (unspec:V8QI [
>                     (reg:V8HI 102 [ vect_patt_28.9 ])
>                     (reg:V8HI 115)
>                 ] UNSPEC_ADDHN)
>             (const_vector:V8QI [
>                     (const_int 0 [0]) repeated x8
>                 ]))) "draw.c":6:35 2688 {aarch64_addhnv8hi_insn_le}
>      (expr_list:REG_EQUAL (vec_concat:V16QI (unspec:V8QI [
>                     (reg:V8HI 102 [ vect_patt_28.9 ])
>                     (const_vector:V8HI [
>                             (const_int 257 [0x101]) repeated x8
>                         ])
>                 ] UNSPEC_ADDHN)
>             (const_vector:V8QI [
>                     (const_int 0 [0]) repeated x8
>                 ]))
>         (nil)))
> (insn 39 37 40 4 (set (reg:V8HI 121)
>         (plus:V8HI (zero_extend:V8HI (subreg:V8QI (reg:V16QI 122) 0))
>             (reg:V8HI 102 [ vect_patt_28.9 ]))) "draw.c":6:35 2635 
> {aarch64_uaddwv8qi}
>      (expr_list:REG_DEAD (reg:V16QI 122)
>         (expr_list:REG_DEAD (reg:V8HI 102 [ vect_patt_28.9 ])
>             (nil))))
> (insn 40 39 41 4 (set (reg:V8HI 104 [ vect_patt_25.10 ])
>         (lshiftrt:V8HI (reg:V8HI 121)
>             (const_vector:V8HI [
>                     (const_int 8 [0x8]) repeated x8
>                 ]))) "draw.c":6:35 1741 {aarch64_simd_lshrv8hi}
> 
> Cheers,
> Tamar
> 
> > 
> > > > Btw, on x86 we use
> > > >
> > > > t.c:3:21: note:   replacing earlier pattern patt_25 = patt_28 / 255;
> > > > t.c:3:21: note:   with patt_25 = patt_19 >> 7;
> > > > t.c:3:21: note:   extra pattern stmt: patt_19 = patt_28 h* 32897;
> > > >
> > > > which translates to
> > > >
> > > >         vpmulhuw        %ymm4, %ymm0, %ymm0
> > > >         vpmulhuw        %ymm4, %ymm1, %ymm1
> > > >         vpsrlw  $7, %ymm0, %ymm0
> > > >         vpsrlw  $7, %ymm1, %ymm1
> > > >
> > > > there's odd
> > > >
> > > >         vpand   %ymm0, %ymm3, %ymm0
> > > >         vpand   %ymm1, %ymm3, %ymm1
> > > >
> > > > before (%ymm3 is all 0x00ff)
> > > >
> > > >         vpackuswb       %ymm1, %ymm0, %ymm0
> > > >
> > > > that's not visible in GIMPLE.  I guess aarch64 lacks a highpart multiply
> > here?
> > > > In any case, it seems that generic division expansion could be
> > > > improved here? (choose_multiplier?)
> > >
> > > We do generate multiply highpart here, but the patch completely avoids
> > > multiplies and shifts entirely by creative use of the ISA. Another reason 
> > > I
> > went for an optab is costing.
> > > The chosen operations are significantly cheaper on all Arm uarches than
> > Shifts and multiply.
> > >
> > > This means we get vectorization in some cases where the cost model
> > > would correctly say It's too expensive to vectorize. Particularly around
> > double precision.
> > >
> > > Thanks,
> > > Tamar
> > >
> > > >
> > > > Richard.
> > > >
> > > > > Richard.
> > > > >
> > > > > > Thanks,
> > > > > > Tamar
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >     * internal-fn.def (DIV_POW2_BITMASK): New.
> > > > > >     * optabs.def (udiv_pow2_bitmask_optab): New.
> > > > > >     * doc/md.texi: Document it.
> > > > > >     * tree-vect-patterns.cc (vect_recog_divmod_pattern): Recognize
> > > > pattern.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > >     * gcc.dg/vect/vect-div-bitmask-1.c: New test.
> > > > > >     * gcc.dg/vect/vect-div-bitmask-2.c: New test.
> > > > > >     * gcc.dg/vect/vect-div-bitmask-3.c: New test.
> > > > > >     * gcc.dg/vect/vect-div-bitmask.h: New file.
> > > > > >
> > > > > > --- inline copy of patch --
> > > > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> > > > > >
> > > >
> > f3619c505c025f158c2bc64756531877378b22e1..784c49d7d24cef7619e4d613f7
> > > > > > b4f6e945866c38 100644
> > > > > > --- a/gcc/doc/md.texi
> > > > > > +++ b/gcc/doc/md.texi
> > > > > > @@ -5588,6 +5588,18 @@ signed op0, op1;
> > > > > >  op0 = op1 / (1 << imm);
> > > > > >  @end smallexample
> > > > > >
> > > > > > +@cindex @code{udiv_pow2_bitmask@var{m2}} instruction pattern
> > > > @item
> > > > > > +@samp{udiv_pow2_bitmask@var{m2}} @cindex
> > > > > > +@code{udiv_pow2_bitmask@var{m2}} instruction pattern @itemx
> > > > > > +@samp{udiv_pow2_bitmask@var{m2}} Unsigned vector division by
> > an
> > > > > > +immediate that is equivalent to
> > > > > > +@samp{2^(bitsize(m) / 2) - 1}.
> > > > > > +@smallexample
> > > > > > +unsigned short op0; op1;
> > > > > > +@dots{}
> > > > > > +op0 = op1 / 0xffU;
> > > > > > +@end smallexample
> > > > > > +
> > > > > >  @cindex @code{vec_shl_insert_@var{m}} instruction pattern
> > > > > > @item @samp{vec_shl_insert_@var{m}}  Shift the elements in
> > > > > > vector input operand 1 left one element (i.e.@:
> > > > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> > > > > >
> > > >
> > d2d550d358606022b1cb44fa842f06e0be507bc3..a3e3cc1520f77683ebf6256898
> > > > > > f916ed45de475f 100644
> > > > > > --- a/gcc/internal-fn.def
> > > > > > +++ b/gcc/internal-fn.def
> > > > > > @@ -159,6 +159,8 @@ DEF_INTERNAL_OPTAB_FN (VEC_SHL_INSERT,
> > > > ECF_CONST | ECF_NOTHROW,
> > > > > >                    vec_shl_insert, binary)
> > > > > >
> > > > > >  DEF_INTERNAL_OPTAB_FN (DIV_POW2, ECF_CONST |
> > ECF_NOTHROW,
> > > > > > sdiv_pow2, binary)
> > > > > > +DEF_INTERNAL_OPTAB_FN (DIV_POW2_BITMASK, ECF_CONST |
> > > > ECF_NOTHROW,
> > > > > > +                  udiv_pow2_bitmask, unary)
> > > > > >
> > > > > >  DEF_INTERNAL_OPTAB_FN (FMS, ECF_CONST, fms, ternary)
> > > > > > DEF_INTERNAL_OPTAB_FN (FNMA, ECF_CONST, fnma, ternary) diff
> > > > > > --git a/gcc/optabs.def b/gcc/optabs.def index
> > > > > >
> > > >
> > 801310ebaa7d469520809bb7efed6820f8eb866b..3f0ac05ef5ad5aed8d6ca391f
> > > > 4
> > > > > > eed71b0494e17f 100644
> > > > > > --- a/gcc/optabs.def
> > > > > > +++ b/gcc/optabs.def
> > > > > > @@ -372,6 +372,7 @@ OPTAB_D (smulhrs_optab, "smulhrs$a3")
> > > > OPTAB_D
> > > > > > (umulhs_optab, "umulhs$a3")  OPTAB_D (umulhrs_optab,
> > > > > > "umulhrs$a3") OPTAB_D (sdiv_pow2_optab, "sdiv_pow2$a3")
> > > > > > +OPTAB_D (udiv_pow2_bitmask_optab, "udiv_pow2_bitmask$a2")
> > > > > >  OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
> > > > > > OPTAB_D (vec_pack_ssat_optab, "vec_pack_ssat_$a")  OPTAB_D
> > > > > > (vec_pack_trunc_optab, "vec_pack_trunc_$a") diff --git
> > > > > > a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > > > > new file mode 100644
> > > > > > index
> > > > > >
> > > >
> > 0000000000000000000000000000000000000000..a7ea3cce4764239c5d281a8f0b
> > > > > > ead1f6a452de3f
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > > > > @@ -0,0 +1,25 @@
> > > > > > +/* { dg-require-effective-target vect_int } */
> > > > > > +
> > > > > > +#include <stdint.h>
> > > > > > +#include "tree-vect.h"
> > > > > > +
> > > > > > +#define N 50
> > > > > > +#define TYPE uint8_t
> > > > > > +
> > > > > > +__attribute__((noipa, noinline, optimize("O1"))) void
> > > > > > +fun1(TYPE* restrict pixel, TYPE level, int n) {
> > > > > > +  for (int i = 0; i < n; i+=1)
> > > > > > +    pixel[i] = (pixel[i] * level) / 0xff; }
> > > > > > +
> > > > > > +__attribute__((noipa, noinline, optimize("O3"))) void
> > > > > > +fun2(TYPE* restrict pixel, TYPE level, int n) {
> > > > > > +  for (int i = 0; i < n; i+=1)
> > > > > > +    pixel[i] = (pixel[i] * level) / 0xff; }
> > > > > > +
> > > > > > +#include "vect-div-bitmask.h"
> > > > > > +
> > > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > > > > +detected" "vect" } } */
> > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > > > > new file mode 100644
> > > > > > index
> > > > > >
> > > >
> > 0000000000000000000000000000000000000000..009e16e1b36497e5724410d98
> > > > 4
> > > > > > 3f1ce122b26dda
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > > > > @@ -0,0 +1,25 @@
> > > > > > +/* { dg-require-effective-target vect_int } */
> > > > > > +
> > > > > > +#include <stdint.h>
> > > > > > +#include "tree-vect.h"
> > > > > > +
> > > > > > +#define N 50
> > > > > > +#define TYPE uint16_t
> > > > > > +
> > > > > > +__attribute__((noipa, noinline, optimize("O1"))) void
> > > > > > +fun1(TYPE* restrict pixel, TYPE level, int n) {
> > > > > > +  for (int i = 0; i < n; i+=1)
> > > > > > +    pixel[i] = (pixel[i] * level) / 0xffffU; }
> > > > > > +
> > > > > > +__attribute__((noipa, noinline, optimize("O3"))) void
> > > > > > +fun2(TYPE* restrict pixel, TYPE level, int n) {
> > > > > > +  for (int i = 0; i < n; i+=1)
> > > > > > +    pixel[i] = (pixel[i] * level) / 0xffffU; }
> > > > > > +
> > > > > > +#include "vect-div-bitmask.h"
> > > > > > +
> > > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > > > > +detected" "vect" } } */
> > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > > > > new file mode 100644
> > > > > > index
> > > > > >
> > > >
> > 0000000000000000000000000000000000000000..bf35a0bda8333c418e692d942
> > > > 2
> > > > > > 0df849cc47930b
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > > > > @@ -0,0 +1,26 @@
> > > > > > +/* { dg-require-effective-target vect_int } */
> > > > > > +/* { dg-additional-options "-fno-vect-cost-model" { target
> > > > > > +aarch64*-*-* } } */
> > > > > > +
> > > > > > +#include <stdint.h>
> > > > > > +#include "tree-vect.h"
> > > > > > +
> > > > > > +#define N 50
> > > > > > +#define TYPE uint32_t
> > > > > > +
> > > > > > +__attribute__((noipa, noinline, optimize("O1"))) void
> > > > > > +fun1(TYPE* restrict pixel, TYPE level, int n) {
> > > > > > +  for (int i = 0; i < n; i+=1)
> > > > > > +    pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; }
> > > > > > +
> > > > > > +__attribute__((noipa, noinline, optimize("O3"))) void
> > > > > > +fun2(TYPE* restrict pixel, TYPE level, int n) {
> > > > > > +  for (int i = 0; i < n; i+=1)
> > > > > > +    pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; }
> > > > > > +
> > > > > > +#include "vect-div-bitmask.h"
> > > > > > +
> > > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > > > > +detected" "vect" } } */
> > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > > > > new file mode 100644
> > > > > > index
> > > > > >
> > > >
> > 0000000000000000000000000000000000000000..29a16739aa4b706616367bfd1
> > > > 8
> > > > > > 32f28ebd07993e
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > > > > @@ -0,0 +1,43 @@
> > > > > > +#include <stdio.h>
> > > > > > +
> > > > > > +#ifndef N
> > > > > > +#define N 65
> > > > > > +#endif
> > > > > > +
> > > > > > +#ifndef TYPE
> > > > > > +#define TYPE uint32_t
> > > > > > +#endif
> > > > > > +
> > > > > > +#ifndef DEBUG
> > > > > > +#define DEBUG 0
> > > > > > +#endif
> > > > > > +
> > > > > > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> > > > > > +
> > > > > > +int main ()
> > > > > > +{
> > > > > > +  TYPE a[N];
> > > > > > +  TYPE b[N];
> > > > > > +
> > > > > > +  for (int i = 0; i < N; ++i)
> > > > > > +    {
> > > > > > +      a[i] = BASE + i * 13;
> > > > > > +      b[i] = BASE + i * 13;
> > > > > > +      if (DEBUG)
> > > > > > +        printf ("%d: 0x%x\n", i, a[i]);
> > > > > > +    }
> > > > > > +
> > > > > > +  fun1 (a, N / 2, N);
> > > > > > +  fun2 (b, N / 2, N);
> > > > > > +
> > > > > > +  for (int i = 0; i < N; ++i)
> > > > > > +    {
> > > > > > +      if (DEBUG)
> > > > > > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> > > > > > +
> > > > > > +      if (a[i] != b[i])
> > > > > > +        __builtin_abort ();
> > > > > > +    }
> > > > > > +  return 0;
> > > > > > +}
> > > > > > +
> > > > > > diff --git a/gcc/tree-vect-patterns.cc
> > > > > > b/gcc/tree-vect-patterns.cc index
> > > > > >
> > > >
> > 217bdfd7045a22578a35bb891a4318d741071872..a738558cb8d12296bff462d71
> > > > 6
> > > > > > 310ca8d82957b5 100644
> > > > > > --- a/gcc/tree-vect-patterns.cc
> > > > > > +++ b/gcc/tree-vect-patterns.cc
> > > > > > @@ -3558,6 +3558,33 @@ vect_recog_divmod_pattern (vec_info
> > > > > > *vinfo,
> > > > > >
> > > > > >        return pattern_stmt;
> > > > > >      }
> > > > > > +  else if ((TYPE_UNSIGNED (itype) || tree_int_cst_sgn (oprnd1) != 
> > > > > > 1)
> > > > > > +      && rhs_code != TRUNC_MOD_EXPR)
> > > > > > +    {
> > > > > > +      wide_int icst = wi::to_wide (oprnd1);
> > > > > > +      wide_int val = wi::add (icst, 1);
> > > > > > +      int pow = wi::exact_log2 (val);
> > > > > > +      if (pow == (prec / 2))
> > > > > > +   {
> > > > > > +     /* Pattern detected.  */
> > > > > > +     vect_pattern_detected ("vect_recog_divmod_pattern",
> > > > > > +last_stmt);
> > > > > > +
> > > > > > +     *type_out = vectype;
> > > > > > +
> > > > > > +     /* Check if the target supports this internal function.  */
> > > > > > +     internal_fn ifn = IFN_DIV_POW2_BITMASK;
> > > > > > +     if (direct_internal_fn_supported_p (ifn, vectype,
> > > > OPTIMIZE_FOR_SPEED))
> > > > > > +       {
> > > > > > +         tree var_div = vect_recog_temp_ssa_var (itype, NULL);
> > > > > > +         gimple *div_stmt = gimple_build_call_internal (ifn, 1,
> > oprnd0);
> > > > > > +         gimple_call_set_lhs (div_stmt, var_div);
> > > > > > +
> > > > > > +         gimple_set_location (div_stmt, gimple_location
> > > > > > +(last_stmt));
> > > > > > +
> > > > > > +         return div_stmt;
> > > > > > +       }
> > > > > > +   }
> > > > > > +    }
> > > > > >
> > > > > >    if (prec > HOST_BITS_PER_WIDE_INT
> > > > > >        || integer_zerop (oprnd1))
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > Richard Biener <rguent...@suse.de>
> > > > SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461
> > > > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald,
> > > > Boudien Moerman; HRB 36809 (AG Nuernberg)
> > >
> > 
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461
> > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald,
> > Boudien Moerman; HRB 36809 (AG Nuernberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

RE: [PATCH 1/2]middle-end Support optimized division by pow2 bitmask

Reply via email to