date:20231009

Re: [PATCH] TEST: Fix dump FAIL of vect-multitypes-16.c for RVV

2023-10-09 Thread Robin Dapp

> Maybe I should pretend RVV support vect_pack/vect_unpack and enable
> all the tests in target-supports.exp?

The problem is that vect_pack/unpack is an overloaded term in the
moment meaning "vector conversion" (promotion/demotion) or so.  This
test does not require pack/unpack for successful vectorization but our
method of keeping the number of elements the same works as well.  The
naming probably precedes vectorizer support for that.
I can't imagine cases where vectorization would fail because of this
as we can always work around it some way.  So from that point of view
"pretending" to support it would work.  However in case somebody wants
to really write a specific test cases that relies on pack/unpack
(maybe there are already some?) "pretending" would fail.

I lean towards "pretending" at the moment ;)  The other option would be
to rename that and audit all test cases.

Note there are also vect_intfloat_cvt as well as others that don't have
pack/unpack in the name (that we also probably still need to enable).

Regards
 Robin

Re: [PATCH V2] TEST: Fix vect_cond_arith_* dump checks for RVV

2023-10-09 Thread juzhe.zh...@rivai.ai

Hi, Richi and Robin.

Turns out COND(_LEN)?_ADD can't work.

Is this patch Ok ? Or do you have another solution to change the dump check for 
RVV?

Thanks.



juzhe.zh...@rivai.ai
 
From: Juzhe-Zhong
Date: 2023-10-08 09:33
To: gcc-patches
CC: rguenther; jeffreyalaw; rdapp.gcc; Juzhe-Zhong
Subject: [PATCH V2] TEST: Fix vect_cond_arith_* dump checks for RVV
This patch fixes the following dumple FAILs:
FAIL: gcc.dg/vect/vect-cond-arith-2.c -flto -ffat-lto-objects  scan-tree-dump 
optimized " = \\.COND_SUB"
FAIL: gcc.dg/vect/vect-cond-arith-2.c -flto -ffat-lto-objects  scan-tree-dump 
vect " = \\.COND_ADD"
FAIL: gcc.dg/vect/vect-cond-arith-2.c scan-tree-dump optimized " = \\.COND_SUB"
FAIL: gcc.dg/vect/vect-cond-arith-2.c scan-tree-dump vect " = \\.COND_ADD"
FAIL: gcc.dg/vect/vect-cond-arith-4.c -flto -ffat-lto-objects  scan-tree-dump 
optimized " = \\.COND_ADD"
FAIL: gcc.dg/vect/vect-cond-arith-4.c -flto -ffat-lto-objects  scan-tree-dump 
optimized " = \\.COND_MUL"
FAIL: gcc.dg/vect/vect-cond-arith-4.c -flto -ffat-lto-objects  scan-tree-dump 
optimized " = \\.COND_RDIV"
FAIL: gcc.dg/vect/vect-cond-arith-4.c -flto -ffat-lto-objects  scan-tree-dump 
optimized " = \\.COND_SUB"
FAIL: gcc.dg/vect/vect-cond-arith-4.c scan-tree-dump optimized " = \\.COND_ADD"
FAIL: gcc.dg/vect/vect-cond-arith-4.c scan-tree-dump optimized " = \\.COND_MUL"
FAIL: gcc.dg/vect/vect-cond-arith-4.c scan-tree-dump optimized " = \\.COND_RDIV"
FAIL: gcc.dg/vect/vect-cond-arith-4.c scan-tree-dump optimized " = \\.COND_SUB"
FAIL: gcc.dg/vect/vect-cond-arith-5.c -flto -ffat-lto-objects  scan-tree-dump 
optimized " = \\.COND_ADD"
FAIL: gcc.dg/vect/vect-cond-arith-5.c -flto -ffat-lto-objects  scan-tree-dump 
optimized " = \\.COND_MUL"
FAIL: gcc.dg/vect/vect-cond-arith-5.c -flto -ffat-lto-objects  scan-tree-dump 
optimized " = \\.COND_RDIV"
FAIL: gcc.dg/vect/vect-cond-arith-5.c -flto -ffat-lto-objects  scan-tree-dump 
optimized " = \\.COND_SUB"
FAIL: gcc.dg/vect/vect-cond-arith-5.c scan-tree-dump optimized " = \\.COND_ADD"
FAIL: gcc.dg/vect/vect-cond-arith-5.c scan-tree-dump optimized " = \\.COND_MUL"
FAIL: gcc.dg/vect/vect-cond-arith-5.c scan-tree-dump optimized " = \\.COND_RDIV"
FAIL: gcc.dg/vect/vect-cond-arith-5.c scan-tree-dump optimized " = \\.COND_SUB"
FAIL: gcc.dg/vect/vect-cond-arith-6.c -flto -ffat-lto-objects  
scan-tree-dump-times optimized " = \\.COND_ADD" 1
FAIL: gcc.dg/vect/vect-cond-arith-6.c -flto -ffat-lto-objects  
scan-tree-dump-times optimized " = \\.COND_MUL" 1
FAIL: gcc.dg/vect/vect-cond-arith-6.c -flto -ffat-lto-objects  
scan-tree-dump-times optimized " = \\.COND_RDIV" 1
FAIL: gcc.dg/vect/vect-cond-arith-6.c -flto -ffat-lto-objects  
scan-tree-dump-times optimized " = \\.COND_SUB" 1
FAIL: gcc.dg/vect/vect-cond-arith-6.c scan-tree-dump-times optimized " = 
\\.COND_ADD" 1
FAIL: gcc.dg/vect/vect-cond-arith-6.c scan-tree-dump-times optimized " = 
\\.COND_MUL" 1
FAIL: gcc.dg/vect/vect-cond-arith-6.c scan-tree-dump-times optimized " = 
\\.COND_RDIV" 1
FAIL: gcc.dg/vect/vect-cond-arith-6.c scan-tree-dump-times optimized " = 
\\.COND_SUB" 1
 
For RVV, the expected dumple IR is COND_LEN_* pattern.
 
Also, we are still failing at this check:
 
FAIL: gcc.dg/vect/vect-cond-arith-2.c scan-tree-dump optimized " = 
\\.COND_LEN_SUB"
FAIL: gcc.dg/vect/vect-cond-arith-2.c -flto -ffat-lto-objects  scan-tree-dump 
optimized " = \\.COND_LEN_SUB"
 
Since we have a known bug in GIMPLE_FOLD that Robin is working on it.
 
@Robin: Plz make sure vect-cond-arith-2.c passes with this patch and your bug 
fix patch.
 
Ok for trunk ?
 
gcc/testsuite/ChangeLog:
 
* gcc.dg/vect/vect-cond-arith-2.c: Fix dump check for RVV.
* gcc.dg/vect/vect-cond-arith-4.c: Ditto.
* gcc.dg/vect/vect-cond-arith-5.c: Ditto.
* gcc.dg/vect/vect-cond-arith-6.c: Ditto.
 
---
gcc/testsuite/gcc.dg/vect/vect-cond-arith-2.c | 4 ++--
gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c | 8 
gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c | 8 
gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c | 8 
4 files changed, 14 insertions(+), 14 deletions(-)
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-2.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-2.c
index 38994ea82a5..3832a660023 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-2.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-2.c
@@ -41,5 +41,5 @@ neg_xi (double *x)
   return res_3;
}
-/* { dg-final { scan-tree-dump { = \.COND_ADD} "vect" { target { 
vect_double_cond_arith && vect_fully_masked } } } } */
-/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target { 
vect_double_cond_arith && vect_fully_masked } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_L?E?N?_?ADD} "vect" { target { 
vect_double_cond_arith && vect_fully_masked } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_L?E?N?_?SUB} "optimized" { target { 
vect_double_cond_arith && vect_fully_masked } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
index 1af0fe642a

Re: [PATCH]middle-end match.pd: optimize fneg (fabs (x)) to x | (1 << signbit(x)) [PR109154]

2023-10-09 Thread Richard Biener

On Sat, 7 Oct 2023, Richard Sandiford wrote:

> Richard Biener  writes:
> >> Am 07.10.2023 um 11:23 schrieb Richard Sandiford 
> >> >> Richard Biener  writes:
> >>> On Thu, 5 Oct 2023, Tamar Christina wrote:
> >>> 
> > I suppose the idea is that -abs(x) might be easier to optimize with 
> > other
> > patterns (consider a - copysign(x,...), optimizing to a + abs(x)).
> > 
> > For abs vs copysign it's a canonicalization, but (negate (abs @0)) is 
> > less
> > canonical than copysign.
> > 
> >> Should I try removing this?
> > 
> > I'd say yes (and put the reverse canonicalization next to this pattern).
> > 
>  
>  This patch transforms fneg (fabs (x)) into copysign (x, -1) which is more
>  canonical and allows a target to expand this sequence efficiently.  Such
>  sequences are common in scientific code working with gradients.
>  
>  various optimizations in match.pd only happened on COPYSIGN but not 
>  COPYSIGN_ALL
>  which means they exclude IFN_COPYSIGN.  COPYSIGN however is restricted 
>  to only
> >>> 
> >>> That's not true:
> >>> 
> >>> (define_operator_list COPYSIGN
> >>>BUILT_IN_COPYSIGNF
> >>>BUILT_IN_COPYSIGN
> >>>BUILT_IN_COPYSIGNL
> >>>IFN_COPYSIGN)
> >>> 
> >>> but they miss the extended float builtin variants like
> >>> __builtin_copysignf16.  Also see below
> >>> 
>  the C99 builtins and so doesn't work for vectors.
>  
>  The patch expands these optimizations to work on COPYSIGN_ALL.
>  
>  There is an existing canonicalization of copysign (x, -1) to fneg (fabs 
>  (x))
>  which I remove since this is a less efficient form.  The testsuite is 
>  also
>  updated in light of this.
>  
>  Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>  
>  Ok for master?
>  
>  Thanks,
>  Tamar
>  
>  gcc/ChangeLog:
>  
> PR tree-optimization/109154
> * match.pd: Add new neg+abs rule, remove inverse copysign rule and
> expand existing copysign optimizations.
>  
>  gcc/testsuite/ChangeLog:
>  
> PR tree-optimization/109154
> * gcc.dg/fold-copysign-1.c: Updated.
> * gcc.dg/pr55152-2.c: Updated.
> * gcc.dg/tree-ssa/abs-4.c: Updated.
> * gcc.dg/tree-ssa/backprop-6.c: Updated.
> * gcc.dg/tree-ssa/copy-sign-2.c: Updated.
> * gcc.dg/tree-ssa/mult-abs-2.c: Updated.
> * gcc.target/aarch64/fneg-abs_1.c: New test.
> * gcc.target/aarch64/fneg-abs_2.c: New test.
> * gcc.target/aarch64/fneg-abs_3.c: New test.
> * gcc.target/aarch64/fneg-abs_4.c: New test.
> * gcc.target/aarch64/sve/fneg-abs_1.c: New test.
> * gcc.target/aarch64/sve/fneg-abs_2.c: New test.
> * gcc.target/aarch64/sve/fneg-abs_3.c: New test.
> * gcc.target/aarch64/sve/fneg-abs_4.c: New test.
>  
>  --- inline copy of patch ---
>  
>  diff --git a/gcc/match.pd b/gcc/match.pd
>  index 
>  4bdd83e6e061b16dbdb2845b9398fcfb8a6c9739..bd6599d36021e119f51a4928354f580ffe82c6e2
>   100644
>  --- a/gcc/match.pd
>  +++ b/gcc/match.pd
>  @@ -1074,45 +1074,43 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  
>  /* cos(copysign(x, y)) -> cos(x).  Similarly for cosh.  */
>  (for coss (COS COSH)
>  - copysigns (COPYSIGN)
>  - (simplify
>  -  (coss (copysigns @0 @1))
>  -   (coss @0)))
>  + (for copysigns (COPYSIGN_ALL)
> >>> 
> >>> So this ends up generating for example the match
> >>> (cosf (copysignl ...)) which doesn't make much sense.
> >>> 
> >>> The lock-step iteration did
> >>> (cosf (copysignf ..)) ... (ifn_cos (ifn_copysign ...))
> >>> which is leaner but misses the case of
> >>> (cosf (ifn_copysign ..)) - that's probably what you are
> >>> after with this change.
> >>> 
> >>> That said, there isn't a nice solution (without altering the match.pd
> >>> IL).  There's the explicit solution, spelling out all combinations.
> >>> 
> >>> So if we want to go with yout pragmatic solution changing this
> >>> to use COPYSIGN_ALL isn't necessary, only changing the lock-step
> >>> for iteration to a cross product for iteration is.
> >>> 
> >>> Changing just this pattern to
> >>> 
> >>> (for coss (COS COSH)
> >>> (for copysigns (COPYSIGN)
> >>>  (simplify
> >>>   (coss (copysigns @0 @1))
> >>>   (coss @0
> >>> 
> >>> increases the total number of gimple-match-x.cc lines from
> >>> 234988 to 235324.
> >> 
> >> I guess the difference between this and the later suggestions is that
> >> this one allows builtin copysign to be paired with ifn cos, which would
> >> be potentially useful in other situations.  (It isn't here because
> >> ifn_cos is rarely provided.)  How much of the growth is due to that,
> >> and much of it is from nonsensical combinations like
> >> (builtin_cosf (builtin_copysignl ...))?
> >> 
> >> If it's mostly from nonsensical combinations then would i

Re: [PATCH V2] TEST: Fix vect_cond_arith_* dump checks for RVV

2023-10-09 Thread Richard Biener

On Mon, 9 Oct 2023, juzhe.zh...@rivai.ai wrote:

> Hi, Richi and Robin.
> 
> Turns out COND(_LEN)?_ADD can't work.

Did you try quoting?  Try (_LEN|) maybe.

> Is this patch Ok ? Or do you have another solution to change the dump check 
> for RVV?
> 
> Thanks.
> 
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Juzhe-Zhong
> Date: 2023-10-08 09:33
> To: gcc-patches
> CC: rguenther; jeffreyalaw; rdapp.gcc; Juzhe-Zhong
> Subject: [PATCH V2] TEST: Fix vect_cond_arith_* dump checks for RVV
> This patch fixes the following dumple FAILs:
> FAIL: gcc.dg/vect/vect-cond-arith-2.c -flto -ffat-lto-objects  scan-tree-dump 
> optimized " = \\.COND_SUB"
> FAIL: gcc.dg/vect/vect-cond-arith-2.c -flto -ffat-lto-objects  scan-tree-dump 
> vect " = \\.COND_ADD"
> FAIL: gcc.dg/vect/vect-cond-arith-2.c scan-tree-dump optimized " = 
> \\.COND_SUB"
> FAIL: gcc.dg/vect/vect-cond-arith-2.c scan-tree-dump vect " = \\.COND_ADD"
> FAIL: gcc.dg/vect/vect-cond-arith-4.c -flto -ffat-lto-objects  scan-tree-dump 
> optimized " = \\.COND_ADD"
> FAIL: gcc.dg/vect/vect-cond-arith-4.c -flto -ffat-lto-objects  scan-tree-dump 
> optimized " = \\.COND_MUL"
> FAIL: gcc.dg/vect/vect-cond-arith-4.c -flto -ffat-lto-objects  scan-tree-dump 
> optimized " = \\.COND_RDIV"
> FAIL: gcc.dg/vect/vect-cond-arith-4.c -flto -ffat-lto-objects  scan-tree-dump 
> optimized " = \\.COND_SUB"
> FAIL: gcc.dg/vect/vect-cond-arith-4.c scan-tree-dump optimized " = 
> \\.COND_ADD"
> FAIL: gcc.dg/vect/vect-cond-arith-4.c scan-tree-dump optimized " = 
> \\.COND_MUL"
> FAIL: gcc.dg/vect/vect-cond-arith-4.c scan-tree-dump optimized " = 
> \\.COND_RDIV"
> FAIL: gcc.dg/vect/vect-cond-arith-4.c scan-tree-dump optimized " = 
> \\.COND_SUB"
> FAIL: gcc.dg/vect/vect-cond-arith-5.c -flto -ffat-lto-objects  scan-tree-dump 
> optimized " = \\.COND_ADD"
> FAIL: gcc.dg/vect/vect-cond-arith-5.c -flto -ffat-lto-objects  scan-tree-dump 
> optimized " = \\.COND_MUL"
> FAIL: gcc.dg/vect/vect-cond-arith-5.c -flto -ffat-lto-objects  scan-tree-dump 
> optimized " = \\.COND_RDIV"
> FAIL: gcc.dg/vect/vect-cond-arith-5.c -flto -ffat-lto-objects  scan-tree-dump 
> optimized " = \\.COND_SUB"
> FAIL: gcc.dg/vect/vect-cond-arith-5.c scan-tree-dump optimized " = 
> \\.COND_ADD"
> FAIL: gcc.dg/vect/vect-cond-arith-5.c scan-tree-dump optimized " = 
> \\.COND_MUL"
> FAIL: gcc.dg/vect/vect-cond-arith-5.c scan-tree-dump optimized " = 
> \\.COND_RDIV"
> FAIL: gcc.dg/vect/vect-cond-arith-5.c scan-tree-dump optimized " = 
> \\.COND_SUB"
> FAIL: gcc.dg/vect/vect-cond-arith-6.c -flto -ffat-lto-objects  
> scan-tree-dump-times optimized " = \\.COND_ADD" 1
> FAIL: gcc.dg/vect/vect-cond-arith-6.c -flto -ffat-lto-objects  
> scan-tree-dump-times optimized " = \\.COND_MUL" 1
> FAIL: gcc.dg/vect/vect-cond-arith-6.c -flto -ffat-lto-objects  
> scan-tree-dump-times optimized " = \\.COND_RDIV" 1
> FAIL: gcc.dg/vect/vect-cond-arith-6.c -flto -ffat-lto-objects  
> scan-tree-dump-times optimized " = \\.COND_SUB" 1
> FAIL: gcc.dg/vect/vect-cond-arith-6.c scan-tree-dump-times optimized " = 
> \\.COND_ADD" 1
> FAIL: gcc.dg/vect/vect-cond-arith-6.c scan-tree-dump-times optimized " = 
> \\.COND_MUL" 1
> FAIL: gcc.dg/vect/vect-cond-arith-6.c scan-tree-dump-times optimized " = 
> \\.COND_RDIV" 1
> FAIL: gcc.dg/vect/vect-cond-arith-6.c scan-tree-dump-times optimized " = 
> \\.COND_SUB" 1
>  
> For RVV, the expected dumple IR is COND_LEN_* pattern.
>  
> Also, we are still failing at this check:
>  
> FAIL: gcc.dg/vect/vect-cond-arith-2.c scan-tree-dump optimized " = 
> \\.COND_LEN_SUB"
> FAIL: gcc.dg/vect/vect-cond-arith-2.c -flto -ffat-lto-objects  scan-tree-dump 
> optimized " = \\.COND_LEN_SUB"
>  
> Since we have a known bug in GIMPLE_FOLD that Robin is working on it.
>  
> @Robin: Plz make sure vect-cond-arith-2.c passes with this patch and your bug 
> fix patch.
>  
> Ok for trunk ?
>  
> gcc/testsuite/ChangeLog:
>  
> * gcc.dg/vect/vect-cond-arith-2.c: Fix dump check for RVV.
> * gcc.dg/vect/vect-cond-arith-4.c: Ditto.
> * gcc.dg/vect/vect-cond-arith-5.c: Ditto.
> * gcc.dg/vect/vect-cond-arith-6.c: Ditto.
>  
> ---
> gcc/testsuite/gcc.dg/vect/vect-cond-arith-2.c | 4 ++--
> gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c | 8 
> gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c | 8 
> gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c | 8 
> 4 files changed, 14 insertions(+), 14 deletions(-)
>  
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-2.c 
> b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-2.c
> index 38994ea82a5..3832a660023 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-2.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-2.c
> @@ -41,5 +41,5 @@ neg_xi (double *x)
>return res_3;
> }
> -/* { dg-final { scan-tree-dump { = \.COND_ADD} "vect" { target { 
> vect_double_cond_arith && vect_fully_masked } } } } */
> -/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target { 
> vect_double_cond_arith && vect_fully_masked } } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_L?E?N?_?ADD}

Re: [PATCH] TEST: Fix dump FAIL of vect-multitypes-16.c for RVV

2023-10-09 Thread Richard Biener

On Sun, 8 Oct 2023, Jeff Law wrote:

> 
> 
> On 10/8/23 05:35, Juzhe-Zhong wrote:
> > RVV (RISC-V Vector) doesn't enable vect_unpack, but we still vectorize this
> > case well.
> > So, adjust dump check for RVV.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> >  * gcc.dg/vect/vect-multitypes-16.c: Fix dump FAIL of RVV.
> I'd hoped to avoid a bunch of risc-v special casing in the generic part of the
> testsuite.  Basically the more we have target specific conditionals rather
> than conditionals using properties, the more likely we are to keep revisiting
> this stuff over time and possibly for other architectures as well.
> 
> What is it about risc-v's vector support that allows it to optimize this case?
> Is it the same property that allows us to handle the outer loop vectorization
> tests that you changed in another patch?

I suspect for VLA vectorization we can use direct conversion from
char to long long here?  I also notice the testcase uses 'char',
not specifying its sign.  So either of [sz]extVxyzDIVxyzQI is possibly
provided by RISCV?  (or possibly via some intermediate types in a
multi-step conversion)

For non-VLA and with the single vector size restriction we'd need
unpacking.

So it might be better

 { target { vect_unpack || { vect_vla && vect_sext_char_longlong } } }

where I think neither vect_vla nor vect_sext_char_longlong exists.

Richard - didn't you run into similar things with SVE?

Richard.

> Neither an ACK nor NAK right now.
> 
> Jeff
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH V2] TEST: Fix vect_cond_arith_* dump checks for RVV

2023-10-09 Thread Andreas Schwab

On Okt 09 2023, juzhe.zh...@rivai.ai wrote:

> Turns out COND(_LEN)?_ADD can't work.

It should work though.  Tcl regexps are a superset of POSIX EREs.

-- 
Andreas Schwab, SUSE Labs, sch...@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

Re: Re: [PATCH] TEST: Fix dump FAIL of vect-multitypes-16.c for RVV

2023-10-09 Thread juzhe.zh...@rivai.ai

Yes. We do have && enable char -> long conversion (vsext.vf8/vzext.vf8)

Thanks for the comment, I will adapt test as you suggested.

juzhe.zh...@rivai.ai

From: Richard Biener
Date: 2023-10-09 15:31
To: Jeff Law
CC: Juzhe-Zhong; gcc-patches; richard.sandiford
Subject: Re: [PATCH] TEST: Fix dump FAIL of vect-multitypes-16.c for RVV
On Sun, 8 Oct 2023, Jeff Law wrote:

> 
> 
> On 10/8/23 05:35, Juzhe-Zhong wrote:
> > RVV (RISC-V Vector) doesn't enable vect_unpack, but we still vectorize this
> > case well.
> > So, adjust dump check for RVV.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> >  * gcc.dg/vect/vect-multitypes-16.c: Fix dump FAIL of RVV.
> I'd hoped to avoid a bunch of risc-v special casing in the generic part of the
> testsuite.  Basically the more we have target specific conditionals rather
> than conditionals using properties, the more likely we are to keep revisiting
> this stuff over time and possibly for other architectures as well.
> 
> What is it about risc-v's vector support that allows it to optimize this case?
> Is it the same property that allows us to handle the outer loop vectorization
> tests that you changed in another patch?

I suspect for VLA vectorization we can use direct conversion from
char to long long here?  I also notice the testcase uses 'char',
not specifying its sign.  So either of [sz]extVxyzDIVxyzQI is possibly
provided by RISCV?  (or possibly via some intermediate types in a
multi-step conversion)

For non-VLA and with the single vector size restriction we'd need
unpacking.

So it might be better

{ target { vect_unpack || { vect_vla && vect_sext_char_longlong } } }

where I think neither vect_vla nor vect_sext_char_longlong exists.

Richard - didn't you run into similar things with SVE?

Richard.

> Neither an ACK nor NAK right now.
> 
> Jeff
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH]middle-end match.pd: optimize fneg (fabs (x)) to x | (1 << signbit(x)) [PR109154]

2023-10-09 Thread Andrew Pinski

On Mon, Oct 9, 2023 at 12:20 AM Richard Biener  wrote:
>
> On Sat, 7 Oct 2023, Richard Sandiford wrote:
>
> > Richard Biener  writes:
> > >> Am 07.10.2023 um 11:23 schrieb Richard Sandiford 
> > >> >> Richard Biener  writes:
> > >>> On Thu, 5 Oct 2023, Tamar Christina wrote:
> > >>>
> > > I suppose the idea is that -abs(x) might be easier to optimize with 
> > > other
> > > patterns (consider a - copysign(x,...), optimizing to a + abs(x)).
> > >
> > > For abs vs copysign it's a canonicalization, but (negate (abs @0)) is 
> > > less
> > > canonical than copysign.
> > >
> > >> Should I try removing this?
> > >
> > > I'd say yes (and put the reverse canonicalization next to this 
> > > pattern).
> > >
> > 
> >  This patch transforms fneg (fabs (x)) into copysign (x, -1) which is 
> >  more
> >  canonical and allows a target to expand this sequence efficiently.  
> >  Such
> >  sequences are common in scientific code working with gradients.
> > 
> >  various optimizations in match.pd only happened on COPYSIGN but not 
> >  COPYSIGN_ALL
> >  which means they exclude IFN_COPYSIGN.  COPYSIGN however is restricted 
> >  to only
> > >>>
> > >>> That's not true:
> > >>>
> > >>> (define_operator_list COPYSIGN
> > >>>BUILT_IN_COPYSIGNF
> > >>>BUILT_IN_COPYSIGN
> > >>>BUILT_IN_COPYSIGNL
> > >>>IFN_COPYSIGN)
> > >>>
> > >>> but they miss the extended float builtin variants like
> > >>> __builtin_copysignf16.  Also see below
> > >>>
> >  the C99 builtins and so doesn't work for vectors.
> > 
> >  The patch expands these optimizations to work on COPYSIGN_ALL.
> > 
> >  There is an existing canonicalization of copysign (x, -1) to fneg 
> >  (fabs (x))
> >  which I remove since this is a less efficient form.  The testsuite is 
> >  also
> >  updated in light of this.
> > 
> >  Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > 
> >  Ok for master?
> > 
> >  Thanks,
> >  Tamar
> > 
> >  gcc/ChangeLog:
> > 
> > PR tree-optimization/109154
> > * match.pd: Add new neg+abs rule, remove inverse copysign rule and
> > expand existing copysign optimizations.
> > 
> >  gcc/testsuite/ChangeLog:
> > 
> > PR tree-optimization/109154
> > * gcc.dg/fold-copysign-1.c: Updated.
> > * gcc.dg/pr55152-2.c: Updated.
> > * gcc.dg/tree-ssa/abs-4.c: Updated.
> > * gcc.dg/tree-ssa/backprop-6.c: Updated.
> > * gcc.dg/tree-ssa/copy-sign-2.c: Updated.
> > * gcc.dg/tree-ssa/mult-abs-2.c: Updated.
> > * gcc.target/aarch64/fneg-abs_1.c: New test.
> > * gcc.target/aarch64/fneg-abs_2.c: New test.
> > * gcc.target/aarch64/fneg-abs_3.c: New test.
> > * gcc.target/aarch64/fneg-abs_4.c: New test.
> > * gcc.target/aarch64/sve/fneg-abs_1.c: New test.
> > * gcc.target/aarch64/sve/fneg-abs_2.c: New test.
> > * gcc.target/aarch64/sve/fneg-abs_3.c: New test.
> > * gcc.target/aarch64/sve/fneg-abs_4.c: New test.
> > 
> >  --- inline copy of patch ---
> > 
> >  diff --git a/gcc/match.pd b/gcc/match.pd
> >  index 
> >  4bdd83e6e061b16dbdb2845b9398fcfb8a6c9739..bd6599d36021e119f51a4928354f580ffe82c6e2
> >   100644
> >  --- a/gcc/match.pd
> >  +++ b/gcc/match.pd
> >  @@ -1074,45 +1074,43 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > 
> >  /* cos(copysign(x, y)) -> cos(x).  Similarly for cosh.  */
> >  (for coss (COS COSH)
> >  - copysigns (COPYSIGN)
> >  - (simplify
> >  -  (coss (copysigns @0 @1))
> >  -   (coss @0)))
> >  + (for copysigns (COPYSIGN_ALL)
> > >>>
> > >>> So this ends up generating for example the match
> > >>> (cosf (copysignl ...)) which doesn't make much sense.
> > >>>
> > >>> The lock-step iteration did
> > >>> (cosf (copysignf ..)) ... (ifn_cos (ifn_copysign ...))
> > >>> which is leaner but misses the case of
> > >>> (cosf (ifn_copysign ..)) - that's probably what you are
> > >>> after with this change.
> > >>>
> > >>> That said, there isn't a nice solution (without altering the match.pd
> > >>> IL).  There's the explicit solution, spelling out all combinations.
> > >>>
> > >>> So if we want to go with yout pragmatic solution changing this
> > >>> to use COPYSIGN_ALL isn't necessary, only changing the lock-step
> > >>> for iteration to a cross product for iteration is.
> > >>>
> > >>> Changing just this pattern to
> > >>>
> > >>> (for coss (COS COSH)
> > >>> (for copysigns (COPYSIGN)
> > >>>  (simplify
> > >>>   (coss (copysigns @0 @1))
> > >>>   (coss @0
> > >>>
> > >>> increases the total number of gimple-match-x.cc lines from
> > >>> 234988 to 235324.
> > >>
> > >> I guess the difference between this and the later suggestions is that
> > >> this one allows builtin copysign to be paired with ifn cos, which would
> > >

Re: [PATCH] TEST: Fix dump FAIL of vect-multitypes-16.c for RVV

2023-10-09 Thread Richard Biener

On Mon, 9 Oct 2023, Robin Dapp wrote:

> > Maybe I should pretend RVV support vect_pack/vect_unpack and enable
> > all the tests in target-supports.exp?
> 
> The problem is that vect_pack/unpack is an overloaded term in the
> moment meaning "vector conversion" (promotion/demotion) or so.  This
> test does not require pack/unpack for successful vectorization but our
> method of keeping the number of elements the same works as well.  The
> naming probably precedes vectorizer support for that.
> I can't imagine cases where vectorization would fail because of this
> as we can always work around it some way.  So from that point of view
> "pretending" to support it would work.  However in case somebody wants
> to really write a specific test cases that relies on pack/unpack
> (maybe there are already some?) "pretending" would fail.

I suspect that for VLS you need to provide the respective patterns
because of the single vector-size restriction.

> I lean towards "pretending" at the moment ;)  The other option would be
> to rename that and audit all test cases.
> 
> Note there are also vect_intfloat_cvt as well as others that don't have
> pack/unpack in the name (that we also probably still need to enable).

Yeah well - the dejagnu "targets" are mostly too broad, but as usual
time is spent elsewhere instead of at cleaning up the mess ;)

It might be more useful to provide vect_.. dg targets
because then it's at least obvious what is meant.  Or group
things as vect_float vect_int.

Richard.

> Regards
>  Robin
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH] RISC-V: Support movmisalign of RVV VLA modes

2023-10-09 Thread Richard Biener

On Sun, Oct 8, 2023 at 9:22 AM Juzhe-Zhong  wrote:
>
> Previously, I removed the movmisalign pattern to fix the execution FAILs in 
> this commit:
> https://github.com/gcc-mirror/gcc/commit/f7bff24905a6959f85f866390db2fff1d6f95520
>
> I was thinking that RVV doesn't allow misaligned at the beginning so I 
> removed that pattern.
> However, after deep investigation && reading RVV ISA again and experiment on 
> SPIKE,
> I realized I was wrong.
>
> RVV ISA reference: 
> https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-memory-alignment-constraints
>
> "If an element accessed by a vector memory instruction is not naturally 
> aligned to the size of the element,
>  either the element is transferred successfully or an address misaligned 
> exception is raised on that element."

But you gobble the "or .." into an existing -mstrict-align flag - are
you sure all implementations are
self-consistent with handling non-vector memory instructions and
vector memory instructions here?
At least the above wording doesn't seem to impose such requirement.

> It's obvious that RVV ISA does allow misaligned vector load/store.
>
> And experiment and confirm on SPIKE:
>
> [jzzhong@rios-cad122:/work/home/jzzhong/work/toolchain/riscv/gcc/gcc/testsuite/gcc.dg/vect]$~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/bin/spike
>  --isa=rv64gcv --varch=vlen:128,elen:64 
> ~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/riscv64-unknown-elf/bin/pk64
>   a.out
> bbl loader
> z   ra 00010158 sp 003ffb40 gp 
> 00012c48
> tp  t0 000110da t1 000f t2 
> 
> s0 00013460 s1  a0 00012ef5 a1 
> 00012018
> a2 00012a71 a3 000d a4 0004 a5 
> 00012a71
> a6 00012a71 a7 00012018 s2  s3 
> 
> s4  s5  s6  s7 
> 
> s8  s9  sA  sB 
> 
> t3  t4  t5  t6 
> 
> pc 00010258 va/inst 020660a7 sr 80026620
> Store/AMO access fault!
>
> [jzzhong@rios-cad122:/work/home/jzzhong/work/toolchain/riscv/gcc/gcc/testsuite/gcc.dg/vect]$~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/bin/spike
>  --misaligned --isa=rv64gcv --varch=vlen:128,elen:64 
> ~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/riscv64-unknown-elf/bin/pk64
>   a.out
> bbl loader
>
> We can see SPIKE can pass previous *FAILED* execution tests with specifying 
> --misaligned to SPIKE.
>
> So, to honor RVV ISA SPEC, we should add movmisalign pattern back base on the 
> investigations I have done since
> it can improve multiple vectorization tests and fix dumple FAILs.
>
> This patch fixes these following dump FAILs:
>
> FAIL: gcc.dg/vect/vect-bitfield-read-2.c -flto -ffat-lto-objects  
> scan-tree-dump-not optimized "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-read-2.c scan-tree-dump-not optimized 
> "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-read-4.c -flto -ffat-lto-objects  
> scan-tree-dump-not optimized "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-read-4.c scan-tree-dump-not optimized 
> "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-write-2.c -flto -ffat-lto-objects  
> scan-tree-dump-not optimized "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-write-2.c scan-tree-dump-not optimized 
> "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-write-3.c -flto -ffat-lto-objects  
> scan-tree-dump-not optimized "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-write-3.c scan-tree-dump-not optimized 
> "Invalid sum"
>
> Consider this following case:
>
> struct s {
> unsigned i : 31;
> char a : 4;
> };
>
> #define N 32
> #define ELT0 {0x7FFFUL, 0}
> #define ELT1 {0x7FFFUL, 1}
> #define ELT2 {0x7FFFUL, 2}
> #define ELT3 {0x7FFFUL, 3}
> #define RES 48
> struct s A[N]
>   = { ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
>   ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
>   ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
>   ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3};
>
> int __attribute__ ((noipa))
> f(struct s *ptr, unsigned n) {
> int res = 0;
> for (int i = 0; i < n; ++i)
>   res += ptr[i].a;
> return res;
> }
>
> -O3 -S -fno-vect-cost-model (default strict-align):
>
> f:
> mv  a4,a0
> beq a1,zero,.L9
> addiw   a5,a1,-1
> li  a3,14
> vsetivlizero,16,e64,m8,ta,ma
> bleua5,a3,.L3
> andia5,a0,127
> bne a5,zero,.L3
> srliw   a3,a1,4
> sllia3,a3,7
> li  a0,15
> sllia0,a0,32
> add a3,a3,a4
> mv  a5,a4
>

Re: Re: [PATCH] RISC-V: Support movmisalign of RVV VLA modes

2023-10-09 Thread juzhe.zh...@rivai.ai

>> But you gobble the "or .." into an existing -mstrict-align flag - are
>> you sure all implementations are
>> self-consistent with handling non-vector memory instructions and
>> vector memory instructions here?
>> At least the above wording doesn't seem to impose such requirement.

RVV ISA： 
"Support for misaligned vector memory accesses is independent of an 
implementation’s support for misaligned scalar memory accesses."
Support misalign vector memory access is independent on scalar memory access.
I think this patch (using -mno-strict-align) is not appropriate, which means I 
need additional compile option.




juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-10-09 16:01
To: Juzhe-Zhong
CC: gcc-patches; kito.cheng; kito.cheng; jeffreyalaw; rdapp.gcc
Subject: Re: [PATCH] RISC-V: Support movmisalign of RVV VLA modes
On Sun, Oct 8, 2023 at 9:22 AM Juzhe-Zhong  wrote:
>
> Previously, I removed the movmisalign pattern to fix the execution FAILs in 
> this commit:
> https://github.com/gcc-mirror/gcc/commit/f7bff24905a6959f85f866390db2fff1d6f95520
>
> I was thinking that RVV doesn't allow misaligned at the beginning so I 
> removed that pattern.
> However, after deep investigation && reading RVV ISA again and experiment on 
> SPIKE,
> I realized I was wrong.
>
> RVV ISA reference: 
> https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-memory-alignment-constraints
>
> "If an element accessed by a vector memory instruction is not naturally 
> aligned to the size of the element,
>  either the element is transferred successfully or an address misaligned 
> exception is raised on that element."
 
But you gobble the "or .." into an existing -mstrict-align flag - are
you sure all implementations are
self-consistent with handling non-vector memory instructions and
vector memory instructions here?
At least the above wording doesn't seem to impose such requirement.
 
> It's obvious that RVV ISA does allow misaligned vector load/store.
>
> And experiment and confirm on SPIKE:
>
> [jzzhong@rios-cad122:/work/home/jzzhong/work/toolchain/riscv/gcc/gcc/testsuite/gcc.dg/vect]$~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/bin/spike
>  --isa=rv64gcv --varch=vlen:128,elen:64 
> ~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/riscv64-unknown-elf/bin/pk64
>   a.out
> bbl loader
> z   ra 00010158 sp 003ffb40 gp 
> 00012c48
> tp  t0 000110da t1 000f t2 
> 
> s0 00013460 s1  a0 00012ef5 a1 
> 00012018
> a2 00012a71 a3 000d a4 0004 a5 
> 00012a71
> a6 00012a71 a7 00012018 s2  s3 
> 
> s4  s5  s6  s7 
> 
> s8  s9  sA  sB 
> 
> t3  t4  t5  t6 
> 
> pc 00010258 va/inst 020660a7 sr 80026620
> Store/AMO access fault!
>
> [jzzhong@rios-cad122:/work/home/jzzhong/work/toolchain/riscv/gcc/gcc/testsuite/gcc.dg/vect]$~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/bin/spike
>  --misaligned --isa=rv64gcv --varch=vlen:128,elen:64 
> ~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/riscv64-unknown-elf/bin/pk64
>   a.out
> bbl loader
>
> We can see SPIKE can pass previous *FAILED* execution tests with specifying 
> --misaligned to SPIKE.
>
> So, to honor RVV ISA SPEC, we should add movmisalign pattern back base on the 
> investigations I have done since
> it can improve multiple vectorization tests and fix dumple FAILs.
>
> This patch fixes these following dump FAILs:
>
> FAIL: gcc.dg/vect/vect-bitfield-read-2.c -flto -ffat-lto-objects  
> scan-tree-dump-not optimized "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-read-2.c scan-tree-dump-not optimized 
> "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-read-4.c -flto -ffat-lto-objects  
> scan-tree-dump-not optimized "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-read-4.c scan-tree-dump-not optimized 
> "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-write-2.c -flto -ffat-lto-objects  
> scan-tree-dump-not optimized "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-write-2.c scan-tree-dump-not optimized 
> "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-write-3.c -flto -ffat-lto-objects  
> scan-tree-dump-not optimized "Invalid sum"
> FAIL: gcc.dg/vect/vect-bitfield-write-3.c scan-tree-dump-not optimized 
> "Invalid sum"
>
> Consider this following case:
>
> struct s {
> unsigned i : 31;
> char a : 4;
> };
>
> #define N 32
> #define ELT0 {0x7FFFUL, 0}
> #define ELT1 {0x7FFFUL, 1}
> #define ELT2 {0x7FFFUL, 2}
> #define ELT3 {0x7FFFUL, 3}
> #define RES 48
> struct s A[N]
>   = { ELT

Re: [PATCH] TEST: Fix dump FAIL for RVV (RISCV-V vector)

2023-10-09 Thread Richard Biener

On Sun, 8 Oct 2023, Juzhe-Zhong wrote:

> As this showed: https://godbolt.org/z/3K9oK7fx3
> 
> ARM SVE 2 times for FOLD_EXTRACT_LAST wheras RVV 4 times.
> 
> This is because RISC-V doesn't enable vec_pack_trunc so we will failed 
> conversion and fold_extract_last at the first time analysis.
> Then we succeed at the second time.
> 
> So RVV has 4 times of showing "FOLD_EXTRACT_LAST:.

OK

> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/vect-cond-reduc-4.c: Add vect_pack_trunc variant.
> 
> ---
>  gcc/testsuite/gcc.dg/vect/vect-cond-reduc-4.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-reduc-4.c 
> b/gcc/testsuite/gcc.dg/vect/vect-cond-reduc-4.c
> index 8820075b1dc..8ea8c538713 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-cond-reduc-4.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cond-reduc-4.c
> @@ -42,6 +42,7 @@ main (void)
>  }
>  
>  /* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" } } */
> -/* { dg-final { scan-tree-dump-times "optimizing condition reduction with 
> FOLD_EXTRACT_LAST" 2 "vect" { target vect_fold_extract_last } } } */
> +/* { dg-final { scan-tree-dump-times "optimizing condition reduction with 
> FOLD_EXTRACT_LAST" 2 "vect" { target { vect_fold_extract_last && 
> vect_pack_trunc } } } } */
> +/* { dg-final { scan-tree-dump-times "optimizing condition reduction with 
> FOLD_EXTRACT_LAST" 4 "vect" { target { { vect_fold_extract_last } && { ! 
> vect_pack_trunc } } } } } */
>  /* { dg-final { scan-tree-dump-times "condition expression based on integer 
> induction." 2 "vect" { target { ! vect_fold_extract_last } } } } */
>  
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH] TEST: Fix dump FAIL for RVV

2023-10-09 Thread Richard Biener

On Sun, 8 Oct 2023, Juzhe-Zhong wrote:

> gcc/testsuite/ChangeLog:

OK

> 
>   * gcc.dg/vect/bb-slp-cond-1.c: Fix dump FAIL for RVV.
>   * gcc.dg/vect/pr57705.c: Ditto.
> 
> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c | 4 ++--
>  gcc/testsuite/gcc.dg/vect/pr57705.c   | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c 
> b/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
> index c8024429e9c..e1ebc23505f 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
> @@ -47,6 +47,6 @@ int main ()
>  }
>  
>  /* { dg-final { scan-tree-dump {(no need for alias check [^\n]* when VF is 
> 1|no alias between [^\n]* when [^\n]* is outside \(-16, 16\))} "vect" { 
> target vect_element_align } } } */
> -/* { dg-final { scan-tree-dump-times "loop vectorized" 1 "vect" { target { 
> vect_element_align && { ! amdgcn-*-* } } } } } */
> -/* { dg-final { scan-tree-dump-times "loop vectorized" 2 "vect" { target 
> amdgcn-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "loop vectorized" 1 "vect" { target { 
> vect_element_align && { { ! amdgcn-*-* } && { ! riscv_v } } } } } } */
> +/* { dg-final { scan-tree-dump-times "loop vectorized" 2 "vect" { target { 
> amdgcn-*-* || riscv_v } } } } */
>  
> diff --git a/gcc/testsuite/gcc.dg/vect/pr57705.c 
> b/gcc/testsuite/gcc.dg/vect/pr57705.c
> index 39c32946d74..2dacea0a7a7 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr57705.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr57705.c
> @@ -64,5 +64,5 @@ main ()
>return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 3 "vect" { target 
> vect_pack_trunc } } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 2 "vect" { target { 
> ! vect_pack_trunc } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 3 "vect" { target { 
> vect_pack_trunc || riscv_v } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 2 "vect" { target { 
> { ! vect_pack_trunc } && { ! riscv_v } } } } } */
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH] TEST: Fix XPASS of outer loop vectorization tests for RVV

2023-10-09 Thread Richard Biener

On Sun, 8 Oct 2023, Juzhe-Zhong wrote:

> Even though RVV doesn't enable vec_unpack/vec_pack, it succeed on outer loop 
> vectorizations.

How so?  I think this maybe goes with the other similar change.

That is, when we already have specific target checks adding riscv-*-* 
looks sensible but when we don't we should figure if there's a capability
we can (add and) test instead.

> Fix these following XPASS FAILs:
> 
> XPASS: gcc.dg/vect/no-scevccp-outer-16.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> XPASS: gcc.dg/vect/no-scevccp-outer-17.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> XPASS: gcc.dg/vect/no-scevccp-outer-19.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> XPASS: gcc.dg/vect/no-scevccp-outer-21.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/no-scevccp-outer-16.c: Fix XPASS for RVV.
>   * gcc.dg/vect/no-scevccp-outer-17.c: Ditto.
>   * gcc.dg/vect/no-scevccp-outer-19.c: Ditto.
>   * gcc.dg/vect/no-scevccp-outer-21.c: Ditto.
> 
> ---
>  gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c | 2 +-
>  gcc/testsuite/gcc.dg/vect/no-scevccp-outer-17.c | 2 +-
>  gcc/testsuite/gcc.dg/vect/no-scevccp-outer-19.c | 2 +-
>  gcc/testsuite/gcc.dg/vect/no-scevccp-outer-21.c | 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c 
> b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c
> index c7c2fa8a504..12179949e00 100644
> --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c
> +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c
> @@ -59,4 +59,4 @@ int main (void)
>return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { ! {vect_unpack } } } } } */
> +/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { { ! {vect_unpack } } && { ! {riscv_v } } } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-17.c 
> b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-17.c
> index ba904a6c03e..86554a98169 100644
> --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-17.c
> +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-17.c
> @@ -65,4 +65,4 @@ int main (void)
>return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { ! {vect_unpack } } } } } */
> +/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { { ! {vect_unpack } } && { ! {riscv_v } } } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-19.c 
> b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-19.c
> index 5cd4049d08c..624b54accf4 100644
> --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-19.c
> +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-19.c
> @@ -49,4 +49,4 @@ int main (void)
>return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { ! {vect_unpack } } } } } */
> +/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { { ! {vect_unpack } } && { ! {riscv_v } } } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-21.c 
> b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-21.c
> index 72e53c2bfb0..b30a5d78819 100644
> --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-21.c
> +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-21.c
> @@ -59,4 +59,4 @@ int main (void)
>return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { ! { vect_pack_trunc } } } } } */
> +/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { { ! {vect_pack_trunc } } && { ! {riscv_v } } } } } } */
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH] ifcvt/vect: Emit COND_ADD for conditional scalar reduction.

2023-10-09 Thread Richard Biener

On Fri, 6 Oct 2023, Robin Dapp wrote:

> > So if you think you got everything correct the patch is OK as-is,
> > I just wasn't sure - maybe the neutral_element change deserves
> > a comment as to how MINUS_EXPR is handled.
> 
> Heh, I never think I got everything correct ;)
> 
> Added this now:
> 
>  static bool
>  fold_left_reduction_fn (code_helper code, internal_fn *reduc_fn)
>  {
> +  /* We support MINUS_EXPR by negating the operand.  This also preserves an
> + initial -0.0 since -0.0 - 0.0 (neutral op for MINUS_EXPR) == -0.0 +
> + (-0.0) = -0.0.  */
> 
> What I still found is that aarch64 ICEs at the assertion you added
> with -frounding-math.  Therefore I changed it to:
> 
> - gcc_assert (!HONOR_SIGN_DEPENDENT_ROUNDING (vectype_out));
> + if (HONOR_SIGN_DEPENDENT_ROUNDING (vectype_out))
> +   {
> + if (dump_enabled_p ())
> +   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +"cannot vectorize fold-left reduction 
> because"
> +" signed zeros cannot be preserved.\n");
> + return false;
> +   }
> 
> No code changes apart from that.  Will leave it until Monday and push then
> barring any objections.

Hmm, the function is called at transform time so this shouldn't help
avoiding the ICE.  I expected we refuse to vectorize _any_ reduction
when sign dependent rounding is in effect?  OTOH maybe sign-dependent
rounding is OK but only when we use a unconditional fold-left
(so a loop mask from fully masking is OK but not an original COND_ADD?).

Still the check should be done in vectorizable_reduction, not only
during transform (there the assert is proper, if we can distinguish
the loop mask vs. the COND_ADD here, otherwise just remove it).

Richard.

> Thanks for the pointers.
> 
> Regards
>  Robin
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: Re: [PATCH] TEST: Fix XPASS of outer loop vectorization tests for RVV

2023-10-09 Thread juzhe.zh...@rivai.ai

Thanks Richi.

I will try to figure out a better way to adapt the tests without adding riscv* 
specific targets variant.



juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-10-09 16:17
To: Juzhe-Zhong
CC: gcc-patches; jeffreyalaw
Subject: Re: [PATCH] TEST: Fix XPASS of outer loop vectorization tests for RVV
On Sun, 8 Oct 2023, Juzhe-Zhong wrote:
 
> Even though RVV doesn't enable vec_unpack/vec_pack, it succeed on outer loop 
> vectorizations.
 
How so?  I think this maybe goes with the other similar change.
 
That is, when we already have specific target checks adding riscv-*-* 
looks sensible but when we don't we should figure if there's a capability
we can (add and) test instead.
 
> Fix these following XPASS FAILs:
> 
> XPASS: gcc.dg/vect/no-scevccp-outer-16.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> XPASS: gcc.dg/vect/no-scevccp-outer-17.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> XPASS: gcc.dg/vect/no-scevccp-outer-19.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> XPASS: gcc.dg/vect/no-scevccp-outer-21.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> 
> gcc/testsuite/ChangeLog:
> 
> * gcc.dg/vect/no-scevccp-outer-16.c: Fix XPASS for RVV.
> * gcc.dg/vect/no-scevccp-outer-17.c: Ditto.
> * gcc.dg/vect/no-scevccp-outer-19.c: Ditto.
> * gcc.dg/vect/no-scevccp-outer-21.c: Ditto.
> 
> ---
>  gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c | 2 +-
>  gcc/testsuite/gcc.dg/vect/no-scevccp-outer-17.c | 2 +-
>  gcc/testsuite/gcc.dg/vect/no-scevccp-outer-19.c | 2 +-
>  gcc/testsuite/gcc.dg/vect/no-scevccp-outer-21.c | 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c 
> b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c
> index c7c2fa8a504..12179949e00 100644
> --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c
> +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c
> @@ -59,4 +59,4 @@ int main (void)
>return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { ! {vect_unpack } } } } } */
> +/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { { ! {vect_unpack } } && { ! {riscv_v } } } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-17.c 
> b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-17.c
> index ba904a6c03e..86554a98169 100644
> --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-17.c
> +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-17.c
> @@ -65,4 +65,4 @@ int main (void)
>return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { ! {vect_unpack } } } } } */
> +/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { { ! {vect_unpack } } && { ! {riscv_v } } } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-19.c 
> b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-19.c
> index 5cd4049d08c..624b54accf4 100644
> --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-19.c
> +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-19.c
> @@ -49,4 +49,4 @@ int main (void)
>return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { ! {vect_unpack } } } } } */
> +/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { { ! {vect_unpack } } && { ! {riscv_v } } } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-21.c 
> b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-21.c
> index 72e53c2bfb0..b30a5d78819 100644
> --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-21.c
> +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-21.c
> @@ -59,4 +59,4 @@ int main (void)
>return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { ! { vect_pack_trunc } } } } } */
> +/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail { { ! {vect_pack_trunc } } && { ! {riscv_v } } } } } } */
> 
 
-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

[PATCH v1] RISC-V: Refine bswap16 auto vectorization code gen

2023-10-09 Thread pan2 . li

From: Pan Li 

This patch would like to refine the code gen for the bswap16.

We will have VEC_PERM_EXPR after rtl expand when invoking
__builtin_bswap. It will generate about 9 instructions in
loop as below, no matter it is bswap16, bswap32 or bswap64.

  .L2:
1 vle16.v v4,0(a0)
2 vmv.v.x v2,a7
3 vand.vv v2,v6,v2
4 sllia2,a5,1
5 vrgatherei16.vv v1,v4,v2
6 sub a4,a4,a5
7 vse16.v v1,0(a3)
8 add a0,a0,a2
9 add a3,a3,a2
  bne a4,zero,.L2

But for bswap16 we may have a even simple code gen, which
has only 7 instructions in loop as below.

  .L5
1 vle8.v  v2,0(a5)
2 addia5,a5,32
3 vsrl.vi v4,v2,8
4 vsll.vi v2,v2,8
5 vor.vv  v4,v4,v2
6 vse8.v  v4,0(a4)
7 addia4,a4,32
  bne a5,a6,.L5

Unfortunately, this way will make the insn in loop will grow up to
13 and 24 for bswap32 and bswap64. Thus, we will refine the code
gen for the bswap16 only, and leave both the bswap32 and bswap64
as is.

gcc/ChangeLog:

* config/riscv/riscv-v.cc (emit_vec_sll_scalar): New help func
impl for emit vsll.vi/vsll.vx
(emit_vec_srl_scalar): Likewise for vsrl.vi/vsrl.vx.
(emit_vec_or): Likewise for vor.vv.
(shuffle_bswap_pattern): New func impl for shuffle bswap.
(expand_vec_perm_const_1): Add shuffle bswap pattern.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls/perm-4.c: Adjust checker.
* gcc.target/riscv/rvv/autovec/unop/bswap16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/bswap16-0.c: New test.

Signed-off-by: Pan Li 
---
 gcc/config/riscv/riscv-v.cc   | 117 ++
 .../riscv/rvv/autovec/unop/bswap16-0.c|  17 +++
 .../riscv/rvv/autovec/unop/bswap16-run-0.c|  44 +++
 .../riscv/rvv/autovec/vls/bswap16-0.c |  34 +
 .../gcc.target/riscv/rvv/autovec/vls/perm-4.c |   4 +-
 5 files changed, 214 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/bswap16-0.c

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 23633a2a74d..3e3b5f2e797 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -878,6 +878,33 @@ emit_vlmax_decompress_insn (rtx target, rtx op0, rtx op1, 
rtx mask)
   emit_vlmax_masked_gather_mu_insn (target, op1, sel, mask);
 }
 
+static void
+emit_vec_sll_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx sll_ops[] = {op_0, op_1, op_2};
+  insn_code icode = code_for_pred_scalar (ASHIFT, vec_mode);
+
+  emit_vlmax_insn (icode, BINARY_OP, sll_ops);
+}
+
+static void
+emit_vec_srl_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx srl_ops[] = {op_0, op_1, op_2};
+  insn_code icode = code_for_pred_scalar (LSHIFTRT, vec_mode);
+
+  emit_vlmax_insn (icode, BINARY_OP, srl_ops);
+}
+
+static void
+emit_vec_or (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx or_ops[] = {op_0, op_1, op_2};
+  insn_code icode = code_for_pred (IOR, vec_mode);
+
+  emit_vlmax_insn (icode, BINARY_OP, or_ops);
+}
+
 /* Emit merge instruction.  */
 
 static machine_mode
@@ -3030,6 +3057,94 @@ shuffle_decompress_patterns (struct expand_vec_perm_d *d)
   return true;
 }
 
+static bool
+shuffle_bswap_pattern (struct expand_vec_perm_d *d)
+{
+  HOST_WIDE_INT diff;
+  unsigned i, size, step;
+
+  if (!d->one_vector_p || !d->perm[0].is_constant (&diff) || !diff)
+return false;
+
+  step = diff + 1;
+  size = step * GET_MODE_UNIT_BITSIZE (d->vmode);
+
+  switch (size)
+{
+case 16:
+  break;
+case 32:
+case 64:
+  /* We will have VEC_PERM_EXPR after rtl expand when invoking
+__builtin_bswap. It will generate about 9 instructions in
+loop as below, no matter it is bswap16, bswap32 or bswap64.
+  .L2:
+1 vle16.v v4,0(a0)
+2 vmv.v.x v2,a7
+3 vand.vv v2,v6,v2
+4 sllia2,a5,1
+5 vrgatherei16.vv v1,v4,v2
+6 sub a4,a4,a5
+7 vse16.v v1,0(a3)
+8 add a0,a0,a2
+9 add a3,a3,a2
+  bne a4,zero,.L2
+
+But for bswap16 we may have a even simple code gen, which
+has only 7 instructions in loop as below.
+  .L5
+1 vle8.v  v2,0(a5)
+2 addia5,a5,32
+3 vsrl.vi v4,v2,8
+4 vsll.vi v2,v2,8
+5 vor.vv  v4,v4,v2
+6 vse8.v  v4,0(a4)
+7 addia4,a4,32
+  bne a5,a6,.L5
+
+Unfortunately, the instructions in loop will grow to 13 and 24
+for bswap32 and bswap64. Thus, we will leverage vrgather (9 insn)
+for both the bswap64 and bswap32, but take shift and or (7 insn)
+for bswap16.
+   */
+default:
+  return false;
+}
+
+  for (i = 0; i < step; i++)
+if (!d->p

Re: [PATCH v1] RISC-V: Refine bswap16 auto vectorization code gen

2023-10-09 Thread juzhe.zh...@rivai.ai

Remove these functions:

+static void
+emit_vec_sll_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx sll_ops[] = {op_0, op_1, op_2};
+  insn_code icode = code_for_pred_scalar (ASHIFT, vec_mode);
+
+  emit_vlmax_insn (icode, BINARY_OP, sll_ops);
+}
+
+static void
+emit_vec_srl_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx srl_ops[] = {op_0, op_1, op_2};
+  insn_code icode = code_for_pred_scalar (LSHIFTRT, vec_mode);
+
+  emit_vlmax_insn (icode, BINARY_OP, srl_ops);
+}
+
+static void
+emit_vec_or (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx or_ops[] = {op_0, op_1, op_2};
+  insn_code icode = code_for_pred (IOR, vec_mode);
+
+  emit_vlmax_insn (icode, BINARY_OP, or_ops);
+}
+

Instead, 

For sll, you should use :
rtx tmp
= expand_binop (Pmode, ashl_optab, op_1,
gen_int_mode (8, Pmode), NULL_RTX, 0,
OPTAB_DIRECT);

For srl, you should use:
rtx tmp
= expand_binop (Pmode, lshiftrt_optab, op_1,
gen_int_mode (8, Pmode), NULL_RTX, 0,
OPTAB_DIRECT);


For or, you should use:
expand_binop (Pmode, ior_optab, tmp, dest, NULL_RTX, 0,
   OPTAB_DIRECT);



juzhe.zh...@rivai.ai
 
From: pan2.li
Date: 2023-10-09 16:51
To: gcc-patches
CC: juzhe.zhong; pan2.li; yanzhang.wang; kito.cheng
Subject: [PATCH v1] RISC-V: Refine bswap16 auto vectorization code gen
From: Pan Li 
 
This patch would like to refine the code gen for the bswap16.
 
We will have VEC_PERM_EXPR after rtl expand when invoking
__builtin_bswap. It will generate about 9 instructions in
loop as below, no matter it is bswap16, bswap32 or bswap64.
 
  .L2:
1 vle16.v v4,0(a0)
2 vmv.v.x v2,a7
3 vand.vv v2,v6,v2
4 sllia2,a5,1
5 vrgatherei16.vv v1,v4,v2
6 sub a4,a4,a5
7 vse16.v v1,0(a3)
8 add a0,a0,a2
9 add a3,a3,a2
  bne a4,zero,.L2
 
But for bswap16 we may have a even simple code gen, which
has only 7 instructions in loop as below.
 
  .L5
1 vle8.v  v2,0(a5)
2 addia5,a5,32
3 vsrl.vi v4,v2,8
4 vsll.vi v2,v2,8
5 vor.vv  v4,v4,v2
6 vse8.v  v4,0(a4)
7 addia4,a4,32
  bne a5,a6,.L5
 
Unfortunately, this way will make the insn in loop will grow up to
13 and 24 for bswap32 and bswap64. Thus, we will refine the code
gen for the bswap16 only, and leave both the bswap32 and bswap64
as is.
 
gcc/ChangeLog:
 
* config/riscv/riscv-v.cc (emit_vec_sll_scalar): New help func
impl for emit vsll.vi/vsll.vx
(emit_vec_srl_scalar): Likewise for vsrl.vi/vsrl.vx.
(emit_vec_or): Likewise for vor.vv.
(shuffle_bswap_pattern): New func impl for shuffle bswap.
(expand_vec_perm_const_1): Add shuffle bswap pattern.
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/autovec/vls/perm-4.c: Adjust checker.
* gcc.target/riscv/rvv/autovec/unop/bswap16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/bswap16-0.c: New test.
 
Signed-off-by: Pan Li 
---
gcc/config/riscv/riscv-v.cc   | 117 ++
.../riscv/rvv/autovec/unop/bswap16-0.c|  17 +++
.../riscv/rvv/autovec/unop/bswap16-run-0.c|  44 +++
.../riscv/rvv/autovec/vls/bswap16-0.c |  34 +
.../gcc.target/riscv/rvv/autovec/vls/perm-4.c |   4 +-
5 files changed, 214 insertions(+), 2 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/bswap16-0.c
 
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 23633a2a74d..3e3b5f2e797 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -878,6 +878,33 @@ emit_vlmax_decompress_insn (rtx target, rtx op0, rtx op1, 
rtx mask)
   emit_vlmax_masked_gather_mu_insn (target, op1, sel, mask);
}
+static void
+emit_vec_sll_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx sll_ops[] = {op_0, op_1, op_2};
+  insn_code icode = code_for_pred_scalar (ASHIFT, vec_mode);
+
+  emit_vlmax_insn (icode, BINARY_OP, sll_ops);
+}
+
+static void
+emit_vec_srl_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx srl_ops[] = {op_0, op_1, op_2};
+  insn_code icode = code_for_pred_scalar (LSHIFTRT, vec_mode);
+
+  emit_vlmax_insn (icode, BINARY_OP, srl_ops);
+}
+
+static void
+emit_vec_or (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx or_ops[] = {op_0, op_1, op_2};
+  insn_code icode = code_for_pred (IOR, vec_mode);
+
+  emit_vlmax_insn (icode, BINARY_OP, or_ops);
+}
+
/* Emit merge instruction.  */
static machine_mode
@@ -3030,6 +3057,94 @@ shuffle_decompress_patterns (struct expand_vec_perm_d *d)
   return true;
}
+static bool
+shuffle_bswap_pattern (struct expand_vec_perm_d *d)
+{
+  HOST_WIDE_INT diff;
+  unsigned i, size, step;
+
+  if (!d->one_vector_p || !d->perm[0].is_constant (&diff) || !diff)
+return false;
+
+  step = diff + 1;
+  size = step * GET_MODE_UNIT_BITSIZE (d->vmode);
+
+

Re: [PATCH]middle-end match.pd: optimize fneg (fabs (x)) to x | (1 << signbit(x)) [PR109154]

2023-10-09 Thread Richard Biener

On Mon, 9 Oct 2023, Andrew Pinski wrote:

> On Mon, Oct 9, 2023 at 12:20?AM Richard Biener  wrote:
> >
> > On Sat, 7 Oct 2023, Richard Sandiford wrote:
> >
> > > Richard Biener  writes:
> > > >> Am 07.10.2023 um 11:23 schrieb Richard Sandiford 
> > > >> >> Richard Biener  
> > > >> writes:
> > > >>> On Thu, 5 Oct 2023, Tamar Christina wrote:
> > > >>>
> > > > I suppose the idea is that -abs(x) might be easier to optimize with 
> > > > other
> > > > patterns (consider a - copysign(x,...), optimizing to a + abs(x)).
> > > >
> > > > For abs vs copysign it's a canonicalization, but (negate (abs @0)) 
> > > > is less
> > > > canonical than copysign.
> > > >
> > > >> Should I try removing this?
> > > >
> > > > I'd say yes (and put the reverse canonicalization next to this 
> > > > pattern).
> > > >
> > > 
> > >  This patch transforms fneg (fabs (x)) into copysign (x, -1) which is 
> > >  more
> > >  canonical and allows a target to expand this sequence efficiently.  
> > >  Such
> > >  sequences are common in scientific code working with gradients.
> > > 
> > >  various optimizations in match.pd only happened on COPYSIGN but not 
> > >  COPYSIGN_ALL
> > >  which means they exclude IFN_COPYSIGN.  COPYSIGN however is 
> > >  restricted to only
> > > >>>
> > > >>> That's not true:
> > > >>>
> > > >>> (define_operator_list COPYSIGN
> > > >>>BUILT_IN_COPYSIGNF
> > > >>>BUILT_IN_COPYSIGN
> > > >>>BUILT_IN_COPYSIGNL
> > > >>>IFN_COPYSIGN)
> > > >>>
> > > >>> but they miss the extended float builtin variants like
> > > >>> __builtin_copysignf16.  Also see below
> > > >>>
> > >  the C99 builtins and so doesn't work for vectors.
> > > 
> > >  The patch expands these optimizations to work on COPYSIGN_ALL.
> > > 
> > >  There is an existing canonicalization of copysign (x, -1) to fneg 
> > >  (fabs (x))
> > >  which I remove since this is a less efficient form.  The testsuite 
> > >  is also
> > >  updated in light of this.
> > > 
> > >  Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > 
> > >  Ok for master?
> > > 
> > >  Thanks,
> > >  Tamar
> > > 
> > >  gcc/ChangeLog:
> > > 
> > > PR tree-optimization/109154
> > > * match.pd: Add new neg+abs rule, remove inverse copysign rule and
> > > expand existing copysign optimizations.
> > > 
> > >  gcc/testsuite/ChangeLog:
> > > 
> > > PR tree-optimization/109154
> > > * gcc.dg/fold-copysign-1.c: Updated.
> > > * gcc.dg/pr55152-2.c: Updated.
> > > * gcc.dg/tree-ssa/abs-4.c: Updated.
> > > * gcc.dg/tree-ssa/backprop-6.c: Updated.
> > > * gcc.dg/tree-ssa/copy-sign-2.c: Updated.
> > > * gcc.dg/tree-ssa/mult-abs-2.c: Updated.
> > > * gcc.target/aarch64/fneg-abs_1.c: New test.
> > > * gcc.target/aarch64/fneg-abs_2.c: New test.
> > > * gcc.target/aarch64/fneg-abs_3.c: New test.
> > > * gcc.target/aarch64/fneg-abs_4.c: New test.
> > > * gcc.target/aarch64/sve/fneg-abs_1.c: New test.
> > > * gcc.target/aarch64/sve/fneg-abs_2.c: New test.
> > > * gcc.target/aarch64/sve/fneg-abs_3.c: New test.
> > > * gcc.target/aarch64/sve/fneg-abs_4.c: New test.
> > > 
> > >  --- inline copy of patch ---
> > > 
> > >  diff --git a/gcc/match.pd b/gcc/match.pd
> > >  index 
> > >  4bdd83e6e061b16dbdb2845b9398fcfb8a6c9739..bd6599d36021e119f51a4928354f580ffe82c6e2
> > >   100644
> > >  --- a/gcc/match.pd
> > >  +++ b/gcc/match.pd
> > >  @@ -1074,45 +1074,43 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > > 
> > >  /* cos(copysign(x, y)) -> cos(x).  Similarly for cosh.  */
> > >  (for coss (COS COSH)
> > >  - copysigns (COPYSIGN)
> > >  - (simplify
> > >  -  (coss (copysigns @0 @1))
> > >  -   (coss @0)))
> > >  + (for copysigns (COPYSIGN_ALL)
> > > >>>
> > > >>> So this ends up generating for example the match
> > > >>> (cosf (copysignl ...)) which doesn't make much sense.
> > > >>>
> > > >>> The lock-step iteration did
> > > >>> (cosf (copysignf ..)) ... (ifn_cos (ifn_copysign ...))
> > > >>> which is leaner but misses the case of
> > > >>> (cosf (ifn_copysign ..)) - that's probably what you are
> > > >>> after with this change.
> > > >>>
> > > >>> That said, there isn't a nice solution (without altering the match.pd
> > > >>> IL).  There's the explicit solution, spelling out all combinations.
> > > >>>
> > > >>> So if we want to go with yout pragmatic solution changing this
> > > >>> to use COPYSIGN_ALL isn't necessary, only changing the lock-step
> > > >>> for iteration to a cross product for iteration is.
> > > >>>
> > > >>> Changing just this pattern to
> > > >>>
> > > >>> (for coss (COS COSH)
> > > >>> (for copysigns (COPYSIGN)
> > > >>>  (simplify
> > > >>>   (coss (copysi

Re: [PATCH] RISC-V: Support movmisalign of RVV VLA modes

2023-10-09 Thread Robin Dapp

Hi Juzhe,

I think an extra param might be too intrusive.  I would expect normal
hardware implementations to support unaligned accesses (but they might
be slow which should be covered by costs) and only rarely have hardware
that doesn't support it and raises exceptions.

Therefore I would suggest to have a separate TARGET attribute like
TARGET_RVV_MISALIGNMENT_SUPPORTED (or so) that enable or disables the
movmisalign pattern.  This would be enabled by default for now and when
there is a uarch that wants different behavior it should do something
like

 #define TARGET_RVV_MISALIGNMENT_SUPPORTED (uarch != SPECIFIC_UARCH)

Regards
 Robin

Re: [X86 PATCH] Implement doubleword right shifts by 1 bit using s[ha]r+rcr.

2023-10-09 Thread Uros Bizjak

On Fri, Oct 6, 2023 at 3:59 PM Roger Sayle  wrote:
>
>
> Grr!  I've done it again.  ENOPATCH.
>
> > -Original Message-
> > From: Roger Sayle 
> > Sent: 06 October 2023 14:58
> > To: 'gcc-patches@gcc.gnu.org' 
> > Cc: 'Uros Bizjak' 
> > Subject: [X86 PATCH] Implement doubleword right shifts by 1 bit using
> s[ha]r+rcr.
> >
> >
> > This patch tweaks the i386 back-end's ix86_split_ashr and ix86_split_lshr
> > functions to implement doubleword right shifts by 1 bit, using a shift of
> the
> > highpart that sets the carry flag followed by a rotate-carry-right
> > (RCR) instruction on the lowpart.
> >
> > Conceptually this is similar to the recent left shift patch, but with two
> > complicating factors.  The first is that although the RCR sequence is
> shorter, and is
> > a ~3x performance improvement on AMD, my micro-benchmarking shows it
> > ~10% slower on Intel.  Hence this patch also introduces a new
> > X86_TUNE_USE_RCR tuning parameter.  The second is that I believe this is
> the
> > first time a "rotate-right-through-carry" and a right shift that sets the
> carry flag
> > from the least significant bit has been modelled in GCC RTL (on a MODE_CC
> > target).  For this I've used the i386 back-end's UNSPEC_CC_NE which seems
> > appropriate.  Finally rcrsi2 and rcrdi2 are separate define_insns so that
> we can
> > use their generator functions.
> >
> > For the pair of functions:
> > unsigned __int128 foo(unsigned __int128 x) { return x >> 1; }
> > __int128 bar(__int128 x) { return x >> 1; }
> >
> > with -O2 -march=znver4 we previously generated:
> >
> > foo:movq%rdi, %rax
> > movq%rsi, %rdx
> > shrdq   $1, %rsi, %rax
> > shrq%rdx
> > ret
> > bar:movq%rdi, %rax
> > movq%rsi, %rdx
> > shrdq   $1, %rsi, %rax
> > sarq%rdx
> > ret
> >
> > with this patch we now generate:
> >
> > foo:movq%rsi, %rdx
> > movq%rdi, %rax
> > shrq%rdx
> > rcrq%rax
> > ret
> > bar:movq%rsi, %rdx
> > movq%rdi, %rax
> > sarq%rdx
> > rcrq%rax
> > ret
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> > make -k check, both with and without --target_board=unix{-m32} with no new
> > failures.  And to provide additional testing, I've also bootstrapped and
> regression
> > tested a version of this patch where the RCR is always generated
> (independent of
> > the -march target) again with no regressions.  Ok for mainline?
> >
> >
> > 2023-10-06  Roger Sayle  
> >
> > gcc/ChangeLog
> > * config/i386/i386-expand.c (ix86_split_ashr): Split shifts by
> > one into ashr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR
> > or -Oz.
> > (ix86_split_lshr): Likewise, split shifts by one bit into
> > lshr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR or -Oz.
> > * config/i386/i386.h (TARGET_USE_RCR): New backend macro.
> > * config/i386/i386.md (rcrsi2): New define_insn for rcrl.
> > (rcrdi2): New define_insn for rcrq.
> > (3_carry): New define_insn for right shifts that
> > set the carry flag from the least significant bit, modelled using
> > UNSPEC_CC_NE.
> > * config/i386/x86-tune.def (X86_TUNE_USE_RCR): New tuning
> parameter
> > controlling use of rcr 1 vs. shrd, which is significantly faster
> on
> > AMD processors.
> >
> > gcc/testsuite/ChangeLog
> > * gcc.target/i386/rcr-1.c: New 64-bit test case.
> > * gcc.target/i386/rcr-2.c: New 32-bit test case.

OK.

Just don't set the new tune for generic. I hope Intel people notice
the performance difference...

Thanks,
Uros.

Re: [PATCH][_GLIBCXX_INLINE_VERSION] Fix

2023-10-09 Thread Iain Sandoe




> On 9 Oct 2023, at 06:06, François Dumont  wrote:
> 
> I think we can do the same without the symbol alias feature. It's even 
> simpler cause do not require any maintenance when version symbol bump.
> 
> Here is what I'm testing, at least exported symbol is fine.

Thanks; works for me, (g++ tests with your patches + a local one to enable 
versioned-namespace on Darwin).

Iain

> 
> François
> 
> 
> On 08/10/2023 16:06, Iain Sandoe wrote:
>> Hi François,
>> 
>>> On 21 Sep 2023, at 05:41, François Dumont  wrote:
>>> 
>>> Tests were successful, ok to commit ?
>>> 
>>> On 20/09/2023 19:51, François Dumont wrote:
 libstdc++: [_GLIBCXX_INLINE_VERSION] Add handle_contract_violation symbol 
 alias
 
 libstdc++-v3/ChangeLog:
 
 * src/experimental/contract.cc
 [_GLIBCXX_INLINE_VERSION](handle_contract_violation): Provide symbol 
 alias
 without version namespace decoration for gcc.
>> This does not work in the source on targets without support for symbol 
>> aliases (Darwin is one)
>> “../experimental/contract.cc:79:8: warning: alias definitions not supported 
>> in Mach-O; ignored”
>> 
>> - there might be a way to do it at link-time (for one symbol not too bad); I 
>> will have to poke at
>>   it a bit.
>> Iain
>> 
 Here is what I'm testing eventually, ok to commit if successful ?
 
 François
 
 On 20/09/2023 11:32, Jonathan Wakely wrote:
> On Wed, 20 Sept 2023 at 05:51, François Dumont via Libstdc++
>  wrote:
>> libstdc++: Remove std::constract_violation from versioned namespace
> Spelling mistake in contract_violation, and it's not
> std::contract_violation, it's std::experimental::contract_violation
> 
>> GCC expects this type to be in std namespace directly.
> Again, it's in std::experimental not in std directly.
> 
> Will this change cause problems when including another experimental
> header, which does put experimental below std::__8?
> 
> I think std::__8::experimental and std::experimental will become 
> ambiguous.
> 
> Maybe we do want to remove the inline __8 namespace from all
> experimental headers. That needs a bit more thought though.
> 
>> libstdc++-v3/ChangeLog:
>> 
>>   * include/experimental/contract:
>>   Remove 
>> _GLIBCXX_BEGIN_NAMESPACE_VERSION/_GLIBCXX_END_NAMESPACE_VERSION.
> This line is too long for the changelog.
> 
>> It does fix 29 g++.dg/contracts in gcc testsuite.
>> 
>> Ok to commit ?
>> 
>> François
>

Re: [PATCH] test: Isolate slp-1.c check of target supports vect_strided5

2023-10-09 Thread Andrew Stubbs


On 07/10/2023 02:04, juzhe.zh...@rivai.ai wrote:

Thanks for reporting it.

I think we may need to change it into:
+ /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 
"vect" { target {! vect_load_lanes } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 
"vect" { target vect_strided5 && vect_load_lanes } } } */


Could you verify it whether it work for you ?


You need an additional set of curly braces in the second line to avoid a 
syntax error message, but I get a pass with that change.


Thanks

Andrew

RE: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-09 Thread Tamar Christina

> -Original Message-
> From: Richard Sandiford 
> Sent: Saturday, October 7, 2023 10:58 AM
> To: Richard Biener 
> Cc: Tamar Christina ; gcc-patches@gcc.gnu.org;
> nd ; Richard Earnshaw ;
> Marcus Shawcroft ; Kyrylo Tkachov
> 
> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> 
> Richard Biener  writes:
> > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina
>  wrote:
> >>
> >> > -Original Message-
> >> > From: Richard Sandiford 
> >> > Sent: Thursday, October 5, 2023 9:26 PM
> >> > To: Tamar Christina 
> >> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> >> > ; Marcus Shawcroft
> >> > ; Kyrylo Tkachov
> 
> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
> cond_copysign.
> >> >
> >> > Tamar Christina  writes:
> >> > >> -Original Message-
> >> > >> From: Richard Sandiford 
> >> > >> Sent: Thursday, October 5, 2023 8:29 PM
> >> > >> To: Tamar Christina 
> >> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> >> > >> ; Marcus Shawcroft
> >> > >> ; Kyrylo Tkachov
> >> > 
> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
> cond_copysign.
> >> > >>
> >> > >> Tamar Christina  writes:
> >> > >> > Hi All,
> >> > >> >
> >> > >> > This adds an implementation for masked copysign along with an
> >> > >> > optimized pattern for masked copysign (x, -1).
> >> > >>
> >> > >> It feels like we're ending up with a lot of AArch64-specific
> >> > >> code that just hard- codes the observation that changing the
> >> > >> sign is equivalent to changing the top bit.  We then need to
> >> > >> make sure that we choose the best way of changing the top bit for any
> given situation.
> >> > >>
> >> > >> Hard-coding the -1/negative case is one instance of that.  But
> >> > >> it looks like we also fail to use the best sequence for SVE2.  E.g.
> >> > >> [https://godbolt.org/z/ajh3MM5jv]:
> >> > >>
> >> > >> #include 
> >> > >>
> >> > >> void f(double *restrict a, double *restrict b) {
> >> > >> for (int i = 0; i < 100; ++i)
> >> > >> a[i] = __builtin_copysign(a[i], b[i]); }
> >> > >>
> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
> >> > >> for (int i = 0; i < 100; ++i)
> >> > >> a[i] = (a[i] & ~c) | (b[i] & c); }
> >> > >>
> >> > >> gives:
> >> > >>
> >> > >> f:
> >> > >> mov x2, 0
> >> > >> mov w3, 100
> >> > >> whilelo p7.d, wzr, w3
> >> > >> .L2:
> >> > >> ld1dz30.d, p7/z, [x0, x2, lsl 3]
> >> > >> ld1dz31.d, p7/z, [x1, x2, lsl 3]
> >> > >> and z30.d, z30.d, #0x7fff
> >> > >> and z31.d, z31.d, #0x8000
> >> > >> orr z31.d, z31.d, z30.d
> >> > >> st1dz31.d, p7, [x0, x2, lsl 3]
> >> > >> incdx2
> >> > >> whilelo p7.d, w2, w3
> >> > >> b.any   .L2
> >> > >> ret
> >> > >> g:
> >> > >> mov x3, 0
> >> > >> mov w4, 100
> >> > >> mov z29.d, x2
> >> > >> whilelo p7.d, wzr, w4
> >> > >> .L6:
> >> > >> ld1dz30.d, p7/z, [x0, x3, lsl 3]
> >> > >> ld1dz31.d, p7/z, [x1, x3, lsl 3]
> >> > >> bsl z31.d, z31.d, z30.d, z29.d
> >> > >> st1dz31.d, p7, [x0, x3, lsl 3]
> >> > >> incdx3
> >> > >> whilelo p7.d, w3, w4
> >> > >> b.any   .L6
> >> > >> ret
> >> > >>
> >> > >> I saw that you originally tried to do this in match.pd and that
> >> > >> the decision was to fold to copysign instead.  But perhaps
> >> > >> there's a compromise where isel does something with the (new)
> >> > >> copysign canonical
> >> > form?
> >> > >> I.e. could we go with your new version of the match.pd patch,
> >> > >> and add some isel stuff as a follow-on?
> >> > >>
> >> > >
> >> > > Sure if that's what's desired But..
> >> > >
> >> > > The example you posted above is for instance worse for x86
> >> > > https://godbolt.org/z/x9ccqxW6T where the first operation has a
> >> > > dependency chain of 2 and the latter of 3.  It's likely any open
> >> > > coding of this
> >> > operation is going to hurt a target.
> >> > >
> >> > > So I'm unsure what isel transform this into...
> >> >
> >> > I didn't mean that we should go straight to using isel for the
> >> > general case, just for the new case.  The example above was instead
> >> > trying to show the general point that hiding the logic ops in target 
> >> > code is
> a double-edged sword.
> >>
> >> I see.. but the problem here is that transforming copysign (x, -1)
> >> into (x | 0x800) would require an integer operation on an FP
> >> value.  I'm happy to do it but it seems like it'll be an AArch64 only thing
> anyway.
> >>
> >> If we want to do this we need to check can_change_mode_class or a hook.
> >> Most targets including x86 reject the conversion.  So it'll just be
> >> effectively an AArch64 thing.
> >>
> >> You're right that the actual equivalent transformation is this
> >> https://godbolt.org/z/KesfrMv5

Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-09 Thread Richard Biener

On Mon, Oct 9, 2023 at 11:39 AM Tamar Christina  wrote:
>
> > -Original Message-
> > From: Richard Sandiford 
> > Sent: Saturday, October 7, 2023 10:58 AM
> > To: Richard Biener 
> > Cc: Tamar Christina ; gcc-patches@gcc.gnu.org;
> > nd ; Richard Earnshaw ;
> > Marcus Shawcroft ; Kyrylo Tkachov
> > 
> > Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> >
> > Richard Biener  writes:
> > > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina
> >  wrote:
> > >>
> > >> > -Original Message-
> > >> > From: Richard Sandiford 
> > >> > Sent: Thursday, October 5, 2023 9:26 PM
> > >> > To: Tamar Christina 
> > >> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> > >> > ; Marcus Shawcroft
> > >> > ; Kyrylo Tkachov
> > 
> > >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
> > cond_copysign.
> > >> >
> > >> > Tamar Christina  writes:
> > >> > >> -Original Message-
> > >> > >> From: Richard Sandiford 
> > >> > >> Sent: Thursday, October 5, 2023 8:29 PM
> > >> > >> To: Tamar Christina 
> > >> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> > >> > >> ; Marcus Shawcroft
> > >> > >> ; Kyrylo Tkachov
> > >> > 
> > >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
> > cond_copysign.
> > >> > >>
> > >> > >> Tamar Christina  writes:
> > >> > >> > Hi All,
> > >> > >> >
> > >> > >> > This adds an implementation for masked copysign along with an
> > >> > >> > optimized pattern for masked copysign (x, -1).
> > >> > >>
> > >> > >> It feels like we're ending up with a lot of AArch64-specific
> > >> > >> code that just hard- codes the observation that changing the
> > >> > >> sign is equivalent to changing the top bit.  We then need to
> > >> > >> make sure that we choose the best way of changing the top bit for 
> > >> > >> any
> > given situation.
> > >> > >>
> > >> > >> Hard-coding the -1/negative case is one instance of that.  But
> > >> > >> it looks like we also fail to use the best sequence for SVE2.  E.g.
> > >> > >> [https://godbolt.org/z/ajh3MM5jv]:
> > >> > >>
> > >> > >> #include 
> > >> > >>
> > >> > >> void f(double *restrict a, double *restrict b) {
> > >> > >> for (int i = 0; i < 100; ++i)
> > >> > >> a[i] = __builtin_copysign(a[i], b[i]); }
> > >> > >>
> > >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
> > >> > >> for (int i = 0; i < 100; ++i)
> > >> > >> a[i] = (a[i] & ~c) | (b[i] & c); }
> > >> > >>
> > >> > >> gives:
> > >> > >>
> > >> > >> f:
> > >> > >> mov x2, 0
> > >> > >> mov w3, 100
> > >> > >> whilelo p7.d, wzr, w3
> > >> > >> .L2:
> > >> > >> ld1dz30.d, p7/z, [x0, x2, lsl 3]
> > >> > >> ld1dz31.d, p7/z, [x1, x2, lsl 3]
> > >> > >> and z30.d, z30.d, #0x7fff
> > >> > >> and z31.d, z31.d, #0x8000
> > >> > >> orr z31.d, z31.d, z30.d
> > >> > >> st1dz31.d, p7, [x0, x2, lsl 3]
> > >> > >> incdx2
> > >> > >> whilelo p7.d, w2, w3
> > >> > >> b.any   .L2
> > >> > >> ret
> > >> > >> g:
> > >> > >> mov x3, 0
> > >> > >> mov w4, 100
> > >> > >> mov z29.d, x2
> > >> > >> whilelo p7.d, wzr, w4
> > >> > >> .L6:
> > >> > >> ld1dz30.d, p7/z, [x0, x3, lsl 3]
> > >> > >> ld1dz31.d, p7/z, [x1, x3, lsl 3]
> > >> > >> bsl z31.d, z31.d, z30.d, z29.d
> > >> > >> st1dz31.d, p7, [x0, x3, lsl 3]
> > >> > >> incdx3
> > >> > >> whilelo p7.d, w3, w4
> > >> > >> b.any   .L6
> > >> > >> ret
> > >> > >>
> > >> > >> I saw that you originally tried to do this in match.pd and that
> > >> > >> the decision was to fold to copysign instead.  But perhaps
> > >> > >> there's a compromise where isel does something with the (new)
> > >> > >> copysign canonical
> > >> > form?
> > >> > >> I.e. could we go with your new version of the match.pd patch,
> > >> > >> and add some isel stuff as a follow-on?
> > >> > >>
> > >> > >
> > >> > > Sure if that's what's desired But..
> > >> > >
> > >> > > The example you posted above is for instance worse for x86
> > >> > > https://godbolt.org/z/x9ccqxW6T where the first operation has a
> > >> > > dependency chain of 2 and the latter of 3.  It's likely any open
> > >> > > coding of this
> > >> > operation is going to hurt a target.
> > >> > >
> > >> > > So I'm unsure what isel transform this into...
> > >> >
> > >> > I didn't mean that we should go straight to using isel for the
> > >> > general case, just for the new case.  The example above was instead
> > >> > trying to show the general point that hiding the logic ops in target 
> > >> > code is
> > a double-edged sword.
> > >>
> > >> I see.. but the problem here is that transforming copysign (x, -1)
> > >> into (x | 0x800) would require an integer operation on an FP
> > >> value.  I'm happy to do it but it seems like it'll be an AArch64 onl

RE: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-09 Thread Tamar Christina

> -Original Message-
> From: Richard Biener 
> Sent: Monday, October 9, 2023 10:45 AM
> To: Tamar Christina 
> Cc: Richard Sandiford ; gcc-
> patc...@gcc.gnu.org; nd ; Richard Earnshaw
> ; Marcus Shawcroft
> ; Kyrylo Tkachov 
> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> 
> On Mon, Oct 9, 2023 at 11:39 AM Tamar Christina
>  wrote:
> >
> > > -Original Message-
> > > From: Richard Sandiford 
> > > Sent: Saturday, October 7, 2023 10:58 AM
> > > To: Richard Biener 
> > > Cc: Tamar Christina ;
> > > gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> > > ; Marcus Shawcroft
> > > ; Kyrylo Tkachov
> 
> > > Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> > >
> > > Richard Biener  writes:
> > > > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina
> > >  wrote:
> > > >>
> > > >> > -Original Message-
> > > >> > From: Richard Sandiford 
> > > >> > Sent: Thursday, October 5, 2023 9:26 PM
> > > >> > To: Tamar Christina 
> > > >> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> > > >> > ; Marcus Shawcroft
> > > >> > ; Kyrylo Tkachov
> > > 
> > > >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
> > > cond_copysign.
> > > >> >
> > > >> > Tamar Christina  writes:
> > > >> > >> -Original Message-
> > > >> > >> From: Richard Sandiford 
> > > >> > >> Sent: Thursday, October 5, 2023 8:29 PM
> > > >> > >> To: Tamar Christina 
> > > >> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard
> > > >> > >> Earnshaw ; Marcus Shawcroft
> > > >> > >> ; Kyrylo Tkachov
> > > >> > 
> > > >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
> > > cond_copysign.
> > > >> > >>
> > > >> > >> Tamar Christina  writes:
> > > >> > >> > Hi All,
> > > >> > >> >
> > > >> > >> > This adds an implementation for masked copysign along with
> > > >> > >> > an optimized pattern for masked copysign (x, -1).
> > > >> > >>
> > > >> > >> It feels like we're ending up with a lot of AArch64-specific
> > > >> > >> code that just hard- codes the observation that changing the
> > > >> > >> sign is equivalent to changing the top bit.  We then need to
> > > >> > >> make sure that we choose the best way of changing the top
> > > >> > >> bit for any
> > > given situation.
> > > >> > >>
> > > >> > >> Hard-coding the -1/negative case is one instance of that.
> > > >> > >> But it looks like we also fail to use the best sequence for SVE2. 
> > > >> > >>  E.g.
> > > >> > >> [https://godbolt.org/z/ajh3MM5jv]:
> > > >> > >>
> > > >> > >> #include 
> > > >> > >>
> > > >> > >> void f(double *restrict a, double *restrict b) {
> > > >> > >> for (int i = 0; i < 100; ++i)
> > > >> > >> a[i] = __builtin_copysign(a[i], b[i]); }
> > > >> > >>
> > > >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
> > > >> > >> for (int i = 0; i < 100; ++i)
> > > >> > >> a[i] = (a[i] & ~c) | (b[i] & c); }
> > > >> > >>
> > > >> > >> gives:
> > > >> > >>
> > > >> > >> f:
> > > >> > >> mov x2, 0
> > > >> > >> mov w3, 100
> > > >> > >> whilelo p7.d, wzr, w3
> > > >> > >> .L2:
> > > >> > >> ld1dz30.d, p7/z, [x0, x2, lsl 3]
> > > >> > >> ld1dz31.d, p7/z, [x1, x2, lsl 3]
> > > >> > >> and z30.d, z30.d, #0x7fff
> > > >> > >> and z31.d, z31.d, #0x8000
> > > >> > >> orr z31.d, z31.d, z30.d
> > > >> > >> st1dz31.d, p7, [x0, x2, lsl 3]
> > > >> > >> incdx2
> > > >> > >> whilelo p7.d, w2, w3
> > > >> > >> b.any   .L2
> > > >> > >> ret
> > > >> > >> g:
> > > >> > >> mov x3, 0
> > > >> > >> mov w4, 100
> > > >> > >> mov z29.d, x2
> > > >> > >> whilelo p7.d, wzr, w4
> > > >> > >> .L6:
> > > >> > >> ld1dz30.d, p7/z, [x0, x3, lsl 3]
> > > >> > >> ld1dz31.d, p7/z, [x1, x3, lsl 3]
> > > >> > >> bsl z31.d, z31.d, z30.d, z29.d
> > > >> > >> st1dz31.d, p7, [x0, x3, lsl 3]
> > > >> > >> incdx3
> > > >> > >> whilelo p7.d, w3, w4
> > > >> > >> b.any   .L6
> > > >> > >> ret
> > > >> > >>
> > > >> > >> I saw that you originally tried to do this in match.pd and
> > > >> > >> that the decision was to fold to copysign instead.  But
> > > >> > >> perhaps there's a compromise where isel does something with
> > > >> > >> the (new) copysign canonical
> > > >> > form?
> > > >> > >> I.e. could we go with your new version of the match.pd
> > > >> > >> patch, and add some isel stuff as a follow-on?
> > > >> > >>
> > > >> > >
> > > >> > > Sure if that's what's desired But..
> > > >> > >
> > > >> > > The example you posted above is for instance worse for x86
> > > >> > > https://godbolt.org/z/x9ccqxW6T where the first operation has
> > > >> > > a dependency chain of 2 and the latter of 3.  It's likely any
> > > >> > > open coding of this
> > > >> > operation is going to hurt a target.
> > > >> > >
> > > >> >

Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-09 Thread Richard Sandiford

Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Saturday, October 7, 2023 10:58 AM
>> To: Richard Biener 
>> Cc: Tamar Christina ; gcc-patches@gcc.gnu.org;
>> nd ; Richard Earnshaw ;
>> Marcus Shawcroft ; Kyrylo Tkachov
>> 
>> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> 
>> Richard Biener  writes:
>> > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina
>>  wrote:
>> >>
>> >> > -Original Message-
>> >> > From: Richard Sandiford 
>> >> > Sent: Thursday, October 5, 2023 9:26 PM
>> >> > To: Tamar Christina 
>> >> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> >> > ; Marcus Shawcroft
>> >> > ; Kyrylo Tkachov
>> 
>> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
>> cond_copysign.
>> >> >
>> >> > Tamar Christina  writes:
>> >> > >> -Original Message-
>> >> > >> From: Richard Sandiford 
>> >> > >> Sent: Thursday, October 5, 2023 8:29 PM
>> >> > >> To: Tamar Christina 
>> >> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> >> > >> ; Marcus Shawcroft
>> >> > >> ; Kyrylo Tkachov
>> >> > 
>> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
>> cond_copysign.
>> >> > >>
>> >> > >> Tamar Christina  writes:
>> >> > >> > Hi All,
>> >> > >> >
>> >> > >> > This adds an implementation for masked copysign along with an
>> >> > >> > optimized pattern for masked copysign (x, -1).
>> >> > >>
>> >> > >> It feels like we're ending up with a lot of AArch64-specific
>> >> > >> code that just hard- codes the observation that changing the
>> >> > >> sign is equivalent to changing the top bit.  We then need to
>> >> > >> make sure that we choose the best way of changing the top bit for any
>> given situation.
>> >> > >>
>> >> > >> Hard-coding the -1/negative case is one instance of that.  But
>> >> > >> it looks like we also fail to use the best sequence for SVE2.  E.g.
>> >> > >> [https://godbolt.org/z/ajh3MM5jv]:
>> >> > >>
>> >> > >> #include 
>> >> > >>
>> >> > >> void f(double *restrict a, double *restrict b) {
>> >> > >> for (int i = 0; i < 100; ++i)
>> >> > >> a[i] = __builtin_copysign(a[i], b[i]); }
>> >> > >>
>> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
>> >> > >> for (int i = 0; i < 100; ++i)
>> >> > >> a[i] = (a[i] & ~c) | (b[i] & c); }
>> >> > >>
>> >> > >> gives:
>> >> > >>
>> >> > >> f:
>> >> > >> mov x2, 0
>> >> > >> mov w3, 100
>> >> > >> whilelo p7.d, wzr, w3
>> >> > >> .L2:
>> >> > >> ld1dz30.d, p7/z, [x0, x2, lsl 3]
>> >> > >> ld1dz31.d, p7/z, [x1, x2, lsl 3]
>> >> > >> and z30.d, z30.d, #0x7fff
>> >> > >> and z31.d, z31.d, #0x8000
>> >> > >> orr z31.d, z31.d, z30.d
>> >> > >> st1dz31.d, p7, [x0, x2, lsl 3]
>> >> > >> incdx2
>> >> > >> whilelo p7.d, w2, w3
>> >> > >> b.any   .L2
>> >> > >> ret
>> >> > >> g:
>> >> > >> mov x3, 0
>> >> > >> mov w4, 100
>> >> > >> mov z29.d, x2
>> >> > >> whilelo p7.d, wzr, w4
>> >> > >> .L6:
>> >> > >> ld1dz30.d, p7/z, [x0, x3, lsl 3]
>> >> > >> ld1dz31.d, p7/z, [x1, x3, lsl 3]
>> >> > >> bsl z31.d, z31.d, z30.d, z29.d
>> >> > >> st1dz31.d, p7, [x0, x3, lsl 3]
>> >> > >> incdx3
>> >> > >> whilelo p7.d, w3, w4
>> >> > >> b.any   .L6
>> >> > >> ret
>> >> > >>
>> >> > >> I saw that you originally tried to do this in match.pd and that
>> >> > >> the decision was to fold to copysign instead.  But perhaps
>> >> > >> there's a compromise where isel does something with the (new)
>> >> > >> copysign canonical
>> >> > form?
>> >> > >> I.e. could we go with your new version of the match.pd patch,
>> >> > >> and add some isel stuff as a follow-on?

[A]

>> >> > >>
>> >> > >
>> >> > > Sure if that's what's desired But..
>> >> > >
>> >> > > The example you posted above is for instance worse for x86
>> >> > > https://godbolt.org/z/x9ccqxW6T where the first operation has a
>> >> > > dependency chain of 2 and the latter of 3.  It's likely any open
>> >> > > coding of this
>> >> > operation is going to hurt a target.
>> >> > >
>> >> > > So I'm unsure what isel transform this into...
>> >> >
>> >> > I didn't mean that we should go straight to using isel for the
>> >> > general case, just for the new case.  The example above was instead
>> >> > trying to show the general point that hiding the logic ops in target 
>> >> > code is
>> a double-edged sword.
>> >>
>> >> I see.. but the problem here is that transforming copysign (x, -1)
>> >> into (x | 0x800) would require an integer operation on an FP
>> >> value.  I'm happy to do it but it seems like it'll be an AArch64 only 
>> >> thing
>> anyway.
>> >>
>> >> If we want to do this we need to check can_change_mode_class or a hook.
>> >> Most targets including x86 reject the conversi

[PATCH] c++: Improve diagnostics for constexpr cast from void*

2023-10-09 Thread Nathaniel Shead

Bootstrapped and regtested on x86_64-pc-linux-gnu with 
GXX_TESTSUITE_STDS=98,11,14,17,20,23,26,impcx.

-- >8 --

This patch improves the errors given when casting from void* in C++26 to
include the expected type if the type of the pointed-to object was
not similar to the casted-to type. 

It also ensures (for all standard modes) that void* casts are checked
even for DECL_ARTIFICIAL declarations, such as lifetime-extended
temporaries, and is only ignored for cases where we know it's OK (heap
identifiers and source_location::current). This provides more accurate
diagnostics when using the pointer and ensures that some other casts
from void* are now correctly rejected.

gcc/cp/ChangeLog:

* constexpr.cc (is_std_source_location_current): New.
(cxx_eval_constant_expression): Only ignore cast from void* for
specific cases and improve other diagnostics.

gcc/testsuite/ChangeLog:

* g++.dg/cpp0x/constexpr-cast4.C: New test.

Signed-off-by: Nathaniel Shead 
---
 gcc/cp/constexpr.cc  | 83 +---
 gcc/testsuite/g++.dg/cpp0x/constexpr-cast4.C |  7 ++
 2 files changed, 78 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/cpp0x/constexpr-cast4.C

diff --git a/gcc/cp/constexpr.cc b/gcc/cp/constexpr.cc
index 0f948db7c2d..f38d541a662 100644
--- a/gcc/cp/constexpr.cc
+++ b/gcc/cp/constexpr.cc
@@ -2301,6 +2301,36 @@ is_std_allocator_allocate (const constexpr_call *call)
  && is_std_allocator_allocate (call->fundef->decl));
 }
 
+/* Return true if FNDECL is std::source_location::current.  */
+
+static inline bool
+is_std_source_location_current (tree fndecl)
+{
+  if (!decl_in_std_namespace_p (fndecl))
+return false;
+
+  tree name = DECL_NAME (fndecl);
+  if (name == NULL_TREE || !id_equal (name, "current"))
+return false;
+
+  tree ctx = DECL_CONTEXT (fndecl);
+  if (ctx == NULL_TREE || !CLASS_TYPE_P (ctx) || !TYPE_MAIN_DECL (ctx))
+return false;
+
+  name = DECL_NAME (TYPE_MAIN_DECL (ctx));
+  return name && id_equal (name, "source_location");
+}
+
+/* Overload for the above taking constexpr_call*.  */
+
+static inline bool
+is_std_source_location_current (const constexpr_call *call)
+{
+  return (call
+ && call->fundef
+ && is_std_source_location_current (call->fundef->decl));
+}
+
 /* Return true if FNDECL is __dynamic_cast.  */
 
 static inline bool
@@ -7850,33 +7880,62 @@ cxx_eval_constant_expression (const constexpr_ctx *ctx, 
tree t,
if (TYPE_PTROB_P (type)
&& TYPE_PTR_P (TREE_TYPE (op))
&& VOID_TYPE_P (TREE_TYPE (TREE_TYPE (op)))
-   /* Inside a call to std::construct_at or to
-  std::allocator::{,de}allocate, we permit casting from void*
+   /* Inside a call to std::construct_at,
+  std::allocator::{,de}allocate, or
+  std::source_location::current, we permit casting from void*
   because that is compiler-generated code.  */
&& !is_std_construct_at (ctx->call)
-   && !is_std_allocator_allocate (ctx->call))
+   && !is_std_allocator_allocate (ctx->call)
+   && !is_std_source_location_current (ctx->call))
  {
/* Likewise, don't error when casting from void* when OP is
   &heap uninit and similar.  */
tree sop = tree_strip_nop_conversions (op);
-   if (TREE_CODE (sop) == ADDR_EXPR
-   && VAR_P (TREE_OPERAND (sop, 0))
-   && DECL_ARTIFICIAL (TREE_OPERAND (sop, 0)))
+   tree decl = NULL_TREE;
+   if (TREE_CODE (sop) == ADDR_EXPR)
+ decl = TREE_OPERAND (sop, 0);
+   if (decl
+   && VAR_P (decl)
+   && DECL_ARTIFICIAL (decl)
+   && (DECL_NAME (decl) == heap_identifier
+   || DECL_NAME (decl) == heap_uninit_identifier
+   || DECL_NAME (decl) == heap_vec_identifier
+   || DECL_NAME (decl) == heap_vec_uninit_identifier))
  /* OK */;
/* P2738 (C++26): a conversion from a prvalue P of type "pointer to
   cv void" to a pointer-to-object type T unless P points to an
   object whose type is similar to T.  */
-   else if (cxx_dialect > cxx23
-&& (sop = cxx_fold_indirect_ref (ctx, loc,
- TREE_TYPE (type), sop)))
+   else if (cxx_dialect > cxx23)
  {
-   r = build1 (ADDR_EXPR, type, sop);
-   break;
+   r = cxx_fold_indirect_ref (ctx, loc, TREE_TYPE (type), sop);
+   if (r)
+ {
+   r = build1 (ADDR_EXPR, type, r);
+   break;
+ }
+   if (!ctx->quiet)
+ {
+   if (TREE_CODE (sop) == ADDR_EXPR)
+ {
+   error_at (loc, "cast from %qT is not allowed becau

Re: [pushed] analyzer: improvements to out-of-bounds diagrams [PR111155]

2023-10-09 Thread Tobias Burnus


Hi David,

your commit breaks compilation with GCC < 6, here with GCC 5.2:

gcc/analyzer/access-diagram.cc: In member function 'void ana::boundaries::add(const 
ana::access_range&, ana::boundaries::kind)':
gcc/analyzer/access-diagram.cc:655:20: error: 'kind' is not a class, namespace, 
or enumeration
   (kind == kind::HARD) ? "HARD" : "soft");
^
The problem is ...

On 09.10.23 00:58, David Malcolm wrote:


Update out-of-bounds diagrams to show existing string values,
diff --git a/gcc/analyzer/access-diagram.cc b/gcc/analyzer/access-diagram.cc
index a51d594b5b2..2197ec63f53 100644
--- a/gcc/analyzer/access-diagram.cc
+++ b/gcc/analyzer/access-diagram.cc
@@ -630,8 +630,8 @@ class boundaries
  public:
enum class kind { HARD, SOFT};


...


@@ -646,6 +646,15 @@ public:


Just above the following diff is the line:

  void add (const access_range &range, enum kind kind)


{
  add (range.m_start, kind);
  add (range.m_next, kind);
+if (m_logger)
+  {
+ m_logger->start_log_line ();
+ m_logger->log_partial ("added access_range: ");
+ range.dump_to_pp (m_logger->get_printer (), true);
+ m_logger->log_partial (" (%s)",
+(kind == kind::HARD) ? "HARD" : "soft");
+ m_logger->end_log_line ();


Actual problem:

Playing around also with the compiler explorer shows that GCC 5.2 or likewise 
5.5
do not like the variable (PARAM_DECL) name "kind" combined with  "kind::HARD".

The following works:
(A) Using "kind == boundaries::kind::HARD" - i.e. adding "boundaries::"
(B) Renaming the parameter name "kind" to something else - like "k" as used
in the other functions.

Can you fix it?

Thanks,

Tobias

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

RE: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-09 Thread Tamar Christina

> -Original Message-
> From: Richard Sandiford 
> Sent: Monday, October 9, 2023 10:56 AM
> To: Tamar Christina 
> Cc: Richard Biener ; gcc-patches@gcc.gnu.org;
> nd ; Richard Earnshaw ;
> Marcus Shawcroft ; Kyrylo Tkachov
> 
> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> 
> Tamar Christina  writes:
> >> -Original Message-
> >> From: Richard Sandiford 
> >> Sent: Saturday, October 7, 2023 10:58 AM
> >> To: Richard Biener 
> >> Cc: Tamar Christina ;
> >> gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> >> ; Marcus Shawcroft
> >> ; Kyrylo Tkachov
> 
> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> >>
> >> Richard Biener  writes:
> >> > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina
> >>  wrote:
> >> >>
> >> >> > -Original Message-
> >> >> > From: Richard Sandiford 
> >> >> > Sent: Thursday, October 5, 2023 9:26 PM
> >> >> > To: Tamar Christina 
> >> >> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> >> >> > ; Marcus Shawcroft
> >> >> > ; Kyrylo Tkachov
> >> 
> >> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
> >> cond_copysign.
> >> >> >
> >> >> > Tamar Christina  writes:
> >> >> > >> -Original Message-
> >> >> > >> From: Richard Sandiford 
> >> >> > >> Sent: Thursday, October 5, 2023 8:29 PM
> >> >> > >> To: Tamar Christina 
> >> >> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard
> >> >> > >> Earnshaw ; Marcus Shawcroft
> >> >> > >> ; Kyrylo Tkachov
> >> >> > 
> >> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
> >> cond_copysign.
> >> >> > >>
> >> >> > >> Tamar Christina  writes:
> >> >> > >> > Hi All,
> >> >> > >> >
> >> >> > >> > This adds an implementation for masked copysign along with
> >> >> > >> > an optimized pattern for masked copysign (x, -1).
> >> >> > >>
> >> >> > >> It feels like we're ending up with a lot of AArch64-specific
> >> >> > >> code that just hard- codes the observation that changing the
> >> >> > >> sign is equivalent to changing the top bit.  We then need to
> >> >> > >> make sure that we choose the best way of changing the top bit
> >> >> > >> for any
> >> given situation.
> >> >> > >>
> >> >> > >> Hard-coding the -1/negative case is one instance of that.
> >> >> > >> But it looks like we also fail to use the best sequence for SVE2.  
> >> >> > >> E.g.
> >> >> > >> [https://godbolt.org/z/ajh3MM5jv]:
> >> >> > >>
> >> >> > >> #include 
> >> >> > >>
> >> >> > >> void f(double *restrict a, double *restrict b) {
> >> >> > >> for (int i = 0; i < 100; ++i)
> >> >> > >> a[i] = __builtin_copysign(a[i], b[i]); }
> >> >> > >>
> >> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
> >> >> > >> for (int i = 0; i < 100; ++i)
> >> >> > >> a[i] = (a[i] & ~c) | (b[i] & c); }
> >> >> > >>
> >> >> > >> gives:
> >> >> > >>
> >> >> > >> f:
> >> >> > >> mov x2, 0
> >> >> > >> mov w3, 100
> >> >> > >> whilelo p7.d, wzr, w3
> >> >> > >> .L2:
> >> >> > >> ld1dz30.d, p7/z, [x0, x2, lsl 3]
> >> >> > >> ld1dz31.d, p7/z, [x1, x2, lsl 3]
> >> >> > >> and z30.d, z30.d, #0x7fff
> >> >> > >> and z31.d, z31.d, #0x8000
> >> >> > >> orr z31.d, z31.d, z30.d
> >> >> > >> st1dz31.d, p7, [x0, x2, lsl 3]
> >> >> > >> incdx2
> >> >> > >> whilelo p7.d, w2, w3
> >> >> > >> b.any   .L2
> >> >> > >> ret
> >> >> > >> g:
> >> >> > >> mov x3, 0
> >> >> > >> mov w4, 100
> >> >> > >> mov z29.d, x2
> >> >> > >> whilelo p7.d, wzr, w4
> >> >> > >> .L6:
> >> >> > >> ld1dz30.d, p7/z, [x0, x3, lsl 3]
> >> >> > >> ld1dz31.d, p7/z, [x1, x3, lsl 3]
> >> >> > >> bsl z31.d, z31.d, z30.d, z29.d
> >> >> > >> st1dz31.d, p7, [x0, x3, lsl 3]
> >> >> > >> incdx3
> >> >> > >> whilelo p7.d, w3, w4
> >> >> > >> b.any   .L6
> >> >> > >> ret
> >> >> > >>
> >> >> > >> I saw that you originally tried to do this in match.pd and
> >> >> > >> that the decision was to fold to copysign instead.  But
> >> >> > >> perhaps there's a compromise where isel does something with
> >> >> > >> the (new) copysign canonical
> >> >> > form?
> >> >> > >> I.e. could we go with your new version of the match.pd patch,
> >> >> > >> and add some isel stuff as a follow-on?
> 
> [A]
> 
> >> >> > >>
> >> >> > >
> >> >> > > Sure if that's what's desired But..
> >> >> > >
> >> >> > > The example you posted above is for instance worse for x86
> >> >> > > https://godbolt.org/z/x9ccqxW6T where the first operation has
> >> >> > > a dependency chain of 2 and the latter of 3.  It's likely any
> >> >> > > open coding of this
> >> >> > operation is going to hurt a target.
> >> >> > >
> >> >> > > So I'm unsure what isel transform this into...
> >> >> >
> >> >> > I didn't mean that we should go straight to using isel for the
> >> >>

Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-09 Thread Richard Sandiford

Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Monday, October 9, 2023 10:56 AM
>> To: Tamar Christina 
>> Cc: Richard Biener ; gcc-patches@gcc.gnu.org;
>> nd ; Richard Earnshaw ;
>> Marcus Shawcroft ; Kyrylo Tkachov
>> 
>> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> 
>> Tamar Christina  writes:
>> >> -Original Message-
>> >> From: Richard Sandiford 
>> >> Sent: Saturday, October 7, 2023 10:58 AM
>> >> To: Richard Biener 
>> >> Cc: Tamar Christina ;
>> >> gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> >> ; Marcus Shawcroft
>> >> ; Kyrylo Tkachov
>> 
>> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> >>
>> >> Richard Biener  writes:
>> >> > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina
>> >>  wrote:
>> >> >>
>> >> >> > -Original Message-
>> >> >> > From: Richard Sandiford 
>> >> >> > Sent: Thursday, October 5, 2023 9:26 PM
>> >> >> > To: Tamar Christina 
>> >> >> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> >> >> > ; Marcus Shawcroft
>> >> >> > ; Kyrylo Tkachov
>> >> 
>> >> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
>> >> cond_copysign.
>> >> >> >
>> >> >> > Tamar Christina  writes:
>> >> >> > >> -Original Message-
>> >> >> > >> From: Richard Sandiford 
>> >> >> > >> Sent: Thursday, October 5, 2023 8:29 PM
>> >> >> > >> To: Tamar Christina 
>> >> >> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard
>> >> >> > >> Earnshaw ; Marcus Shawcroft
>> >> >> > >> ; Kyrylo Tkachov
>> >> >> > 
>> >> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
>> >> cond_copysign.
>> >> >> > >>
>> >> >> > >> Tamar Christina  writes:
>> >> >> > >> > Hi All,
>> >> >> > >> >
>> >> >> > >> > This adds an implementation for masked copysign along with
>> >> >> > >> > an optimized pattern for masked copysign (x, -1).
>> >> >> > >>
>> >> >> > >> It feels like we're ending up with a lot of AArch64-specific
>> >> >> > >> code that just hard- codes the observation that changing the
>> >> >> > >> sign is equivalent to changing the top bit.  We then need to
>> >> >> > >> make sure that we choose the best way of changing the top bit
>> >> >> > >> for any
>> >> given situation.
>> >> >> > >>
>> >> >> > >> Hard-coding the -1/negative case is one instance of that.
>> >> >> > >> But it looks like we also fail to use the best sequence for SVE2. 
>> >> >> > >>  E.g.
>> >> >> > >> [https://godbolt.org/z/ajh3MM5jv]:
>> >> >> > >>
>> >> >> > >> #include 
>> >> >> > >>
>> >> >> > >> void f(double *restrict a, double *restrict b) {
>> >> >> > >> for (int i = 0; i < 100; ++i)
>> >> >> > >> a[i] = __builtin_copysign(a[i], b[i]); }
>> >> >> > >>
>> >> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
>> >> >> > >> for (int i = 0; i < 100; ++i)
>> >> >> > >> a[i] = (a[i] & ~c) | (b[i] & c); }
>> >> >> > >>
>> >> >> > >> gives:
>> >> >> > >>
>> >> >> > >> f:
>> >> >> > >> mov x2, 0
>> >> >> > >> mov w3, 100
>> >> >> > >> whilelo p7.d, wzr, w3
>> >> >> > >> .L2:
>> >> >> > >> ld1dz30.d, p7/z, [x0, x2, lsl 3]
>> >> >> > >> ld1dz31.d, p7/z, [x1, x2, lsl 3]
>> >> >> > >> and z30.d, z30.d, #0x7fff
>> >> >> > >> and z31.d, z31.d, #0x8000
>> >> >> > >> orr z31.d, z31.d, z30.d
>> >> >> > >> st1dz31.d, p7, [x0, x2, lsl 3]
>> >> >> > >> incdx2
>> >> >> > >> whilelo p7.d, w2, w3
>> >> >> > >> b.any   .L2
>> >> >> > >> ret
>> >> >> > >> g:
>> >> >> > >> mov x3, 0
>> >> >> > >> mov w4, 100
>> >> >> > >> mov z29.d, x2
>> >> >> > >> whilelo p7.d, wzr, w4
>> >> >> > >> .L6:
>> >> >> > >> ld1dz30.d, p7/z, [x0, x3, lsl 3]
>> >> >> > >> ld1dz31.d, p7/z, [x1, x3, lsl 3]
>> >> >> > >> bsl z31.d, z31.d, z30.d, z29.d
>> >> >> > >> st1dz31.d, p7, [x0, x3, lsl 3]
>> >> >> > >> incdx3
>> >> >> > >> whilelo p7.d, w3, w4
>> >> >> > >> b.any   .L6
>> >> >> > >> ret
>> >> >> > >>
>> >> >> > >> I saw that you originally tried to do this in match.pd and
>> >> >> > >> that the decision was to fold to copysign instead.  But
>> >> >> > >> perhaps there's a compromise where isel does something with
>> >> >> > >> the (new) copysign canonical
>> >> >> > form?
>> >> >> > >> I.e. could we go with your new version of the match.pd patch,
>> >> >> > >> and add some isel stuff as a follow-on?
>> 
>> [A]
>> 
>> >> >> > >>
>> >> >> > >
>> >> >> > > Sure if that's what's desired But..
>> >> >> > >
>> >> >> > > The example you posted above is for instance worse for x86
>> >> >> > > https://godbolt.org/z/x9ccqxW6T where the first operation has
>> >> >> > > a dependency chain of 2 and the latter of 3.  It's likely any
>> >> >> > > open coding of this
>> >> >> > operation is going to hurt a target.
>> >> >>

Re: [PATCH V2] TEST: Fix vect_cond_arith_* dump checks for RVV

2023-10-09 Thread Robin Dapp

On 10/9/23 09:32, Andreas Schwab wrote:
> On Okt 09 2023, juzhe.zh...@rivai.ai wrote:
> 
>> Turns out COND(_LEN)?_ADD can't work.
> 
> It should work though.  Tcl regexps are a superset of POSIX EREs.
> 

The problem is that COND(_LEN)?_ADD matches two times against
COND_LEN_ADD and a scan-tree-dump-times 1 will fail.  So for those
checks in vect-cond-arith-6.c we either need to switch to
scan-tree-dump or change the pattern to "\.(?:COND|COND_LEN)_ADD".

Juzhe, something like the attached works for me.

Regards
 Robin

diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
index 1af0fe642a0..7d26dbedc5e 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
@@ -52,8 +52,8 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
vect_double_cond_arith } } } */
 /* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target 
vect_double_cond_arith } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
index ec3d9db4202..f7daa13685c 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
@@ -54,8 +54,8 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
-/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
-/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
-/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
 /* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
index 2aeebd44f83..a80c30a50b2 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
@@ -56,8 +56,8 @@ main (void)
 }
 
 /* { dg-final { scan-tree-dump-times {vectorizing stmts using SLP} 4 "vect" { 
target vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_ADD} 1 "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_SUB} 1 "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_MUL} 1 "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_RDIV} 1 "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
vect_double_cond_arith } } } */
 /* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target 
vect_double_cond_arith } } } */

Re: Re: [PATCH V2] TEST: Fix vect_cond_arith_* dump checks for RVV

2023-10-09 Thread juzhe.zh...@rivai.ai

Thanks Robin. Could you send V3 to Richi ? And commit it if Richi is ok with 
that.



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-10-09 18:26
To: Andreas Schwab; juzhe.zhong
CC: rdapp.gcc; gcc-patches; rguenther; jeffreyalaw
Subject: Re: [PATCH V2] TEST: Fix vect_cond_arith_* dump checks for RVV
On 10/9/23 09:32, Andreas Schwab wrote:
> On Okt 09 2023, juzhe.zh...@rivai.ai wrote:
> 
>> Turns out COND(_LEN)?_ADD can't work.
> 
> It should work though.  Tcl regexps are a superset of POSIX EREs.
> 
 
The problem is that COND(_LEN)?_ADD matches two times against
COND_LEN_ADD and a scan-tree-dump-times 1 will fail.  So for those
checks in vect-cond-arith-6.c we either need to switch to
scan-tree-dump or change the pattern to "\.(?:COND|COND_LEN)_ADD".
 
Juzhe, something like the attached works for me.
 
Regards
Robin
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
index 1af0fe642a0..7d26dbedc5e 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
@@ -52,8 +52,8 @@ main (void)
   return 0;
}
-/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
vect_double_cond_arith } } } */
/* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target 
vect_double_cond_arith } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
index ec3d9db4202..f7daa13685c 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
@@ -54,8 +54,8 @@ main (void)
   return 0;
}
-/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
-/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
-/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
-/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
/* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
index 2aeebd44f83..a80c30a50b2 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
@@ -56,8 +56,8 @@ main (void)
}
/* { dg-final { scan-tree-dump-times {vectorizing stmts using SLP} 4 "vect" { 
target vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_ADD} 1 "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_SUB} 1 "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_MUL} 1 "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_RDIV} 1 "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
vect_double_cond_arith } } } */
/* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target 
vect_double_cond_arith } } } */

RE: [PATCH v1] RISC-V: Refine bswap16 auto vectorization code gen

2023-10-09 Thread Li, Pan2

Sure thing, will send V2 for this change.

Pan

From: juzhe.zh...@rivai.ai 
Sent: Monday, October 9, 2023 5:04 PM
To: Li, Pan2 ; gcc-patches 
Cc: Li, Pan2 ; Wang, Yanzhang ; 
kito.cheng 
Subject: Re: [PATCH v1] RISC-V: Refine bswap16 auto vectorization code gen

Remove these functions:


+static void

+emit_vec_sll_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)

+{

+  rtx sll_ops[] = {op_0, op_1, op_2};

+  insn_code icode = code_for_pred_scalar (ASHIFT, vec_mode);

+

+  emit_vlmax_insn (icode, BINARY_OP, sll_ops);

+}

+

+static void

+emit_vec_srl_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)

+{

+  rtx srl_ops[] = {op_0, op_1, op_2};

+  insn_code icode = code_for_pred_scalar (LSHIFTRT, vec_mode);

+

+  emit_vlmax_insn (icode, BINARY_OP, srl_ops);

+}

+

+static void

+emit_vec_or (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)

+{

+  rtx or_ops[] = {op_0, op_1, op_2};

+  insn_code icode = code_for_pred (IOR, vec_mode);

+

+  emit_vlmax_insn (icode, BINARY_OP, or_ops);

+}

+

Instead,

For sll, you should use :
rtx tmp
= expand_binop (Pmode, ashl_optab, op_1,
gen_int_mode (8, Pmode), NULL_RTX, 0,
OPTAB_DIRECT);

For srl, you should use:
rtx tmp
= expand_binop (Pmode, lshiftrt_optab, op_1,
gen_int_mode (8, Pmode), NULL_RTX, 0,
OPTAB_DIRECT);


For or, you should use:
expand_binop (Pmode, ior_optab, tmp, dest, NULL_RTX, 0,
   OPTAB_DIRECT);


juzhe.zh...@rivai.ai

From: pan2.li
Date: 2023-10-09 16:51
To: gcc-patches
CC: juzhe.zhong; 
pan2.li; 
yanzhang.wang; 
kito.cheng
Subject: [PATCH v1] RISC-V: Refine bswap16 auto vectorization code gen
From: Pan Li mailto:pan2...@intel.com>>

This patch would like to refine the code gen for the bswap16.

We will have VEC_PERM_EXPR after rtl expand when invoking
__builtin_bswap. It will generate about 9 instructions in
loop as below, no matter it is bswap16, bswap32 or bswap64.

  .L2:
1 vle16.v v4,0(a0)
2 vmv.v.x v2,a7
3 vand.vv v2,v6,v2
4 sllia2,a5,1
5 vrgatherei16.vv v1,v4,v2
6 sub a4,a4,a5
7 vse16.v v1,0(a3)
8 add a0,a0,a2
9 add a3,a3,a2
  bne a4,zero,.L2

But for bswap16 we may have a even simple code gen, which
has only 7 instructions in loop as below.

  .L5
1 vle8.v  v2,0(a5)
2 addia5,a5,32
3 vsrl.vi v4,v2,8
4 vsll.vi v2,v2,8
5 vor.vv  v4,v4,v2
6 vse8.v  v4,0(a4)
7 addia4,a4,32
  bne a5,a6,.L5

Unfortunately, this way will make the insn in loop will grow up to
13 and 24 for bswap32 and bswap64. Thus, we will refine the code
gen for the bswap16 only, and leave both the bswap32 and bswap64
as is.

gcc/ChangeLog:

* config/riscv/riscv-v.cc (emit_vec_sll_scalar): New help func
impl for emit vsll.vi/vsll.vx
(emit_vec_srl_scalar): Likewise for vsrl.vi/vsrl.vx.
(emit_vec_or): Likewise for vor.vv.
(shuffle_bswap_pattern): New func impl for shuffle bswap.
(expand_vec_perm_const_1): Add shuffle bswap pattern.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls/perm-4.c: Adjust checker.
* gcc.target/riscv/rvv/autovec/unop/bswap16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/bswap16-0.c: New test.

Signed-off-by: Pan Li mailto:pan2...@intel.com>>
---
gcc/config/riscv/riscv-v.cc   | 117 ++
.../riscv/rvv/autovec/unop/bswap16-0.c|  17 +++
.../riscv/rvv/autovec/unop/bswap16-run-0.c|  44 +++
.../riscv/rvv/autovec/vls/bswap16-0.c |  34 +
.../gcc.target/riscv/rvv/autovec/vls/perm-4.c |   4 +-
5 files changed, 214 insertions(+), 2 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/bswap16-0.c

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 23633a2a74d..3e3b5f2e797 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -878,6 +878,33 @@ emit_vlmax_decompress_insn (rtx target, rtx op0, rtx op1, 
rtx mask)
   emit_vlmax_masked_gather_mu_insn (target, op1, sel, mask);
}
+static void
+emit_vec_sll_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx sll_ops[] = {op_0, op_1, op_2};
+  insn_code icode = code_for_pred_scalar (ASHIFT, vec_mode);
+
+  emit_vlmax_insn (icode, BINARY_OP, sll_ops);
+}
+
+static void
+emit_vec_srl_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx srl_ops[] = {op_0, op_1, op_2};
+  insn_code icode = code_for_pred_scalar (LSHIFTRT, vec_mode);
+
+  emit_vlmax_insn (icode, BINARY_OP, srl_ops);
+}
+
+static void
+emit_vec_or (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx or_ops[]

[PATCH] tree-optimization/111715 - improve TBAA for access paths with pun

2023-10-09 Thread Richard Biener

The following improves basic TBAA for access paths formed by
C++ abstraction where we are able to combine a path from an
address-taking operation with a path based on that access using
a pun to avoid memory access semantics on the address-taking part.

The trick is to identify the point the semantic memory access path
starts which allows us to use the alias set of the outermost access
instead of only that of the base of this path.

Bootstrapped and tested on x86_64-unknown-linux-gnu for all languages
with a slightly different variant, re-bootstrapping/testing now
(with doing the extra walk just for AGGREGATE_TYPE_P).

PR tree-optimization/111715
* alias.cc (reference_alias_ptr_type_1): When we have
a type-punning ref at the base search for the access
path part that's still semantically valid.

* gcc.dg/tree-ssa/ssa-fre-102.c: New testcase.
---
 gcc/alias.cc| 20 -
 gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c | 32 +
 2 files changed, 51 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c

diff --git a/gcc/alias.cc b/gcc/alias.cc
index 7c1af1fe96e..4060ff72949 100644
--- a/gcc/alias.cc
+++ b/gcc/alias.cc
@@ -774,7 +774,25 @@ reference_alias_ptr_type_1 (tree *t)
   && (TYPE_MAIN_VARIANT (TREE_TYPE (inner))
  != TYPE_MAIN_VARIANT
   (TREE_TYPE (TREE_TYPE (TREE_OPERAND (inner, 1))
-return TREE_TYPE (TREE_OPERAND (inner, 1));
+{
+  tree alias_ptrtype = TREE_TYPE (TREE_OPERAND (inner, 1));
+  /* Unless we have the (aggregate) effective type of the access
+somewhere on the access path.  If we have for example
+(&a->elts[i])->l.len exposed by abstraction we'd see
+MEM  [(B *)a].elts[i].l.len and we can use the alias set
+of 'len' when typeof (MEM  [(B *)a].elts[i]) == B for
+example.  See PR111715.  */
+  if (AGGREGATE_TYPE_P (TREE_TYPE (alias_ptrtype)))
+   {
+ tree inner = *t;
+ while (handled_component_p (inner)
+&& (TYPE_MAIN_VARIANT (TREE_TYPE (inner))
+!= TYPE_MAIN_VARIANT (TREE_TYPE (alias_ptrtype
+   inner = TREE_OPERAND (inner, 0);
+   }
+  if (TREE_CODE (inner) == MEM_REF)
+   return alias_ptrtype;
+}
 
   /* Otherwise, pick up the outermost object that we could have
  a pointer to.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c
new file mode 100644
index 000..afd48050819
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c
@@ -0,0 +1,32 @@
+/* PR/111715 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-fre1" } */
+
+struct B {
+   struct { int len; } l;
+   long n;
+};
+struct A {
+   struct B elts[8];
+};
+
+static void
+set_len (struct B *b, int len)
+{
+  b->l.len = len;
+}
+
+static int
+get_len (struct B *b)
+{
+  return b->l.len;
+}
+
+int foo (struct A *a, int i, long *q)
+{
+  set_len (&a->elts[i], 1);
+  *q = 2;
+  return get_len (&a->elts[i]);
+}
+
+/* { dg-final { scan-tree-dump "return 1;" "fre1" } } */
-- 
2.35.3

Re: [PATCH V2] TEST: Fix vect_cond_arith_* dump checks for RVV

2023-10-09 Thread Richard Biener

On Mon, 9 Oct 2023, Robin Dapp wrote:

> On 10/9/23 09:32, Andreas Schwab wrote:
> > On Okt 09 2023, juzhe.zh...@rivai.ai wrote:
> > 
> >> Turns out COND(_LEN)?_ADD can't work.
> > 
> > It should work though.  Tcl regexps are a superset of POSIX EREs.
> > 
> 
> The problem is that COND(_LEN)?_ADD matches two times against
> COND_LEN_ADD and a scan-tree-dump-times 1 will fail.  So for those
> checks in vect-cond-arith-6.c we either need to switch to
> scan-tree-dump or change the pattern to "\.(?:COND|COND_LEN)_ADD".
> 
> Juzhe, something like the attached works for me.

LGTM.

Richard.

> Regards
>  Robin
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c 
> b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
> index 1af0fe642a0..7d26dbedc5e 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
> @@ -52,8 +52,8 @@ main (void)
>return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target 
> vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target 
> vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target 
> vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
> vect_double_cond_arith } } } */
>  /* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target 
> vect_double_cond_arith } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c 
> b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
> index ec3d9db4202..f7daa13685c 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
> @@ -54,8 +54,8 @@ main (void)
>return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> -/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> -/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> -/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
> { vect_double_cond_arith && vect_masked_store } } } } */
>  /* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c 
> b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
> index 2aeebd44f83..a80c30a50b2 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
> @@ -56,8 +56,8 @@ main (void)
>  }
>  
>  /* { dg-final { scan-tree-dump-times {vectorizing stmts using SLP} 4 "vect" 
> { target vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump-times { = \.COND_ADD} 1 "optimized" { target 
> vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump-times { = \.COND_SUB} 1 "optimized" { target 
> vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump-times { = \.COND_MUL} 1 "optimized" { target 
> vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump-times { = \.COND_RDIV} 1 "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
> vect_double_cond_arith } } } */
>  /* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target 
> vect_double_cond_arith } } } */
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany

Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-09 Thread Richard Biener

On Mon, Oct 9, 2023 at 12:17 PM Richard Sandiford
 wrote:
>
> Tamar Christina  writes:
> >> -Original Message-
> >> From: Richard Sandiford 
> >> Sent: Monday, October 9, 2023 10:56 AM
> >> To: Tamar Christina 
> >> Cc: Richard Biener ; gcc-patches@gcc.gnu.org;
> >> nd ; Richard Earnshaw ;
> >> Marcus Shawcroft ; Kyrylo Tkachov
> >> 
> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> >>
> >> Tamar Christina  writes:
> >> >> -Original Message-
> >> >> From: Richard Sandiford 
> >> >> Sent: Saturday, October 7, 2023 10:58 AM
> >> >> To: Richard Biener 
> >> >> Cc: Tamar Christina ;
> >> >> gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> >> >> ; Marcus Shawcroft
> >> >> ; Kyrylo Tkachov
> >> 
> >> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> >> >>
> >> >> Richard Biener  writes:
> >> >> > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina
> >> >>  wrote:
> >> >> >>
> >> >> >> > -Original Message-
> >> >> >> > From: Richard Sandiford 
> >> >> >> > Sent: Thursday, October 5, 2023 9:26 PM
> >> >> >> > To: Tamar Christina 
> >> >> >> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> >> >> >> > ; Marcus Shawcroft
> >> >> >> > ; Kyrylo Tkachov
> >> >> 
> >> >> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
> >> >> cond_copysign.
> >> >> >> >
> >> >> >> > Tamar Christina  writes:
> >> >> >> > >> -Original Message-
> >> >> >> > >> From: Richard Sandiford 
> >> >> >> > >> Sent: Thursday, October 5, 2023 8:29 PM
> >> >> >> > >> To: Tamar Christina 
> >> >> >> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard
> >> >> >> > >> Earnshaw ; Marcus Shawcroft
> >> >> >> > >> ; Kyrylo Tkachov
> >> >> >> > 
> >> >> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
> >> >> cond_copysign.
> >> >> >> > >>
> >> >> >> > >> Tamar Christina  writes:
> >> >> >> > >> > Hi All,
> >> >> >> > >> >
> >> >> >> > >> > This adds an implementation for masked copysign along with
> >> >> >> > >> > an optimized pattern for masked copysign (x, -1).
> >> >> >> > >>
> >> >> >> > >> It feels like we're ending up with a lot of AArch64-specific
> >> >> >> > >> code that just hard- codes the observation that changing the
> >> >> >> > >> sign is equivalent to changing the top bit.  We then need to
> >> >> >> > >> make sure that we choose the best way of changing the top bit
> >> >> >> > >> for any
> >> >> given situation.
> >> >> >> > >>
> >> >> >> > >> Hard-coding the -1/negative case is one instance of that.
> >> >> >> > >> But it looks like we also fail to use the best sequence for 
> >> >> >> > >> SVE2.  E.g.
> >> >> >> > >> [https://godbolt.org/z/ajh3MM5jv]:
> >> >> >> > >>
> >> >> >> > >> #include 
> >> >> >> > >>
> >> >> >> > >> void f(double *restrict a, double *restrict b) {
> >> >> >> > >> for (int i = 0; i < 100; ++i)
> >> >> >> > >> a[i] = __builtin_copysign(a[i], b[i]); }
> >> >> >> > >>
> >> >> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
> >> >> >> > >> for (int i = 0; i < 100; ++i)
> >> >> >> > >> a[i] = (a[i] & ~c) | (b[i] & c); }
> >> >> >> > >>
> >> >> >> > >> gives:
> >> >> >> > >>
> >> >> >> > >> f:
> >> >> >> > >> mov x2, 0
> >> >> >> > >> mov w3, 100
> >> >> >> > >> whilelo p7.d, wzr, w3
> >> >> >> > >> .L2:
> >> >> >> > >> ld1dz30.d, p7/z, [x0, x2, lsl 3]
> >> >> >> > >> ld1dz31.d, p7/z, [x1, x2, lsl 3]
> >> >> >> > >> and z30.d, z30.d, #0x7fff
> >> >> >> > >> and z31.d, z31.d, #0x8000
> >> >> >> > >> orr z31.d, z31.d, z30.d
> >> >> >> > >> st1dz31.d, p7, [x0, x2, lsl 3]
> >> >> >> > >> incdx2
> >> >> >> > >> whilelo p7.d, w2, w3
> >> >> >> > >> b.any   .L2
> >> >> >> > >> ret
> >> >> >> > >> g:
> >> >> >> > >> mov x3, 0
> >> >> >> > >> mov w4, 100
> >> >> >> > >> mov z29.d, x2
> >> >> >> > >> whilelo p7.d, wzr, w4
> >> >> >> > >> .L6:
> >> >> >> > >> ld1dz30.d, p7/z, [x0, x3, lsl 3]
> >> >> >> > >> ld1dz31.d, p7/z, [x1, x3, lsl 3]
> >> >> >> > >> bsl z31.d, z31.d, z30.d, z29.d
> >> >> >> > >> st1dz31.d, p7, [x0, x3, lsl 3]
> >> >> >> > >> incdx3
> >> >> >> > >> whilelo p7.d, w3, w4
> >> >> >> > >> b.any   .L6
> >> >> >> > >> ret
> >> >> >> > >>
> >> >> >> > >> I saw that you originally tried to do this in match.pd and
> >> >> >> > >> that the decision was to fold to copysign instead.  But
> >> >> >> > >> perhaps there's a compromise where isel does something with
> >> >> >> > >> the (new) copysign canonical
> >> >> >> > form?
> >> >> >> > >> I.e. could we go with your new version of the match.pd patch,
> >> >> >> > >> and add some isel stuff as a follow-on?
> >>
> >> [A]
> >>
> >> >> >> > >>
> >> >> >> > >
> >> >> >> > > Sure if that's what's desired But..
> >> >> >> > >
> >> >>

Re: PR111648: Fix wrong code-gen due to incorrect VEC_PERM_EXPR folding

2023-10-09 Thread Richard Sandiford

Prathamesh Kulkarni  writes:
> Hi,
> The attached patch attempts to fix PR111648.
> As mentioned in PR, the issue is when a1 is a multiple of vector
> length, we end up creating following encoding in result: { base_elem,
> arg[0], arg[1], ... } (assuming S = 1),
> where arg is chosen input vector, which is incorrect, since the
> encoding originally in arg would be: { arg[0], arg[1], arg[2], ... }
>
> For the test-case mentioned in PR, vectorizer pass creates
> VEC_PERM_EXPR where:
> arg0: { -16, -9, -10, -11 }
> arg1: { -12, -5, -6, -7 }
> sel = { 3, 4, 5, 6 }
>
> arg0, arg1 and sel are encoded with npatterns = 1 and nelts_per_pattern = 3.
> Since a1 = 4 and arg_len = 4, it ended up creating the result with
> following encoding:
> res = { arg0[3], arg1[0], arg1[1] } // npatterns = 1, nelts_per_pattern = 3
>   = { -11, -12, -5 }
>
> So for res[3], it used S = (-5) - (-12) = 7
> And hence computed it as -5 + 7 = 2.
> instead of selecting arg1[2], ie, -6.
>
> The patch tweaks valid_mask_for_fold_vec_perm_cst_p to punt if a1 is a 
> multiple
> of vector length, so a1 ... ae select elements only from stepped part
> of the pattern
> from input vector and return false for this case.
>
> Since the vectors are VLS, fold_vec_perm_cst then sets:
> res_npatterns = res_nelts
> res_nelts_per_pattern  = 1
> which seems to fix the issue by encoding all the elements.
>
> The patch resulted in Case 4 and Case 5 failing from test_nunits_min_2 because
> they used sel = { 0, 0, 1, ... } and {len, 0, 1, ... } respectively,
> which used a1 = 0, and thus selected arg1[0].
>
> I removed Case 4 because it was already covered in test_nunits_min_4,
> and moved Case 5 to test_nunits_min_4, with sel = { len, 1, 2, ... }
> and added a new Case 9 to test for this issue.
>
> Passes bootstrap+test on aarch64-linux-gnu with and without SVE,
> and on x86_64-linux-gnu.
> Does the patch look OK ?
>
> Thanks,
> Prathamesh
>
> [PR111648] Fix wrong code-gen due to incorrect VEC_PERM_EXPR folding.
>
> gcc/ChangeLog:
>   PR tree-optimization/111648
>   * fold-const.cc (valid_mask_for_fold_vec_perm_cst_p): Punt if a1
>   is a multiple of vector length.
>   (test_nunits_min_2): Remove Case 4 and move Case 5 to ...
>   (test_nunits_min_4): ... here and rename case numbers. Also add
>   Case 9.
>
> gcc/testsuite/ChangeLog:
>   PR tree-optimization/111648
>   * gcc.dg/vect/pr111648.c: New test.
>
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 4f8561509ff..c5f421d6b76 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -10682,8 +10682,8 @@ valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree 
> arg1,
> return false;
>   }
>  
> -  /* Ensure that the stepped sequence always selects from the same
> -  input pattern.  */
> +  /* Ensure that the stepped sequence always selects from the stepped
> +  part of same input pattern.  */
>unsigned arg_npatterns
>   = ((q1 & 1) == 0) ? VECTOR_CST_NPATTERNS (arg0)
> : VECTOR_CST_NPATTERNS (arg1);
> @@ -10694,6 +10694,20 @@ valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree 
> arg1,
>   *reason = "step is not multiple of npatterns";
> return false;
>   }
> +
> +  /* If a1 is a multiple of len, it will select base element of input
> +  vector resulting in following encoding:
> +  { base_elem, arg[0], arg[1], ... } where arg is the chosen input
> +  vector. This encoding is not originally present in arg, since it's
> +  defined as:
> +  { arg[0], arg[1], arg[2], ... }.  */
> +
> +  if (multiple_p (a1, arg_len))
> + {
> +   if (reason)
> + *reason = "selecting base element of input vector";
> +   return false;
> + }

That wouldn't catch (for example) cases where a1 == arg_len + 1 and the
second argument has 2 stepped patterns.

The equivalent condition that handles multiple patterns would
probably be to reject q1 < arg_npatterns.  But that's only necessary if:

(1) the argument has three elements per pattern (i.e. has a stepped
sequence) and

(2) element 2 - element 1 != element 1 - element 0

I think we should check those to avoid pessimising VLA cases.

Thanks,
Richard

>  }
>  
>return true;
> @@ -17425,47 +17439,6 @@ test_nunits_min_2 (machine_mode vmode)
>   tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
>   validate_res (2, 2, res, expected_res);
>}
> -
> -  /* Case 4: mask = {0, 0, 1, ...} // (1, 3)
> -  Test that the stepped sequence of the pattern selects from
> -  same input pattern. Since input vectors have npatterns = 2,
> -  and step (a2 - a1) = 1, step is not a multiple of npatterns
> -  in input vector. So return NULL_TREE.  */
> -  {
> - tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> - tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> - poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> -
> - vec_perm_builder buil

Re: [PATCH] ifcvt/vect: Emit COND_ADD for conditional scalar reduction.

2023-10-09 Thread Robin Dapp

> It'd be good to expand on this comment a bit.  What kind of COND are you
> anticipating?  A COND with the neutral op as the else value, so that the
> PLUS_EXPR (or whatever) can remain unconditional?  If so, it would be
> good to sketch briefly how that happens, and why it's better than using
> the conditional PLUS_EXPR.
> 
> If that's the reason, perhaps we want a single-use check as well.
> It's possible that OP1 is used elsewhere in the loop body, in a
> context that would prefer a different else value.

Would something like the following on top work?

-  /* If possible try to create an IFN_COND_ADD instead of a COND_EXPR and
- a PLUS_EXPR.  Don't do this if the reduction def operand itself is
+  /* If possible create a COND_OP instead of a COND_EXPR and an OP_EXPR.
+ The COND_OP will have a neutral_op else value.
+
+ This allows re-using the mask directly in a masked reduction instead
+ of creating a vector merge (or similar) and then an unmasked reduction.
+
+ Don't do this if the reduction def operand itself is
  a vectorizable call as we can create a COND version of it directly.  */

   if (ifn != IFN_LAST
   && vectorized_internal_fn_supported_p (ifn, TREE_TYPE (lhs))
-  && try_cond_op && !swap)
+  && use_cond_op && !swap && has_single_use (op1))

Regards
 Robin

[PATCH V2] RISC-V: Support movmisalign of RVV VLA modes

2023-10-09 Thread Juzhe-Zhong

This patch fixed these following FAILs in regressions:
FAIL: gcc.dg/vect/slp-perm-11.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "vectorizing stmts using SLP" 1
FAIL: gcc.dg/vect/slp-perm-11.c scan-tree-dump-times vect "vectorizing stmts 
using SLP" 1
FAIL: gcc.dg/vect/vect-bitfield-read-2.c -flto -ffat-lto-objects  
scan-tree-dump-not optimized "Invalid sum"
FAIL: gcc.dg/vect/vect-bitfield-read-2.c scan-tree-dump-not optimized "Invalid 
sum"
FAIL: gcc.dg/vect/vect-bitfield-read-4.c -flto -ffat-lto-objects  
scan-tree-dump-not optimized "Invalid sum"
FAIL: gcc.dg/vect/vect-bitfield-read-4.c scan-tree-dump-not optimized "Invalid 
sum"
FAIL: gcc.dg/vect/vect-bitfield-write-2.c -flto -ffat-lto-objects  
scan-tree-dump-not optimized "Invalid sum"
FAIL: gcc.dg/vect/vect-bitfield-write-2.c scan-tree-dump-not optimized "Invalid 
sum"
FAIL: gcc.dg/vect/vect-bitfield-write-3.c -flto -ffat-lto-objects  
scan-tree-dump-not optimized "Invalid sum"
FAIL: gcc.dg/vect/vect-bitfield-write-3.c scan-tree-dump-not optimized "Invalid 
sum"

Previously, I removed the movmisalign pattern to fix the execution FAILs in 
this commit:
https://github.com/gcc-mirror/gcc/commit/f7bff24905a6959f85f866390db2fff1d6f95520

I was thinking that RVV doesn't allow misaligned at the beginning so I removed 
that pattern.
However, after deep investigation && reading RVV ISA again and experiment on 
SPIKE,
I realized I was wrong.

RVV ISA reference: 
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-memory-alignment-constraints

"If an element accessed by a vector memory instruction is not naturally aligned 
to the size of the element, 
 either the element is transferred successfully or an address misaligned 
exception is raised on that element."

It's obvious that RVV ISA does allow misaligned vector load/store.

And experiment and confirm on SPIKE:

[jzzhong@rios-cad122:/work/home/jzzhong/work/toolchain/riscv/gcc/gcc/testsuite/gcc.dg/vect]$~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/bin/spike
 --isa=rv64gcv --varch=vlen:128,elen:64 
~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/riscv64-unknown-elf/bin/pk64
  a.out
bbl loader
z   ra 00010158 sp 003ffb40 gp 00012c48
tp  t0 000110da t1 000f t2 
s0 00013460 s1  a0 00012ef5 a1 00012018
a2 00012a71 a3 000d a4 0004 a5 00012a71
a6 00012a71 a7 00012018 s2  s3 
s4  s5  s6  s7 
s8  s9  sA  sB 
t3  t4  t5  t6 
pc 00010258 va/inst 020660a7 sr 80026620
Store/AMO access fault!

[jzzhong@rios-cad122:/work/home/jzzhong/work/toolchain/riscv/gcc/gcc/testsuite/gcc.dg/vect]$~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/bin/spike
 --misaligned --isa=rv64gcv --varch=vlen:128,elen:64 
~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/riscv64-unknown-elf/bin/pk64
  a.out
bbl loader

We can see SPIKE can pass previous *FAILED* execution tests with specifying 
--misaligned to SPIKE.

So, to honor RVV ISA SPEC, we should add movmisalign pattern back base on the 
investigations I have done since
it can improve multiple vectorization tests and fix dumple FAILs.

This patch adds TARGET_VECTOR_MISALIGN_SUPPORTED to decide whether we support 
misalign pattern for VLA modes (By default it is enabled).

Consider this following case:

struct s {
unsigned i : 31;
char a : 4;
};

#define N 32
#define ELT0 {0x7FFFUL, 0}
#define ELT1 {0x7FFFUL, 1}
#define ELT2 {0x7FFFUL, 2}
#define ELT3 {0x7FFFUL, 3}
#define RES 48
struct s A[N]
  = { ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
  ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
  ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
  ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3};

int __attribute__ ((noipa))
f(struct s *ptr, unsigned n) {
int res = 0;
for (int i = 0; i < n; ++i)
  res += ptr[i].a;
return res;
}

-O3 -S -fno-vect-cost-model (default strict-align):

f:
mv  a4,a0
beq a1,zero,.L9
addiw   a5,a1,-1
li  a3,14
vsetivlizero,16,e64,m8,ta,ma
bleua5,a3,.L3
andia5,a0,127
bne a5,zero,.L3
srliw   a3,a1,4
sllia3,a3,7
li  a0,15
sllia0,a0,32
add a3,a3,a4
mv  a5,a4
li  a2,32
vmv.v.x v16,a0
vsetvli zero,zero,e32,m4,ta,ma
vmv.v.i v4,0
.L4:
vsetvli zero,zero,e64,m8,ta,ma
vle64.v v8,0(a5)
addia5,a5,128

[PATCH] RISC-V Regression test: Fix FAIL of fast-math-slp-38.c for RVV

2023-10-09 Thread Juzhe-Zhong

Reference: https://godbolt.org/z/G9jzf5Grh

RVV is able to vectorize this case using SLP. However, with 
-fno-vect-cost-model,
RVV vectorize it by vec_load_lanes with stride 6.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/fast-math-slp-38.c: Add ! vect_strided6.

---
 gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c 
b/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
index 7c7acd5bab6..96751faae7f 100644
--- a/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
+++ b/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
@@ -18,4 +18,4 @@ foo (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } 
} */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { ! vect_strided6 } } } } */
-- 
2.36.3

Re: [PATCH] RISC-V Regression test: Fix FAIL of fast-math-slp-38.c for RVV

2023-10-09 Thread Richard Biener

On Mon, 9 Oct 2023, Juzhe-Zhong wrote:

> Reference: https://godbolt.org/z/G9jzf5Grh
> 
> RVV is able to vectorize this case using SLP. However, with 
> -fno-vect-cost-model, RVV vectorize it by vec_load_lanes with stride 6.

OK.  Note load/store-lanes is specifically pre-empting SLP if all
loads/stores of a SLP intance can support that.  Not sure if this
heuristic is good for load/store lanes with high stride?

> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/fast-math-slp-38.c: Add ! vect_strided6.
> 
> ---
>  gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c 
> b/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
> index 7c7acd5bab6..96751faae7f 100644
> --- a/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
> +++ b/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
> @@ -18,4 +18,4 @@ foo (void)
>  }
>  
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> { target { ! vect_strided6 } } } } */
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: Re: [PATCH] RISC-V Regression test: Fix FAIL of fast-math-slp-38.c for RVV

2023-10-09 Thread juzhe.zh...@rivai.ai

>> OK.  
Thanks.  Committed.

>> Note load/store-lanes is specifically pre-empting SLP if all
>> loads/stores of a SLP intance can support that.  Not sure if this
>> heuristic is good for load/store lanes with high stride?

Yeah, I understand your concern. 
Em, I am sure too.
But RVV ISA define lanes load/store from 2 to 8 and LLVM already supported.
I think we can fully support them, then let RISC-V COST model decide it whether 
it is profitable or not.

Also, I found RVV can vectorize a TSVC case with stride = 5 
lane_load/lane_store:

tsvc-s353.c:

-/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" { xfail *-*-* } } } 
*/
+/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" { xfail { ! riscv_v 
} } } } */

https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632213.html

So, I think overall it is beneficial we support high stride lane load/store 
which can help us vectorize more cases.



juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-10-09 20:41
To: Juzhe-Zhong
CC: gcc-patches; jeffreyalaw
Subject: Re: [PATCH] RISC-V Regression test: Fix FAIL of fast-math-slp-38.c for 
RVV
On Mon, 9 Oct 2023, Juzhe-Zhong wrote:
 
> Reference: https://godbolt.org/z/G9jzf5Grh
> 
> RVV is able to vectorize this case using SLP. However, with 
> -fno-vect-cost-model, RVV vectorize it by vec_load_lanes with stride 6.
 
OK.  Note load/store-lanes is specifically pre-empting SLP if all
loads/stores of a SLP intance can support that.  Not sure if this
heuristic is good for load/store lanes with high stride?
 
> gcc/testsuite/ChangeLog:
> 
> * gcc.dg/vect/fast-math-slp-38.c: Add ! vect_strided6.
> 
> ---
>  gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c 
> b/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
> index 7c7acd5bab6..96751faae7f 100644
> --- a/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
> +++ b/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
> @@ -18,4 +18,4 @@ foo (void)
>  }
>  
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> { target { ! vect_strided6 } } } } */
> 
 
-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH 1/6] aarch64: Sync system register information with Binutils

2023-10-09 Thread Victor Do Nascimento





On 10/9/23 01:02, Ramana Radhakrishnan wrote:




On 5 Oct 2023, at 14:04, Victor Do Nascimento  
wrote:

External email: Use caution opening links or attachments


On 10/5/23 12:42, Richard Earnshaw wrote:



On 03/10/2023 16:18, Victor Do Nascimento wrote:

This patch adds the `aarch64-sys-regs.def' file to GCC, teaching
the compiler about system registers known to the assembler and how
these can be used.

The macros used to hold system register information reflect those in
use by binutils, a design choice made to facilitate the sharing of data
between different parts of the toolchain.

By aligning the representation of data common to different parts of
the toolchain we can greatly reduce the duplication of work,
facilitating the maintenance of the aarch64 back-end across different
parts of the toolchain; any `SYSREG (...)' that is added in one
project can just as easily be added to its counterpart.

GCC does not implement the full range of ISA flags present in
Binutils.  Where this is the case, aliases must be added to aarch64.h
with the unknown architectural extension being mapped to its
associated base architecture, such that any flag present in Binutils
and used in system register definitions is understood in GCC.  Again,
this is done such that flags can be used interchangeably between
projects making use of the aarch64-system-regs.def file.  This is done
in the next patch in the series.

`.arch' directives missing from the emitted assembly files as a
consequence of this aliasing are accounted for by the compiler using
the S encoding of system registers when
issuing mrs/msr instructions.  This design choice ensures the
assembler will accept anything that was deemed acceptable by the
compiler.

gcc/ChangeLog:

* gcc/config/aarch64/aarch64-system-regs.def: New.
---
  gcc/config/aarch64/aarch64-sys-regs.def | 1059 +++
  1 file changed, 1059 insertions(+)
  create mode 100644 gcc/config/aarch64/aarch64-sys-regs.def


This file is supposed to be /identical/ to the one in GNU Binutils,
right?


You're right Richard.

We want the same file to be compatible with both parts of the toolchain
and, consequently, there is no compelling reason as to why the copy of
the file found in GCC should in any way diverge from its Binutils
counterpart.


If so, I think it needs to continue to say that it is part of
GNU Binutils, not part of GCC.  Ramana, has this happened before?  If
not, does the SC have a position here?



I’ve not had the time to delve into the patch, apologies.


Is the intention here to keep a copy of the file with the main copy being in 
binutils i.e. modifications are made in binutils and then sync’d with GCC at 
the same time ?


In which case the comments in the file should make the mechanics of updates 
abundantly clear.


That is indeed correct.
I will make this clear in the comments for the file.  Thanks for picking 
up on this.



Is there any reason why if the 2 versions were different, you’d have problems 
between gcc and binutils ?

If so, what kinds of problems would they be ? i.e. would they be no more than 
gas not knowing about a system register that GCC claimed to know because 
binutils and gcc were built with different versions of the system register file.


There would be no problem, should the two versions be different for 
whatever reason.  Even the issue you mention of gas not knowing about a 
system register that GCC claimed to know is circumvented.


Gcc is configured to always emit generic register names in the resulting 
asm, decoupling the system register validation mechanisms of the two 
parts of the toolchain.  If gcc deems the requirements of a particular 
system register to be satisfied, it won't trigger the assembler's 
validation mechanism when the assembly stage is reached.  Consequently, 
a stale copy of `aarch64-sys-reg.def' in binutils will bear no impact on 
gcc's execution.


Conversely, a stale `aarch64-sys-reg.def' on gcc's end will result in 
some register names not being recognized by gcc but, as in the above 
scenario, no ill-behavior will be triggered as a consequence of 
mismatches in `aarch64-sys-reg.def' version between different parts of 
the toolchain.



Speaking for myself, I do not see this request being any different from the 
requests for imports from other repositories into the GCC repository.




R.


This does raise a very interesting question on the intellectual property
front and one that is well beyond my competence to opine about.

Nonetheless, this is a question which may arise again if we abstract
away more target description data into such .def files, as has been
discussed for architectural feature flags (for example).

So what might be nice (but not necessarily tenable) is if we had
appropriate provisions in place for where files were shared across
different parts of the toolchain.

Something like "This file is a shared resource of GCC and Binutils."




This model of an additional shared repository with a

Re: [PATCH] ifcvt/vect: Emit COND_ADD for conditional scalar reduction.

2023-10-09 Thread Robin Dapp

> Hmm, the function is called at transform time so this shouldn't help
> avoiding the ICE.  I expected we refuse to vectorize _any_ reduction
> when sign dependent rounding is in effect?  OTOH maybe sign-dependent
> rounding is OK but only when we use a unconditional fold-left
> (so a loop mask from fully masking is OK but not an original COND_ADD?).

So we currently only disable the use of partial vectors

  else if (reduction_type == FOLD_LEFT_REDUCTION
   && reduc_fn == IFN_LAST
   && FLOAT_TYPE_P (vectype_in)
   && HONOR_SIGNED_ZEROS (vectype_in)
   && HONOR_SIGN_DEPENDENT_ROUNDING (vectype_in))
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "can't operate on partial vectors because"
 " signed zeros cannot be preserved.\n");
  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;

which is inside a LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P block.

For the fully masked case we continue (and then fail the assertion
on aarch64 at transform time).

I didn't get why that case is ok, though?  We still merge the initial
definition with the identity/neutral op (i.e. possibly -0.0) based on
the loop mask.  Is that different to partial masking?

Regards
 Robin

Re: [PATCH] wide-int: Allow up to 16320 bits wide_int and change widest_int precision to 32640 bits [PR102989]

2023-10-09 Thread Richard Sandiford

Jakub Jelinek  writes:
> Hi!
>
> As mentioned in the _BitInt support thread, _BitInt(N) is currently limited
> by the wide_int/widest_int maximum precision limitation, which is depending
> on target 191, 319, 575 or 703 bits (one less than WIDE_INT_MAX_PRECISION).
> That is fairly low limit for _BitInt, especially on the targets with the 191
> bit limitation.
>
> The following patch bumps that limit to 16319 bits on all arches, which is
> the limit imposed by INTEGER_CST representation (unsigned char members
> holding number of HOST_WIDE_INT limbs).
>
> In order to achieve that, wide_int is changed from a trivially copyable type
> which contained just an inline array of WIDE_INT_MAX_ELTS (3, 5, 9 or
> 11 limbs depending on target) limbs into a non-trivially copy constructible,
> copy assignable and destructible type which for the usual small cases (up
> to WIDE_INT_MAX_INL_ELTS which is the former WIDE_INT_MAX_ELTS) still uses
> an inline array of limbs, but for larger precisions uses heap allocated
> limb array.  This makes wide_int unusable in GC structures, so for dwarf2out
> which was the only place which needed it there is a new rwide_int type
> (restricted wide_int) which supports only up to RWIDE_INT_MAX_ELTS limbs
> inline and is trivially copyable (dwarf2out should never deal with large
> _BitInt constants, those should have been lowered earlier).
>
> Similarly, widest_int has been changed from a trivially copyable type which
> contained also an inline array of WIDE_INT_MAX_ELTS limbs (but unlike
> wide_int didn't contain precision and assumed that to be
> WIDE_INT_MAX_PRECISION) into a non-trivially copy constructible, copy
> assignable and destructible type which has always WIDEST_INT_MAX_PRECISION
> precision (32640 bits currently, twice as much as INTEGER_CST limitation
> allows) and unlike wide_int decides depending on get_len () value whether
> it uses an inline array (again, up to WIDE_INT_MAX_INL_ELTS) or heap
> allocated one.  In wide-int.h this means we need to estimate an upper
> bound on how many limbs will wide-int.cc (usually, sometimes wide-int.h)
> need to write, heap allocate if needed based on that estimation and upon
> set_len which is done at the end if we guessed over WIDE_INT_MAX_INL_ELTS
> and allocated dynamically, while we actually need less than that
> copy/deallocate.  The unexact guesses are needed because the exact
> computation of the length in wide-int.cc is sometimes quite complex and
> especially canonicalize at the end can decrease it.  widest_int is again
> because of this not usable in GC structures, so cfgloop.h has been changed
> to use fixed_wide_int_storage  and punt if
> we'd have larger _BitInt based iterators, programs having more than 128-bit
> iterators will be hopefully rare and I think it is fine to treat loops with
> more than 2^127 iterations as effectively possibly infinite, omp-general.cc
> is changed to use fixed_wide_int_storage <1024>, as it better should support
> scores with the same precision on all arches.
>
> Code which used WIDE_INT_PRINT_BUFFER_SIZE sized buffers for printing
> wide_int/widest_int into buffer had to be changed to use XALLOCAVEC for
> larger lengths.
>
> On x86_64, the patch in --enable-checking=yes,rtl,extra configured
> bootstrapped cc1plus enlarges the .text section by 1.01% - from
> 0x25725a5 to 0x25e and similarly at least when compiling insn-recog.cc
> with the usual bootstrap option slows compilation down by 1.01%,
> user 4m22.046s and 4m22.384s on vanilla trunk vs.
> 4m25.947s and 4m25.581s on patched trunk.  I'm afraid some code size growth
> and compile time slowdown is unavoidable in this case, we use wide_int and
> widest_int everywhere, and while the rare cases are marked with UNLIKELY
> macros, it still means extra checks for it.

Yeah, it's unfortunate, but like you say, it's probably unavoidable.
Having effectively arbitrary-size integers breaks most of the simplifying
asssumptions.

> The patch also regresses
> +FAIL: gm2/pim/fail/largeconst.mod,  -O  
> +FAIL: gm2/pim/fail/largeconst.mod,  -O -g  
> +FAIL: gm2/pim/fail/largeconst.mod,  -O3 -fomit-frame-pointer  
> +FAIL: gm2/pim/fail/largeconst.mod,  -O3 -fomit-frame-pointer 
> -finline-functions  
> +FAIL: gm2/pim/fail/largeconst.mod,  -Os  
> +FAIL: gm2/pim/fail/largeconst.mod,  -g  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -O  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -O -g  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -O3 -fomit-frame-pointer  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -O3 -fomit-frame-pointer 
> -finline-functions  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -Os  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -g  
> tests, which previously were rejected with
> error: constant literal 
> ‘12345678912345678912345679123456789123456789123456789123456789123456791234567891234567891234567891234567891234567912345678912345678912345678912345678912345679123456789123456789’
>  exceeds internal ZTYPE range
> kind of errors, but now are accepted.  Seems the F

[PATCH] RISC-V Regression test: Fix FAIL of pr45752.c for RVV

2023-10-09 Thread Juzhe-Zhong

RVV use load_lanes with stride = 5 vectorize this case with -fno-vect-cost-model
instead of SLP.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr45752.c: Adapt dump check for target supports 
load_lanes with stride = 5.

---
 gcc/testsuite/gcc.dg/vect/pr45752.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr45752.c 
b/gcc/testsuite/gcc.dg/vect/pr45752.c
index e8b364f29eb..3c87d9b04fc 100644
--- a/gcc/testsuite/gcc.dg/vect/pr45752.c
+++ b/gcc/testsuite/gcc.dg/vect/pr45752.c
@@ -159,4 +159,4 @@ int main (int argc, const char* argv[])
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 /* { dg-final { scan-tree-dump-times "gaps requires scalar epilogue loop" 0 
"vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" } 
} */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" 
{target { ! { vect_load_lanes && vect_strided5 } } } } } */
-- 
2.36.3

Re: [PATCH] ifcvt/vect: Emit COND_ADD for conditional scalar reduction.

2023-10-09 Thread Richard Biener

On Mon, 9 Oct 2023, Robin Dapp wrote:

> > Hmm, the function is called at transform time so this shouldn't help
> > avoiding the ICE.  I expected we refuse to vectorize _any_ reduction
> > when sign dependent rounding is in effect?  OTOH maybe sign-dependent
> > rounding is OK but only when we use a unconditional fold-left
> > (so a loop mask from fully masking is OK but not an original COND_ADD?).
> 
> So we currently only disable the use of partial vectors
> 
>   else if (reduction_type == FOLD_LEFT_REDUCTION
>  && reduc_fn == IFN_LAST

aarch64 probably chokes because reduc_fn is not IFN_LAST.

>  && FLOAT_TYPE_P (vectype_in)
>  && HONOR_SIGNED_ZEROS (vectype_in)

so with your change we'd support signed zeros correctly.

>  && HONOR_SIGN_DEPENDENT_ROUNDING (vectype_in))
>   {
> if (dump_enabled_p ())
>   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>"can't operate on partial vectors because"
>" signed zeros cannot be preserved.\n");
> LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> 
> which is inside a LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P block.
> 
> For the fully masked case we continue (and then fail the assertion
> on aarch64 at transform time).
> 
> I didn't get why that case is ok, though?  We still merge the initial
> definition with the identity/neutral op (i.e. possibly -0.0) based on
> the loop mask.  Is that different to partial masking?

I think the main point with my earlier change is that without
native support for a fold-left reduction (like on x86) we get

 ops = mask ? ops : neutral;
 acc += ops[0];
 acc += ops[1];
 ...

so we wouldn't use a COND_ADD but add neutral elements for masked
elements.  That's OK for signed zeros after your change (great)
but not OK for sign dependent rounding (because we can't decide on
the sign of the neutral zero then).

For the case of using an internal function, thus direct target support,
it should be OK to have sign-dependent rounding if we can use
the masked-fold-left reduction op.  As we do

  /* On the first iteration the input is simply the scalar phi
 result, and for subsequent iterations it is the output of
 the preceding operation.  */
  if (reduc_fn != IFN_LAST || (mask && mask_reduc_fn != IFN_LAST))
{
  if (mask && len && mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS)
new_stmt = gimple_build_call_internal (mask_reduc_fn, 5, 
reduc_var,
   def0, mask, len, bias);
  else if (mask && mask_reduc_fn == IFN_MASK_FOLD_LEFT_PLUS)
new_stmt = gimple_build_call_internal (mask_reduc_fn, 3, 
reduc_var,
   def0, mask);
  else
new_stmt = gimple_build_call_internal (reduc_fn, 2, reduc_var,
   def0);

the last case should be able to assert that 
!HONOR_SIGN_DEPENDENT_ROUNDING (also the reduc_fn == IFN_LAST case).

The quoted condition above should change to drop the HONOR_SIGNED_ZEROS
condition and the reduc_fn == IFN_LAST should change, maybe to
internal_fn_mask_index (reduc_fn) == -1?

Richard.

Re: [PATCH] RISC-V Regression test: Fix FAIL of pr45752.c for RVV

2023-10-09 Thread Richard Biener

On Mon, 9 Oct 2023, Juzhe-Zhong wrote:

> RVV use load_lanes with stride = 5 vectorize this case with 
> -fno-vect-cost-model
> instead of SLP.

OK

> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/pr45752.c: Adapt dump check for target supports 
> load_lanes with stride = 5.
> 
> ---
>  gcc/testsuite/gcc.dg/vect/pr45752.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/pr45752.c 
> b/gcc/testsuite/gcc.dg/vect/pr45752.c
> index e8b364f29eb..3c87d9b04fc 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr45752.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr45752.c
> @@ -159,4 +159,4 @@ int main (int argc, const char* argv[])
>  
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>  /* { dg-final { scan-tree-dump-times "gaps requires scalar epilogue loop" 0 
> "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" 
> } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" 
> {target { ! { vect_load_lanes && vect_strided5 } } } } } */
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

[PATCH v2] RISC-V: Refine bswap16 auto vectorization code gen

2023-10-09 Thread pan2 . li

From: Pan Li 

Update in v2

* Remove emit helper functions.
* Take expand_binop instead.

Original log:

This patch would like to refine the code gen for the bswap16.

We will have VEC_PERM_EXPR after rtl expand when invoking
__builtin_bswap. It will generate about 9 instructions in
loop as below, no matter it is bswap16, bswap32 or bswap64.

  .L2:
1 vle16.v v4,0(a0)
2 vmv.v.x v2,a7
3 vand.vv v2,v6,v2
4 sllia2,a5,1
5 vrgatherei16.vv v1,v4,v2
6 sub a4,a4,a5
7 vse16.v v1,0(a3)
8 add a0,a0,a2
9 add a3,a3,a2
  bne a4,zero,.L2

But for bswap16 we may have a even simple code gen, which
has only 7 instructions in loop as below.

  .L5
1 vle8.v  v2,0(a5)
2 addia5,a5,32
3 vsrl.vi v4,v2,8
4 vsll.vi v2,v2,8
5 vor.vv  v4,v4,v2
6 vse8.v  v4,0(a4)
7 addia4,a4,32
  bne a5,a6,.L5

Unfortunately, this way will make the insn in loop will grow up to
13 and 24 for bswap32 and bswap64. Thus, we will refine the code
gen for the bswap16 only, and leave both the bswap32 and bswap64
as is.

gcc/ChangeLog:

* config/riscv/riscv-v.cc (shuffle_bswap_pattern): New func impl
for shuffle bswap.
(expand_vec_perm_const_1): Add handling for shuffle bswap pattern.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls/perm-4.c: Adjust checker.
* gcc.target/riscv/rvv/autovec/unop/bswap16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/bswap16-0.c: New test.

Signed-off-by: Pan Li 
---
 gcc/config/riscv/riscv-v.cc   | 91 +++
 .../riscv/rvv/autovec/unop/bswap16-0.c| 17 
 .../riscv/rvv/autovec/unop/bswap16-run-0.c| 44 +
 .../riscv/rvv/autovec/vls/bswap16-0.c | 34 +++
 .../gcc.target/riscv/rvv/autovec/vls/perm-4.c |  4 +-
 5 files changed, 188 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/bswap16-0.c

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 23633a2a74d..c72e411f125 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -3030,6 +3030,95 @@ shuffle_decompress_patterns (struct expand_vec_perm_d *d)
   return true;
 }
 
+static bool
+shuffle_bswap_pattern (struct expand_vec_perm_d *d)
+{
+  HOST_WIDE_INT diff;
+  unsigned i, size, step;
+
+  if (!d->one_vector_p || !d->perm[0].is_constant (&diff) || !diff)
+return false;
+
+  step = diff + 1;
+  size = step * GET_MODE_UNIT_BITSIZE (d->vmode);
+
+  switch (size)
+{
+case 16:
+  break;
+case 32:
+case 64:
+  /* We will have VEC_PERM_EXPR after rtl expand when invoking
+__builtin_bswap. It will generate about 9 instructions in
+loop as below, no matter it is bswap16, bswap32 or bswap64.
+  .L2:
+1 vle16.v v4,0(a0)
+2 vmv.v.x v2,a7
+3 vand.vv v2,v6,v2
+4 sllia2,a5,1
+5 vrgatherei16.vv v1,v4,v2
+6 sub a4,a4,a5
+7 vse16.v v1,0(a3)
+8 add a0,a0,a2
+9 add a3,a3,a2
+  bne a4,zero,.L2
+
+But for bswap16 we may have a even simple code gen, which
+has only 7 instructions in loop as below.
+  .L5
+1 vle8.v  v2,0(a5)
+2 addia5,a5,32
+3 vsrl.vi v4,v2,8
+4 vsll.vi v2,v2,8
+5 vor.vv  v4,v4,v2
+6 vse8.v  v4,0(a4)
+7 addia4,a4,32
+  bne a5,a6,.L5
+
+Unfortunately, the instructions in loop will grow to 13 and 24
+for bswap32 and bswap64. Thus, we will leverage vrgather (9 insn)
+for both the bswap64 and bswap32, but take shift and or (7 insn)
+for bswap16.
+   */
+default:
+  return false;
+}
+
+  for (i = 0; i < step; i++)
+if (!d->perm.series_p (i, step, diff - i, step))
+  return false;
+
+  if (d->testing_p)
+return true;
+
+  machine_mode vhi_mode;
+  poly_uint64 vhi_nunits = exact_div (GET_MODE_NUNITS (d->vmode), 2);
+
+  if (!get_vector_mode (HImode, vhi_nunits).exists (&vhi_mode))
+return false;
+
+  /* Step-1: Move op0 to src with VHI mode.  */
+  rtx src = gen_reg_rtx (vhi_mode);
+  emit_move_insn (src, gen_lowpart (vhi_mode, d->op0));
+
+  /* Step-2: Shift right 8 bits to dest.  */
+  rtx dest = expand_binop (vhi_mode, lshr_optab, src, gen_int_mode (8, Pmode),
+  NULL_RTX, 0, OPTAB_DIRECT);
+
+  /* Step-3: Shift left 8 bits to src.  */
+  src = expand_binop (vhi_mode, ashl_optab, src, gen_int_mode (8, Pmode),
+ NULL_RTX, 0, OPTAB_DIRECT);
+
+  /* Step-4: Logic Or dest and src to dest.  */
+  dest = expand_binop (vhi_mode, ior_optab, dest, src,
+  NULL_RTX, 0, OPTAB_DIRECT);
+
+  /* Step-5: Move src to target with VQI mode.  */
+  emit_move

Re: [PATCH v2] RISC-V: Refine bswap16 auto vectorization code gen

2023-10-09 Thread juzhe.zh...@rivai.ai

LGTM now.

Thanks.



juzhe.zh...@rivai.ai
 
From: pan2.li
Date: 2023-10-09 21:09
To: gcc-patches
CC: juzhe.zhong; pan2.li; yanzhang.wang; kito.cheng
Subject: [PATCH v2] RISC-V: Refine bswap16 auto vectorization code gen
From: Pan Li 
 
Update in v2
 
* Remove emit helper functions.
* Take expand_binop instead.
 
Original log:
 
This patch would like to refine the code gen for the bswap16.
 
We will have VEC_PERM_EXPR after rtl expand when invoking
__builtin_bswap. It will generate about 9 instructions in
loop as below, no matter it is bswap16, bswap32 or bswap64.
 
  .L2:
1 vle16.v v4,0(a0)
2 vmv.v.x v2,a7
3 vand.vv v2,v6,v2
4 sllia2,a5,1
5 vrgatherei16.vv v1,v4,v2
6 sub a4,a4,a5
7 vse16.v v1,0(a3)
8 add a0,a0,a2
9 add a3,a3,a2
  bne a4,zero,.L2
 
But for bswap16 we may have a even simple code gen, which
has only 7 instructions in loop as below.
 
  .L5
1 vle8.v  v2,0(a5)
2 addia5,a5,32
3 vsrl.vi v4,v2,8
4 vsll.vi v2,v2,8
5 vor.vv  v4,v4,v2
6 vse8.v  v4,0(a4)
7 addia4,a4,32
  bne a5,a6,.L5
 
Unfortunately, this way will make the insn in loop will grow up to
13 and 24 for bswap32 and bswap64. Thus, we will refine the code
gen for the bswap16 only, and leave both the bswap32 and bswap64
as is.
 
gcc/ChangeLog:
 
* config/riscv/riscv-v.cc (shuffle_bswap_pattern): New func impl
for shuffle bswap.
(expand_vec_perm_const_1): Add handling for shuffle bswap pattern.
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/autovec/vls/perm-4.c: Adjust checker.
* gcc.target/riscv/rvv/autovec/unop/bswap16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/bswap16-0.c: New test.
 
Signed-off-by: Pan Li 
---
gcc/config/riscv/riscv-v.cc   | 91 +++
.../riscv/rvv/autovec/unop/bswap16-0.c| 17 
.../riscv/rvv/autovec/unop/bswap16-run-0.c| 44 +
.../riscv/rvv/autovec/vls/bswap16-0.c | 34 +++
.../gcc.target/riscv/rvv/autovec/vls/perm-4.c |  4 +-
5 files changed, 188 insertions(+), 2 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/bswap16-0.c
 
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 23633a2a74d..c72e411f125 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -3030,6 +3030,95 @@ shuffle_decompress_patterns (struct expand_vec_perm_d *d)
   return true;
}
+static bool
+shuffle_bswap_pattern (struct expand_vec_perm_d *d)
+{
+  HOST_WIDE_INT diff;
+  unsigned i, size, step;
+
+  if (!d->one_vector_p || !d->perm[0].is_constant (&diff) || !diff)
+return false;
+
+  step = diff + 1;
+  size = step * GET_MODE_UNIT_BITSIZE (d->vmode);
+
+  switch (size)
+{
+case 16:
+  break;
+case 32:
+case 64:
+  /* We will have VEC_PERM_EXPR after rtl expand when invoking
+ __builtin_bswap. It will generate about 9 instructions in
+ loop as below, no matter it is bswap16, bswap32 or bswap64.
+.L2:
+ 1 vle16.v v4,0(a0)
+ 2 vmv.v.x v2,a7
+ 3 vand.vv v2,v6,v2
+ 4 sllia2,a5,1
+ 5 vrgatherei16.vv v1,v4,v2
+ 6 sub a4,a4,a5
+ 7 vse16.v v1,0(a3)
+ 8 add a0,a0,a2
+ 9 add a3,a3,a2
+bne a4,zero,.L2
+
+ But for bswap16 we may have a even simple code gen, which
+ has only 7 instructions in loop as below.
+.L5
+ 1 vle8.v  v2,0(a5)
+ 2 addia5,a5,32
+ 3 vsrl.vi v4,v2,8
+ 4 vsll.vi v2,v2,8
+ 5 vor.vv  v4,v4,v2
+ 6 vse8.v  v4,0(a4)
+ 7 addia4,a4,32
+bne a5,a6,.L5
+
+ Unfortunately, the instructions in loop will grow to 13 and 24
+ for bswap32 and bswap64. Thus, we will leverage vrgather (9 insn)
+ for both the bswap64 and bswap32, but take shift and or (7 insn)
+ for bswap16.
+   */
+default:
+  return false;
+}
+
+  for (i = 0; i < step; i++)
+if (!d->perm.series_p (i, step, diff - i, step))
+  return false;
+
+  if (d->testing_p)
+return true;
+
+  machine_mode vhi_mode;
+  poly_uint64 vhi_nunits = exact_div (GET_MODE_NUNITS (d->vmode), 2);
+
+  if (!get_vector_mode (HImode, vhi_nunits).exists (&vhi_mode))
+return false;
+
+  /* Step-1: Move op0 to src with VHI mode.  */
+  rtx src = gen_reg_rtx (vhi_mode);
+  emit_move_insn (src, gen_lowpart (vhi_mode, d->op0));
+
+  /* Step-2: Shift right 8 bits to dest.  */
+  rtx dest = expand_binop (vhi_mode, lshr_optab, src, gen_int_mode (8, Pmode),
+NULL_RTX, 0, OPTAB_DIRECT);
+
+  /* Step-3: Shift left 8 bits to src.  */
+  src = expand_binop (vhi_mode, ashl_optab, src, gen_int_mode (8, Pmode),
+   NULL_RTX, 0, OPTAB_DIRECT);
+
+  /* Step-4: Logic Or dest and src to dest.  */
+  dest = expand_binop (vhi_mode, ior_optab, dest, src,
+NULL_RTX, 0, OPTAB_DIRECT);
+
+  /* Step-5: Move src to target with VQI mode.  */
+  emit_move_insn (d->target, gen_lowpart (d->vmode, dest));
+
+  return true;
+}
+
/*

RE: [PATCH] RISC-V Regression test: Fix FAIL of pr45752.c for RVV

2023-10-09 Thread Li, Pan2

Committed, thanks Richard.

Pan

-Original Message-
From: Richard Biener  
Sent: Monday, October 9, 2023 9:07 PM
To: Juzhe-Zhong 
Cc: gcc-patches@gcc.gnu.org; jeffreya...@gmail.com
Subject: Re: [PATCH] RISC-V Regression test: Fix FAIL of pr45752.c for RVV

On Mon, 9 Oct 2023, Juzhe-Zhong wrote:

> RVV use load_lanes with stride = 5 vectorize this case with 
> -fno-vect-cost-model
> instead of SLP.

OK

> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/pr45752.c: Adapt dump check for target supports 
> load_lanes with stride = 5.
> 
> ---
>  gcc/testsuite/gcc.dg/vect/pr45752.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/pr45752.c 
> b/gcc/testsuite/gcc.dg/vect/pr45752.c
> index e8b364f29eb..3c87d9b04fc 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr45752.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr45752.c
> @@ -159,4 +159,4 @@ int main (int argc, const char* argv[])
>  
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>  /* { dg-final { scan-tree-dump-times "gaps requires scalar epilogue loop" 0 
> "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" 
> } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" 
> {target { ! { vect_load_lanes && vect_strided5 } } } } } */
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH 6/6] aarch64: Add front-end argument type checking for target builtins

2023-10-09 Thread Victor Do Nascimento





On 10/7/23 12:53, Richard Sandiford wrote:

Richard Earnshaw  writes:

On 03/10/2023 16:18, Victor Do Nascimento wrote:

In implementing the ACLE read/write system register builtins it was
observed that leaving argument type checking to be done at expand-time
meant that poorly-formed function calls were being "fixed" by certain
optimization passes, meaning bad code wasn't being properly picked up
in checking.

Example:

const char *regname = "amcgcr_el0";
long long a = __builtin_aarch64_rsr64 (regname);

is reduced by the ccp1 pass to

long long a = __builtin_aarch64_rsr64 ("amcgcr_el0");

As these functions require an argument of STRING_CST type, there needs
to be a check carried out by the front-end capable of picking this up.

The introduced `check_general_builtin_call' function will be called by
the TARGET_CHECK_BUILTIN_CALL hook whenever a call to a builtin
belonging to the AARCH64_BUILTIN_GENERAL category is encountered,
carrying out any appropriate checks associated with a particular
builtin function code.


Doesn't this prevent reasonable wrapping of the __builtin... names with
something more palatable?  Eg:

static inline __attribute__(("always_inline")) long long get_sysreg_ll
(const char *regname)
{
return __builtin_aarch64_rsr64 (regname);
}

...
long long x = get_sysreg_ll("amcgcr_el0");
...


I think it's case of picking your poison.  If we didn't do this,
and only checked later, then it's unlikely that GCC and Clang would
be consistent about when a constant gets folded soon enough.

But yeah, it means that the above would need to be a macro in C.
Enlightened souls using C++ could instead do:

   template
   long long get_sysreg_ll()
   {
 return __builtin_aarch64_rsr64(regname);
   }

   ... get_sysreg_ll<"amcgcr_el0">() ...

Or at least I hope so.  Might be nice to have a test for this.

Thanks,
Richard


As Richard Earnshaw mentioned, this does break the use of `static inline 
__attribute__(("always_inline"))', something I had found out in my 
testing.  My chosen implementation was indeed, to quote Richard 
Sandiford, a case of "picking your poison" to have things line up with 
Clang and behaving consistently across optimization levels.


Relaxing the the use of `TARGET_CHECK_BUILTIN_CALL' meant optimizations 
were letting too many things through. Example:


const char *regname = "amcgcr_el0";
long long a = __builtin_aarch64_rsr64 (regname);

gets folded to

long long a = __builtin_aarch64_rsr64 ("amcgcr_el0");

and compilation passes at -01 even though it fails at -O0.

I had, however, not given any thought to the use of a template as a 
valid C++ alternative.


I will evaluate the use of templates and add tests accordingly.

Cheers,
Victor

RE: [PATCH v2] RISC-V: Refine bswap16 auto vectorization code gen

2023-10-09 Thread Li, Pan2

Committed, thanks Juzhe.

Pan

From: juzhe.zh...@rivai.ai 
Sent: Monday, October 9, 2023 9:11 PM
To: Li, Pan2 ; gcc-patches 
Cc: Li, Pan2 ; Wang, Yanzhang ; 
kito.cheng 
Subject: Re: [PATCH v2] RISC-V: Refine bswap16 auto vectorization code gen

LGTM now.

Thanks.


juzhe.zh...@rivai.ai

From: pan2.li
Date: 2023-10-09 21:09
To: gcc-patches
CC: juzhe.zhong; 
pan2.li; 
yanzhang.wang; 
kito.cheng
Subject: [PATCH v2] RISC-V: Refine bswap16 auto vectorization code gen
From: Pan Li mailto:pan2...@intel.com>>

Update in v2

* Remove emit helper functions.
* Take expand_binop instead.

Original log:

This patch would like to refine the code gen for the bswap16.

We will have VEC_PERM_EXPR after rtl expand when invoking
__builtin_bswap. It will generate about 9 instructions in
loop as below, no matter it is bswap16, bswap32 or bswap64.

  .L2:
1 vle16.v v4,0(a0)
2 vmv.v.x v2,a7
3 vand.vv v2,v6,v2
4 sllia2,a5,1
5 vrgatherei16.vv v1,v4,v2
6 sub a4,a4,a5
7 vse16.v v1,0(a3)
8 add a0,a0,a2
9 add a3,a3,a2
  bne a4,zero,.L2

But for bswap16 we may have a even simple code gen, which
has only 7 instructions in loop as below.

  .L5
1 vle8.v  v2,0(a5)
2 addia5,a5,32
3 vsrl.vi v4,v2,8
4 vsll.vi v2,v2,8
5 vor.vv  v4,v4,v2
6 vse8.v  v4,0(a4)
7 addia4,a4,32
  bne a5,a6,.L5

Unfortunately, this way will make the insn in loop will grow up to
13 and 24 for bswap32 and bswap64. Thus, we will refine the code
gen for the bswap16 only, and leave both the bswap32 and bswap64
as is.

gcc/ChangeLog:

* config/riscv/riscv-v.cc (shuffle_bswap_pattern): New func impl
for shuffle bswap.
(expand_vec_perm_const_1): Add handling for shuffle bswap pattern.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls/perm-4.c: Adjust checker.
* gcc.target/riscv/rvv/autovec/unop/bswap16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/bswap16-0.c: New test.

Signed-off-by: Pan Li mailto:pan2...@intel.com>>
---
gcc/config/riscv/riscv-v.cc   | 91 +++
.../riscv/rvv/autovec/unop/bswap16-0.c| 17 
.../riscv/rvv/autovec/unop/bswap16-run-0.c| 44 +
.../riscv/rvv/autovec/vls/bswap16-0.c | 34 +++
.../gcc.target/riscv/rvv/autovec/vls/perm-4.c |  4 +-
5 files changed, 188 insertions(+), 2 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/bswap16-0.c

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 23633a2a74d..c72e411f125 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -3030,6 +3030,95 @@ shuffle_decompress_patterns (struct expand_vec_perm_d *d)
   return true;
}
+static bool
+shuffle_bswap_pattern (struct expand_vec_perm_d *d)
+{
+  HOST_WIDE_INT diff;
+  unsigned i, size, step;
+
+  if (!d->one_vector_p || !d->perm[0].is_constant (&diff) || !diff)
+return false;
+
+  step = diff + 1;
+  size = step * GET_MODE_UNIT_BITSIZE (d->vmode);
+
+  switch (size)
+{
+case 16:
+  break;
+case 32:
+case 64:
+  /* We will have VEC_PERM_EXPR after rtl expand when invoking
+ __builtin_bswap. It will generate about 9 instructions in
+ loop as below, no matter it is bswap16, bswap32 or bswap64.
+.L2:
+ 1 vle16.v v4,0(a0)
+ 2 vmv.v.x v2,a7
+ 3 vand.vv v2,v6,v2
+ 4 sllia2,a5,1
+ 5 vrgatherei16.vv v1,v4,v2
+ 6 sub a4,a4,a5
+ 7 vse16.v v1,0(a3)
+ 8 add a0,a0,a2
+ 9 add a3,a3,a2
+bne a4,zero,.L2
+
+ But for bswap16 we may have a even simple code gen, which
+ has only 7 instructions in loop as below.
+.L5
+ 1 vle8.v  v2,0(a5)
+ 2 addia5,a5,32
+ 3 vsrl.vi v4,v2,8
+ 4 vsll.vi v2,v2,8
+ 5 vor.vv  v4,v4,v2
+ 6 vse8.v  v4,0(a4)
+ 7 addia4,a4,32
+bne a5,a6,.L5
+
+ Unfortunately, the instructions in loop will grow to 13 and 24
+ for bswap32 and bswap64. Thus, we will leverage vrgather (9 insn)
+ for both the bswap64 and bswap32, but take shift and or (7 insn)
+ for bswap16.
+   */
+default:
+  return false;
+}
+
+  for (i = 0; i < step; i++)
+if (!d->perm.series_p (i, step, diff - i, step))
+  return false;
+
+  if (d->testing_p)
+return true;
+
+  machine_mode vhi_mode;
+  poly_uint64 vhi_nunits = exact_div (GET_MODE_NUNITS (d->vmode), 2);
+
+  if (!get_vector_mode (HImode, vhi_nunits).exists (&vhi_mode))
+return false;
+
+  /* Step-1: Move op0 to src with VHI mode.  */
+  rtx src = gen_reg_rtx (vhi_mode);
+  emit_move_insn (src, gen_lowpart (vhi_mode, d->op0));
+
+  /* Step-2: Shift right 8 bits to dest.  */
+  rtx dest = expand_binop (vhi_mod

[PATCH] RISC-V Regression tests: Fix FAIL of pr97832* for RVV

2023-10-09 Thread Juzhe-Zhong

These cases are vectorized by vec_load_lanes with strided = 8 instead of SLP
with -fno-vect-cost-model.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr97832-2.c: Adapt dump check for target supports 
load_lanes with stride = 8.
* gcc.dg/vect/pr97832-3.c: Ditto.
* gcc.dg/vect/pr97832-4.c: Ditto.

---
 gcc/testsuite/gcc.dg/vect/pr97832-2.c | 4 ++--
 gcc/testsuite/gcc.dg/vect/pr97832-3.c | 4 ++--
 gcc/testsuite/gcc.dg/vect/pr97832-4.c | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr97832-2.c 
b/gcc/testsuite/gcc.dg/vect/pr97832-2.c
index 4f0578120ee..7d8d2691432 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97832-2.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97832-2.c
@@ -25,5 +25,5 @@ void foo1x1(double* restrict y, const double* restrict x, int 
clen)
   }
 }
 
-/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
-/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
! { vect_load_lanes && vect_strided8 } } } } } */
+/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" { target 
{ ! { vect_load_lanes && vect_strided8 } } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr97832-3.c 
b/gcc/testsuite/gcc.dg/vect/pr97832-3.c
index ad1225ddbaa..c0603e1432e 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97832-3.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97832-3.c
@@ -46,5 +46,5 @@ void foo(double* restrict y, const double* restrict x0, const 
double* restrict x
   }
 }
 
-/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
-/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
! { vect_load_lanes && vect_strided8 } } } } } */
+/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" { target 
{ ! { vect_load_lanes && vect_strided8 } } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr97832-4.c 
b/gcc/testsuite/gcc.dg/vect/pr97832-4.c
index 74ae27ff873..c03442816a4 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97832-4.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97832-4.c
@@ -24,5 +24,5 @@ void foo1x1(double* restrict y, const double* restrict x, int 
clen)
   }
 }
 
-/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
-/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
! { vect_load_lanes && vect_strided8 } } } } } */
+/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" { target 
{ ! { vect_load_lanes && vect_strided8 } } } } } */
-- 
2.36.3

Re: [PATCH] tree-optimization/111715 - improve TBAA for access paths with pun

2023-10-09 Thread Richard Biener

On Mon, 9 Oct 2023, Richard Biener wrote:

> The following improves basic TBAA for access paths formed by
> C++ abstraction where we are able to combine a path from an
> address-taking operation with a path based on that access using
> a pun to avoid memory access semantics on the address-taking part.
> 
> The trick is to identify the point the semantic memory access path
> starts which allows us to use the alias set of the outermost access
> instead of only that of the base of this path.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu for all languages
> with a slightly different variant, re-bootstrapping/testing now
> (with doing the extra walk just for AGGREGATE_TYPE_P).

I ended up pushing the original version below after bothing the
AGGREGATE_TYPE_P, improperly hiding the local variable.  It's
a micr-optimization not worth the trouble I think.

Bootstrapped and tested on x86_64-unknown-linux-gnu for all languages, 
pushed.

>From 9cf3fca604db73866d0dc69dc88f95155027b3d7 Mon Sep 17 00:00:00 2001
From: Richard Biener 
Date: Mon, 9 Oct 2023 13:05:10 +0200
Subject: [PATCH] tree-optimization/111715 - improve TBAA for access paths with
 pun
To: gcc-patches@gcc.gnu.org

The following improves basic TBAA for access paths formed by
C++ abstraction where we are able to combine a path from an
address-taking operation with a path based on that access using
a pun to avoid memory access semantics on the address-taking part.

The trick is to identify the point the semantic memory access path
starts which allows us to use the alias set of the outermost access
instead of only that of the base of this path.

PR tree-optimization/111715
* alias.cc (reference_alias_ptr_type_1): When we have
a type-punning ref at the base search for the access
path part that's still semantically valid.

* gcc.dg/tree-ssa/ssa-fre-102.c: New testcase.
---
 gcc/alias.cc| 17 ++-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c | 32 +
 2 files changed, 48 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c

diff --git a/gcc/alias.cc b/gcc/alias.cc
index 7c1af1fe96e..86d8f7104ad 100644
--- a/gcc/alias.cc
+++ b/gcc/alias.cc
@@ -774,7 +774,22 @@ reference_alias_ptr_type_1 (tree *t)
   && (TYPE_MAIN_VARIANT (TREE_TYPE (inner))
  != TYPE_MAIN_VARIANT
   (TREE_TYPE (TREE_TYPE (TREE_OPERAND (inner, 1))
-return TREE_TYPE (TREE_OPERAND (inner, 1));
+{
+  tree alias_ptrtype = TREE_TYPE (TREE_OPERAND (inner, 1));
+  /* Unless we have the (aggregate) effective type of the access
+somewhere on the access path.  If we have for example
+(&a->elts[i])->l.len exposed by abstraction we'd see
+MEM  [(B *)a].elts[i].l.len and we can use the alias set
+of 'len' when typeof (MEM  [(B *)a].elts[i]) == B for
+example.  See PR111715.  */
+  tree inner = *t;
+  while (handled_component_p (inner)
+&& (TYPE_MAIN_VARIANT (TREE_TYPE (inner))
+!= TYPE_MAIN_VARIANT (TREE_TYPE (alias_ptrtype
+   inner = TREE_OPERAND (inner, 0);
+  if (TREE_CODE (inner) == MEM_REF)
+   return alias_ptrtype;
+}
 
   /* Otherwise, pick up the outermost object that we could have
  a pointer to.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c
new file mode 100644
index 000..afd48050819
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c
@@ -0,0 +1,32 @@
+/* PR/111715 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-fre1" } */
+
+struct B {
+   struct { int len; } l;
+   long n;
+};
+struct A {
+   struct B elts[8];
+};
+
+static void
+set_len (struct B *b, int len)
+{
+  b->l.len = len;
+}
+
+static int
+get_len (struct B *b)
+{
+  return b->l.len;
+}
+
+int foo (struct A *a, int i, long *q)
+{
+  set_len (&a->elts[i], 1);
+  *q = 2;
+  return get_len (&a->elts[i]);
+}
+
+/* { dg-final { scan-tree-dump "return 1;" "fre1" } } */
-- 
2.35.3

Re: [PATCH] RISC-V: THead: Fix missing CFI directives for th.sdd in prologue.

2023-10-09 Thread Jeff Law





On 10/4/23 01:49, Xianmiao Qu wrote:

From: quxm 

When generating CFI directives for the store-pair instruction,
if we add two parallel REG_FRAME_RELATED_EXPR expr_lists like
   (expr_list:REG_FRAME_RELATED_EXPR (set (mem/c:DI (plus:DI (reg/f:DI 2 sp)
 (const_int 8 [0x8])) [1  S8 A64])
 (reg:DI 1 ra))
   (expr_list:REG_FRAME_RELATED_EXPR (set (mem/c:DI (reg/f:DI 2 sp) [1  S8 A64])
 (reg:DI 8 s0))
only the first expr_list will be recognized by dwarf2out_frame_debug
funciton. So, here we generate a SEQUENCE expression of REG_FRAME_RELATED_EXPR,
which includes two sub-expressions of RTX_FRAME_RELATED_P. Then the
dwarf2out_frame_debug_expr function will iterate through all the sub-expressions
and generate the corresponding CFI directives.

gcc/
* config/riscv/thead.cc (th_mempair_save_regs): Fix missing CFI
directives for store-pair instruction.

gcc/testsuite/
* gcc.target/riscv/xtheadmempair-4.c: New test.

THanks.  I pushed this to the trunk.
jeff

[PATCH] RISC-V Regression test: Fix FAIL of slp-12a.c

2023-10-09 Thread Juzhe-Zhong

This case is vectorized by stride8 load_lanes.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-12a.c: Adapt for stride 8 load_lanes.

---
 gcc/testsuite/gcc.dg/vect/slp-12a.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-12a.c 
b/gcc/testsuite/gcc.dg/vect/slp-12a.c
index f0dda55acae..973de6ada21 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-12a.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-12a.c
@@ -76,5 +76,5 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { 
vect_strided8 && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { 
! { vect_strided8 && vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { vect_strided8 && vect_int_mult } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { 
target { ! { vect_strided8 && vect_int_mult } } } } } */
-- 
2.36.3

Re: [PATCH 1/3]middle-end: Refactor vectorizer loop conditionals and separate out IV to new variables

2023-10-09 Thread Richard Biener

On Mon, 2 Oct 2023, Tamar Christina wrote:

> Hi All,
> 
> This is extracted out of the patch series to support early break vectorization
> in order to simplify the review of that patch series.
> 
> The goal of this one is to separate out the refactoring from the new
> functionality.
> 
> This first patch separates out the vectorizer's definition of an exit to their
> own values inside loop_vinfo.  During vectorization we can have three separate
> copies for each loop: scalar, vectorized, epilogue.  The scalar loop can also 
> be
> the versioned loop before peeling.
> 
> Because of this we track 3 different exits inside loop_vinfo corresponding to
> each of these loops.  Additionally each function that uses an exit, when not
> obviously clear which exit is needed will now take the exit explicitly as an
> argument.
> 
> This is because often times the callers switch the loops being passed around.
> While the caller knows which loops it is, the callee does not.
> 
> For now the loop exits are simply initialized to same value as before 
> determined
> by single_exit (..).
> 
> No change in functionality is expected throughout this patch series.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu, and
> no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>   * tree-loop-distribution.cc (copy_loop_before): Pass exit explicitly.
>   (loop_distribution::distribute_loop): Bail out of not single exit.
>   * tree-scalar-evolution.cc (get_loop_exit_condition): New.
>   * tree-scalar-evolution.h (get_loop_exit_condition): New.
>   * tree-vect-data-refs.cc (vect_enhance_data_refs_alignment): Pass exit
>   explicitly.
>   * tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors,
>   vect_set_loop_condition_partial_vectors_avx512,
>   vect_set_loop_condition_normal, vect_set_loop_condition): Explicitly
>   take exit.
>   (slpeel_tree_duplicate_loop_to_edge_cfg): Explicitly take exit and
>   return new peeled corresponding peeled exit.
>   (slpeel_can_duplicate_loop_p): Explicitly take exit.
>   (find_loop_location): Handle not knowing an explicit exit.
>   (vect_update_ivs_after_vectorizer, vect_gen_vector_loop_niters_mult_vf,
>   find_guard_arg, slpeel_update_phi_nodes_for_loops,
>   slpeel_update_phi_nodes_for_guard2): Use new exits.
>   (vect_do_peeling): Update bookkeeping to keep track of exits.
>   * tree-vect-loop.cc (vect_get_loop_niters): Explicitly take exit to
>   analyze.
>   (vec_init_loop_exit_info): New.
>   (_loop_vec_info::_loop_vec_info): Initialize vec_loop_iv,
>   vec_epilogue_loop_iv, scalar_loop_iv.
>   (vect_analyze_loop_form): Initialize exits.
>   (vect_create_loop_vinfo): Set main exit.
>   (vect_create_epilog_for_reduction, vectorizable_live_operation,
>   vect_transform_loop): Use it.
>   (scale_profile_for_vect_loop): Explicitly take exit to scale.
>   * tree-vectorizer.cc (set_uid_loop_bbs): Initialize loop exit.
>   * tree-vectorizer.h (LOOP_VINFO_IV_EXIT, LOOP_VINFO_EPILOGUE_IV_EXIT,
>   LOOP_VINFO_SCALAR_IV_EXIT): New.
>   (struct loop_vec_info): Add vec_loop_iv, vec_epilogue_loop_iv,
>   scalar_loop_iv.
>   (vect_set_loop_condition, slpeel_can_duplicate_loop_p,
>   slpeel_tree_duplicate_loop_to_edge_cfg): Take explicit exits.
>   (vec_init_loop_exit_info): New.
>   (struct vect_loop_form_info): Add loop_exit.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
> index 
> a28470b66ea935741a61fb73961ed7c927543a3d..902edc49ab588152a5b845f2c8a42a7e2a1d6080
>  100644
> --- a/gcc/tree-loop-distribution.cc
> +++ b/gcc/tree-loop-distribution.cc
> @@ -949,7 +949,8 @@ copy_loop_before (class loop *loop, bool 
> redirect_lc_phi_defs)
>edge preheader = loop_preheader_edge (loop);
>  
>initialize_original_copy_tables ();
> -  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader);
> +  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, single_exit (loop), 
> NULL,
> + NULL, preheader, NULL);
>gcc_assert (res != NULL);
>  
>/* When a not last partition is supposed to keep the LC PHIs computed
> @@ -3043,6 +3044,24 @@ loop_distribution::distribute_loop (class loop *loop,
>return 0;
>  }
>  
> +  /* Loop distribution only does prologue peeling but we still need to
> + initialize loop exit information.  However we only support single exits 
> at
> + the moment.  As such, should exit information not have been provided 
> and we
> + have more than one exit, bail out.  */
> +  if (!single_exit (loop))
> +{
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file,
> +  "Loop %d not distributed: too many exits.\n",
> +  loop->num);
> +
> +  free_rdg (rdg);
> +  loop_nest.release ();
> +  free_d

[PATCH] RISC-V Regression test: Adapt SLP tests like ARM SVE

2023-10-09 Thread Juzhe-Zhong

Like ARM SVE, RVV is vectorizing these 2 cases in the same way.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-23.c: Add RVV like ARM SVE.
* gcc.dg/vect/slp-perm-10.c: Ditto.

---
 gcc/testsuite/gcc.dg/vect/slp-23.c  | 2 +-
 gcc/testsuite/gcc.dg/vect/slp-perm-10.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-23.c 
b/gcc/testsuite/gcc.dg/vect/slp-23.c
index d32ee5ba73b..8836acf0330 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-23.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-23.c
@@ -114,5 +114,5 @@ int main (void)
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { ! vect_perm } } } } */
 /* SLP fails for the second loop with variable-length SVE because
the load size is greater than the minimum vector size.  */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { 
target vect_perm xfail { aarch64_sve && vect_variable_length } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { 
target vect_perm xfail { { aarch64_sve || riscv_v } && vect_variable_length } } 
} } */
   
diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-10.c 
b/gcc/testsuite/gcc.dg/vect/slp-perm-10.c
index 2cce30c2444..03de4c61b50 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-perm-10.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-perm-10.c
@@ -53,4 +53,4 @@ int main ()
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
vect_perm } } } */
 /* SLP fails for variable-length SVE because the load size is greater
than the minimum vector size.  */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target vect_perm xfail { aarch64_sve && vect_variable_length } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target vect_perm xfail { { aarch64_sve || riscv_v } && vect_variable_length } } 
} } */
-- 
2.36.3

[PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV

2023-10-09 Thread Juzhe-Zhong

RVV vectorize it with stride5 load_lanes.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-perm-4.c: Adapt test for stride5 load_lanes.

---
 gcc/testsuite/gcc.dg/vect/slp-perm-4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-4.c 
b/gcc/testsuite/gcc.dg/vect/slp-perm-4.c
index 107968f1f7c..f4bda39c837 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-perm-4.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-perm-4.c
@@ -115,4 +115,4 @@ int main (int argc, const char* argv[])
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 /* { dg-final { scan-tree-dump-times "gaps requires scalar epilogue loop" 0 
"vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } 
} */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { ! { vect_load_lanes && vect_strided5 } } } } } */
-- 
2.36.3

[PATCH] RISC-V Regression test: Fix FAIL of slp-reduc-4.c for RVV

2023-10-09 Thread Juzhe-Zhong

RVV vectortizes this case with stride8 load_lanes.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-reduc-4.c: Adapt test for stride8 load_lanes.

---
 gcc/testsuite/gcc.dg/vect/slp-reduc-4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-reduc-4.c 
b/gcc/testsuite/gcc.dg/vect/slp-reduc-4.c
index 15f5c259e98..e2fe01bb13d 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-reduc-4.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-reduc-4.c
@@ -60,6 +60,6 @@ int main (void)
 /* For variable-length SVE, the number of scalar statements in the
reduction exceeds the number of elements in a 128-bit granule.  */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { ! vect_multiple_sizes } xfail { vect_no_int_min_max || { aarch64_sve 
&& vect_variable_length } } } } } */
-/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
vect_multiple_sizes } } } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
vect_multiple_sizes && { ! { vect_load_lanes && vect_strided8 } } } } } } */
 /* { dg-final { scan-tree-dump-times "VEC_PERM_EXPR" 0 "vect" { xfail { 
aarch64_sve && vect_variable_length } } } } */
 
-- 
2.36.3

Re: [PATCH] wide-int: Allow up to 16320 bits wide_int and change widest_int precision to 32640 bits [PR102989]

2023-10-09 Thread Jakub Jelinek

On Mon, Oct 09, 2023 at 01:54:19PM +0100, Richard Sandiford wrote:
> > I've additionally built it with the incremental attached patch and
> > on make -C gcc check-gcc check-g++ -j32 -k it didn't show any
> > wide_int/widest_int heap allocations unless a > 128-bit _BitInt or wb/uwb
> > constant needing > 128-bit _BitInt was used in a testcase.
> 
> Overall it looks really good to me FWIW.  Some comments about the
> wide-int.h changes below.  Will send a separate message about wide-int.cc.

Thanks, just quick answers, will work on patch adjustments after trying to
get rid of rwide_int (seems dwarf2out has very limited needs from it, just
some routine to construct it in GCed memory (and never change afterwards)
from const wide_int_ref & or so, and then working operator ==,
get_precision, elt, get_len and get_val methods, so I think we could just
have a struct dw_wide_int { unsigned int prec, len; HOST_WIDE_INT val[1]; };
and perform the methods on it after converting to a storage ref.

> > @@ -380,7 +406,11 @@ namespace wi
> >  
> >  /* The integer has a constant precision (known at GCC compile time)
> > and is signed.  */
> > -CONST_PRECISION
> > +CONST_PRECISION,
> > +
> > +/* Like CONST_PRECISION, but with WIDEST_INT_MAX_PRECISION or larger
> > +   precision where not all elements of arrays are always present.  */
> > +WIDEST_CONST_PRECISION
> >};
> 
> Sorry to bring this up so late, but how about using INL_CONST_PRECISION
> for the fully inline case and CONST_PRECISION for the general case?
> That seems more consistent with the other naming in the patch.

Ok.

> > @@ -482,6 +541,18 @@ namespace wi
> >};
> >  
> >template 
> > +  struct binary_traits  > WIDEST_CONST_PRECISION>
> > +  {
> > +STATIC_ASSERT (int_traits ::precision == int_traits 
> > ::precision);
> 
> Should this assert for equal inl_precision too?  Although it probably
> isn't necessary computationally, it seems a bit arbitrary to pick the
> first inl_precision...

inl_precision is only used for widest_int/widest2_int, so if precision is
equal, inl_precision is as well.

> > +inline wide_int_storage::wide_int_storage (const wide_int_storage &x)
> > +{
> > +  len = x.len;
> > +  precision = x.precision;
> > +  if (UNLIKELY (precision > WIDE_INT_MAX_INL_PRECISION))
> > +{
> > +  u.valp = XNEWVEC (HOST_WIDE_INT, CEIL (precision, 
> > HOST_BITS_PER_WIDE_INT));
> > +  memcpy (u.valp, x.u.valp, len * sizeof (HOST_WIDE_INT));
> > +}
> > +  else if (LIKELY (precision))
> > +memcpy (u.val, x.u.val, len * sizeof (HOST_WIDE_INT));
> > +}
> 
> Does the variable-length memcpy pay for itself?  If so, perhaps that's a
> sign that we should have a smaller inline buffer for this class (say 2 HWIs).

Guess I'll try to see what results in smaller .text size.

> > +namespace wi
> > +{
> > +  template 
> > +  struct int_traits < widest_int_storage  >
> > +  {
> > +static const enum precision_type precision_type = 
> > WIDEST_CONST_PRECISION;
> > +static const bool host_dependent_precision = false;
> > +static const bool is_sign_extended = true;
> > +static const bool needs_write_val_arg = true;
> > +static const unsigned int precision
> > +  = N / WIDE_INT_MAX_INL_PRECISION * WIDEST_INT_MAX_PRECISION;
> 
> What's the reasoning behind this calculation?  It would give 0 for
> N < WIDE_INT_MAX_INL_PRECISION, and the "MAX" suggests that N
> shouldn't be > WIDE_INT_MAX_INL_PRECISION either.
> 
> I wonder whether this should be a second template parameter, with an
> assert that precision > inl_precision.

Maybe.  Yet another option would be to always use WIDE_INT_MAX_INL_PRECISION
as the inline precision (and use N template parameter just to decide about
the overall precision), regardless of whether it is widest_int or
widest2_int.  The latter is very rare and even much rarer that something
wouldn't fit into the WIDE_INT_MAX_INL_PRECISION when not using _BitInt.
The reason for introducing inl_precision was to avoid the heap allocation
for widest2_int unless _BitInt is in use, but maybe that isn't worth it.

> Nit: might format more naturally with:
> 
>   using res_traits = wi::int_traits :
>   ...

Ok.

> > @@ -2203,6 +2781,9 @@ wi::sext (const T &x, unsigned int offse
> >unsigned int precision = get_precision (result);
> >WIDE_INT_REF_FOR (T) xi (x, precision);
> >  
> > +  if (result.needs_write_val_arg)
> > +val = result.write_val (MAX (xi.len,
> > +CEIL (offset, HOST_BITS_PER_WIDE_INT)));
> 
> Why MAX rather than MIN?

Because it needs to be an upper bound.
In this case, sext_large has
  unsigned int len = offset / HOST_BITS_PER_WIDE_INT;
  /* Extending beyond the precision is a no-op.  If we have only stored
 OFFSET bits or fewer, the rest are already signs.  */
  if (offset >= precision || len >= xlen)
{
  for (unsigned i = 0; i < xlen; ++i)
val[i] = xval[i];
  return xlen;
}
  unsigned int suboffset = of

Re: [PATCH] RISC-V Regression test: Adapt SLP tests like ARM SVE

2023-10-09 Thread Jeff Law





On 10/9/23 07:37, Juzhe-Zhong wrote:

Like ARM SVE, RVV is vectorizing these 2 cases in the same way.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-23.c: Add RVV like ARM SVE.
* gcc.dg/vect/slp-perm-10.c: Ditto.

OK
jeff

Re: [PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV

2023-10-09 Thread Jeff Law





On 10/9/23 07:39, Juzhe-Zhong wrote:

RVV vectorize it with stride5 load_lanes.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-perm-4.c: Adapt test for stride5 load_lanes.

OK.

As a follow-up, would it make sense to test the .vect dump for something 
else in the ! {vec_load_lanes && vect_strided5 } case to verify that it 
does and continues to be vectorized for that configuration?


jeff

Re: [PATCH] RISC-V Regression test: Fix FAIL of slp-reduc-4.c for RVV

2023-10-09 Thread Jeff Law





On 10/9/23 07:41, Juzhe-Zhong wrote:

RVV vectortizes this case with stride8 load_lanes.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-reduc-4.c: Adapt test for stride8 load_lanes.
OK.  Similar question as my last ack.  Do we want a follow-up here which 
tests the .vect dump for the ! { vect_load_lanes && vec_strided8 } case?


jeff

Re: [PATCH] RISC-V Regression test: Fix FAIL of slp-12a.c

2023-10-09 Thread Jeff Law





On 10/9/23 07:35, Juzhe-Zhong wrote:

This case is vectorized by stride8 load_lanes.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-12a.c: Adapt for stride 8 load_lanes.

OK.  Same question as last two ACKs.

jeff

Re: [PATCH] RISC-V Regression tests: Fix FAIL of pr97832* for RVV

2023-10-09 Thread Jeff Law





On 10/9/23 07:15, Juzhe-Zhong wrote:

These cases are vectorized by vec_load_lanes with strided = 8 instead of SLP
with -fno-vect-cost-model.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr97832-2.c: Adapt dump check for target supports 
load_lanes with stride = 8.
* gcc.dg/vect/pr97832-3.c: Ditto.
* gcc.dg/vect/pr97832-4.c: Ditto.

OK.  Same question as last 3 acks.

jeff

Re: [PATCH V2] RISC-V: Support movmisalign of RVV VLA modes

2023-10-09 Thread Robin Dapp

Thanks, for now this LGTM.

Regards
 Robin

RE: [PATCH V2] RISC-V: Support movmisalign of RVV VLA modes

2023-10-09 Thread Li, Pan2

Committed, thanks Robin.

Pan

-Original Message-
From: Robin Dapp  
Sent: Monday, October 9, 2023 9:54 PM
To: Juzhe-Zhong ; gcc-patches@gcc.gnu.org
Cc: rdapp@gmail.com; kito.ch...@gmail.com; kito.ch...@sifive.com; 
jeffreya...@gmail.com
Subject: Re: [PATCH V2] RISC-V: Support movmisalign of RVV VLA modes

Thanks, for now this LGTM.

Regards
 Robin

Re: [pushed] analyzer: improvements to out-of-bounds diagrams [PR111155]

2023-10-09 Thread David Malcolm

On Mon, 2023-10-09 at 12:09 +0200, Tobias Burnus wrote:
> Hi David,
> 
> your commit breaks compilation with GCC < 6, here with GCC 5.2:
> 
> gcc/analyzer/access-diagram.cc: In member function 'void
> ana::boundaries::add(const ana::access_range&,
> ana::boundaries::kind)':
> gcc/analyzer/access-diagram.cc:655:20: error: 'kind' is not a class,
> namespace, or enumeration
>     (kind == kind::HARD) ? "HARD" : "soft");
>  ^
> The problem is ...
> 
> On 09.10.23 00:58, David Malcolm wrote:
> 
> > Update out-of-bounds diagrams to show existing string values,
> > diff --git a/gcc/analyzer/access-diagram.cc b/gcc/analyzer/access-
> > diagram.cc
> > index a51d594b5b2..2197ec63f53 100644
> > --- a/gcc/analyzer/access-diagram.cc
> > +++ b/gcc/analyzer/access-diagram.cc
> > @@ -630,8 +630,8 @@ class boundaries
> >   public:
> >     enum class kind { HARD, SOFT};
> 
> ...
> 
> > @@ -646,6 +646,15 @@ public:
> 
> Just above the following diff is the line:
> 
>    void add (const access_range &range, enum kind kind)
> 
> >     {
> >   add (range.m_start, kind);
> >   add (range.m_next, kind);
> > +    if (m_logger)
> > +  {
> > + m_logger->start_log_line ();
> > + m_logger->log_partial ("added access_range: ");
> > + range.dump_to_pp (m_logger->get_printer (), true);
> > + m_logger->log_partial (" (%s)",
> > +    (kind == kind::HARD) ? "HARD" :
> > "soft");
> > + m_logger->end_log_line ();
> 
> Actual problem:
> 
> Playing around also with the compiler explorer shows that GCC 5.2 or
> likewise 5.5
> do not like the variable (PARAM_DECL) name "kind" combined with 
> "kind::HARD".
> 
> The following works:
> (A) Using "kind == boundaries::kind::HARD" - i.e. adding
> "boundaries::"
> (B) Renaming the parameter name "kind" to something else - like "k"
> as used
>  in the other functions.
> 
> Can you fix it?

Sorry about the breakage, and thanks for the investigation.

Does the following patch fix the build for you?
Thanks


gcc/analyzer/ChangeLog:
* access-diagram.cc (boundaries::add): Explicitly state
"boundaries::" scope for "kind" enum.
---
 gcc/analyzer/access-diagram.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/analyzer/access-diagram.cc b/gcc/analyzer/access-diagram.cc
index 2197ec63f53..c7d190e3188 100644
--- a/gcc/analyzer/access-diagram.cc
+++ b/gcc/analyzer/access-diagram.cc
@@ -652,7 +652,8 @@ public:
m_logger->log_partial ("added access_range: ");
range.dump_to_pp (m_logger->get_printer (), true);
m_logger->log_partial (" (%s)",
-  (kind == kind::HARD) ? "HARD" : "soft");
+  (kind == boundaries::kind::HARD)
+  ? "HARD" : "soft");
m_logger->end_log_line ();
   }
   }
-- 
2.26.3

Re: [PATCH v1 1/4] options: Define TARGETP and TARGETOPTS_P macro for Mask and InverseMask

2023-10-09 Thread Jeff Law





On 10/3/23 03:09, Kito Cheng wrote:

We TARGET__P marcro to test a Mask and InverseMask with user
specified target_variable, however we may want to test with specific
gcc_options variable rather than target_variable.

Like RISC-V has defined lots of Mask with TargetVariable, which is not
easy to use, because that means we need to known which Mask are associate with
which TargetVariable, so take a gcc_options variable is a better interface
for such use case.

gcc/ChangeLog:

* doc/options.texi (Mask): Document TARGET__P and
TARGET__OPTS_P.
(InverseMask): Ditto.
* opth-gen.awk (Mask): Generate TARGET__P and
TARGET__OPTS_P macro.
(InverseMask): Ditto.
Doesn't this need to be updated to avoid multi-dimensional arrays in awk 
and rebased?


Jeff

Re: [PATCH v1 2/4] RISC-V: Refactor riscv_option_override and riscv_convert_vector_bits. [NFC]

2023-10-09 Thread Jeff Law





On 10/3/23 03:09, Kito Cheng wrote:

Allow those funciton apply from a local gcc_options rather than the
global options.

Preparatory for target attribute, sperate this change for eaiser reivew
since it's a NFC.

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_convert_vector_bits): Get setting
from argument rather than get setting from global setting.
(riscv_override_options_internal): New, splited from
riscv_override_options, also take a gcc_options argument.
(riscv_option_override): Splited most part to
riscv_override_options_internal.

OK once prerequisites are approved and installed.

jeff

Re: [PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV

2023-10-09 Thread juzhe.zhong

Do you mean add a check whether it is vectorized or not？Sounds reasonable， I can add that in another patch. Replied Message FromJeff LawDate10/09/2023 21:51 ToJuzhe-Zhong,gcc-patches@gcc.gnu.org Ccrguent...@suse.deSubjectRe: [PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV

On 10/9/23 07:39, Juzhe-Zhong wrote:
> RVV vectorize it with stride5 load_lanes.
>  
> gcc/testsuite/ChangeLog:
>  
>     * gcc.dg/vect/slp-perm-4.c: Adapt test for stride5 load_lanes.
OK.

As a follow-up, would it make sense to test the .vect dump for something  
else in the ! {vec_load_lanes && vect_strided5 } case to verify that it  
does and continues to be vectorized for that configuration?

jeff

RE: [PATCH] RISC-V Regression test: Adapt SLP tests like ARM SVE

2023-10-09 Thread Li, Pan2

Committed, thanks Jeff.

Pan

-Original Message-
From: Jeff Law  
Sent: Monday, October 9, 2023 9:49 PM
To: Juzhe-Zhong ; gcc-patches@gcc.gnu.org
Cc: rguent...@suse.de
Subject: Re: [PATCH] RISC-V Regression test: Adapt SLP tests like ARM SVE



On 10/9/23 07:37, Juzhe-Zhong wrote:
> Like ARM SVE, RVV is vectorizing these 2 cases in the same way.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/slp-23.c: Add RVV like ARM SVE.
>   * gcc.dg/vect/slp-perm-10.c: Ditto.
OK
jeff

Re: [PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV

2023-10-09 Thread Jeff Law





On 10/9/23 08:21, juzhe.zhong wrote:

Do you mean add a check whether it is vectorized or not？

Yes.



Sounds reasonable， I can add that in another patch.

Sounds good.  Thanks.

jeff

RE: [PATCH] RISC-V Regression test: Fix FAIL of slp-reduc-4.c for RVV

2023-10-09 Thread Li, Pan2

Committed, thanks Jeff.

Pan

-Original Message-
From: Jeff Law  
Sent: Monday, October 9, 2023 9:52 PM
To: Juzhe-Zhong ; gcc-patches@gcc.gnu.org
Cc: rguent...@suse.de
Subject: Re: [PATCH] RISC-V Regression test: Fix FAIL of slp-reduc-4.c for RVV



On 10/9/23 07:41, Juzhe-Zhong wrote:
> RVV vectortizes this case with stride8 load_lanes.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/slp-reduc-4.c: Adapt test for stride8 load_lanes.
OK.  Similar question as my last ack.  Do we want a follow-up here which 
tests the .vect dump for the ! { vect_load_lanes && vec_strided8 } case?

jeff

RE: [PATCH] RISC-V Regression test: Fix FAIL of slp-12a.c

2023-10-09 Thread Li, Pan2

Committed, thanks Jeff.

Pan

-Original Message-
From: Jeff Law  
Sent: Monday, October 9, 2023 9:53 PM
To: Juzhe-Zhong ; gcc-patches@gcc.gnu.org
Cc: rguent...@suse.de
Subject: Re: [PATCH] RISC-V Regression test: Fix FAIL of slp-12a.c



On 10/9/23 07:35, Juzhe-Zhong wrote:
> This case is vectorized by stride8 load_lanes.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/slp-12a.c: Adapt for stride 8 load_lanes.
OK.  Same question as last two ACKs.

jeff

RE: [PATCH] RISC-V Regression tests: Fix FAIL of pr97832* for RVV

2023-10-09 Thread Li, Pan2

Committed, thanks Jeff.

Pan

-Original Message-
From: Jeff Law  
Sent: Monday, October 9, 2023 9:53 PM
To: Juzhe-Zhong ; gcc-patches@gcc.gnu.org
Cc: rguent...@suse.de
Subject: Re: [PATCH] RISC-V Regression tests: Fix FAIL of pr97832* for RVV



On 10/9/23 07:15, Juzhe-Zhong wrote:
> These cases are vectorized by vec_load_lanes with strided = 8 instead of SLP
> with -fno-vect-cost-model.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/pr97832-2.c: Adapt dump check for target supports 
> load_lanes with stride = 8.
>   * gcc.dg/vect/pr97832-3.c: Ditto.
>   * gcc.dg/vect/pr97832-4.c: Ditto.
OK.  Same question as last 3 acks.

jeff

RE: [PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV

2023-10-09 Thread Li, Pan2

Committed, thanks Jeff.

Pan

-Original Message-
From: Jeff Law  
Sent: Monday, October 9, 2023 10:28 PM
To: juzhe.zhong 
Cc: gcc-patches@gcc.gnu.org; rguent...@suse.de
Subject: Re: [PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV



On 10/9/23 08:21, juzhe.zhong wrote:
> Do you mean add a check whether it is vectorized or not？
Yes.

> 
> Sounds reasonable， I can add that in another patch.
Sounds good.  Thanks.

jeff

Re: [PATCH] sso-string@gnu-versioned-namespace [PR83077]

2023-10-09 Thread Iain Sandoe

Hi François,

> On 7 Oct 2023, at 20:32, François Dumont  wrote:
> 
> I've been told that previous patch generated with 'git diff -b' was not 
> applying properly so here is the same patch again with a simple 'git diff'.

Thanks, that did fix it - There are some training whitespaces in the config 
files, but I suspect that they need to be there since those have values 
appended during the configuration.

Anyway, with this + the coroutines and contract v2 (weak def) fix, plus a local 
patch to enable versioned namespace on Darwin, I get results comparable with 
the non-versioned case - but one more patchlet is needed on  yours (to allow 
for targets using emultated TLS):

diff --git a/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver 
b/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
index 9fab8bead15..b7167fc0c2f 100644
--- a/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
+++ b/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
@@ -78,6 +78,7 @@ GLIBCXX_8.0 {
 
 # thread/mutex/condition_variable/future
 __once_proxy;
+__emutls_v._ZNSt3__81?__once_call*;
 
 # std::__convert_to_v
 _ZNSt3__814__convert_to_v*;


thanks
Iain

> 
> 
> On 07/10/2023 14:25, François Dumont wrote:
>> Hi
>> 
>> Here is a rebased version of this patch.
>> 
>> There are few test failures when running 'make check-c++' but nothing new.
>> 
>> Still, there are 2 patches awaiting validation to fix some of them, PR 
>> c++/111524 to fix another bunch and I fear that we will have to live with 
>> the others.
>> 
>> libstdc++: [_GLIBCXX_INLINE_VERSION] Use cxx11 abi [PR83077]
>> 
>> Use cxx11 abi when activating versioned namespace mode. To do support
>> a new configuration mode where !_GLIBCXX_USE_DUAL_ABI and 
>> _GLIBCXX_USE_CXX11_ABI.
>> 
>> The main change is that std::__cow_string is now defined whenever 
>> _GLIBCXX_USE_DUAL_ABI
>> or _GLIBCXX_USE_CXX11_ABI is true. Implementation is using available 
>> std::string in
>> case of dual abi and a subset of it when it's not.
>> 
>> On the other side std::__sso_string is defined only when 
>> _GLIBCXX_USE_DUAL_ABI is true
>> and _GLIBCXX_USE_CXX11_ABI is false. Meaning that std::__sso_string is a 
>> typedef for the
>> cow std::string implementation when dual abi is disabled and cow string 
>> is being used.
>> 
>> libstdcxx-v3/ChangeLog:
>> 
>> PR libstdc++/83077
>> * acinclude.m4 [GLIBCXX_ENABLE_LIBSTDCXX_DUAL_ABI]: Default to 
>> "new" libstdcxx abi
>> when enable_symvers is gnu-versioned-namespace.
>> * config/locale/dragonfly/monetary_members.cc 
>> [!_GLIBCXX_USE_DUAL_ABI]: Define money_base
>> members.
>> * config/locale/generic/monetary_members.cc 
>> [!_GLIBCXX_USE_DUAL_ABI]: Likewise.
>> * config/locale/gnu/monetary_members.cc 
>> [!_GLIBCXX_USE_DUAL_ABI]: Likewise.
>> * config/locale/gnu/numeric_members.cc
>> [!_GLIBCXX_USE_DUAL_ABI](__narrow_multibyte_chars): Define.
>> * configure: Regenerate.
>> * include/bits/c++config
>> [_GLIBCXX_INLINE_VERSION](_GLIBCXX_NAMESPACE_CXX11, 
>> _GLIBCXX_BEGIN_NAMESPACE_CXX11):
>> Define empty.
>> [_GLIBCXX_INLINE_VERSION](_GLIBCXX_END_NAMESPACE_CXX11, 
>> _GLIBCXX_DEFAULT_ABI_TAG):
>> Likewise.
>> * include/bits/cow_string.h [!_GLIBCXX_USE_CXX11_ABI]: Define a 
>> light version of COW
>> basic_string as __std_cow_string for use in stdexcept.
>> * include/std/stdexcept [_GLIBCXX_USE_CXX11_ABI]: Define 
>> __cow_string.
>> (__cow_string(const char*)): New.
>> (__cow_string::c_str()): New.
>> * python/libstdcxx/v6/printers.py (StdStringPrinter::__init__): 
>> Set self.new_string to True
>> when std::__8::basic_string type is found.
>> * src/Makefile.am 
>> [ENABLE_SYMVERS_GNU_NAMESPACE](ldbl_alt128_compat_sources): Define empty.
>> * src/Makefile.in: Regenerate.
>> * src/c++11/Makefile.am (cxx11_abi_sources): Rename into...
>> (dual_abi_sources): ...this. Also move cow-local_init.cc, 
>> cxx11-hash_tr1.cc,
>> cxx11-ios_failure.cc entries to...
>> (sources): ...this.
>> (extra_string_inst_sources): Move cow-fstream-inst.cc, 
>> cow-sstream-inst.cc, cow-string-inst.cc,
>> cow-string-io-inst.cc, cow-wtring-inst.cc, 
>> cow-wstring-io-inst.cc, cxx11-locale-inst.cc,
>> cxx11-wlocale-inst.cc entries to...
>> (inst_sources): ...this.
>> * src/c++11/Makefile.in: Regenerate.
>> * src/c++11/cow-fstream-inst.cc [_GLIBCXX_USE_CXX11_ABI]: Skip 
>> definitions.
>> * src/c++11/cow-locale_init.cc [_GLIBCXX_USE_CXX11_ABI]: Skip 
>> definitions.
>> * src/c++11/cow-sstream-inst.cc [_GLIBCXX_USE_CXX11_ABI]: Skip 
>> definitions.
>> * src/c++11/cow-stdexce

Re: [PATCH] ifcvt/vect: Emit COND_ADD for conditional scalar reduction.

2023-10-09 Thread Richard Sandiford

Robin Dapp  writes:
>> It'd be good to expand on this comment a bit.  What kind of COND are you
>> anticipating?  A COND with the neutral op as the else value, so that the
>> PLUS_EXPR (or whatever) can remain unconditional?  If so, it would be
>> good to sketch briefly how that happens, and why it's better than using
>> the conditional PLUS_EXPR.
>> 
>> If that's the reason, perhaps we want a single-use check as well.
>> It's possible that OP1 is used elsewhere in the loop body, in a
>> context that would prefer a different else value.
>
> Would something like the following on top work?
>
> -  /* If possible try to create an IFN_COND_ADD instead of a COND_EXPR and
> - a PLUS_EXPR.  Don't do this if the reduction def operand itself is
> +  /* If possible create a COND_OP instead of a COND_EXPR and an OP_EXPR.
> + The COND_OP will have a neutral_op else value.
> +
> + This allows re-using the mask directly in a masked reduction instead
> + of creating a vector merge (or similar) and then an unmasked reduction.
> +
> + Don't do this if the reduction def operand itself is
>   a vectorizable call as we can create a COND version of it directly.  */

It wasn't very clear, sorry, but it was the last sentence I was asking
for clarification on, not the other bits.  Why do we want to avoid
generating a COND_ADD when the operand is a vectorisable call?

Thanks,
Richard

>
>if (ifn != IFN_LAST
>&& vectorized_internal_fn_supported_p (ifn, TREE_TYPE (lhs))
> -  && try_cond_op && !swap)
> +  && use_cond_op && !swap && has_single_use (op1))
>
> Regards
>  Robin

[PATCH] wide-int: Remove rwide_int, introduce dw_wide_int

2023-10-09 Thread Jakub Jelinek

On Mon, Oct 09, 2023 at 12:55:02PM +0200, Jakub Jelinek wrote:
> This makes wide_int unusable in GC structures, so for dwarf2out
> which was the only place which needed it there is a new rwide_int type
> (restricted wide_int) which supports only up to RWIDE_INT_MAX_ELTS limbs
> inline and is trivially copyable (dwarf2out should never deal with large
> _BitInt constants, those should have been lowered earlier).

As discussed on IRC, the dwarf2out.{h,cc} needs are actually quite limited,
it just needs to allocate new GC structures val_wide points to (constructed
from some const wide_int_ref &) and needs to call operator==,
get_precision, elt, get_len and get_val methods on it.
Even trailing_wide_int would be overkill for that, the following just adds
a new struct with precision/len and trailing val array members and
implements the needed methods (only 2 of them using wide_int_ref constructed
from those).

Incremental patch, so far compile time tested only:

--- gcc/wide-int.h.jj   2023-10-09 14:37:45.878940132 +0200
+++ gcc/wide-int.h  2023-10-09 16:06:39.326805176 +0200
@@ -27,7 +27,7 @@ along with GCC; see the file COPYING3.
other longer storage GCC representations (rtl and tree).
 
The actual precision of a wide_int depends on the flavor.  There
-   are four predefined flavors:
+   are three predefined flavors:
 
  1) wide_int (the default).  This flavor does the math in the
  precision of its input arguments.  It is assumed (and checked)
@@ -80,12 +80,7 @@ along with GCC; see the file COPYING3.
wi::leu_p (a, b) as a more efficient short-hand for
"a >= 0 && a <= b". ]
 
- 3) rwide_int.  Restricted wide_int.  This is similar to
- wide_int, but maximum possible precision is RWIDE_INT_MAX_PRECISION
- and it always uses an inline buffer.  offset_int and rwide_int are
- GC-friendly, wide_int and widest_int are not.
-
- 4) widest_int.  This representation is an approximation of
+ 3) widest_int.  This representation is an approximation of
  infinite precision math.  However, it is not really infinite
  precision math as in the GMP library.  It is really finite
  precision math where the precision is WIDEST_INT_MAX_PRECISION.
@@ -257,9 +252,6 @@ along with GCC; see the file COPYING3.
 #define WIDE_INT_MAX_ELTS 255
 #define WIDE_INT_MAX_PRECISION (WIDE_INT_MAX_ELTS * HOST_BITS_PER_WIDE_INT)
 
-#define RWIDE_INT_MAX_ELTS WIDE_INT_MAX_INL_ELTS
-#define RWIDE_INT_MAX_PRECISION WIDE_INT_MAX_INL_PRECISION
-
 /* Precision of widest_int and largest _BitInt precision + 1 we can
support.  */
 #define WIDEST_INT_MAX_ELTS 510
@@ -343,7 +335,6 @@ STATIC_ASSERT (WIDE_INT_MAX_INL_ELTS < W
 template  class generic_wide_int;
 template  class fixed_wide_int_storage;
 class wide_int_storage;
-class rwide_int_storage;
 template  class widest_int_storage;
 
 /* An N-bit integer.  Until we can use typedef templates, use this instead.  */
@@ -352,7 +343,6 @@ template  class widest_int_storag
 
 typedef generic_wide_int  wide_int;
 typedef FIXED_WIDE_INT (ADDR_MAX_PRECISION) offset_int;
-typedef generic_wide_int  rwide_int;
 typedef generic_wide_int  > 
widest_int;
 typedef generic_wide_int  
> widest2_int;
 
@@ -1371,180 +1361,6 @@ wi::int_traits ::get_b
 return wi::get_precision (x);
 }
 
-/* The storage used by rwide_int.  */
-class GTY(()) rwide_int_storage
-{
-private:
-  HOST_WIDE_INT val[RWIDE_INT_MAX_ELTS];
-  unsigned int len;
-  unsigned int precision;
-
-public:
-  rwide_int_storage () = default;
-  template 
-  rwide_int_storage (const T &);
-
-  /* The standard generic_rwide_int storage methods.  */
-  unsigned int get_precision () const;
-  const HOST_WIDE_INT *get_val () const;
-  unsigned int get_len () const;
-  HOST_WIDE_INT *write_val (unsigned int);
-  void set_len (unsigned int, bool = false);
-
-  template 
-  rwide_int_storage &operator = (const T &);
-
-  static rwide_int from (const wide_int_ref &, unsigned int, signop);
-  static rwide_int from_array (const HOST_WIDE_INT *, unsigned int,
-  unsigned int, bool = true);
-  static rwide_int create (unsigned int);
-};
-
-namespace wi
-{
-  template <>
-  struct int_traits 
-  {
-static const enum precision_type precision_type = VAR_PRECISION;
-/* Guaranteed by a static assert in the rwide_int_storage constructor.  */
-static const bool host_dependent_precision = false;
-static const bool is_sign_extended = true;
-static const bool needs_write_val_arg = false;
-template 
-static rwide_int get_binary_result (const T1 &, const T2 &);
-template 
-static unsigned int get_binary_precision (const T1 &, const T2 &);
-  };
-}
-
-/* Initialize the storage from integer X, in its natural precision.
-   Note that we do not allow integers with host-dependent precision
-   to become rwide_ints; rwide_ints must always be logically independent
-   of the host.  */
-template 
-inline rwide_int_storage::rwide_int_storage (const T &x)
-{
-  ST

[PATCH] TEST: Add vectorization check

2023-10-09 Thread Juzhe-Zhong

These cases won't check SLP for load_lanes support target.

Add vectorization check for situations.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr97832-2.c: Add vectorization check.
* gcc.dg/vect/pr97832-3.c: Ditto.
* gcc.dg/vect/pr97832-4.c: Ditto.

---
 gcc/testsuite/gcc.dg/vect/pr97832-2.c | 1 +
 gcc/testsuite/gcc.dg/vect/pr97832-3.c | 1 +
 gcc/testsuite/gcc.dg/vect/pr97832-4.c | 1 +
 3 files changed, 3 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/vect/pr97832-2.c 
b/gcc/testsuite/gcc.dg/vect/pr97832-2.c
index 7d8d2691432..60e8e8516fc 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97832-2.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97832-2.c
@@ -27,3 +27,4 @@ void foo1x1(double* restrict y, const double* restrict x, int 
clen)
 
 /* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
! { vect_load_lanes && vect_strided8 } } } } } */
 /* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" { target 
{ ! { vect_load_lanes && vect_strided8 } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr97832-3.c 
b/gcc/testsuite/gcc.dg/vect/pr97832-3.c
index c0603e1432e..2dc76e5b565 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97832-3.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97832-3.c
@@ -48,3 +48,4 @@ void foo(double* restrict y, const double* restrict x0, const 
double* restrict x
 
 /* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
! { vect_load_lanes && vect_strided8 } } } } } */
 /* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" { target 
{ ! { vect_load_lanes && vect_strided8 } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr97832-4.c 
b/gcc/testsuite/gcc.dg/vect/pr97832-4.c
index c03442816a4..7e74c9313d5 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97832-4.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97832-4.c
@@ -26,3 +26,4 @@ void foo1x1(double* restrict y, const double* restrict x, int 
clen)
 
 /* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
! { vect_load_lanes && vect_strided8 } } } } } */
 /* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" { target 
{ ! { vect_load_lanes && vect_strided8 } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
-- 
2.36.3

Re: [pushed] analyzer: improvements to out-of-bounds diagrams [PR111155]

2023-10-09 Thread Tobias Burnus


Hi David,

On 09.10.23 16:08, David Malcolm wrote:

On Mon, 2023-10-09 at 12:09 +0200, Tobias Burnus wrote:

The following works:
(A) Using "kind == boundaries::kind::HARD" - i.e. adding
"boundaries::"
(B) Renaming the parameter name "kind" to something else - like "k"
as used
  in the other functions.

Can you fix it?

Sorry about the breakage, and thanks for the investigation.

Well, without an older compiler, one does not see it. It also worked
flawlessly on my laptop today.

Does the following patch fix the build for you?


Yes – as mentioned either of the variants above should work and (A) is
what you have in your patch.

And it is what I actually tried for the full build. Hence, yes, it works :-)

Thanks for the quick action!

Tobias


gcc/analyzer/ChangeLog:
  * access-diagram.cc (boundaries::add): Explicitly state
  "boundaries::" scope for "kind" enum.
---
  gcc/analyzer/access-diagram.cc | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/analyzer/access-diagram.cc b/gcc/analyzer/access-diagram.cc
index 2197ec63f53..c7d190e3188 100644
--- a/gcc/analyzer/access-diagram.cc
+++ b/gcc/analyzer/access-diagram.cc
@@ -652,7 +652,8 @@ public:
  m_logger->log_partial ("added access_range: ");
  range.dump_to_pp (m_logger->get_printer (), true);
  m_logger->log_partial (" (%s)",
-(kind == kind::HARD) ? "HARD" : "soft");
+(kind == boundaries::kind::HARD)
+? "HARD" : "soft");
  m_logger->end_log_line ();
}
}

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Re: [PATCH-2, rs6000] Enable vector mode for memory equality compare [PR111449]

2023-10-09 Thread David Edelsohn

On Sun, Oct 8, 2023 at 10:30 PM HAO CHEN GUI  wrote:

> Hi,
>   This patch enables vector mode for memory equality compare by adding
> a new expand cbranchv16qi4 and implementing it. Also the corresponding
> CC reg and compare code is set in rs6000_generate_compare. With the
> patch, 16-byte equality compare can be implemented by one vector compare
> instructions other than 2 8-byte compares with branches.
>
>   The test case is in the second patch which is rs6000 specific.
>
>   Bootstrapped and tested on powerpc64-linux BE and LE with no
> regressions.
>

Thanks for working on this.



>
> Thanks
> Gui Haochen
>
> ChangeLog
> rs6000: Enable vector compare for memory equality compare
>
> gcc/
> PR target/111449
> * config/rs6000/altivec.md (cbranchv16qi4): New expand pattern.
> * config/rs6000/rs6000.cc (rs6000_generate_compare): Generate insn
> sequence for V16QImode equality compare.
> * config/rs6000/rs6000.h (MOVE_MAX_PIECES): Define.
> (COMPARE_MAX_PIECES): Define.
>
> gcc/testsuite/
> PR target/111449
> * gcc.target/powerpc/pr111449.c: New.
>
> patch.diff
> diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
> index e8a596fb7e9..c69bf266402 100644
> --- a/gcc/config/rs6000/altivec.md
> +++ b/gcc/config/rs6000/altivec.md
> @@ -2605,6 +2605,39 @@ (define_insn "altivec_vupklpx"
>  }
>[(set_attr "type" "vecperm")])
>
> +(define_expand "cbranchv16qi4"
> +  [(use (match_operator 0 "equality_operator"
> +   [(match_operand:V16QI 1 "gpc_reg_operand")
> +(match_operand:V16QI 2 "gpc_reg_operand")]))
> +   (use (match_operand 3))]
> +  "VECTOR_UNIT_ALTIVEC_P (V16QImode)"
> +{
> +  if (!TARGET_P9_VECTOR
> +  && MEM_P (operands[1])
> +  && !altivec_indexed_or_indirect_operand (operands[1], V16QImode)
> +  && MEM_P (operands[2])
> +  && !altivec_indexed_or_indirect_operand (operands[2], V16QImode))
> +{
> +  /* Use direct move as the byte order doesn't matter for equality
> +compare.  */
> +  rtx reg_op1 = gen_reg_rtx (V16QImode);
> +  rtx reg_op2 = gen_reg_rtx (V16QImode);
> +  rs6000_emit_le_vsx_permute (reg_op1, operands[1], V16QImode);
> +  rs6000_emit_le_vsx_permute (reg_op2, operands[2], V16QImode);
> +  operands[1] = reg_op1;
> +  operands[2] = reg_op2;
> +}
> +  else
> +{
> +  operands[1] = force_reg (V16QImode, operands[1]);
> +  operands[2] = force_reg (V16QImode, operands[2]);
> +}
> +  rtx_code code = GET_CODE (operands[0]);
> +  operands[0] = gen_rtx_fmt_ee (code, V16QImode, operands[1],
> operands[2]);
> +  rs6000_emit_cbranch (V16QImode, operands);
> +  DONE;
> +})
> +
>  ;; Compare vectors producing a vector result and a predicate, setting CR6
> to
>  ;; indicate a combined status
>  (define_insn "altivec_vcmpequ_p"
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index efe9adce1f8..0087d786840 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -15264,6 +15264,15 @@ rs6000_generate_compare (rtx cmp, machine_mode
> mode)
>   else
> emit_insn (gen_stack_protect_testsi (compare_result, op0,
> op1b));
> }
> +  else if (mode == V16QImode)
> +   {
> + gcc_assert (code == EQ || code == NE);
> +
> + rtx result_vector = gen_reg_rtx (V16QImode);
> + compare_result = gen_rtx_REG (CCmode, CR6_REGNO);
> + emit_insn (gen_altivec_vcmpequb_p (result_vector, op0, op1));
> + code = (code == NE) ? GE : LT;
> +   }
>else
> emit_insn (gen_rtx_SET (compare_result,
> gen_rtx_COMPARE (comp_mode, op0, op1)));
> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> index 3503614efbd..dc33bca0802 100644
> --- a/gcc/config/rs6000/rs6000.h
> +++ b/gcc/config/rs6000/rs6000.h
> @@ -1730,6 +1730,8 @@ typedef struct rs6000_args
> in one reasonably fast instruction.  */
>  #define MOVE_MAX (! TARGET_POWERPC64 ? 4 : 8)
>  #define MAX_MOVE_MAX 8
> +#define MOVE_MAX_PIECES (!TARGET_POWERPC64 ? 4 : 16)
> +#define COMPARE_MAX_PIECES (!TARGET_POWERPC64 ? 4 : 16)
>

How are the definitions of MOVE_MAX_PIECES and COMPARE_MAX_PIECES
determined?  The email does not provide any explanation for the
implementation.  The rest of the patch is related to vector support, but
vector support is not dependent on TARGET_POWERPC64.

Thanks, David


>
>  /* Nonzero if access to memory by bytes is no faster than for words.
> Also nonzero if doing byte operations (specifically shifts) in
> registers
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr111449.c
> b/gcc/testsuite/gcc.target/powerpc/pr111449.c
> new file mode 100644
> index 000..a8c30b92a41
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr111449.c
> @@ -0,0 +1,19 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p8vector_ok } */
> +/* { dg-options "-maltivec -O2" } */
> +

Re: [PATCH] sso-string@gnu-versioned-namespace [PR83077]

2023-10-09 Thread Iain Sandoe




> On 9 Oct 2023, at 15:42, Iain Sandoe  wrote:

>> On 7 Oct 2023, at 20:32, François Dumont  wrote:
>> 
>> I've been told that previous patch generated with 'git diff -b' was not 
>> applying properly so here is the same patch again with a simple 'git diff'.
> 
> Thanks, that did fix it - There are some training whitespaces in the config 
> files, but I suspect that they need to be there since those have values 
> appended during the configuration.
> 
> Anyway, with this + the coroutines and contract v2 (weak def) fix, plus a 
> local patch to enable versioned namespace on Darwin, I get results comparable 
> with the non-versioned case - but one more patchlet is needed on  yours (to 
> allow for targets using emultated TLS):
> 
> diff --git a/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver 
> b/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
> index 9fab8bead15..b7167fc0c2f 100644
> --- a/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
> +++ b/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
> @@ -78,6 +78,7 @@ GLIBCXX_8.0 {
> 
> # thread/mutex/condition_variable/future
> __once_proxy;
> +__emutls_v._ZNSt3__81?__once_call*;
> 
> # std::__convert_to_v
> _ZNSt3__814__convert_to_v*;

Having said this, since the versioned lib is an ABI-break, perhaps we should 
also take the opportunity
to fix the once_call impl. here too?

(at least the fix I made locally does not need the TLS var, so ths would then 
be moot)

Iain

> 
> thanks
> Iain
> 
>> 
>> 
>> On 07/10/2023 14:25, François Dumont wrote:
>>> Hi
>>> 
>>> Here is a rebased version of this patch.
>>> 
>>> There are few test failures when running 'make check-c++' but nothing new.
>>> 
>>> Still, there are 2 patches awaiting validation to fix some of them, PR 
>>> c++/111524 to fix another bunch and I fear that we will have to live with 
>>> the others.
>>> 
>>>libstdc++: [_GLIBCXX_INLINE_VERSION] Use cxx11 abi [PR83077]
>>> 
>>>Use cxx11 abi when activating versioned namespace mode. To do support
>>>a new configuration mode where !_GLIBCXX_USE_DUAL_ABI and 
>>> _GLIBCXX_USE_CXX11_ABI.
>>> 
>>>The main change is that std::__cow_string is now defined whenever 
>>> _GLIBCXX_USE_DUAL_ABI
>>>or _GLIBCXX_USE_CXX11_ABI is true. Implementation is using available 
>>> std::string in
>>>case of dual abi and a subset of it when it's not.
>>> 
>>>On the other side std::__sso_string is defined only when 
>>> _GLIBCXX_USE_DUAL_ABI is true
>>>and _GLIBCXX_USE_CXX11_ABI is false. Meaning that std::__sso_string is a 
>>> typedef for the
>>>cow std::string implementation when dual abi is disabled and cow string 
>>> is being used.
>>> 
>>>libstdcxx-v3/ChangeLog:
>>> 
>>>PR libstdc++/83077
>>>* acinclude.m4 [GLIBCXX_ENABLE_LIBSTDCXX_DUAL_ABI]: Default to 
>>> "new" libstdcxx abi
>>>when enable_symvers is gnu-versioned-namespace.
>>>* config/locale/dragonfly/monetary_members.cc 
>>> [!_GLIBCXX_USE_DUAL_ABI]: Define money_base
>>>members.
>>>* config/locale/generic/monetary_members.cc 
>>> [!_GLIBCXX_USE_DUAL_ABI]: Likewise.
>>>* config/locale/gnu/monetary_members.cc 
>>> [!_GLIBCXX_USE_DUAL_ABI]: Likewise.
>>>* config/locale/gnu/numeric_members.cc
>>>[!_GLIBCXX_USE_DUAL_ABI](__narrow_multibyte_chars): Define.
>>>* configure: Regenerate.
>>>* include/bits/c++config
>>>[_GLIBCXX_INLINE_VERSION](_GLIBCXX_NAMESPACE_CXX11, 
>>> _GLIBCXX_BEGIN_NAMESPACE_CXX11):
>>>Define empty.
>>> [_GLIBCXX_INLINE_VERSION](_GLIBCXX_END_NAMESPACE_CXX11, 
>>> _GLIBCXX_DEFAULT_ABI_TAG):
>>>Likewise.
>>>* include/bits/cow_string.h [!_GLIBCXX_USE_CXX11_ABI]: Define a 
>>> light version of COW
>>>basic_string as __std_cow_string for use in stdexcept.
>>>* include/std/stdexcept [_GLIBCXX_USE_CXX11_ABI]: Define 
>>> __cow_string.
>>>(__cow_string(const char*)): New.
>>>(__cow_string::c_str()): New.
>>>* python/libstdcxx/v6/printers.py (StdStringPrinter::__init__): 
>>> Set self.new_string to True
>>>when std::__8::basic_string type is found.
>>>* src/Makefile.am 
>>> [ENABLE_SYMVERS_GNU_NAMESPACE](ldbl_alt128_compat_sources): Define empty.
>>>* src/Makefile.in: Regenerate.
>>>* src/c++11/Makefile.am (cxx11_abi_sources): Rename into...
>>>(dual_abi_sources): ...this. Also move cow-local_init.cc, 
>>> cxx11-hash_tr1.cc,
>>>cxx11-ios_failure.cc entries to...
>>>(sources): ...this.
>>>(extra_string_inst_sources): Move cow-fstream-inst.cc, 
>>> cow-sstream-inst.cc, cow-string-inst.cc,
>>>cow-string-io-inst.cc, cow-wtring-inst.cc, 
>>> cow-wstring-io-inst.cc, cxx11-locale-inst.cc,
>>>cxx11-wlocale-inst.cc entries to...
>>>(inst_sources): ...this.
>>>

[COMMITTED] Remove unused get_identity_relation.

2023-10-09 Thread Andrew MacLeod

I added this routine for Aldy when he thought we were going to have to 
add explicit versions for unordered relations.


It seems that with accurate tracking of NANs, we do not need the 
explicit versions in the oracle, so we will not need this identity 
routine to pick the appropriate version of VREL_EQ... as there is only 
one.  As it stands, always returns VREL_EQ, so simply use VREL_EQ in the 
2 calling locations.


Bootstrapped on x86_64-pc-linux-gnu with no regressions. Pushed.

Andrew
From 5ee51119d1345f3f13af784455a4ae466766912b Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Mon, 9 Oct 2023 10:01:11 -0400
Subject: [PATCH 1/2] Remove unused get_identity_relation.

Turns out we didnt need this as there is no unordered relations
managed by the oracle.

	* gimple-range-gori.cc (gori_compute::compute_operand1_range): Do
	not call get_identity_relation.
	(gori_compute::compute_operand2_range): Ditto.
	* value-relation.cc (get_identity_relation): Remove.
	* value-relation.h (get_identity_relation): Remove protyotype.
---
 gcc/gimple-range-gori.cc | 10 ++
 gcc/value-relation.cc| 14 --
 gcc/value-relation.h |  3 ---
 3 files changed, 2 insertions(+), 25 deletions(-)

diff --git a/gcc/gimple-range-gori.cc b/gcc/gimple-range-gori.cc
index 1b5eda43390..887da0ff094 100644
--- a/gcc/gimple-range-gori.cc
+++ b/gcc/gimple-range-gori.cc
@@ -1146,10 +1146,7 @@ gori_compute::compute_operand1_range (vrange &r,
 
   // If op1 == op2, create a new trio for just this call.
   if (op1 == op2 && gimple_range_ssa_p (op1))
-	{
-	  relation_kind k = get_identity_relation (op1, op1_range);
-	  trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), k);
-	}
+	trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), VREL_EQ);
   if (!handler.calc_op1 (r, lhs, op2_range, trio))
 	return false;
 }
@@ -1225,10 +1222,7 @@ gori_compute::compute_operand2_range (vrange &r,
 
   // If op1 == op2, create a new trio for this stmt.
   if (op1 == op2 && gimple_range_ssa_p (op1))
-{
-  relation_kind k = get_identity_relation (op1, op1_range);
-  trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), k);
-}
+trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), VREL_EQ);
   // Intersect with range for op2 based on lhs and op1.
   if (!handler.calc_op2 (r, lhs, op1_range, trio))
 return false;
diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc
index 8fea4aad345..a2ae39692a6 100644
--- a/gcc/value-relation.cc
+++ b/gcc/value-relation.cc
@@ -183,20 +183,6 @@ relation_transitive (relation_kind r1, relation_kind r2)
   return relation_kind (rr_transitive_table[r1][r2]);
 }
 
-// When operands of a statement are identical ssa_names, return the
-// approriate relation between operands for NAME == NAME, given RANGE.
-//
-relation_kind
-get_identity_relation (tree name, vrange &range ATTRIBUTE_UNUSED)
-{
-  // Return VREL_UNEQ when it is supported for floats as appropriate.
-  if (frange::supports_p (TREE_TYPE (name)))
-return VREL_EQ;
-
-  // Otherwise return VREL_EQ.
-  return VREL_EQ;
-}
-
 // This vector maps a relation to the equivalent tree code.
 
 static const tree_code relation_to_code [VREL_LAST] = {
diff --git a/gcc/value-relation.h b/gcc/value-relation.h
index f00f84f93b6..be6e277421b 100644
--- a/gcc/value-relation.h
+++ b/gcc/value-relation.h
@@ -91,9 +91,6 @@ inline bool relation_equiv_p (relation_kind r)
 
 void print_relation (FILE *f, relation_kind rel);
 
-// Return relation for NAME == NAME with RANGE.
-relation_kind get_identity_relation (tree name, vrange &range);
-
 class relation_oracle
 {
 public:
-- 
2.41.0

[COMMITTED] PR tree-optimization/111694 - Ensure float equivalences include + and - zero.

2023-10-09 Thread Andrew MacLeod

When ranger propagates ranges in the on-entry cache, it also check for 
equivalences and incorporates the equivalence into the range for a name 
if it is known.


With floating point values, the equivalence that is generated by 
comparison must also take into account that if the equivalence contains 
zero, both positive and negative zeros could be in the range.


This PR demonstrates that once we establish an equivalence, even though 
we know one value may only have a positive zero, the equivalence may 
have been formed earlier and included a negative zero  This patch 
pessimistically assumes that if the equivalence contains zero, we should 
include both + and - 0 in the equivalence that we utilize.


I audited the other places, and found no other place where this issue 
might arise.  Cache propagation is the only place where we augment the 
range with random equivalences.


Bootstrapped on x86_64-pc-linux-gnu with no regressions. Pushed.

Andrew
From b0892b1fc637fadf14d7016858983bc5776a1e69 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Mon, 9 Oct 2023 10:15:07 -0400
Subject: [PATCH 2/2] Ensure float equivalences include + and - zero.

A floating point equivalence may not properly reflect both signs of
zero, so be pessimsitic and ensure both signs are included.

	PR tree-optimization/111694
	gcc/
	* gimple-range-cache.cc (ranger_cache::fill_block_cache): Adjust
	equivalence range.
	* value-relation.cc (adjust_equivalence_range): New.
	* value-relation.h (adjust_equivalence_range): New prototype.

	gcc/testsuite/
	* gcc.dg/pr111694.c: New.
---
 gcc/gimple-range-cache.cc   |  3 +++
 gcc/testsuite/gcc.dg/pr111694.c | 19 +++
 gcc/value-relation.cc   | 19 +++
 gcc/value-relation.h|  3 +++
 4 files changed, 44 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr111694.c

diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc
index 3c819933c4e..89c0845457d 100644
--- a/gcc/gimple-range-cache.cc
+++ b/gcc/gimple-range-cache.cc
@@ -1470,6 +1470,9 @@ ranger_cache::fill_block_cache (tree name, basic_block bb, basic_block def_bb)
 		{
 		  if (rel != VREL_EQ)
 		range_cast (equiv_range, type);
+		  else
+		adjust_equivalence_range (equiv_range);
+
 		  if (block_result.intersect (equiv_range))
 		{
 		  if (DEBUG_RANGE_CACHE)
diff --git a/gcc/testsuite/gcc.dg/pr111694.c b/gcc/testsuite/gcc.dg/pr111694.c
new file mode 100644
index 000..a70b03069dc
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr111694.c
@@ -0,0 +1,19 @@
+/* PR tree-optimization/111009 */
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+#define signbit(x) __builtin_signbit(x)
+
+static void test(double l, double r)
+{
+  if (l == r && (signbit(l) || signbit(r)))
+;
+  else
+__builtin_abort();
+}
+
+int main()
+{
+  test(0.0, -0.0);
+}
+
diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc
index a2ae39692a6..0326fe7cde6 100644
--- a/gcc/value-relation.cc
+++ b/gcc/value-relation.cc
@@ -183,6 +183,25 @@ relation_transitive (relation_kind r1, relation_kind r2)
   return relation_kind (rr_transitive_table[r1][r2]);
 }
 
+// When one name is an equivalence of another, ensure the equivalence
+// range is correct.  Specifically for floating point, a +0 is also
+// equivalent to a -0 which may not be reflected.  See PR 111694.
+
+void
+adjust_equivalence_range (vrange &range)
+{
+  if (range.undefined_p () || !is_a (range))
+return;
+
+  frange fr = as_a (range);
+  // If range includes 0 make sure both signs of zero are included.
+  if (fr.contains_p (dconst0) || fr.contains_p (dconstm0))
+{
+  frange zeros (range.type (), dconstm0, dconst0);
+  range.union_ (zeros);
+}
+ }
+
 // This vector maps a relation to the equivalent tree code.
 
 static const tree_code relation_to_code [VREL_LAST] = {
diff --git a/gcc/value-relation.h b/gcc/value-relation.h
index be6e277421b..31d48908678 100644
--- a/gcc/value-relation.h
+++ b/gcc/value-relation.h
@@ -91,6 +91,9 @@ inline bool relation_equiv_p (relation_kind r)
 
 void print_relation (FILE *f, relation_kind rel);
 
+// Adjust range as an equivalence.
+void adjust_equivalence_range (vrange &range);
+
 class relation_oracle
 {
 public:
-- 
2.41.0

Re: [PATCH] sso-string@gnu-versioned-namespace [PR83077]

2023-10-09 Thread François Dumont




On 09/10/2023 16:42, Iain Sandoe wrote:

Hi François,


On 7 Oct 2023, at 20:32, François Dumont  wrote:

I've been told that previous patch generated with 'git diff -b' was not 
applying properly so here is the same patch again with a simple 'git diff'.

Thanks, that did fix it - There are some training whitespaces in the config 
files, but I suspect that they need to be there since those have values 
appended during the configuration.


You're talking about the ones coming from regenerated Makefile.in and 
configure I guess. I prefer not to edit those, those trailing 
whitespaces are already in.





Anyway, with this + the coroutines and contract v2 (weak def) fix, plus a local 
patch to enable versioned namespace on Darwin, I get results comparable with 
the non-versioned case - but one more patchlet is needed on  yours (to allow 
for targets using emultated TLS):

diff --git a/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver 
b/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
index 9fab8bead15..b7167fc0c2f 100644
--- a/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
+++ b/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
@@ -78,6 +78,7 @@ GLIBCXX_8.0 {
  
  # thread/mutex/condition_variable/future

  __once_proxy;
+__emutls_v._ZNSt3__81?__once_call*;


I can add this one, sure, even if it could be part of a dedicated patch. 
I'm surprised that we do not need the __once_callable emul symbol too, 
it would be more consistent with the non-versioned mode.


I'm pretty sure there are a bunch of other symbols missing, but this 
mode is seldomly tested...


  
  # std::__convert_to_v

  _ZNSt3__814__convert_to_v*;


thanks
Iain



On 07/10/2023 14:25, François Dumont wrote:

Hi

Here is a rebased version of this patch.

There are few test failures when running 'make check-c++' but nothing new.

Still, there are 2 patches awaiting validation to fix some of them, PR 
c++/111524 to fix another bunch and I fear that we will have to live with the 
others.

 libstdc++: [_GLIBCXX_INLINE_VERSION] Use cxx11 abi [PR83077]

 Use cxx11 abi when activating versioned namespace mode. To do support
 a new configuration mode where !_GLIBCXX_USE_DUAL_ABI and 
_GLIBCXX_USE_CXX11_ABI.

 The main change is that std::__cow_string is now defined whenever 
_GLIBCXX_USE_DUAL_ABI
 or _GLIBCXX_USE_CXX11_ABI is true. Implementation is using available 
std::string in
 case of dual abi and a subset of it when it's not.

 On the other side std::__sso_string is defined only when 
_GLIBCXX_USE_DUAL_ABI is true
 and _GLIBCXX_USE_CXX11_ABI is false. Meaning that std::__sso_string is a 
typedef for the
 cow std::string implementation when dual abi is disabled and cow string is 
being used.

 libstdcxx-v3/ChangeLog:

 PR libstdc++/83077
 * acinclude.m4 [GLIBCXX_ENABLE_LIBSTDCXX_DUAL_ABI]: Default to 
"new" libstdcxx abi
 when enable_symvers is gnu-versioned-namespace.
 * config/locale/dragonfly/monetary_members.cc 
[!_GLIBCXX_USE_DUAL_ABI]: Define money_base
 members.
 * config/locale/generic/monetary_members.cc 
[!_GLIBCXX_USE_DUAL_ABI]: Likewise.
 * config/locale/gnu/monetary_members.cc [!_GLIBCXX_USE_DUAL_ABI]: 
Likewise.
 * config/locale/gnu/numeric_members.cc
 [!_GLIBCXX_USE_DUAL_ABI](__narrow_multibyte_chars): Define.
 * configure: Regenerate.
 * include/bits/c++config
 [_GLIBCXX_INLINE_VERSION](_GLIBCXX_NAMESPACE_CXX11, 
_GLIBCXX_BEGIN_NAMESPACE_CXX11):
 Define empty.
[_GLIBCXX_INLINE_VERSION](_GLIBCXX_END_NAMESPACE_CXX11, 
_GLIBCXX_DEFAULT_ABI_TAG):
 Likewise.
 * include/bits/cow_string.h [!_GLIBCXX_USE_CXX11_ABI]: Define a 
light version of COW
 basic_string as __std_cow_string for use in stdexcept.
 * include/std/stdexcept [_GLIBCXX_USE_CXX11_ABI]: Define 
__cow_string.
 (__cow_string(const char*)): New.
 (__cow_string::c_str()): New.
 * python/libstdcxx/v6/printers.py (StdStringPrinter::__init__): 
Set self.new_string to True
 when std::__8::basic_string type is found.
 * src/Makefile.am 
[ENABLE_SYMVERS_GNU_NAMESPACE](ldbl_alt128_compat_sources): Define empty.
 * src/Makefile.in: Regenerate.
 * src/c++11/Makefile.am (cxx11_abi_sources): Rename into...
 (dual_abi_sources): ...this. Also move cow-local_init.cc, 
cxx11-hash_tr1.cc,
 cxx11-ios_failure.cc entries to...
 (sources): ...this.
 (extra_string_inst_sources): Move cow-fstream-inst.cc, 
cow-sstream-inst.cc, cow-string-inst.cc,
 cow-string-io-inst.cc, cow-wtring-inst.cc, cow-wstring-io-inst.cc, 
cxx11-locale-inst.cc,
 cxx11-wlocale-inst.cc entries to...
 (inst_sources): ...this.
 * src/c++11/Makefile.in: Regenerat

Re: [PATCH] wide-int: Allow up to 16320 bits wide_int and change widest_int precision to 32640 bits [PR102989]

2023-10-09 Thread Jakub Jelinek

On Mon, Oct 09, 2023 at 03:44:10PM +0200, Jakub Jelinek wrote:
> Thanks, just quick answers, will work on patch adjustments after trying to
> get rid of rwide_int (seems dwarf2out has very limited needs from it, just
> some routine to construct it in GCed memory (and never change afterwards)
> from const wide_int_ref & or so, and then working operator ==,
> get_precision, elt, get_len and get_val methods, so I think we could just
> have a struct dw_wide_int { unsigned int prec, len; HOST_WIDE_INT val[1]; };
> and perform the methods on it after converting to a storage ref.

Now in patch form (again, incremental).

> > Does the variable-length memcpy pay for itself?  If so, perhaps that's a
> > sign that we should have a smaller inline buffer for this class (say 2 
> > HWIs).
> 
> Guess I'll try to see what results in smaller .text size.

I've left the memcpy changes into a separate patch (incremental, attached).
Seems that second patch results in .text growth by 16256 bytes (0.04%),
though I'd bet it probably makes compile time tiny bit faster because it
replaces an out of line memcpy (caused by variable length) with inlined one.

With even the third one it shrinks by 84544 bytes (0.21% down), but the
extra statistics patch then shows massive number of allocations after
running make check-gcc check-g++ check-gfortran for just a minute or two.
On the widest_int side, I see (first number from sort | uniq -c | sort -nr,
second the estimated or final len)
7289034 4
 173586 5
  21819 6
i.e. there are tons of widest_ints which need len 4 (or perhaps just
have it as upper estimation), maybe even 5 would be nice.
On the wide_int side, I see
 155291 576
(supposedly because of bound_wide_int, where we create wide_int_ref from
the 576-bit precision bound_wide_int and then create 576-bit wide_int when
using unary or binary operation on that).

So, perhaps we could get away with say WIDEST_INT_MAX_INL_ELTS of 5 or 6
instead of 9 but keep WIDE_INT_MAX_INL_ELTS at 9 (or whatever is computed
from MAX_BITSIZE_MODE_ANY_INT?).  Or keep it at 9 for both (i.e. without
the third patch).

--- gcc/poly-int.h.jj   2023-10-09 14:37:45.883940062 +0200
+++ gcc/poly-int.h  2023-10-09 17:05:26.629828329 +0200
@@ -96,7 +96,7 @@ struct poly_coeff_traits
-struct poly_coeff_traits
+struct poly_coeff_traits
 {
   typedef WI_UNARY_RESULT (T) result;
   typedef int int_type;
@@ -110,14 +110,13 @@ struct poly_coeff_traits
-struct poly_coeff_traits
+struct poly_coeff_traits
 {
   typedef WI_UNARY_RESULT (T) result;
   typedef int int_type;
   /* These types are always signed.  */
   static const int signedness = 1;
   static const int precision = wi::int_traits::precision;
-  static const int inl_precision = wi::int_traits::inl_precision;
   static const int rank = precision * 2 / CHAR_BIT;
 
   template
--- gcc/double-int.h.jj 2023-01-02 09:32:22.747280053 +0100
+++ gcc/double-int.h2023-10-09 17:06:03.446317336 +0200
@@ -440,7 +440,7 @@ namespace wi
   template <>
   struct int_traits 
   {
-static const enum precision_type precision_type = CONST_PRECISION;
+static const enum precision_type precision_type = INL_CONST_PRECISION;
 static const bool host_dependent_precision = true;
 static const unsigned int precision = HOST_BITS_PER_DOUBLE_INT;
 static unsigned int get_precision (const double_int &);
--- gcc/wide-int.h.jj   2023-10-09 16:06:39.326805176 +0200
+++ gcc/wide-int.h  2023-10-09 17:29:20.016951691 +0200
@@ -343,8 +343,8 @@ template  class widest_int_storag
 
 typedef generic_wide_int  wide_int;
 typedef FIXED_WIDE_INT (ADDR_MAX_PRECISION) offset_int;
-typedef generic_wide_int  > 
widest_int;
-typedef generic_wide_int  
> widest2_int;
+typedef generic_wide_int  > 
widest_int;
+typedef generic_wide_int  > 
widest2_int;
 
 /* wi::storage_ref can be a reference to a primitive type,
so this is the conservatively-correct setting.  */
@@ -394,13 +394,13 @@ namespace wi
 /* The integer has a variable precision but no defined signedness.  */
 VAR_PRECISION,
 
-/* The integer has a constant precision (known at GCC compile time)
-   and is signed.  */
-CONST_PRECISION,
-
-/* Like CONST_PRECISION, but with WIDEST_INT_MAX_PRECISION or larger
-   precision where not all elements of arrays are always present.  */
-WIDEST_CONST_PRECISION
+/* The integer has a constant precision (known at GCC compile time),
+   is signed and all elements are in inline buffer.  */
+INL_CONST_PRECISION,
+
+/* Like INL_CONST_PRECISION, but elements can be heap allocated for
+   larger lengths.  */
+CONST_PRECISION
   };
 
   /* This class, which has no default implementation, is expected to
@@ -410,15 +410,10 @@ namespace wi
Classifies the type of T.
 
  static const unsigned int precision;
-   Only defined if precision_type == CONST_PRECISION or
-   precision_type == WIDEST_CONST_PRECISION.  Specifies the
+   Only defined if precision_type == INL_CONST_PRE

Re: [pushed] analyzer: improvements to out-of-bounds diagrams [PR111155]

2023-10-09 Thread David Malcolm

On Mon, 2023-10-09 at 17:01 +0200, Tobias Burnus wrote:
> Hi David,
> 
> On 09.10.23 16:08, David Malcolm wrote:
> > On Mon, 2023-10-09 at 12:09 +0200, Tobias Burnus wrote:
> > > The following works:
> > > (A) Using "kind == boundaries::kind::HARD" - i.e. adding
> > > "boundaries::"
> > > (B) Renaming the parameter name "kind" to something else - like
> > > "k"
> > > as used
> > >   in the other functions.
> > > 
> > > Can you fix it?
> > Sorry about the breakage, and thanks for the investigation.
> Well, without an older compiler, one does not see it. It also worked
> flawlessly on my laptop today.
> > Does the following patch fix the build for you?
> 
> Yes – as mentioned either of the variants above should work and (A)
> is
> what you have in your patch.
> 
> And it is what I actually tried for the full build. Hence, yes, it
> works :-)

Thanks!

I've pushed this to trunk as r14-4521-g08d0f840dc7ad2.

Re: [RFC 1/2] RISC-V: Add support for _Bfloat16.

2023-10-09 Thread Jeff Law





On 10/9/23 00:18, Jin Ma wrote:


+;; The conversion of DF to BF needs to be done with SF if there is a
+;; chance to generate at least one instruction, otherwise just using
+;; libfunc __truncdfbf2.
+(define_expand "truncdfbf2"
+  [(set (match_operand:BF 0 "register_operand" "=f")
+   (float_truncate:BF
+   (match_operand:DF 1 "register_operand" " f")))]
+  "TARGET_DOUBLE_FLOAT || TARGET_ZDINX"
+  {
+convert_move (operands[0],
+ convert_modes (SFmode, DFmode, operands[1], 0), 0);
+DONE;
+  })

So for conversions to/from BFmode, doesn't generic code take care of
this for us?  Search for convert_mode_scalar in expr.cc. That code will
utilize SFmode as an intermediate step just like your expander.   Is
there some reason that generic code is insufficient?

Similarly for the the other conversions.


As far as I can see, the function 'convert_mode_scalar' doesn't seem to be 
perfect for
dealing with the conversions to/from BFmode. It can only handle BF to HF, SF, 
DF and
SF to BF well, but the rest of the conversion without any processing, directly 
using
the libcall.

Maybe I should choose to enhance its functionality? This seems to be a
good choice, I'm not sure.My recollection was that BF could be converted to/from SF trivially and 

if we wanted BF->DF we'd first convert to SF, then to DF.

Direct BF<->DF conversions aren't actually important from a performance 
standpoint.  So it's OK if they have an extra step IMHO.


jeff

Re: [PATCH v1 1/4] options: Define TARGETP and TARGETOPTS_P macro for Mask and InverseMask

2023-10-09 Thread Kito Cheng

> Doesn't this need to be updated to avoid multi-dimensional arrays in awk
> and rebased?

Oh, yeah, I should update that, it's post before that issue reported,
let me send v2 sn :P

Re: [PATCH] c++: Improve diagnostics for constexpr cast from void*

2023-10-09 Thread Jason Merrill


On 10/9/23 06:03, Nathaniel Shead wrote:

Bootstrapped and regtested on x86_64-pc-linux-gnu with
GXX_TESTSUITE_STDS=98,11,14,17,20,23,26,impcx.

-- >8 --

This patch improves the errors given when casting from void* in C++26 to
include the expected type if the type of the pointed-to object was
not similar to the casted-to type.

It also ensures (for all standard modes) that void* casts are checked
even for DECL_ARTIFICIAL declarations, such as lifetime-extended
temporaries, and is only ignored for cases where we know it's OK (heap
identifiers and source_location::current). This provides more accurate
diagnostics when using the pointer and ensures that some other casts
from void* are now correctly rejected.

gcc/cp/ChangeLog:

* constexpr.cc (is_std_source_location_current): New.
(cxx_eval_constant_expression): Only ignore cast from void* for
specific cases and improve other diagnostics.

gcc/testsuite/ChangeLog:

* g++.dg/cpp0x/constexpr-cast4.C: New test.

Signed-off-by: Nathaniel Shead 
---
  gcc/cp/constexpr.cc  | 83 +---
  gcc/testsuite/g++.dg/cpp0x/constexpr-cast4.C |  7 ++
  2 files changed, 78 insertions(+), 12 deletions(-)
  create mode 100644 gcc/testsuite/g++.dg/cpp0x/constexpr-cast4.C

diff --git a/gcc/cp/constexpr.cc b/gcc/cp/constexpr.cc
index 0f948db7c2d..f38d541a662 100644
--- a/gcc/cp/constexpr.cc
+++ b/gcc/cp/constexpr.cc
@@ -2301,6 +2301,36 @@ is_std_allocator_allocate (const constexpr_call *call)
  && is_std_allocator_allocate (call->fundef->decl));
  }
  
+/* Return true if FNDECL is std::source_location::current.  */

+
+static inline bool
+is_std_source_location_current (tree fndecl)
+{
+  if (!decl_in_std_namespace_p (fndecl))
+return false;
+
+  tree name = DECL_NAME (fndecl);
+  if (name == NULL_TREE || !id_equal (name, "current"))
+return false;
+
+  tree ctx = DECL_CONTEXT (fndecl);
+  if (ctx == NULL_TREE || !CLASS_TYPE_P (ctx) || !TYPE_MAIN_DECL (ctx))
+return false;
+
+  name = DECL_NAME (TYPE_MAIN_DECL (ctx));
+  return name && id_equal (name, "source_location");
+}
+
+/* Overload for the above taking constexpr_call*.  */
+
+static inline bool
+is_std_source_location_current (const constexpr_call *call)
+{
+  return (call
+ && call->fundef
+ && is_std_source_location_current (call->fundef->decl));
+}
+
  /* Return true if FNDECL is __dynamic_cast.  */
  
  static inline bool

@@ -7850,33 +7880,62 @@ cxx_eval_constant_expression (const constexpr_ctx *ctx, 
tree t,
if (TYPE_PTROB_P (type)
&& TYPE_PTR_P (TREE_TYPE (op))
&& VOID_TYPE_P (TREE_TYPE (TREE_TYPE (op)))
-   /* Inside a call to std::construct_at or to
-  std::allocator::{,de}allocate, we permit casting from void*
+   /* Inside a call to std::construct_at,
+  std::allocator::{,de}allocate, or
+  std::source_location::current, we permit casting from void*
   because that is compiler-generated code.  */
&& !is_std_construct_at (ctx->call)
-   && !is_std_allocator_allocate (ctx->call))
+   && !is_std_allocator_allocate (ctx->call)
+   && !is_std_source_location_current (ctx->call))
  {
/* Likewise, don't error when casting from void* when OP is
   &heap uninit and similar.  */
tree sop = tree_strip_nop_conversions (op);
-   if (TREE_CODE (sop) == ADDR_EXPR
-   && VAR_P (TREE_OPERAND (sop, 0))
-   && DECL_ARTIFICIAL (TREE_OPERAND (sop, 0)))
+   tree decl = NULL_TREE;
+   if (TREE_CODE (sop) == ADDR_EXPR)
+ decl = TREE_OPERAND (sop, 0);
+   if (decl
+   && VAR_P (decl)
+   && DECL_ARTIFICIAL (decl)
+   && (DECL_NAME (decl) == heap_identifier
+   || DECL_NAME (decl) == heap_uninit_identifier
+   || DECL_NAME (decl) == heap_vec_identifier
+   || DECL_NAME (decl) == heap_vec_uninit_identifier))
  /* OK */;
/* P2738 (C++26): a conversion from a prvalue P of type "pointer to
   cv void" to a pointer-to-object type T unless P points to an
   object whose type is similar to T.  */
-   else if (cxx_dialect > cxx23
-&& (sop = cxx_fold_indirect_ref (ctx, loc,
- TREE_TYPE (type), sop)))
+   else if (cxx_dialect > cxx23)
  {
-   r = build1 (ADDR_EXPR, type, sop);
-   break;
+   r = cxx_fold_indirect_ref (ctx, loc, TREE_TYPE (type), sop);
+   if (r)
+ {
+   r = build1 (ADDR_EXPR, type, r);
+   break;
+ }
+   if (!ctx->quiet)
+ {
+   if (TREE_CODE (sop) == ADDR_EXPR)
+ {
+

xthead regression with [COMMITTED] RISC-V: const: hide mvconst splitter from IRA

2023-10-09 Thread Vineet Gupta


Hi Christoph,

On 10/9/23 12:06, Patrick O'Neill wrote:


Hi Vineet,

We're seeing a regression on all riscv targets after this patch:|

FAIL: gcc.target/riscv/xtheadcondmov-indirect.c -O2 
check-function-bodies ConNmv_imm_imm_reg||
FAIL: gcc.target/riscv/xtheadcondmov-indirect.c -O3 -g 
check-function-bodies ConNmv_imm_imm_reg


Debug log output:
body: \taddi    a[0-9]+,a[0-9]+,-1000+
\tli    a[0-9]+,9998336+
\taddi    a[0-9]+,a[0-9]+,1664+
\tth.mveqz    a[0-9]+,a[0-9]+,a[0-9]+
\tret

against:     li    a5,9998336
    addi    a4,a0,-1000
    addi    a0,a5,1664
    th.mveqz    a0,a1,a4
    ret|

https://github.com/patrick-rivos/gcc-postcommit-ci/issues/8
https://github.com/ewlu/riscv-gnu-toolchain/issues/286



It seems with my patch, exactly same instructions get out of order (for 
-O2/-O3) tripping up the test results and differ from say O1 for exact 
same build.


-O2 w/ patch
ConNmv_imm_imm_reg:
    li    a5,9998336
    addi    a4,a0,-1000
    addi    a0,a5,1664
    th.mveqz    a0,a1,a4
    ret

-O1 w/ patch
ConNmv_imm_imm_reg:
    addi    a4,a0,-1000
    li    a5,9998336
    addi    a0,a5,1664
    th.mveqz    a0,a1,a4
    ret

I'm not sure if there is an easy way to handle that.
Is there a real reason for testing the full sequences verbatim, or is 
testing number of occurrences of th.mv{eqz,nez} enough.
It seems Jeff recently added -fno-sched-pressure to avoid similar issues 
but that apparently is no longer sufficient.


Thx,
-Vineet


Thanks,
Patrick

On 10/6/23 11:22, Vineet Gupta wrote:

Vlad recently introduced a new gate @ira_in_progress, similar to
counterparts @{reload,lra}_in_progress.

Use this to hide the constant synthesis splitter from being recog* ()
by IRA register equivalence logic which is eager to undo the splits,
generating worse code for constants (and sometimes no code at all).

See PR/109279 (large constant), PR/110748 (const -0.0) ...

Granted the IRA logic is subsided with -fsched-pressure which is now
enabled for RISC-V backend, the gate makes this future-proof in
addition to helping with -O1 etc.

This fixes 1 addition test

= Summary of gcc testsuite =
 | # of unexpected case / # of unique unexpected 
case
 |  gcc |  g++ | gfortran |

rv32imac/  ilp32/ medlow |  416 /   103 |   13 / 6 |   67 /12 |
  rv32imafdc/ ilp32d/ medlow |  416 /   103 |   13 / 6 |   24 / 4 |
rv64imac/   lp64/ medlow |  417 /   104 |9 / 3 |   67 /12 |
  rv64imafdc/  lp64d/ medlow |  416 /   103 |5 / 2 |6 / 1 |

Also similar to v1, this doesn't move RISC-V SPEC scores at all.

gcc/ChangeLog:
* config/riscv/riscv.md (mvconst_internal): Add !ira_in_progress.

Suggested-by: Jeff Law
Signed-off-by: Vineet Gupta
---
  gcc/config/riscv/riscv.md | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index 1ebe8f92284d..da84b9357bd3 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -1997,13 +1997,16 @@
  
  ;; Pretend to have the ability to load complex const_int in order to get

  ;; better code generation around them.
-;;
  ;; But avoid constants that are special cased elsewhere.
+;;
+;; Hide it from IRA register equiv recog* () to elide potential undoing of 
split
+;;
  (define_insn_and_split "*mvconst_internal"
[(set (match_operand:GPR 0 "register_operand" "=r")
  (match_operand:GPR 1 "splittable_const_int_operand" "i"))]
-  "!(p2m1_shift_operand (operands[1], mode)
- || high_mask_shift_operand (operands[1], mode))"
+  "!ira_in_progress
+   && !(p2m1_shift_operand (operands[1], mode)
+|| high_mask_shift_operand (operands[1], mode))"
"#"
"&& 1"
[(const_int 0)]

Re: xthead regression with [COMMITTED] RISC-V: const: hide mvconst splitter from IRA

2023-10-09 Thread Jeff Law





On 10/9/23 14:36, Vineet Gupta wrote:

Hi Christoph,

On 10/9/23 12:06, Patrick O'Neill wrote:


Hi Vineet,

We're seeing a regression on all riscv targets after this patch:|

FAIL: gcc.target/riscv/xtheadcondmov-indirect.c -O2 
check-function-bodies ConNmv_imm_imm_reg||
FAIL: gcc.target/riscv/xtheadcondmov-indirect.c -O3 -g 
check-function-bodies ConNmv_imm_imm_reg


Debug log output:
body: \taddi    a[0-9]+,a[0-9]+,-1000+
\tli    a[0-9]+,9998336+
\taddi    a[0-9]+,a[0-9]+,1664+
\tth.mveqz    a[0-9]+,a[0-9]+,a[0-9]+
\tret

against:     li    a5,9998336
    addi    a4,a0,-1000
    addi    a0,a5,1664
    th.mveqz    a0,a1,a4
    ret|

https://github.com/patrick-rivos/gcc-postcommit-ci/issues/8
https://github.com/ewlu/riscv-gnu-toolchain/issues/286



It seems with my patch, exactly same instructions get out of order (for 
-O2/-O3) tripping up the test results and differ from say O1 for exact 
same build.


-O2 w/ patch
ConNmv_imm_imm_reg:
     li    a5,9998336
     addi    a4,a0,-1000
     addi    a0,a5,1664
     th.mveqz    a0,a1,a4
     ret

-O1 w/ patch
ConNmv_imm_imm_reg:
     addi    a4,a0,-1000
     li    a5,9998336
     addi    a0,a5,1664
     th.mveqz    a0,a1,a4
     ret

I'm not sure if there is an easy way to handle that.
Is there a real reason for testing the full sequences verbatim, or is 
testing number of occurrences of th.mv{eqz,nez} enough.
It seems Jeff recently added -fno-sched-pressure to avoid similar issues 
but that apparently is no longer sufficient.

I'd suggest doing a count test rather than an exact match.

Verify you get a single li, two addis and one th.mveqz

Jeff

Re: [PATCH v4] c++: Check for indirect change of active union member in constexpr [PR101631,PR102286]

2023-10-09 Thread Jason Merrill


On 10/8/23 21:03, Nathaniel Shead wrote:

Ping for https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631203.html

+ && (TREE_CODE (t) == MODIFY_EXPR
+ /* Also check if initializations have implicit change of active
+member earlier up the access chain.  */
+ || !refs->is_empty())


I'm not sure what the cumulative point of these two tests is.  TREE_CODE 
(t) will be either MODIFY_EXPR or INIT_EXPR, and either should be OK.


As I understand it, the problematic case is something like 
constexpr-union2.C, where we're also looking at a MODIFY_EXPR.  So what 
is this check doing?


Incidentally, I think constexpr-union6.C could use a test where we pass 
&u.s to a function other than construct_at, and then try (and fail) to 
assign to the b member from that function.


Jason

Re: xthead regression with [COMMITTED] RISC-V: const: hide mvconst splitter from IRA

2023-10-09 Thread Christoph Müllner

On Mon, Oct 9, 2023 at 10:36 PM Vineet Gupta  wrote:
>
> Hi Christoph,
>
> On 10/9/23 12:06, Patrick O'Neill wrote:
> >
> > Hi Vineet,
> >
> > We're seeing a regression on all riscv targets after this patch:|
> >
> > FAIL: gcc.target/riscv/xtheadcondmov-indirect.c -O2
> > check-function-bodies ConNmv_imm_imm_reg||
> > FAIL: gcc.target/riscv/xtheadcondmov-indirect.c -O3 -g
> > check-function-bodies ConNmv_imm_imm_reg
> >
> > Debug log output:
> > body: \taddia[0-9]+,a[0-9]+,-1000+
> > \tlia[0-9]+,9998336+
> > \taddia[0-9]+,a[0-9]+,1664+
> > \tth.mveqza[0-9]+,a[0-9]+,a[0-9]+
> > \tret
> >
> > against: lia5,9998336
> > addia4,a0,-1000
> > addia0,a5,1664
> > th.mveqza0,a1,a4
> > ret|
> >
> > https://github.com/patrick-rivos/gcc-postcommit-ci/issues/8
> > https://github.com/ewlu/riscv-gnu-toolchain/issues/286
> >
>
> It seems with my patch, exactly same instructions get out of order (for
> -O2/-O3) tripping up the test results and differ from say O1 for exact
> same build.
>
> -O2 w/ patch
> ConNmv_imm_imm_reg:
>  lia5,9998336
>  addia4,a0,-1000
>  addia0,a5,1664
>  th.mveqza0,a1,a4
>  ret
>
> -O1 w/ patch
> ConNmv_imm_imm_reg:
>  addia4,a0,-1000
>  lia5,9998336
>  addia0,a5,1664
>  th.mveqza0,a1,a4
>  ret
>
> I'm not sure if there is an easy way to handle that.
> Is there a real reason for testing the full sequences verbatim, or is
> testing number of occurrences of th.mv{eqz,nez} enough.

I did not write the test cases, I just merged two non-functional test files
into one that works without changing the actual test approach.

Given that this causes repeated issues, I think that a fall-back to counting
occurrences is the right thing to do.

I can do that if that's ok.

BR
Christoph



> It seems Jeff recently added -fno-sched-pressure to avoid similar issues
> but that apparently is no longer sufficient.
>
> Thx,
> -Vineet
>
> > Thanks,
> > Patrick
> >
> > On 10/6/23 11:22, Vineet Gupta wrote:
> >> Vlad recently introduced a new gate @ira_in_progress, similar to
> >> counterparts @{reload,lra}_in_progress.
> >>
> >> Use this to hide the constant synthesis splitter from being recog* ()
> >> by IRA register equivalence logic which is eager to undo the splits,
> >> generating worse code for constants (and sometimes no code at all).
> >>
> >> See PR/109279 (large constant), PR/110748 (const -0.0) ...
> >>
> >> Granted the IRA logic is subsided with -fsched-pressure which is now
> >> enabled for RISC-V backend, the gate makes this future-proof in
> >> addition to helping with -O1 etc.
> >>
> >> This fixes 1 addition test
> >>
> >> = Summary of gcc testsuite =
> >>  | # of unexpected case / # of unique 
> >> unexpected case
> >>  |  gcc |  g++ | gfortran |
> >>
> >> rv32imac/  ilp32/ medlow |  416 /   103 |   13 / 6 |   67 /12 |
> >>   rv32imafdc/ ilp32d/ medlow |  416 /   103 |   13 / 6 |   24 / 4 |
> >> rv64imac/   lp64/ medlow |  417 /   104 |9 / 3 |   67 /12 |
> >>   rv64imafdc/  lp64d/ medlow |  416 /   103 |5 / 2 |6 / 1 |
> >>
> >> Also similar to v1, this doesn't move RISC-V SPEC scores at all.
> >>
> >> gcc/ChangeLog:
> >>  * config/riscv/riscv.md (mvconst_internal): Add !ira_in_progress.
> >>
> >> Suggested-by: Jeff Law
> >> Signed-off-by: Vineet Gupta
> >> ---
> >>   gcc/config/riscv/riscv.md | 9 ++---
> >>   1 file changed, 6 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
> >> index 1ebe8f92284d..da84b9357bd3 100644
> >> --- a/gcc/config/riscv/riscv.md
> >> +++ b/gcc/config/riscv/riscv.md
> >> @@ -1997,13 +1997,16 @@
> >>
> >>   ;; Pretend to have the ability to load complex const_int in order to get
> >>   ;; better code generation around them.
> >> -;;
> >>   ;; But avoid constants that are special cased elsewhere.
> >> +;;
> >> +;; Hide it from IRA register equiv recog* () to elide potential undoing 
> >> of split
> >> +;;
> >>   (define_insn_and_split "*mvconst_internal"
> >> [(set (match_operand:GPR 0 "register_operand" "=r")
> >>   (match_operand:GPR 1 "splittable_const_int_operand" "i"))]
> >> -  "!(p2m1_shift_operand (operands[1], mode)
> >> - || high_mask_shift_operand (operands[1], mode))"
> >> +  "!ira_in_progress
> >> +   && !(p2m1_shift_operand (operands[1], mode)
> >> +|| high_mask_shift_operand (operands[1], mode))"
> >> "#"
> >> "&& 1"
> >> [(const_int 0)]
>

Re: xthead regression with [COMMITTED] RISC-V: const: hide mvconst splitter from IRA

2023-10-09 Thread Vineet Gupta


On 10/9/23 13:46, Christoph Müllner wrote:
Given that this causes repeated issues, I think that a fall-back to 
counting occurrences is the right thing to do. I can do that if that's ok.


Thanks Christoph.

-Vineet

1 2 >

1 - 100 of 130 matches

Mail list logo