RE: [PATCH 2/2] Add a new permute optimization step in SLP

2024-10-18 Thread Richard Biener
On Thu, 17 Oct 2024, Tamar Christina wrote:

> Hi Christoph,
> 
> > -Original Message-
> > From: Christoph Müllner 
> > Sent: Tuesday, October 15, 2024 3:57 PM
> > To: gcc-patches@gcc.gnu.org; Philipp Tomsich ; 
> > Tamar
> > Christina ; Richard Biener 
> > Cc: Jeff Law ; Robin Dapp ;
> > Christoph Müllner 
> > Subject: [PATCH 2/2] Add a new permute optimization step in SLP
> > 
> > This commit adds a new permute optimization step after running SLP
> > vectorization.
> > Although there are existing places where individual or nested permutes
> > can be optimized, there are cases where independent permutes can be 
> > optimized,
> > which cannot be expressed in the current pattern matching framework.
> > The optimization step is run at the end so that permutes from completely 
> > different
> > SLP builds can be optimized.
> > 
> > The initial optimizations implemented can detect some cases where different
> > "select permutes" (permutes that only use some of the incoming vector lanes)
> > can be co-located in a single permute. This can optimize some cases where
> > two_operator SLP nodes have duplicate elements.
> > 
> > Bootstrapped and reg-tested on AArch64 (C, C++, Fortran).
> > 
> > Manolis Tsamis was the patch's initial author before I took it over.
> > 
> > gcc/ChangeLog:
> > 
> > * tree-vect-slp.cc (get_tree_def): Return the definition of a name.
> > (recognise_perm_binop_perm_pattern): Helper function.
> > (vect_slp_optimize_permutes): New permute optimization step.
> > (vect_slp_function): Run the new permute optimization step.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > * gcc.dg/vect/slp-perm-14.c: New test.
> > * gcc.target/aarch64/sve/slp-perm-14.c: New test.
> > 
> > Signed-off-by: Christoph Müllner 
> > ---
> >  gcc/testsuite/gcc.dg/vect/slp-perm-14.c   |  42 +++
> >  .../gcc.target/aarch64/sve/slp-perm-14.c  |   3 +
> >  gcc/tree-vect-slp.cc  | 248 ++
> >  3 files changed, 293 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-perm-14.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c
> > 
> > diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-14.c
> > b/gcc/testsuite/gcc.dg/vect/slp-perm-14.c
> > new file mode 100644
> > index 000..f56e3982a62
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-14.c
> > @@ -0,0 +1,42 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3 -fdump-tree-slp1-details" } */
> > +
> > +#include 
> > +
> > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
> > +int t0 = s0 + s1;\
> > +int t1 = s0 - s1;\
> > +int t2 = s2 + s3;\
> > +int t3 = s2 - s3;\
> > +d0 = t0 + t2;\
> > +d1 = t1 + t3;\
> > +d2 = t0 - t2;\
> > +d3 = t1 - t3;\
> > +}
> > +
> > +int
> > +x264_pixel_satd_8x4_simplified (uint8_t *pix1, int i_pix1, uint8_t *pix2, 
> > int
> > i_pix2)
> > +{
> > +  uint32_t tmp[4][4];
> > +  uint32_t a0, a1, a2, a3;
> > +  int sum = 0;
> > +
> > +  for (int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2)
> > +{
> > +  a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
> > +  a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
> > +  a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
> > +  a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
> > +  HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, 
> > a3);
> > +}
> > +
> > +  for (int i = 0; i < 4; i++)
> > +{
> > +  HADAMARD4(a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], 
> > tmp[3][i]);
> > +  sum += a0 + a1 + a2 + a3;
> > +}
> > +
> > +  return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "slp1" } } 
> > */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c
> > b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c
> > new file mode 100644
> > index 000..4e0d5175be8
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c
> > @@ -0,0 +1,3 @@
> > +#include "../../../gcc.dg/vect/slp-perm-14.c"
> > +
> > +/* { dg-final { scan-assembler-not {\ttbl\t} } } */
> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > index 8794c94ef90..4bf5ccb9cdf 100644
> > --- a/gcc/tree-vect-slp.cc
> > +++ b/gcc/tree-vect-slp.cc
> > @@ -9478,6 +9478,252 @@ vect_slp_if_converted_bb (basic_block bb, loop_p
> > orig_loop)
> >return vect_slp_bbs (bbs, orig_loop);
> >  }
> > 
> > +/* If NAME is an SSA_NAME defined by an assignment, return that assignment.
> > +   If SINGLE_USE_ONLY is true and NAME has multiple uses, return NULL.  */
> > +
> > +static gassign *
> > +get_tree_def (tree name, bool single_use_only)
> > +{
> > +  if (TREE_CODE (name) != SSA_NAME)
> > +return NULL;
> > +
> > +  gimple *def_stmt = SSA_NAME_DEF_STMT (name);
> > +
> > +  if (single_use_only && !has_single_use (name))
> > +return NULL;
> > +
> > +  if (!is_g

Re: pair-fusion: Assume alias conflict if common address reg changes [PR116783]

2024-10-18 Thread Alex Coplan
On 11/10/2024 14:30, Richard Biener wrote:
> On Fri, 11 Oct 2024, Richard Sandiford wrote:
> 
> > Alex Coplan  writes:
> > > Hi,
> > >
> > > As the PR shows, pair-fusion was tricking memory_modified_in_insn_p into
> > > returning false when a common base register (in this case, x1) was
> > > modified between the mem and the store insn.  This lead to wrong code as
> > > the accesses really did alias.
> > >
> > > To avoid this sort of problem, this patch avoids invoking RTL alias
> > > analysis altogether (and assume an alias conflict) if the two insns to
> > > be compared share a common address register R, and the insns see different
> > > definitions of R (i.e. it was modified in between).
> > >
> > > Bootstrapped/regtested on aarch64-linux-gnu (all languages, both regular
> > > bootstrap and LTO+PGO bootstrap).  OK for trunk?
> > 
> > Sorry for the slow review.  The patch looks good to me, but...

Thanks for the review.  I'd missed that you'd sent this, sorry for not
responding sooner.

> > 
> > > @@ -2544,11 +2624,37 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, 
> > > unsigned access_size,
> > >  && bitmap_bit_p (&m_tombstone_bitmap, insn->uid ());
> > >};
> > >  
> > > +  // Maximum number of distinct regnos we expect to appear in a single
> > > +  // MEM (and thus in a candidate insn).
> > > +  static constexpr int max_mem_regs = 2;
> > > +  auto_vec addr_use_vec[2];
> > > +  use_array addr_uses[2];
> > > +
> > > +  // Collect the lists of register uses that occur in the candidate MEMs.
> > > +  for (int i = 0; i < 2; i++)
> > > +{
> > > +  // N.B. it's safe for us to ignore uses that only occur in notes
> > > +  // here (e.g. in a REG_EQUIV expression) since we only pass the
> > > +  // MEM down to the alias machinery, so it can't see any insn-level
> > > +  // notes.
> > > +  for (auto use : insns[i]->uses ())
> > > + if (use->is_reg ()
> > > + && use->includes_address_uses ()
> > > + && !use->only_occurs_in_notes ())
> > > +   {
> > > + gcc_checking_assert (addr_use_vec[i].length () < max_mem_regs);
> > > + addr_use_vec[i].quick_push (use);
> > 
> > ...if possible, I think it would be better to just use safe_push here,
> > without the assert.  There'd then be no need to split max_mem_regs out;
> > it could just be hard-coded in the addr_use_vec declaration.

I hadn't realised at the time that quick_push () already does a
gcc_checking_assert to make sure that we don't overflow.  It does:

  template
  inline T *
  vec::quick_push (const T &obj)
  {
gcc_checking_assert (space (1));
T *slot = &address ()[m_vecpfx.m_num++];
::new (static_cast(slot)) T (obj);
return slot;
  }

(I checked the behaviour by writing a quick selftest in vec.cc, and it
indeed aborts as expected with quick_push on overflow for a
stack-allocated auto_vec with N = 2.)

This means that the assert above is indeed redundant, so I agree that
we should be able to drop the assert and drop the max_mem_regs constant,
using a literal inside the auto_vec template instead (all while still
using quick_push).

Does that sound OK to you, or did you have another reason to prefer
safe_push?  AIUI the behaviour of safe_push on overflow would be to
allocate a new (heap-allocated) vector instead of asserting.

> > 
> > Or does that not work for some reason?  I'm getting a sense of deja vu...
> 
> safe_push should work but as I understand the desire is to rely
> on fully on-stack pre-allocated vectors?

Yes, that was indeed the original intent.

Thanks,
Alex

> 
> > If it doesn't work, an alternative would be to use access_array_builder.
> > 
> > OK for trunk and backports if using safe_push works.
> > 
> > Thanks,
> > Richard
> > 
> > > +   }
> > > +  addr_uses[i] = use_array (addr_use_vec[i]);
> > > +}
> > > +
> > 
> > 
> > >store_walker
> > > -forward_store_walker (mem_defs[0], cand_mems[0], insns[1], 
> > > tombstone_p);
> > > +forward_store_walker (mem_defs[0], cand_mems[0], addr_uses[0], 
> > > insns[1],
> > > +   tombstone_p);
> > >  
> > >store_walker
> > > -backward_store_walker (mem_defs[1], cand_mems[1], insns[0], 
> > > tombstone_p);
> > > +backward_store_walker (mem_defs[1], cand_mems[1], addr_uses[1], 
> > > insns[0],
> > > +tombstone_p);
> > >  
> > >alias_walker *walkers[4] = {};
> > >if (mem_defs[0])
> > > @@ -2562,8 +2668,10 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, 
> > > unsigned access_size,
> > >  {
> > >// We want to find any loads hanging off the first store.
> > >mem_defs[0] = memory_access (insns[0]->defs ());
> > > -  load_walker forward_load_walker (mem_defs[0], insns[0], 
> > > insns[1]);
> > > -  load_walker backward_load_walker (mem_defs[1], insns[1], 
> > > insns[0]);
> > > +  load_walker forward_load_walker (mem_defs[0], insns[0],
> > > +   addr_uses[0], insns[1]);
> > > +  load_walker backwa

[committed] hppa: Add LRA support

2024-10-18 Thread John David Anglin
Tested on hppa-unknown-linux-gnu and hppa64-hp-hpux11.11.  Committed
to trunk.

Dave
---

hppa: Add LRA support

LRA is not enabled as default since there are some new test fails
remaining to resolve.

2024-10-18  John David Anglin  

gcc/ChangeLog:

PR target/113933
* config/pa/pa.cc (pa_use_lra_p): Declare.
(TARGET_LRA_P): Change define to pa_use_lra_p.
(pa_use_lra_p): New function.
(legitimize_pic_address): Also check lra_in_progress.
(pa_emit_move_sequence): Likewise.
(pa_legitimate_constant_p): Likewise.
(pa_legitimate_address_p): Likewise.
(pa_secondary_reload): For floating-point loads and stores,
return NO_REGS for REG and SUBREG operands.  Return
GENERAL_REGS for some shift register spills.
* config/pa/pa.opt: Add mlra option.
* config/pa/predicates.md (integer_store_memory_operand):
Also check lra_in_progress.
(floating_point_store_memory_operand): Likewise.
(reg_before_reload_operand): Likewise.

diff --git a/gcc/config/pa/pa.cc b/gcc/config/pa/pa.cc
index 84aa4f1b1f2..62f8764b7ca 100644
--- a/gcc/config/pa/pa.cc
+++ b/gcc/config/pa/pa.cc
@@ -209,6 +209,7 @@ static bool pa_can_change_mode_class (machine_mode, 
machine_mode, reg_class_t);
 static HOST_WIDE_INT pa_starting_frame_offset (void);
 static section* pa_elf_select_rtx_section(machine_mode, rtx, unsigned 
HOST_WIDE_INT) ATTRIBUTE_UNUSED;
 static void pa_atomic_assign_expand_fenv (tree *, tree *, tree *);
+static bool pa_use_lra_p (void);
 
 /* The following extra sections are only used for SOM.  */
 static GTY(()) section *som_readonly_data_section;
@@ -412,7 +413,7 @@ static size_t n_deferred_plabels = 0;
 #define TARGET_LEGITIMATE_ADDRESS_P pa_legitimate_address_p
 
 #undef TARGET_LRA_P
-#define TARGET_LRA_P hook_bool_void_false
+#define TARGET_LRA_P pa_use_lra_p
 
 #undef TARGET_HARD_REGNO_NREGS
 #define TARGET_HARD_REGNO_NREGS pa_hard_regno_nregs
@@ -973,7 +974,7 @@ legitimize_pic_address (rtx orig, machine_mode mode, rtx 
reg)
 
   /* During and after reload, we need to generate a REG_LABEL_OPERAND note
 and update LABEL_NUSES because this is not done automatically.  */
-  if (reload_in_progress || reload_completed)
+  if (lra_in_progress || reload_in_progress || reload_completed)
{
  /* Extract LABEL_REF.  */
  if (GET_CODE (orig) == CONST)
@@ -998,7 +999,7 @@ legitimize_pic_address (rtx orig, machine_mode mode, rtx 
reg)
   /* Before reload, allocate a temporary register for the intermediate
 result.  This allows the sequence to be deleted when the final
 result is unused and the insns are trivially dead.  */
-  tmp_reg = ((reload_in_progress || reload_completed)
+  tmp_reg = ((lra_in_progress || reload_in_progress || reload_completed)
 ? reg : gen_reg_rtx (Pmode));
 
   if (function_label_operand (orig, VOIDmode))
@@ -1959,11 +1960,13 @@ pa_emit_move_sequence (rtx *operands, machine_mode 
mode, rtx scratch_reg)
   copy_to_mode_reg (Pmode, XEXP (operand1, 0)));
 
   if (scratch_reg
-  && reload_in_progress && GET_CODE (operand0) == REG
+  && reload_in_progress
+  && GET_CODE (operand0) == REG
   && REGNO (operand0) >= FIRST_PSEUDO_REGISTER)
 operand0 = reg_equiv_mem (REGNO (operand0));
   else if (scratch_reg
-  && reload_in_progress && GET_CODE (operand0) == SUBREG
+  && reload_in_progress
+  && GET_CODE (operand0) == SUBREG
   && GET_CODE (SUBREG_REG (operand0)) == REG
   && REGNO (SUBREG_REG (operand0)) >= FIRST_PSEUDO_REGISTER)
 {
@@ -1976,11 +1979,13 @@ pa_emit_move_sequence (rtx *operands, machine_mode 
mode, rtx scratch_reg)
 }
 
   if (scratch_reg
-  && reload_in_progress && GET_CODE (operand1) == REG
+  && reload_in_progress
+  && GET_CODE (operand1) == REG
   && REGNO (operand1) >= FIRST_PSEUDO_REGISTER)
 operand1 = reg_equiv_mem (REGNO (operand1));
   else if (scratch_reg
-  && reload_in_progress && GET_CODE (operand1) == SUBREG
+  && reload_in_progress
+  && GET_CODE (operand1) == SUBREG
   && GET_CODE (SUBREG_REG (operand1)) == REG
   && REGNO (SUBREG_REG (operand1)) >= FIRST_PSEUDO_REGISTER)
 {
@@ -1992,12 +1997,16 @@ pa_emit_move_sequence (rtx *operands, machine_mode 
mode, rtx scratch_reg)
   operand1 = alter_subreg (&temp, true);
 }
 
-  if (scratch_reg && reload_in_progress && GET_CODE (operand0) == MEM
+  if (scratch_reg
+  && (lra_in_progress || reload_in_progress)
+  && GET_CODE (operand0) == MEM
   && ((tem = find_replacement (&XEXP (operand0, 0)))
  != XEXP (operand0, 0)))
 operand0 = replace_equiv_address (operand0, tem);
 
-  if (scratch_reg && reload_in_progress && GET_CODE (operand1) == MEM
+  if (scratch_reg
+  && (lra_in_progress || reload_in_progress)
+  && GET_CODE (operand1) == MEM
  

[committed] i386: Fix the order of operands in andn3 [PR117192]

2024-10-18 Thread Uros Bizjak
Fix the order of operands in andn3 expander to comply
with the specification, where bitwise-complement applies to operand 2.

PR target/117192

gcc/ChangeLog:

* config/i386/mmx.md (andn3): Swap operand
indexes 1 and 2 to comply with andn specification.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr117192.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index ef4ed8b501a..506f4cab6a8 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -4470,9 +4470,9 @@ (define_split
 (define_expand "andn3"
   [(set (match_operand:MMXMODEI 0 "register_operand")
 (and:MMXMODEI
-  (not:MMXMODEI (match_operand:MMXMODEI 1 "register_operand"))
-  (match_operand:MMXMODEI 2 "register_operand")))]
-  "TARGET_SSE2")
+  (not:MMXMODEI (match_operand:MMXMODEI 2 "register_operand"))
+  (match_operand:MMXMODEI 1 "register_operand")))]
+  "TARGET_MMX_WITH_SSE")
 
 (define_insn "mmx_andnot3"
   [(set (match_operand:MMXMODEI 0 "register_operand" "=y,x,x,v")
diff --git a/gcc/testsuite/gcc.target/i386/pr117192.c 
b/gcc/testsuite/gcc.target/i386/pr117192.c
new file mode 100644
index 000..8480c72dc0e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr117192.c
@@ -0,0 +1,16 @@
+/* PR target/117192 */
+/* { dg-do run } */
+/* { dg-options "-O3 -fno-unswitch-loops" } */
+
+int a, b, c, d;
+int main() {
+  int e[6];
+  for (d = 0; d < 6; d++)
+if (!c)
+  e[d] = 0;
+  for (; b < 6; b++)
+a = e[b];
+  if (a != 0)
+__builtin_abort();
+  return 0;
+}


Re: [PATCH 1/9] Make more places handle exact_div like trunc_div

2024-10-18 Thread Richard Biener
On Fri, 18 Oct 2024, Richard Sandiford wrote:

> I tried to look for places where we were handling TRUNC_DIV_EXPR
> more favourably than EXACT_DIV_EXPR.
> 
> Most of the places that I looked at but didn't change were handling
> div/mod pairs.  But there's bound to be others I missed...

OK, but I'd prefer trunc_or_exact_div_p to be explicit.

Thanks,
Richard.

> gcc/
>   * match.pd: Extend some rules to handle exact_div like trunc_div.
>   * tree.h (trunc_div_p): New function.
>   * tree-ssa-loop-niter.cc (is_rshift_by_1): Use it.
>   * tree-ssa-loop-ivopts.cc (force_expr_to_var_cost): Handle
>   EXACT_DIV_EXPR.
> ---
>  gcc/match.pd| 60 +++--
>  gcc/tree-ssa-loop-ivopts.cc |  2 ++
>  gcc/tree-ssa-loop-niter.cc  |  2 +-
>  gcc/tree.h  | 13 
>  4 files changed, 47 insertions(+), 30 deletions(-)
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 12d81fcac0d..4aea028a866 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -492,27 +492,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> of A starting from shift's type sign bit are zero, as
> (unsigned long long) (1 << 31) is -2147483648ULL, not 2147483648ULL,
> so it is valid only if A >> 31 is zero.  */
> -(simplify
> - (trunc_div (convert?@0 @3) (convert2? (lshift integer_onep@1 @2)))
> - (if ((TYPE_UNSIGNED (type) || tree_expr_nonnegative_p (@0))
> -  && (!VECTOR_TYPE_P (type)
> -   || target_supports_op_p (type, RSHIFT_EXPR, optab_vector)
> -   || target_supports_op_p (type, RSHIFT_EXPR, optab_scalar))
> -  && (useless_type_conversion_p (type, TREE_TYPE (@1))
> -   || (element_precision (type) >= element_precision (TREE_TYPE (@1))
> -   && (TYPE_UNSIGNED (TREE_TYPE (@1))
> -   || (element_precision (type)
> -   == element_precision (TREE_TYPE (@1)))
> -   || (INTEGRAL_TYPE_P (type)
> -   && (tree_nonzero_bits (@0)
> -   & wi::mask (element_precision (TREE_TYPE (@1)) - 1,
> -   true,
> -   element_precision (type))) == 0)
> -   (if (!VECTOR_TYPE_P (type)
> - && useless_type_conversion_p (TREE_TYPE (@3), TREE_TYPE (@1))
> - && element_precision (TREE_TYPE (@3)) < element_precision (type))
> -(convert (rshift @3 @2))
> -(rshift @0 @2
> +(for div (trunc_div exact_div)
> + (simplify
> +  (div (convert?@0 @3) (convert2? (lshift integer_onep@1 @2)))
> +  (if ((TYPE_UNSIGNED (type) || tree_expr_nonnegative_p (@0))
> +   && (!VECTOR_TYPE_P (type)
> +|| target_supports_op_p (type, RSHIFT_EXPR, optab_vector)
> +|| target_supports_op_p (type, RSHIFT_EXPR, optab_scalar))
> +   && (useless_type_conversion_p (type, TREE_TYPE (@1))
> +|| (element_precision (type) >= element_precision (TREE_TYPE (@1))
> +&& (TYPE_UNSIGNED (TREE_TYPE (@1))
> +|| (element_precision (type)
> +== element_precision (TREE_TYPE (@1)))
> +|| (INTEGRAL_TYPE_P (type)
> +&& (tree_nonzero_bits (@0)
> +& wi::mask (element_precision (TREE_TYPE (@1)) - 1,
> +true,
> +element_precision (type))) == 0)
> +(if (!VECTOR_TYPE_P (type)
> +  && useless_type_conversion_p (TREE_TYPE (@3), TREE_TYPE (@1))
> +  && element_precision (TREE_TYPE (@3)) < element_precision (type))
> + (convert (rshift @3 @2))
> + (rshift @0 @2)
>  
>  /* Preserve explicit divisions by 0: the C++ front-end wants to detect
> undefined behavior in constexpr evaluation, and assuming that the division
> @@ -947,13 +948,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   { build_one_cst (utype); })))
>  
>  /* Simplify (unsigned t * 2)/2 -> unsigned t & 0x7FFF.  */
> -(simplify
> - (trunc_div (mult @0 integer_pow2p@1) @1)
> - (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) && TYPE_UNSIGNED (TREE_TYPE (@0)))
> -  (bit_and @0 { wide_int_to_tree
> - (type, wi::mask (TYPE_PRECISION (type)
> -  - wi::exact_log2 (wi::to_wide (@1)),
> -  false, TYPE_PRECISION (type))); })))
> +(for div (trunc_div exact_div)
> + (simplify
> +  (div (mult @0 integer_pow2p@1) @1)
> +  (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) && TYPE_UNSIGNED (TREE_TYPE (@0)))
> +   (bit_and @0 { wide_int_to_tree
> +  (type, wi::mask (TYPE_PRECISION (type)
> +   - wi::exact_log2 (wi::to_wide (@1)),
> +   false, TYPE_PRECISION (type))); }
>  
>  /* Simplify (unsigned t / 2) * 2 -> unsigned t & ~1.  */
>  (simplify
> @@ -5715,7 +5717,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  
>  /* Sink binary operation to branches, but only if we can fold it.  */
>  (for op (tcc_comparison plus minus mult bit_and bit_io

[PATCH 1/7] RISC-V: Fix indentation in riscv_vector::expand_block_move [NFC]

2024-10-18 Thread Craig Blackmore
gcc/ChangeLog:

* config/riscv/riscv-string.cc (expand_block_move): Fix
indentation.
---
 gcc/config/riscv/riscv-string.cc | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc
index 4bb8bcec4a5..0c5ffd7d861 100644
--- a/gcc/config/riscv/riscv-string.cc
+++ b/gcc/config/riscv/riscv-string.cc
@@ -1086,22 +1086,22 @@ expand_block_move (rtx dst_in, rtx src_in, rtx 
length_in)
 {
   HOST_WIDE_INT length = INTVAL (length_in);
 
-/* By using LMUL=8, we can copy as many bytes in one go as there
-   are bits in a vector register.  If the entire block thus fits,
-   we don't need a loop.  */
-if (length <= TARGET_MIN_VLEN)
-  {
-   need_loop = false;
-
-   /* If a single scalar load / store pair can do the job, leave it
-  to the scalar code to do that.  */
-   /* ??? If fast unaligned access is supported, the scalar code could
-  use suitably sized scalars irrespective of alignment.  If that
-  gets fixed, we have to adjust the test here.  */
-
-   if (pow2p_hwi (length) && length <= potential_ew)
- return false;
-  }
+  /* By using LMUL=8, we can copy as many bytes in one go as there
+are bits in a vector register.  If the entire block thus fits,
+we don't need a loop.  */
+  if (length <= TARGET_MIN_VLEN)
+   {
+ need_loop = false;
+
+ /* If a single scalar load / store pair can do the job, leave it
+to the scalar code to do that.  */
+ /* ??? If fast unaligned access is supported, the scalar code could
+use suitably sized scalars irrespective of alignment.  If that
+gets fixed, we have to adjust the test here.  */
+
+ if (pow2p_hwi (length) && length <= potential_ew)
+   return false;
+   }
 
   /* Find the vector mode to use.  Using the largest possible element
 size is likely to give smaller constants, and thus potentially
-- 
2.43.0



Re: [PATCH 9/9] Record nonzero bits in the irange_bitmask of POLY_INT_CSTs

2024-10-18 Thread Andrew MacLeod

That seems like a very reasonable place.

Andrew

On 10/18/24 08:11, Richard Biener wrote:

On Fri, 18 Oct 2024, Richard Sandiford wrote:


At the moment, ranger punts entirely on POLY_INT_CSTs.  Numerical
ranges are a bit difficult, unless we do start modelling bounds on
the indeterminates.  But we can at least track the nonzero bits.

OK unless Andrew knows a better proper place to do this.

Thanks,
Richard.


gcc/
* value-query.cc (range_query::get_tree_range): Use get_nonzero_bits
to populate the irange_bitmask of a POLY_INT_CST.

gcc/testsuite/
* gcc.target/aarch64/sve/cnt_fold_6.c: New test.
---
  .../gcc.target/aarch64/sve/cnt_fold_6.c   | 75 +++
  gcc/value-query.cc|  7 ++
  2 files changed, 82 insertions(+)
  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_6.c

diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_6.c 
b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_6.c
new file mode 100644
index 000..9d9e1ca9330
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_6.c
@@ -0,0 +1,75 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include 
+
+/*
+** f1:
+** ...
+** cntb(x[0-9]+)
+** ...
+** add x[0-9]+, \1, #?16
+** ...
+** csel[^\n]+
+** ret
+*/
+uint64_t
+f1 (int x)
+{
+  uint64_t y = x ? svcnth () : svcnth () + 8;
+  y >>= 3;
+  y <<= 4;
+  return y;
+}
+
+/*
+** f2:
+** ...
+** (?:and|[al]sr)  [^\n]+
+** ...
+** ret
+*/
+uint64_t
+f2 (int x)
+{
+  uint64_t y = x ? svcnth () : svcnth () + 8;
+  y >>= 4;
+  y <<= 5;
+  return y;
+}
+
+/*
+** f3:
+** ...
+** cntw(x[0-9]+)
+** ...
+** add x[0-9]+, \1, #?16
+** ...
+** csel[^\n]+
+** ret
+*/
+uint64_t
+f3 (int x)
+{
+  uint64_t y = x ? svcntd () : svcntd () + 8;
+  y >>= 1;
+  y <<= 2;
+  return y;
+}
+
+/*
+** f4:
+** ...
+** (?:and|[al]sr)  [^\n]+
+** ...
+** ret
+*/
+uint64_t
+f4 (int x)
+{
+  uint64_t y = x ? svcntd () : svcntd () + 8;
+  y >>= 2;
+  y <<= 3;
+  return y;
+}
diff --git a/gcc/value-query.cc b/gcc/value-query.cc
index cac2cb5b2bc..34499da1a98 100644
--- a/gcc/value-query.cc
+++ b/gcc/value-query.cc
@@ -375,6 +375,13 @@ range_query::get_tree_range (vrange &r, tree expr, gimple 
*stmt,
}
  
  default:

+  if (POLY_INT_CST_P (expr))
+   {
+ unsigned int precision = TYPE_PRECISION (type);
+ r.set_varying (type);
+ r.update_bitmask ({ wi::zero (precision), get_nonzero_bits (expr) });
+ return true;
+   }
break;
  }
if (BINARY_CLASS_P (expr) || COMPARISON_CLASS_P (expr))





Re: [PATCH v6] Target-independent store forwarding avoidance.

2024-10-18 Thread Jeff Law



On 10/18/24 3:57 AM, Konstantinos Eleftheriou wrote:

From: kelefth 

This pass detects cases of expensive store forwarding and tries to avoid them
by reordering the stores and using suitable bit insertion sequences.
For example it can transform this:

  strbw2, [x1, 1]
  ldr x0, [x1]  # Expensive store forwarding to larger load.

To:

  ldr x0, [x1]
  strbw2, [x1]
  bfi x0, x2, 0, 8

Assembly like this can appear with bitfields or type punning / unions.
On stress-ng when running the cpu-union microbenchmark the following speedups
have been observed.

   Neoverse-N1:  +29.4%
   Intel Coffeelake: +13.1%
   AMD 5950X:+17.5%
So just fired this up on the crosses after enabling it by default.  It's 
still got several hours to go, but there's a pretty clear goof in here 
that's showing up on multiple targets.


Just taking mcore-elf as an example, we're mis-compiling muldi3 from libgcc.

We have this in the .asmcons dump:


(insn 37 36 40 4 (set (mem/j/c:SI (reg/f:SI 8 r8) [1 MEM[(union  *)_61].s.low+0 
S4 A64])
(reg:SI 77 [ _10 ])) "/home/jlaw/test/gcc/libgcc/libgcc2.c":532:649 
discrim 3 65 {*mcore.md:1196}
 (expr_list:REG_DEAD (reg:SI 77 [ _10 ])
(nil)))

[ ... ]


(insn 44 43 45 4 (set (mem/j/c:SI (plus:SI (reg/f:SI 8 r8)
(const_int 4 [0x4])) [1 MEM[(union  *)_61].s.high+0 S4 A32])
(reg:SI 81 [ _18 ])) "/home/jlaw/test/gcc/libgcc/libgcc2.c":534:12 65 
{*mcore.md:1196}
 (expr_list:REG_DEAD (reg:SI 81 [ _18 ])
(nil)))
(note 45 44 49 4 NOTE_INSN_DELETED)
(insn 49 45 50 4 (set (reg/i:DI 2 r2)
(mem/j/c:DI (reg/f:SI 8 r8) [1 MEM[(union  *)_61].ll+0 S8 A64])) 
"/home/jlaw/test/gcc/libgcc/libgcc2.c":538:1 68 {movdi_i}
 (nil))


So we've got two SImode stores which are then loaded in DImode a bit 
later to set the return value for the function.  A very clear 
opportunity to do store forwarding.



In the store-forwarding dump we have:

(insn 70 36 40 4 (set (reg:SI 95)
(reg:SI 77 [ _10 ])) "/home/jlaw/test/gcc/libgcc/libgcc2.c":532:649 
discrim 3 65 {*mcore.md:1196}
 (nil))

[ ... ]

(insn 67 43 45 4 (set (reg:SI 94)
(reg:SI 81 [ _18 ])) "/home/jlaw/test/gcc/libgcc/libgcc2.c":534:12 65 
{*mcore.md:1196}
 (nil))
(note 45 67 71 4 NOTE_INSN_DELETED)
(insn 71 45 69 4 (set (mem/j/c:SI (reg/f:SI 8 r8) [1 MEM[(union  *)_61].s.low+0 
S4 A64])
(reg:SI 95)) "/home/jlaw/test/gcc/libgcc/libgcc2.c":538:1 65 
{*mcore.md:1196}
 (nil))
(insn 69 71 68 4 (set (reg:DI 93)
(subreg:DI (reg:SI 95) 0)) "/home/jlaw/test/gcc/libgcc/libgcc2.c":538:1 
68 {movdi_i}
 (nil))
(insn 68 69 66 4 (set (mem/j/c:SI (plus:SI (reg/f:SI 8 r8)
(const_int 4 [0x4])) [1 MEM[(union  *)_61].s.high+0 S4 A32])
(reg:SI 94)) "/home/jlaw/test/gcc/libgcc/libgcc2.c":538:1 65 
{*mcore.md:1196}
 (nil))
(insn 66 68 50 4 (set (subreg:SI (reg:DI 93) 4)
(reg:SI 94)) "/home/jlaw/test/gcc/libgcc/libgcc2.c":538:1 65 
{*mcore.md:1196}
 (nil))


Note that we never put a value into (reg:DI 2), so the return value from 
this routine is garbage, naturally leading to testsuite failures.


It looks like we're missing a copy from (reg:DI 93) to (reg:DI 2) to me.

You should be able to see this with a cross compiler and don't need 
binutils/gas, newlib, etc.


Compile the attached testcase with an mcore-elf configured compiler with 
-O3 -favoid-store-forwarding



Related, but obviously not a requirement to go forward.   After the SFB 
elimination, the two stores at insns 71, 68 are dead and could be 
removed.  In theory DSE should have eliminated them, but isn't for 
reasons I haven't investigated.


Jeff

# 0 "j.c"
# 0 ""
# 0 ""
# 1 "j.c"
# 0 "/home/jlaw/test/gcc/libgcc/libgcc2.c"
# 1 "/home/jlaw/test/obj/mcore/gcc/mcore-elf/libgcc//"
# 0 ""
# 0 ""
# 1 "/home/jlaw/test/gcc/libgcc/libgcc2.c"
# 26 "/home/jlaw/test/gcc/libgcc/libgcc2.c"
# 1 "../.././gcc/tconfig.h" 1





# 1 "../.././gcc/auto-host.h" 1
# 7 "../.././gcc/tconfig.h" 2

# 1 "/home/jlaw/test/gcc/libgcc/../include/ansidecl.h" 1
# 9 "../.././gcc/tconfig.h" 2
# 27 "/home/jlaw/test/gcc/libgcc/libgcc2.c" 2
# 1 "/home/jlaw/test/gcc/libgcc/../gcc/tsystem.h" 1
# 44 "/home/jlaw/test/gcc/libgcc/../gcc/tsystem.h"
# 1 "/home/jlaw/test/obj/mcore/gcc/gcc/include/stddef.h" 1 3 4
# 145 "/home/jlaw/test/obj/mcore/gcc/gcc/include/stddef.h" 3 4
# 145 "/home/jlaw/test/obj/mcore/gcc/gcc/include/stddef.h" 3 4

# 145 "/home/jlaw/test/obj/mcore/gcc/gcc/include/stddef.h" 3 4
typedef int ptrdiff_t;
# 214 "/home/jlaw/test/obj/mcore/gcc/gcc/include/stddef.h" 3 4
typedef unsigned int size_t;
# 329 "/home/jlaw/test/obj/mcore/gcc/gcc/include/stddef.h" 3 4
typedef long int wchar_t;
# 425 "/home/jlaw/test/obj/mcore/gcc/gcc/include/stddef.h" 3 4
typedef struct {
  long long __max_align_ll __attribute__((__aligned__(__alignof__(long long;
  long double __max_align_ld __attribute__((__aligned__(__alignof__(long 
double;
# 436 "/home/jlaw/test

[PATCH v2 4/8] vect: Add maskload else value support.

2024-10-18 Thread Robin Dapp
This patch adds an else operand to vectorized masked load calls.
The current implementation adds else-value arguments to the respective
target-querying functions that is used to supply the vectorizer with the
proper else value.

Right now, the only spot where a zero else value is actually enforced is
tree-ifcvt.  Loop masking and other instances of masked loads in the
vectorizer itself do not use vec_cond_exprs.

gcc/ChangeLog:

* optabs-query.cc (supports_vec_convert_optab_p): Return icode.
(get_supported_else_val): Return supported else value for
optab's operand at index.
(supports_vec_gather_load_p): Add else argument.
(supports_vec_scatter_store_p): Ditto.
* optabs-query.h (supports_vec_gather_load_p): Ditto.
(get_supported_else_val): Ditto.
* optabs-tree.cc (target_supports_mask_load_store_p): Ditto.
(can_vec_mask_load_store_p): Ditto.
(target_supports_len_load_store_p): Ditto.
(get_len_load_store_mode): Ditto.
* optabs-tree.h (target_supports_mask_load_store_p): Ditto.
(can_vec_mask_load_store_p): Ditto.
* tree-vect-data-refs.cc (vect_lanes_optab_supported_p): Ditto.
(vect_gather_scatter_fn_p): Ditto.
(vect_check_gather_scatter): Ditto.
(vect_load_lanes_supported): Ditto.
* tree-vect-patterns.cc (vect_recog_gather_scatter_pattern):
Ditto.
* tree-vect-slp.cc (vect_get_operand_map): Adjust indices for
else operand.
(vect_slp_analyze_node_operations): Skip undefined else operand.
* tree-vect-stmts.cc (exist_non_indexing_operands_for_use_p):
Add else operand handling.
(vect_get_vec_defs_for_operand): Handle undefined else operand.
(check_load_store_for_partial_vectors): Add else argument.
(vect_truncate_gather_scatter_offset): Ditto.
(vect_use_strided_gather_scatters_p): Ditto.
(get_group_load_store_type): Ditto.
(get_load_store_type): Ditto.
(vect_get_mask_load_else): Ditto.
(vect_get_else_val_from_tree): Ditto.
(vect_build_one_gather_load_call): Add zero else operand.
(vectorizable_load): Use else operand.
* tree-vectorizer.h (vect_gather_scatter_fn_p): Add else
argument.
(vect_load_lanes_supported): Ditto.
(vect_get_mask_load_else): Ditto.
(vect_get_else_val_from_tree): Ditto.
---
 gcc/optabs-query.cc|  59 ++---
 gcc/optabs-query.h |   3 +-
 gcc/optabs-tree.cc |  62 ++---
 gcc/optabs-tree.h  |   8 +-
 gcc/tree-vect-data-refs.cc |  77 +++
 gcc/tree-vect-patterns.cc  |  18 ++-
 gcc/tree-vect-slp.cc   |  22 +++-
 gcc/tree-vect-stmts.cc | 257 +
 gcc/tree-vectorizer.h  |  11 +-
 9 files changed, 394 insertions(+), 123 deletions(-)

diff --git a/gcc/optabs-query.cc b/gcc/optabs-query.cc
index cc52bc0f5ea..347a1322479 100644
--- a/gcc/optabs-query.cc
+++ b/gcc/optabs-query.cc
@@ -29,6 +29,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "rtl.h"
 #include "recog.h"
 #include "vec-perm-indices.h"
+#include "internal-fn.h"
+#include "memmodel.h"
+#include "optabs.h"
 
 struct target_optabs default_target_optabs;
 struct target_optabs *this_fn_optabs = &default_target_optabs;
@@ -672,34 +675,48 @@ lshift_cheap_p (bool speed_p)
that mode, given that the second mode is always an integer vector.
If MODE is VOIDmode, return true if OP supports any vector mode.  */
 
-static bool
+static enum insn_code
 supports_vec_convert_optab_p (optab op, machine_mode mode)
 {
   int start = mode == VOIDmode ? 0 : mode;
   int end = mode == VOIDmode ? MAX_MACHINE_MODE - 1 : mode;
+  enum insn_code icode = CODE_FOR_nothing;
   for (int i = start; i <= end; ++i)
 if (VECTOR_MODE_P ((machine_mode) i))
   for (int j = MIN_MODE_VECTOR_INT; j < MAX_MODE_VECTOR_INT; ++j)
-   if (convert_optab_handler (op, (machine_mode) i,
-  (machine_mode) j) != CODE_FOR_nothing)
- return true;
+   {
+ if ((icode
+  = convert_optab_handler (op, (machine_mode) i,
+   (machine_mode) j)) != CODE_FOR_nothing)
+   return icode;
+   }
 
-  return false;
+  return icode;
 }
 
 /* If MODE is not VOIDmode, return true if vec_gather_load is available for
that mode.  If MODE is VOIDmode, return true if gather_load is available
-   for at least one vector mode.  */
+   for at least one vector mode.
+   In that case, and if ELSVALS is nonzero, store the supported else values
+   into the vector it points to.  */
 
 bool
-supports_vec_gather_load_p (machine_mode mode)
+supports_vec_gather_load_p (machine_mode mode, auto_vec *elsvals)
 {
-  if (!this_fn_optabs->supports_vec_gather_load[mode])
-this_fn_optabs->supports_vec_gather_load[mode]
-  = (supports_vec_convert_optab_p (gather_load_optab, mode)
-

[PATCH v2 7/8] i386: Add else operand to masked loads.

2024-10-18 Thread Robin Dapp
This patch adds a zero else operand to masked loads, in particular the
masked gather load builtins that are used for gather vectorization.

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_expand_special_args_builtin):
Add else-operand handling.
(ix86_expand_builtin): Ditto.
* config/i386/predicates.md (vcvtne2ps2bf_parallel): New
predicate.
(maskload_else_operand): Ditto.
* config/i386/sse.md: Use predicate.
---
 gcc/config/i386/i386-expand.cc |  26 +--
 gcc/config/i386/predicates.md  |   4 ++
 gcc/config/i386/sse.md | 124 -
 3 files changed, 101 insertions(+), 53 deletions(-)

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 63f5e348d64..f6a2c2d65b8 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -12994,10 +12994,11 @@ ix86_expand_special_args_builtin (const struct 
builtin_description *d,
 {
   tree arg;
   rtx pat, op;
-  unsigned int i, nargs, arg_adjust, memory;
+  unsigned int i, nargs, arg_adjust, memory = -1;
   unsigned int constant = 100;
   bool aligned_mem = false;
-  rtx xops[4];
+  rtx xops[4] = {};
+  bool add_els = false;
   enum insn_code icode = d->icode;
   const struct insn_data_d *insn_p = &insn_data[icode];
   machine_mode tmode = insn_p->operand[0].mode;
@@ -13124,6 +13125,9 @@ ix86_expand_special_args_builtin (const struct 
builtin_description *d,
 case V4DI_FTYPE_PCV4DI_V4DI:
 case V4SI_FTYPE_PCV4SI_V4SI:
 case V2DI_FTYPE_PCV2DI_V2DI:
+  /* Two actual args but an additional else operand.  */
+  add_els = true;
+  /* Fallthru.  */
 case VOID_FTYPE_INT_INT64:
   nargs = 2;
   klass = load;
@@ -13396,6 +13400,12 @@ ix86_expand_special_args_builtin (const struct 
builtin_description *d,
   xops[i]= op;
 }
 
+  if (add_els)
+{
+  xops[i] = CONST0_RTX (GET_MODE (xops[0]));
+  nargs++;
+}
+
   switch (nargs)
 {
 case 0:
@@ -13652,7 +13662,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx 
subtarget,
   enum insn_code icode, icode2;
   tree fndecl = TREE_OPERAND (CALL_EXPR_FN (exp), 0);
   tree arg0, arg1, arg2, arg3, arg4;
-  rtx op0, op1, op2, op3, op4, pat, pat2, insn;
+  rtx op0, op1, op2, op3, op4, opels, pat, pat2, insn;
   machine_mode mode0, mode1, mode2, mode3, mode4;
   unsigned int fcode = DECL_MD_FUNCTION_CODE (fndecl);
   HOST_WIDE_INT bisa, bisa2;
@@ -15559,12 +15569,15 @@ rdseed_step:
  op3 = copy_to_reg (op3);
  op3 = lowpart_subreg (mode3, op3, GET_MODE (op3));
}
+
   if (!insn_data[icode].operand[5].predicate (op4, mode4))
{
-  error ("the last argument must be scale 1, 2, 4, 8");
-  return const0_rtx;
+ error ("the last argument must be scale 1, 2, 4, 8");
+ return const0_rtx;
}
 
+  opels = CONST0_RTX (GET_MODE (subtarget));
+
   /* Optimize.  If mask is known to have all high bits set,
 replace op0 with pc_rtx to signal that the instruction
 overwrites the whole destination and doesn't use its
@@ -15633,7 +15646,8 @@ rdseed_step:
}
}
 
-  pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4);
+  pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4, opels);
+
   if (! pat)
return const0_rtx;
   emit_insn (pat);
diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index 053312bbe27..7c7d8f61f11 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -2346,3 +2346,7 @@ (define_predicate "apx_evex_add_memory_operand"
 
   return true;
 })
+
+(define_predicate "maskload_else_operand"
+  (and (match_code "const_int,const_vector")
+   (match_test "op == CONST0_RTX (GET_MODE (op))")))
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index a45b50ad732..83955eee5a0 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -1575,7 +1575,8 @@ (define_expand "_load_mask"
 }
   else if (MEM_P (operands[1]))
 operands[1] = gen_rtx_UNSPEC (mode,
-gen_rtvec(1, operands[1]),
+gen_rtvec(2, operands[1],
+  CONST0_RTX (mode)),
 UNSPEC_MASKLOAD);
 })
 
@@ -1583,7 +1584,8 @@ (define_insn "*_load_mask"
   [(set (match_operand:V48_AVX512VL 0 "register_operand" "=v")
(vec_merge:V48_AVX512VL
  (unspec:V48_AVX512VL
-   [(match_operand:V48_AVX512VL 1 "memory_operand" "m")]
+   [(match_operand:V48_AVX512VL 1 "memory_operand" "m")
+(match_operand:V48_AVX512VL 4 "maskload_else_operand")]
UNSPEC_MASKLOAD)
  (match_operand:V48_AVX512VL 2 "nonimm_or_0_operand" "0C")
  (match_operand: 3 "register_operand" "Yk")))]
@@ -1611,7 +1613,8 @@ (define_insn "*_load_mask"
 (define_insn_and_split "*_load"
   [(set (match_operand:V48_AVX512VL

[PATCH v2 1/8] docs: Document maskload else operand and behavior.

2024-10-18 Thread Robin Dapp
This patch amends the documentation for masked loads (maskload,
vec_mask_load_lanes, and mask_gather_load as well as their len
counterparts) with an else operand.

gcc/ChangeLog:

* doc/md.texi: Document masked load else operand.
---
 gcc/doc/md.texi | 63 -
 1 file changed, 41 insertions(+), 22 deletions(-)

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 603f74a78c0..632b036b36c 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5017,8 +5017,10 @@ This pattern is not allowed to @code{FAIL}.
 @item @samp{vec_mask_load_lanes@var{m}@var{n}}
 Like @samp{vec_load_lanes@var{m}@var{n}}, but takes an additional
 mask operand (operand 2) that specifies which elements of the destination
-vectors should be loaded.  Other elements of the destination
-vectors are set to zero.  The operation is equivalent to:
+vectors should be loaded.  Other elements of the destination vectors are
+taken from operand 3, which is an else operand similar to the one in
+@code{maskload}.
+The operation is equivalent to:
 
 @smallexample
 int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
@@ -5028,7 +5030,7 @@ for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)
   operand0[i][j] = operand1[j * c + i];
   else
 for (i = 0; i < c; i++)
-  operand0[i][j] = 0;
+  operand0[i][j] = operand3[j];
 @end smallexample
 
 This pattern is not allowed to @code{FAIL}.
@@ -5036,16 +5038,20 @@ This pattern is not allowed to @code{FAIL}.
 @cindex @code{vec_mask_len_load_lanes@var{m}@var{n}} instruction pattern
 @item @samp{vec_mask_len_load_lanes@var{m}@var{n}}
 Like @samp{vec_load_lanes@var{m}@var{n}}, but takes an additional
-mask operand (operand 2), length operand (operand 3) as well as bias operand 
(operand 4)
-that specifies which elements of the destination vectors should be loaded.
-Other elements of the destination vectors are undefined.  The operation is 
equivalent to:
+mask operand (operand 2), length operand (operand 4) as well as bias operand
+(operand 5) that specifies which elements of the destination vectors should be
+loaded.  Other elements of the destination vectors are taken from operand 3,
+which is an else operand similar to the one in @code{maskload}.
+The operation is equivalent to:
 
 @smallexample
 int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
-for (j = 0; j < operand3 + operand4; j++)
-  if (operand2[j])
-for (i = 0; i < c; i++)
+for (j = 0; j < operand4 + operand5; j++)
+  for (i = 0; i < c; i++)
+if (operand2[j])
   operand0[i][j] = operand1[j * c + i];
+else
+  operand0[i][j] = operand3[j];
 @end smallexample
 
 This pattern is not allowed to @code{FAIL}.
@@ -5125,18 +5131,25 @@ address width.
 @cindex @code{mask_gather_load@var{m}@var{n}} instruction pattern
 @item @samp{mask_gather_load@var{m}@var{n}}
 Like @samp{gather_load@var{m}@var{n}}, but takes an extra mask operand as
-operand 5.  Bit @var{i} of the mask is set if element @var{i}
+operand 5.
+Other elements of the destination vectors are taken from operand 6,
+which is an else operand similar to the one in @code{maskload}.
+Bit @var{i} of the mask is set if element @var{i}
 of the result should be loaded from memory and clear if element @var{i}
-of the result should be set to zero.
+of the result should be set to operand 6.
 
 @cindex @code{mask_len_gather_load@var{m}@var{n}} instruction pattern
 @item @samp{mask_len_gather_load@var{m}@var{n}}
-Like @samp{gather_load@var{m}@var{n}}, but takes an extra mask operand 
(operand 5),
-a len operand (operand 6) as well as a bias operand (operand 7).  Similar to 
mask_len_load,
-the instruction loads at most (operand 6 + operand 7) elements from memory.
+Like @samp{gather_load@var{m}@var{n}}, but takes an extra mask operand
+(operand 5) and an else operand (operand 6) as well as a len operand
+(operand 7) and a bias operand (operand 8).
+
+Similar to mask_len_load the instruction loads at
+most (operand 7 + operand 8) elements from memory.
 Bit @var{i} of the mask is set if element @var{i} of the result should
-be loaded from memory and clear if element @var{i} of the result should be 
undefined.
-Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
+be loaded from memory and clear if element @var{i} of the result should
+be set to element @var{i} of operand 6.
+Mask elements @var{i} with @var{i} > (operand 7 + operand 8) are ignored.
 
 @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
 @item @samp{scatter_store@var{m}@var{n}}
@@ -5368,8 +5381,13 @@ Operands 4 and 5 have a target-dependent scalar integer 
mode.
 @cindex @code{maskload@var{m}@var{n}} instruction pattern
 @item @samp{maskload@var{m}@var{n}}
 Perform a masked load of vector from memory operand 1 of mode @var{m}
-into register operand 0.  Mask is provided in register operand 2 of
-mode @var{n}.
+into register operand 0.  The mask is provided in register operand 2 of
+mode @var{n}.  Operand 3 (the "else value") i

[PATCH v2 3/8] tree-ifcvt: Enforce zero else value after maskload.

2024-10-18 Thread Robin Dapp
When predicating a load we implicitly assume that the else value is
zero.  This matters in case the loaded value is padded (like e.g.
a Bool) and we must ensure that the padding bytes are zero on targets
that don't implicitly zero inactive elements.

In order to formalize this this patch queries the target for
its supported else operand and uses that for the maskload call.
Subsequently, if the else operand is nonzero, a cond_expr enforcing
a zero else value is emitted.

gcc/ChangeLog:

* tree-if-conv.cc (predicate_load_or_store): Enforce zero else
value for padded types.
(predicate_statements): Use sequence instead of statement.
---
 gcc/tree-if-conv.cc | 112 +---
 1 file changed, 94 insertions(+), 18 deletions(-)

diff --git a/gcc/tree-if-conv.cc b/gcc/tree-if-conv.cc
index 90c754a4814..9623426e1e1 100644
--- a/gcc/tree-if-conv.cc
+++ b/gcc/tree-if-conv.cc
@@ -2531,12 +2531,15 @@ mask_exists (int size, const vec &vec)
 
 /* Helper function for predicate_statements.  STMT is a memory read or
write and it needs to be predicated by MASK.  Return a statement
-   that does so.  */
+   that does so.  SSA_NAMES is the set of SSA names defined earlier in
+   STMT's block. */
 
-static gimple *
-predicate_load_or_store (gimple_stmt_iterator *gsi, gassign *stmt, tree mask)
+static gimple_seq
+predicate_load_or_store (gimple_stmt_iterator *gsi, gassign *stmt, tree mask,
+hash_set *ssa_names)
 {
-  gcall *new_stmt;
+  gimple_seq stmts = NULL;
+  gcall *call_stmt;
 
   tree lhs = gimple_assign_lhs (stmt);
   tree rhs = gimple_assign_rhs1 (stmt);
@@ -2552,21 +2555,88 @@ predicate_load_or_store (gimple_stmt_iterator *gsi, 
gassign *stmt, tree mask)
   ref);
   if (TREE_CODE (lhs) == SSA_NAME)
 {
-  new_stmt
-   = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr,
- ptr, mask);
-  gimple_call_set_lhs (new_stmt, lhs);
-  gimple_set_vuse (new_stmt, gimple_vuse (stmt));
+  /* Get the preferred vector mode and its corresponding mask for the
+masked load.  We need this to query the target's supported else
+operands.  */
+  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
+  scalar_mode smode = as_a  (mode);
+
+  machine_mode vmode = targetm.vectorize.preferred_simd_mode (smode);
+  machine_mode mask_mode
+   = targetm.vectorize.get_mask_mode (vmode).require ();
+
+  auto_vec elsvals;
+  internal_fn ifn;
+  bool have_masked_load
+   = target_supports_mask_load_store_p (vmode, mask_mode, true, &ifn,
+&elsvals);
+
+  /* We might need to explicitly zero inactive elements if there are
+padding bits in the type that might leak otherwise.
+Refer to PR115336.  */
+  bool need_zero
+   = TYPE_PRECISION (TREE_TYPE (lhs)) < GET_MODE_PRECISION (smode);
+
+  int elsval;
+  bool implicit_zero = false;
+  if (have_masked_load)
+   {
+ gcc_assert (elsvals.length ());
+
+ /* But not if the target already provide implicit zeroing of inactive
+elements.  */
+ implicit_zero = elsvals.contains (MASK_LOAD_ELSE_ZERO);
+
+ /* For now, just use the first else value if zero is unsupported.  */
+ elsval = implicit_zero ? MASK_LOAD_ELSE_ZERO : *elsvals.begin ();
+   }
+  else
+   {
+ /* We cannot vectorize this either way so just use a zero even
+if it is unsupported.  */
+ elsval = MASK_LOAD_ELSE_ZERO;
+   }
+
+  tree els = vect_get_mask_load_else (elsval, TREE_TYPE (lhs));
+
+  call_stmt
+   = gimple_build_call_internal (IFN_MASK_LOAD, 4, addr,
+ ptr, mask, els);
+
+  /* Build the load call and, if the else value is nonzero,
+a COND_EXPR that enforces it.  */
+  tree loadlhs;
+  if (!need_zero || implicit_zero)
+   gimple_call_set_lhs (call_stmt, gimple_get_lhs (stmt));
+  else
+   {
+ loadlhs = make_temp_ssa_name (TREE_TYPE (lhs), NULL, "_ifc_");
+ ssa_names->add (loadlhs);
+ gimple_call_set_lhs (call_stmt, loadlhs);
+   }
+  gimple_set_vuse (call_stmt, gimple_vuse (stmt));
+  gimple_seq_add_stmt (&stmts, call_stmt);
+
+  if (need_zero && !implicit_zero)
+   {
+ tree cond_rhs
+   = fold_build_cond_expr (TREE_TYPE (loadlhs), mask, loadlhs,
+   build_zero_cst (TREE_TYPE (loadlhs)));
+ gassign *cond_stmt
+   = gimple_build_assign (gimple_get_lhs (stmt), cond_rhs);
+ gimple_seq_add_stmt (&stmts, cond_stmt);
+   }
 }
   else
 {
-  new_stmt
+  call_stmt
= gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
  mask, rhs);
-  gimple_move_vops (new_stmt, stmt);
+  gimple_move_vops (call_stmt, s

Re: [PATCH 1/7] libstdc++: Refactor std::uninitialized_{copy, fill, fill_n} algos [PR68350]

2024-10-18 Thread Patrick Palka
On Fri, 18 Oct 2024, Jonathan Wakely wrote:

> On 16/10/24 21:39 -0400, Patrick Palka wrote:
> > On Tue, 15 Oct 2024, Jonathan Wakely wrote:
> > > +#if __cplusplus < 201103L
> > > +
> > > +  // True if we can unwrap _Iter to get a pointer by using
> > > std::__niter_base.
> > > +  template
> > > +struct __unwrappable_niter
> > > +{
> > > +  template struct __is_ptr { enum { __value = 0 }; };
> > > +  template struct __is_ptr<_Tp*> { enum { __value = 1
> > > }; };
> > > +
> > > +  typedef __decltype(std::__niter_base(*(_Iter*)0)) _Base;
> > > +
> > > +  enum { __value = __is_ptr<_Base>::__value };
> > > +};
> > 
> > It might be slightly cheaper to define this without the nested class
> > template as:
> > 
> >  template > __decltype(std::__niter_base(*(_Iter*)0))>
> >  struct __unwrappable_niter
> >  { enum { __value = false }; };
> > 
> >  template
> >  struct __unwrappable_niter<_Iter, _Tp*>
> >  { enum { __value = true }; };

One minor nit, we might as well use 'value' since it's a reserved name
even in C++98?

> 
> Nice. I think after spending a while failing to make any C++98
> metaprogramming work for __memcpyable in cpp_type_traits.h I was not
> in the mood for fighting C++98 any more :-) But this works well.

> 
> > > +
> > > +  // Use template specialization for C++98 when 'if constexpr' can't be
> > > used.
> > > +  template
> > >  struct __uninitialized_copy
> > >  {
> > >template
> > > @@ -186,53 +172,150 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > >template<>
> > >  struct __uninitialized_copy
> > >  {
> > > +  // Overload for generic iterators.
> > >template
> > >  static _ForwardIterator
> > >  __uninit_copy(_InputIterator __first, _InputIterator __last,
> > > _ForwardIterator __result)
> > > -{ return std::copy(__first, __last, __result); }
> > > -};
> > > + {
> > > +   if (__unwrappable_niter<_InputIterator>::__value
> > > + && __unwrappable_niter<_ForwardIterator>::__value)
> > > + {
> > > +   __uninit_copy(std::__niter_base(__first),
> > > + std::__niter_base(__last),
> > > + std::__niter_base(__result));
> > > +   std::advance(__result, std::distance(__first, __last));
> > > +   return __result;
> > > + }
> > > +   else
> > > + return std::__do_uninit_copy(__first, __last, __result);
> > > + }
> > > 
> > > +  // Overload for pointers.
> > > +  template
> > > + static _Up*
> > > + __uninit_copy(_Tp* __first, _Tp* __last, _Up* __result)
> > > + {
> > > +   // Ensure that we don't successfully memcpy in cases that should be
> > > +   // ill-formed because is_constructible<_Up, _Tp&> is false.
> > > +   typedef __typeof__(static_cast<_Up>(*__first)) __check
> > > + __attribute__((__unused__));
> > > +
> > > +   if (const ptrdiff_t __n = __last - __first)
> > 
> > Do we have to worry about the __n == 1 case here like in the C++11 code
> > path?
> 
> Actually I think we don't need to worry about it in either case.
> 
> C++20 had a note in [specialized.algorithms.general]/3 that said:
> 
>   [Note 1: When invoked on ranges of potentially-overlapping subobjects
>   ([intro.object]), the algorithms specified in [specialized.algorithms]
>   result in undefined behavior. — end note]
> 
> The reason is that the uninitialized algos create new objects at the
> specified storage locations, and creating new objects reuses storage,
> which ends the lifetime of any other objects in that storage. That
> includes any objects that were in tail padding within that storage.
> 
> See Casey's Feb 2023 comment at
> https://github.com/cplusplus/draft/issues/6143
> 
> That note was removed for C++23 (which is unfortunate IMHO), but the
> algos still reuse storage by creating new objects, and so still end
> the lifetime of potentially-overlapping subobjects within that
> storage.
> 
> For std::copy there are no new objects created, and the effects are
> specified in terms of assignment, which does not reuse storage. A
> compiler-generated trivial copy assignment operator is careful to not
> overwrite tail padding, so we can't use memmove if it would produce
> different effects.
> 
> tl;dr I think I can remove the __n == 1 handling from the C++11 paths.

Nice, makes sense

> 
> > > + {
> > > +   __builtin_memcpy(__result, __first, __n * sizeof(_Tp));
> > > +   __result += __n;
> > > + }
> > > +   return __result;
> > > + }
> > > +};
> > > +#endif
> > >/// @endcond
> > > 
> > > +#pragma GCC diagnostic push
> > > +#pragma GCC diagnostic ignored "-Wc++17-extensions"
> > >/**
> > > *  @brief Copies the range [first,last) into result.
> > > *  @param  __first  An input iterator.
> > > *  @param  __last   An input iterator.
> > > -   *  @param  __result An output iterator.
> > > -   *  @return   __result + (__first - __last)
> > > +   *  @param  __result A forward iterator.
> > > +   *  @return  

[PATCH v2 5/8] aarch64: Add masked-load else operands.

2024-10-18 Thread Robin Dapp
This adds zero else operands to masked loads and their intrinsics.
I needed to adjust more than initially thought because we rely on
combine for several instructions and a change in a "base" pattern
needs to propagate to all those.

For the lack of a better idea I used a function call property to specify
whether a builtin needs an else operand or not.  Somebody with better
knowledge of the aarch64 target can surely improve that.

gcc/ChangeLog:

* config/aarch64/aarch64-sve-builtins-base.cc: Add else
handling.
* config/aarch64/aarch64-sve-builtins.cc 
(function_expander::use_contiguous_load_insn):
Ditto.
* config/aarch64/aarch64-sve-builtins.h: Add "has else".
* config/aarch64/aarch64-sve.md (*aarch64_load
__mov):
Add else operands.
* config/aarch64/aarch64-sve2.md: Ditto.
* config/aarch64/predicates.md (aarch64_maskload_else_operand):
Add zero else operand.
---
 .../aarch64/aarch64-sve-builtins-base.cc  | 58 ++-
 gcc/config/aarch64/aarch64-sve-builtins.cc|  5 ++
 gcc/config/aarch64/aarch64-sve-builtins.h |  1 +
 gcc/config/aarch64/aarch64-sve.md | 47 +--
 gcc/config/aarch64/aarch64-sve2.md|  3 +-
 gcc/config/aarch64/predicates.md  |  4 ++
 6 files changed, 98 insertions(+), 20 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 1c17149e1f0..08d2fb796dd 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -1476,7 +1476,7 @@ public:
   unsigned int
   call_properties (const function_instance &) const override
   {
-return CP_READ_MEMORY;
+return CP_READ_MEMORY | CP_HAS_ELSE;
   }
 
   gimple *
@@ -1491,11 +1491,12 @@ public:
 gimple_seq stmts = NULL;
 tree pred = f.convert_pred (stmts, vectype, 0);
 tree base = f.fold_contiguous_base (stmts, vectype);
+tree els = build_zero_cst (vectype);
 gsi_insert_seq_before (f.gsi, stmts, GSI_SAME_STMT);
 
 tree cookie = f.load_store_cookie (TREE_TYPE (vectype));
-gcall *new_call = gimple_build_call_internal (IFN_MASK_LOAD, 3,
- base, cookie, pred);
+gcall *new_call = gimple_build_call_internal (IFN_MASK_LOAD, 4,
+ base, cookie, pred, els);
 gimple_call_set_lhs (new_call, f.lhs);
 return new_call;
   }
@@ -1505,10 +1506,16 @@ public:
   {
 insn_code icode;
 if (e.vectors_per_tuple () == 1)
-  icode = convert_optab_handler (maskload_optab,
-e.vector_mode (0), e.gp_mode (0));
+  {
+   icode = convert_optab_handler (maskload_optab,
+  e.vector_mode (0), e.gp_mode (0));
+   e.args.quick_push (CONST0_RTX (e.vector_mode (0)));
+  }
 else
-  icode = code_for_aarch64 (UNSPEC_LD1_COUNT, e.tuple_mode (0));
+  {
+   icode = code_for_aarch64 (UNSPEC_LD1_COUNT, e.tuple_mode (0));
+   e.args.quick_push (CONST0_RTX (e.tuple_mode (0)));
+  }
 return e.use_contiguous_load_insn (icode);
   }
 };
@@ -1519,12 +1526,20 @@ class svld1_extend_impl : public extending_load
 public:
   using extending_load::extending_load;
 
+  unsigned int
+  call_properties (const function_instance &) const override
+  {
+return CP_READ_MEMORY | CP_HAS_ELSE;
+  }
+
   rtx
   expand (function_expander &e) const override
   {
 insn_code icode = code_for_aarch64_load (UNSPEC_LD1_SVE, extend_rtx_code 
(),
 e.vector_mode (0),
 e.memory_vector_mode ());
+/* Add the else operand.  */
+e.args.quick_push (CONST0_RTX (e.vector_mode (1)));
 return e.use_contiguous_load_insn (icode);
   }
 };
@@ -1535,7 +1550,7 @@ public:
   unsigned int
   call_properties (const function_instance &) const override
   {
-return CP_READ_MEMORY;
+return CP_READ_MEMORY | CP_HAS_ELSE;
   }
 
   rtx
@@ -1544,6 +1559,8 @@ public:
 e.prepare_gather_address_operands (1);
 /* Put the predicate last, as required by mask_gather_load_optab.  */
 e.rotate_inputs_left (0, 5);
+/* Add the else operand.  */
+e.args.quick_push (CONST0_RTX (e.vector_mode (0)));
 machine_mode mem_mode = e.memory_vector_mode ();
 machine_mode int_mode = aarch64_sve_int_mode (mem_mode);
 insn_code icode = convert_optab_handler (mask_gather_load_optab,
@@ -1567,6 +1584,8 @@ public:
 e.rotate_inputs_left (0, 5);
 /* Add a constant predicate for the extension rtx.  */
 e.args.quick_push (CONSTM1_RTX (VNx16BImode));
+/* Add the else operand.  */
+e.args.quick_push (CONST0_RTX (e.vector_mode (1)));
 insn_code icode = code_for_aarch64_gather_load (extend_rtx_code (),
e.vector_mode (0),
  

[PATCH v2 8/8] RISC-V: Add else operand to masked loads [PR115336].

2024-10-18 Thread Robin Dapp
This patch adds else operands to masked loads.  Currently the default
else operand predicate accepts "undefined" (i.e. SCRATCH) as well as
all-ones values.

Note that this series introduces a large number of new RVV FAILs for
riscv.  All of them are due to us not being able to elide redundant
vec_cond_exprs.

PR middle-end/115336
PR middle-end/116059

gcc/ChangeLog:

* config/riscv/autovec.md: Add else operand.
* config/riscv/predicates.md (maskload_else_operand): New
predicate.
* config/riscv/riscv-v.cc (get_else_operand): Remove static.
(expand_load_store): Use get_else_operand and adjust index.
(expand_gather_scatter): Ditto.
(expand_lanes_load_store): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr115336.c: New test.
* gcc.target/riscv/rvv/autovec/pr116059.c: New test.
---
 gcc/config/riscv/autovec.md   | 45 +++
 gcc/config/riscv/predicates.md|  3 ++
 gcc/config/riscv/riscv-v.cc   | 26 +++
 .../gcc.target/riscv/rvv/autovec/pr115336.c   | 20 +
 .../gcc.target/riscv/rvv/autovec/pr116059.c   | 13 ++
 5 files changed, 80 insertions(+), 27 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr115336.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr116059.c

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index 7dc78a48874..a09f94021ca 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -26,8 +26,9 @@ (define_expand "mask_len_load"
   [(match_operand:V 0 "register_operand")
(match_operand:V 1 "memory_operand")
(match_operand: 2 "vector_mask_operand")
-   (match_operand 3 "autovec_length_operand")
-   (match_operand 4 "const_0_operand")]
+   (match_operand:V 3 "maskload_else_operand")
+   (match_operand 4 "autovec_length_operand")
+   (match_operand 5 "const_0_operand")]
   "TARGET_VECTOR"
 {
   riscv_vector::expand_load_store (operands, true);
@@ -57,8 +58,9 @@ (define_expand 
"mask_len_gather_load"
(match_operand 3 "")
(match_operand 4 "")
(match_operand: 5 "vector_mask_operand")
-   (match_operand 6 "autovec_length_operand")
-   (match_operand 7 "const_0_operand")]
+   (match_operand 6 "maskload_else_operand")
+   (match_operand 7 "autovec_length_operand")
+   (match_operand 8 "const_0_operand")]
   "TARGET_VECTOR && riscv_vector::gather_scatter_valid_offset_p 
(mode)"
 {
   riscv_vector::expand_gather_scatter (operands, true);
@@ -72,8 +74,9 @@ (define_expand 
"mask_len_gather_load"
(match_operand 3 "")
(match_operand 4 "")
(match_operand: 5 "vector_mask_operand")
-   (match_operand 6 "autovec_length_operand")
-   (match_operand 7 "const_0_operand")]
+   (match_operand 6 "maskload_else_operand")
+   (match_operand 7 "autovec_length_operand")
+   (match_operand 8 "const_0_operand")]
   "TARGET_VECTOR && riscv_vector::gather_scatter_valid_offset_p 
(mode)"
 {
   riscv_vector::expand_gather_scatter (operands, true);
@@ -87,8 +90,9 @@ (define_expand 
"mask_len_gather_load"
(match_operand 3 "")
(match_operand 4 "")
(match_operand: 5 "vector_mask_operand")
-   (match_operand 6 "autovec_length_operand")
-   (match_operand 7 "const_0_operand")]
+   (match_operand 6 "maskload_else_operand")
+   (match_operand 7 "autovec_length_operand")
+   (match_operand 8 "const_0_operand")]
   "TARGET_VECTOR && riscv_vector::gather_scatter_valid_offset_p 
(mode)"
 {
   riscv_vector::expand_gather_scatter (operands, true);
@@ -102,8 +106,9 @@ (define_expand 
"mask_len_gather_load"
(match_operand 3 "")
(match_operand 4 "")
(match_operand: 5 "vector_mask_operand")
-   (match_operand 6 "autovec_length_operand")
-   (match_operand 7 "const_0_operand")]
+   (match_operand 6 "maskload_else_operand")
+   (match_operand 7 "autovec_length_operand")
+   (match_operand 8 "const_0_operand")]
   "TARGET_VECTOR && riscv_vector::gather_scatter_valid_offset_p 
(mode)"
 {
   riscv_vector::expand_gather_scatter (operands, true);
@@ -117,8 +122,9 @@ (define_expand 
"mask_len_gather_load"
(match_operand 3 "")
(match_operand 4 "")
(match_operand: 5 "vector_mask_operand")
-   (match_operand 6 "autovec_length_operand")
-   (match_operand 7 "const_0_operand")]
+   (match_operand 6 "maskload_else_operand")
+   (match_operand 7 "autovec_length_operand")
+   (match_operand 8 "const_0_operand")]
   "TARGET_VECTOR && riscv_vector::gather_scatter_valid_offset_p 
(mode)"
 {
   riscv_vector::expand_gather_scatter (operands, true);
@@ -132,8 +138,9 @@ (define_expand 
"mask_len_gather_load"
(match_operand 3 "")
(match_operand 4 "")
(match_operand: 5 "vector_mask_operand")
-   (match_operand 6 "autovec_length_operand")
-   (match_operand 7 "const_0_operand")]
+   (match_operand 6 "maskload_else_operand")
+   (match_operand 7 "autovec_length_operand")
+   (match_operand 8 "const_0_operand")]
   "TARGET_VECTO

[PATCH v2 0/8] Add maskload else operand.

2024-10-18 Thread Robin Dapp
Hi,

finally, after many distractions, v2 of this series.

Main changes from v1:
 - Restrict to types/modes with padding thanks to Richi's suggestion.
 - Return an array of supported else values and let the vectorizer choose.
 - Undefined else value for GCN.

Bootstrapped and regtested on Power10, x86 and aarch64.
Regtested on rv64gcv.

Testing on GCN would be much appreciated.

Robin Dapp (8):
  docs: Document maskload else operand and behavior.
  ifn: Add else-operand handling.
  tree-ifcvt: Enforce zero else value after maskload.
  vect: Add maskload else value support.
  aarch64: Add masked-load else operands.
  gcn: Add else operand to masked loads.
  i386: Add else operand to masked loads.
  RISC-V: Add else operand to masked loads [PR115336].

 .../aarch64/aarch64-sve-builtins-base.cc  |  58 +++-
 gcc/config/aarch64/aarch64-sve-builtins.cc|   5 +
 gcc/config/aarch64/aarch64-sve-builtins.h |   1 +
 gcc/config/aarch64/aarch64-sve.md |  47 +++-
 gcc/config/aarch64/aarch64-sve2.md|   3 +-
 gcc/config/aarch64/predicates.md  |   4 +
 gcc/config/gcn/gcn-valu.md|  12 +-
 gcc/config/gcn/predicates.md  |   2 +
 gcc/config/i386/i386-expand.cc|  26 +-
 gcc/config/i386/predicates.md |   4 +
 gcc/config/i386/sse.md| 124 +
 gcc/config/riscv/autovec.md   |  45 +--
 gcc/config/riscv/predicates.md|   3 +
 gcc/config/riscv/riscv-v.cc   |  26 +-
 gcc/doc/md.texi   |  63 +++--
 gcc/internal-fn.cc| 131 +++--
 gcc/internal-fn.h |  15 +-
 gcc/optabs-query.cc   |  59 ++--
 gcc/optabs-query.h|   3 +-
 gcc/optabs-tree.cc|  62 +++--
 gcc/optabs-tree.h |   8 +-
 .../gcc.target/riscv/rvv/autovec/pr115336.c   |  20 ++
 .../gcc.target/riscv/rvv/autovec/pr116059.c   |  13 +
 gcc/tree-if-conv.cc   | 112 ++--
 gcc/tree-vect-data-refs.cc|  77 --
 gcc/tree-vect-patterns.cc |  18 +-
 gcc/tree-vect-slp.cc  |  22 +-
 gcc/tree-vect-stmts.cc| 257 ++
 gcc/tree-vectorizer.h |  11 +-
 29 files changed, 943 insertions(+), 288 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr115336.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr116059.c

-- 
2.46.2



[PATCH v2 2/8] ifn: Add else-operand handling.

2024-10-18 Thread Robin Dapp
This patch adds else-operand handling to the internal functions.

gcc/ChangeLog:

* internal-fn.cc (add_mask_and_len_args): Rename...
(add_mask_else_and_len_args): ...to this and add else handling.
(expand_partial_load_optab_fn): Use adjusted function.
(expand_partial_store_optab_fn): Ditto.
(expand_scatter_store_optab_fn): Ditto.
(expand_gather_load_optab_fn): Ditto.
(internal_fn_len_index): Add else handling.
(internal_fn_else_index): Ditto.
(internal_fn_mask_index): Ditto.
(get_supported_else_vals): New function.
(supported_else_val_p): New function.
(internal_gather_scatter_fn_supported_p): Add else operand.
* internal-fn.h (internal_gather_scatter_fn_supported_p): Define
else constants.
(MASK_LOAD_ELSE_ZERO): Ditto.
(MASK_LOAD_ELSE_M1): Ditto.
(MASK_LOAD_ELSE_UNDEFINED): Ditto.
(get_supported_else_vals): Declare.
(supported_else_val_p): Ditto.
---
 gcc/internal-fn.cc | 131 +++--
 gcc/internal-fn.h  |  15 +-
 2 files changed, 129 insertions(+), 17 deletions(-)

diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index d89a04fe412..b6049cec91e 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -331,17 +331,18 @@ get_multi_vector_move (tree array_type, convert_optab 
optab)
   return convert_optab_handler (optab, imode, vmode);
 }
 
-/* Add mask and len arguments according to the STMT.  */
+/* Add mask, else, and len arguments according to the STMT.  */
 
 static unsigned int
-add_mask_and_len_args (expand_operand *ops, unsigned int opno, gcall *stmt)
+add_mask_else_and_len_args (expand_operand *ops, unsigned int opno, gcall 
*stmt)
 {
   internal_fn ifn = gimple_call_internal_fn (stmt);
   int len_index = internal_fn_len_index (ifn);
   /* BIAS is always consecutive next of LEN.  */
   int bias_index = len_index + 1;
   int mask_index = internal_fn_mask_index (ifn);
-  /* The order of arguments are always {len,bias,mask}.  */
+
+  /* The order of arguments is always {mask, else, len, bias}.  */
   if (mask_index >= 0)
 {
   tree mask = gimple_call_arg (stmt, mask_index);
@@ -362,6 +363,23 @@ add_mask_and_len_args (expand_operand *ops, unsigned int 
opno, gcall *stmt)
 
   create_input_operand (&ops[opno++], mask_rtx,
TYPE_MODE (TREE_TYPE (mask)));
+
+}
+
+  int els_index = internal_fn_else_index (ifn);
+  if (els_index >= 0)
+{
+  tree els = gimple_call_arg (stmt, els_index);
+  tree els_type = TREE_TYPE (els);
+  if (TREE_CODE (els) == SSA_NAME
+ && SSA_NAME_IS_DEFAULT_DEF (els)
+ && VAR_P (SSA_NAME_VAR (els)))
+   create_undefined_input_operand (&ops[opno++], TYPE_MODE (els_type));
+  else
+   {
+ rtx els_rtx = expand_normal (els);
+ create_input_operand (&ops[opno++], els_rtx, TYPE_MODE (els_type));
+   }
 }
   if (len_index >= 0)
 {
@@ -3014,7 +3032,7 @@ static void
 expand_partial_load_optab_fn (internal_fn ifn, gcall *stmt, convert_optab 
optab)
 {
   int i = 0;
-  class expand_operand ops[5];
+  class expand_operand ops[6];
   tree type, lhs, rhs, maskt;
   rtx mem, target;
   insn_code icode;
@@ -3044,7 +3062,7 @@ expand_partial_load_optab_fn (internal_fn ifn, gcall 
*stmt, convert_optab optab)
   target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   create_call_lhs_operand (&ops[i++], target, TYPE_MODE (type));
   create_fixed_operand (&ops[i++], mem);
-  i = add_mask_and_len_args (ops, i, stmt);
+  i = add_mask_else_and_len_args (ops, i, stmt);
   expand_insn (icode, i, ops);
 
   assign_call_lhs (lhs, target, &ops[0]);
@@ -3090,7 +3108,7 @@ expand_partial_store_optab_fn (internal_fn ifn, gcall 
*stmt, convert_optab optab
   reg = expand_normal (rhs);
   create_fixed_operand (&ops[i++], mem);
   create_input_operand (&ops[i++], reg, TYPE_MODE (type));
-  i = add_mask_and_len_args (ops, i, stmt);
+  i = add_mask_else_and_len_args (ops, i, stmt);
   expand_insn (icode, i, ops);
 }
 
@@ -3676,7 +3694,7 @@ expand_scatter_store_optab_fn (internal_fn, gcall *stmt, 
direct_optab optab)
   create_integer_operand (&ops[i++], TYPE_UNSIGNED (TREE_TYPE (offset)));
   create_integer_operand (&ops[i++], scale_int);
   create_input_operand (&ops[i++], rhs_rtx, TYPE_MODE (TREE_TYPE (rhs)));
-  i = add_mask_and_len_args (ops, i, stmt);
+  i = add_mask_else_and_len_args (ops, i, stmt);
 
   insn_code icode = convert_optab_handler (optab, TYPE_MODE (TREE_TYPE (rhs)),
   TYPE_MODE (TREE_TYPE (offset)));
@@ -3705,7 +3723,7 @@ expand_gather_load_optab_fn (internal_fn, gcall *stmt, 
direct_optab optab)
   create_input_operand (&ops[i++], offset_rtx, TYPE_MODE (TREE_TYPE (offset)));
   create_integer_operand (&ops[i++], TYPE_UNSIGNED (TREE_TYPE (offset)));
   create_integer_operand (&ops[i++], scale_int);
-  i = add_mask_and_len_args (ops, i, stmt);
+

Re: [PATCH 1/7] libstdc++: Refactor std::uninitialized_{copy, fill, fill_n} algos [PR68350]

2024-10-18 Thread Patrick Palka
On Fri, 18 Oct 2024, Patrick Palka wrote:

> On Fri, 18 Oct 2024, Jonathan Wakely wrote:
> 
> > On 16/10/24 21:39 -0400, Patrick Palka wrote:
> > > On Tue, 15 Oct 2024, Jonathan Wakely wrote:
> > > > +#if __cplusplus < 201103L
> > > > +
> > > > +  // True if we can unwrap _Iter to get a pointer by using
> > > > std::__niter_base.
> > > > +  template
> > > > +struct __unwrappable_niter
> > > > +{
> > > > +  template struct __is_ptr { enum { __value = 0 }; };
> > > > +  template struct __is_ptr<_Tp*> { enum { __value = 1
> > > > }; };
> > > > +
> > > > +  typedef __decltype(std::__niter_base(*(_Iter*)0)) _Base;
> > > > +
> > > > +  enum { __value = __is_ptr<_Base>::__value };
> > > > +};
> > > 
> > > It might be slightly cheaper to define this without the nested class
> > > template as:
> > > 
> > >  template > > __decltype(std::__niter_base(*(_Iter*)0))>
> > >  struct __unwrappable_niter
> > >  { enum { __value = false }; };
> > > 
> > >  template
> > >  struct __unwrappable_niter<_Iter, _Tp*>
> > >  { enum { __value = true }; };
> 
> One minor nit, we might as well use 'value' since it's a reserved name
> even in C++98?

Whoops just saw that you already pushed this, never mind this tiny nit
then :)

> 
> > 
> > Nice. I think after spending a while failing to make any C++98
> > metaprogramming work for __memcpyable in cpp_type_traits.h I was not
> > in the mood for fighting C++98 any more :-) But this works well.
> 
> > 
> > > > +
> > > > +  // Use template specialization for C++98 when 'if constexpr' can't be
> > > > used.
> > > > +  template
> > > >  struct __uninitialized_copy
> > > >  {
> > > >template
> > > > @@ -186,53 +172,150 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > > >template<>
> > > >  struct __uninitialized_copy
> > > >  {
> > > > +  // Overload for generic iterators.
> > > >template
> > > >  static _ForwardIterator
> > > >  __uninit_copy(_InputIterator __first, _InputIterator __last,
> > > >   _ForwardIterator __result)
> > > > -{ return std::copy(__first, __last, __result); }
> > > > -};
> > > > +   {
> > > > + if (__unwrappable_niter<_InputIterator>::__value
> > > > +   && __unwrappable_niter<_ForwardIterator>::__value)
> > > > +   {
> > > > + __uninit_copy(std::__niter_base(__first),
> > > > +   std::__niter_base(__last),
> > > > +   std::__niter_base(__result));
> > > > + std::advance(__result, std::distance(__first, __last));
> > > > + return __result;
> > > > +   }
> > > > + else
> > > > +   return std::__do_uninit_copy(__first, __last, __result);
> > > > +   }
> > > > 
> > > > +  // Overload for pointers.
> > > > +  template
> > > > +   static _Up*
> > > > +   __uninit_copy(_Tp* __first, _Tp* __last, _Up* __result)
> > > > +   {
> > > > + // Ensure that we don't successfully memcpy in cases that 
> > > > should be
> > > > + // ill-formed because is_constructible<_Up, _Tp&> is false.
> > > > + typedef __typeof__(static_cast<_Up>(*__first)) __check
> > > > +   __attribute__((__unused__));
> > > > +
> > > > + if (const ptrdiff_t __n = __last - __first)
> > > 
> > > Do we have to worry about the __n == 1 case here like in the C++11 code
> > > path?
> > 
> > Actually I think we don't need to worry about it in either case.
> > 
> > C++20 had a note in [specialized.algorithms.general]/3 that said:
> > 
> >   [Note 1: When invoked on ranges of potentially-overlapping subobjects
> >   ([intro.object]), the algorithms specified in [specialized.algorithms]
> >   result in undefined behavior. — end note]
> > 
> > The reason is that the uninitialized algos create new objects at the
> > specified storage locations, and creating new objects reuses storage,
> > which ends the lifetime of any other objects in that storage. That
> > includes any objects that were in tail padding within that storage.
> > 
> > See Casey's Feb 2023 comment at
> > https://github.com/cplusplus/draft/issues/6143
> > 
> > That note was removed for C++23 (which is unfortunate IMHO), but the
> > algos still reuse storage by creating new objects, and so still end
> > the lifetime of potentially-overlapping subobjects within that
> > storage.
> > 
> > For std::copy there are no new objects created, and the effects are
> > specified in terms of assignment, which does not reuse storage. A
> > compiler-generated trivial copy assignment operator is careful to not
> > overwrite tail padding, so we can't use memmove if it would produce
> > different effects.
> > 
> > tl;dr I think I can remove the __n == 1 handling from the C++11 paths.
> 
> Nice, makes sense
> 
> > 
> > > > +   {
> > > > + __builtin_memcpy(__result, __first, __n * sizeof(_Tp));
> > > > + __result += __n;
> > > > +   }

Re: [PATCH 4/9] Simplify (X /[ex] C1) * (C1 * C2) -> X * C2

2024-10-18 Thread Richard Biener
On Fri, 18 Oct 2024, Richard Sandiford wrote:

OK.

Thanks,
Richard.

> gcc/
>   * match.pd: Simplify (X /[ex] C1) * (C1 * C2) -> X * C2.
> 
> gcc/testsuite/
>   * gcc.dg/tree-ssa/mulexactdiv-1.c: New test.
>   * gcc.dg/tree-ssa/mulexactdiv-2.c: Likewise.
>   * gcc.dg/tree-ssa/mulexactdiv-3.c: Likewise.
>   * gcc.dg/tree-ssa/mulexactdiv-4.c: Likewise.
>   * gcc.target/aarch64/sve/cnt_fold_1.c: Likewise.
>   * gcc.target/aarch64/sve/cnt_fold_2.c: Likewise.
> ---
>  gcc/match.pd  |   8 ++
>  gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-1.c |  23 
>  gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-2.c |  19 +++
>  gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-3.c |  21 
>  gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-4.c |  14 +++
>  .../gcc.target/aarch64/sve/cnt_fold_1.c   | 110 ++
>  .../gcc.target/aarch64/sve/cnt_fold_2.c   |  55 +
>  7 files changed, 250 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-4.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_2.c
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 1b1d38cf105..6677bc06d80 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -31,6 +31,7 @@ along with GCC; see the file COPYING3.  If not see
> zerop
> initializer_each_zero_or_onep
> CONSTANT_CLASS_P
> +   poly_int_tree_p
> tree_expr_nonnegative_p
> tree_expr_nonzero_p
> integer_valued_real_p
> @@ -5467,6 +5468,13 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>(mult (convert1? (exact_div @0 @@1)) (convert2? @1))
>(convert @0))
>  
> +/* (X /[ex] C1) * (C1 * C2) -> X * C2.  */
> +(simplify
> + (mult (convert? (exact_div @0 INTEGER_CST@1)) poly_int_tree_p@2)
> + (with { poly_widest_int factor; }
> +  (if (multiple_p (wi::to_poly_widest (@2), wi::to_widest (@1), &factor))
> +   (mult (convert @0) { wide_int_to_tree (type, factor); }
> +
>  /* Simplify (A / B) * B + (A % B) -> A.  */
>  (for div (trunc_div ceil_div floor_div round_div)
>   mod (trunc_mod ceil_mod floor_mod round_mod)
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-1.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-1.c
> new file mode 100644
> index 000..fa853eb7dff
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-1.c
> @@ -0,0 +1,23 @@
> +/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
> +
> +#define TEST_CMP(FN, DIV, MUL)   \
> +  int\
> +  FN (int x) \
> +  {  \
> +if (x & 7)   \
> +  __builtin_unreachable ();  \
> +x /= DIV;\
> +return x * MUL;  \
> +  }
> +
> +TEST_CMP (f1, 2, 6)
> +TEST_CMP (f2, 2, 10)
> +TEST_CMP (f3, 4, 80)
> +TEST_CMP (f4, 8, 200)
> +
> +/* { dg-final { scan-tree-dump-not {<[a-z]*_div_expr, } "optimized" } } */
> +/* { dg-final { scan-tree-dump-not { +/* { dg-final { scan-tree-dump { } */
> +/* { dg-final { scan-tree-dump { } */
> +/* { dg-final { scan-tree-dump { } */
> +/* { dg-final { scan-tree-dump { } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-2.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-2.c
> new file mode 100644
> index 000..9df49690ab6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-2.c
> @@ -0,0 +1,19 @@
> +/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
> +
> +#define TEST_CMP(FN, DIV, MUL)   \
> +  int\
> +  FN (int x) \
> +  {  \
> +if (x & 7)   \
> +  __builtin_unreachable ();  \
> +x /= DIV;\
> +return x * MUL;  \
> +  }
> +
> +TEST_CMP (f1, 2, 1)
> +TEST_CMP (f2, 2, 5)
> +TEST_CMP (f3, 4, 10)
> +TEST_CMP (f4, 8, 100)
> +TEST_CMP (f5, 16, 32)
> +
> +/* { dg-final { scan-tree-dump-times {<[a-z]*_div_expr, } 5 "optimized" } } 
> */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-3.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-3.c
> new file mode 100644
> index 000..38778a0d7a5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-3.c
> @@ -0,0 +1,21 @@
> +/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
> +
> +#define TEST_CMP(FN, TYPE1, DIV, TYPE2, MUL) \
> +  TYPE2  \
> +  FN (TYPE1 x)   \
> +  { 

[PATCH 3/7] RISC-V: Fix vector memcpy smaller LMUL generation

2024-10-18 Thread Craig Blackmore
If riscv_vector::expand_block_move is generating a straight-line memcpy
using a predicated store, it tries to use a smaller LMUL to reduce
register pressure if it still allows an entire transfer.

This happens in the inner loop of riscv_vector::expand_block_move,
however, the vmode chosen by this loop gets overwritten later in the
function, so I have added the missing break from the outer loop.

I have also addressed a couple of issues with the conditions of the if
statement within the inner loop.

The first condition did not make sense to me:
```
  TARGET_MIN_VLEN * lmul <= nunits * BITS_PER_UNIT
```
I think this was supposed to be checking that the length fits within the
given LMUL, so I have changed it to do that.

The second condition:
```
  /* Avoid loosing the option of using vsetivli .  */
  && (nunits <= 31 * lmul || nunits > 31 * 8)
```
seems to imply that lmul affects the range of AVL immediate that
vsetivli can take but I don't think that is correct.  Anyway, I don't
think this condition is necessary because if we find a suitable mode we
should stick with it, regardless of whether it allowed vsetivli, rather
than continuing to try larger lmul which would increase register
pressure or smaller potential_ew which would increase AVL.  I have
removed this condition.

gcc/ChangeLog:

* config/riscv/riscv-string.cc (expand_block_move): Fix
condition for using smaller LMUL.  Break outer loop if a
suitable vmode has been found.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/vsetvl/pr112929-1.c: Expect smaller lmul.
* gcc.target/riscv/rvv/vsetvl/pr112988-1.c: Likewise.
* gcc.target/riscv/rvv/base/cpymem-3.c: New test.
---
 gcc/config/riscv/riscv-string.cc  |  8 +-
 .../gcc.target/riscv/rvv/base/cpymem-3.c  | 85 +++
 .../gcc.target/riscv/rvv/vsetvl/pr112929-1.c  |  2 +-
 .../gcc.target/riscv/rvv/vsetvl/pr112988-1.c  |  2 +-
 4 files changed, 92 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cpymem-3.c

diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc
index 0f1353baba3..b590c516354 100644
--- a/gcc/config/riscv/riscv-string.cc
+++ b/gcc/config/riscv/riscv-string.cc
@@ -1153,9 +1153,7 @@ expand_block_move (rtx dst_in, rtx src_in, rtx length_in)
 Still, by choosing a lower LMUL factor that still allows
 an entire transfer, we can reduce register pressure.  */
  for (unsigned lmul = 1; lmul <= 4; lmul <<= 1)
-   if (TARGET_MIN_VLEN * lmul <= nunits * BITS_PER_UNIT
-   /* Avoid loosing the option of using vsetivli .  */
-   && (nunits <= 31 * lmul || nunits > 31 * 8)
+   if (length * BITS_PER_UNIT <= TARGET_MIN_VLEN * lmul
&& multiple_p (BYTES_PER_RISCV_VECTOR * lmul, potential_ew)
&& (riscv_vector::get_vector_mode
 (elem_mode, exact_div (BYTES_PER_RISCV_VECTOR * lmul,
@@ -1163,6 +1161,10 @@ expand_block_move (rtx dst_in, rtx src_in, rtx length_in)
  break;
}
 
+ /* Stop searching if a suitable vmode has been found.  */
+ if (vmode != VOIDmode)
+   break;
+
  /* The RVVM8?I modes are notionally 8 * BYTES_PER_RISCV_VECTOR bytes
 wide.  BYTES_PER_RISCV_VECTOR can't be evenly divided by
 the sizes of larger element types; the LMUL factor of 8 can at
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/cpymem-3.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/cpymem-3.c
new file mode 100644
index 000..f07078ba6a7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/cpymem-3.c
@@ -0,0 +1,85 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O1 -fno-schedule-insns -fno-schedule-insns2 
-mrvv-max-lmul=m8" } */
+/* { dg-add-options riscv_v } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#define MIN_VECTOR_BYTES (__riscv_v_min_vlen / 8)
+
+/* Check that vector memcpy with predicated store uses smaller LMUL where
+   possible.
+
+/* m1
+** f1:
+**  (
+**  vsetivli\s+zero,\d+,e8,m1,ta,ma
+**  |
+**  li\s+[ta][0-7],\d+
+**  vsetvli\s+zero,[ta][0-7],e8,m1,ta,ma
+**  )
+**  vle8.v\s+v\d+,0\(a1\)
+**  vse8.v\s+v\d+,0\(a0\)
+**  ret
+*/
+
+void f1 (char *d, char *s)
+{
+  __builtin_memcpy (d, s, MIN_VECTOR_BYTES - 1);
+}
+
+/* m2
+** f2:
+**  (
+**  vsetivli\s+zero,\d+,e8,m2,ta,ma
+**  |
+**  li\s+[ta][0-7],\d+
+**  vsetvli\s+zero,[ta][0-7],e8,m2,ta,ma
+**  )
+**  vle8.v\s+v\d+,0\(a1\)
+**  vse8.v\s+v\d+,0\(a0\)
+**  ret
+*/
+
+void f2 (char *d, char *s)
+{
+  __builtin_memcpy (d, s, 2 * MIN_VECTOR_BYTES - 1);
+}
+
+/* m4
+** f3:
+**  (
+**  vsetivli\s+zero,\d+,e8,m4,ta,ma
+**  |
+**  li\s+[ta][0-7],\d+
+**  vsetvli\s+zero,[ta][0-7],e8,m4,ta,ma
+**  )
+**  vle8.v\s+v\d+,0\(a1\)
+**  vse8.v\s+v\d+,0\(a0\)
+**  ret
+*/
+
+void f3 (char *d, char *s)
+{
+  __builtin_memcpy (d, s, 4 * MIN_VECTOR_BYTES 

[PATCH 0/7] RISC-V: Vector memcpy/memset fixes and improvements

2024-10-18 Thread Craig Blackmore
The main aim of this patch series is to make inline vector memcpy
respect -mrvv-max-lmul and to extend inline vector memset to be used
in more cases.  It includes some preparatory fixes and refactoring along
the way.

Craig Blackmore (7):
  RISC-V: Fix indentation in riscv_vector::expand_block_move [NFC]
  RISC-V: Fix uninitialized reg in memcpy
  RISC-V: Fix vector memcpy smaller LMUL generation
  RISC-V: Honour -mrvv-max-lmul in riscv_vector::expand_block_move
  RISC-V: Move vector memcpy decision making to separate function [NFC]
  RISC-V: Make vectorized memset handle more cases
  RISC-V: Disable by pieces for vector setmem length > UNITS_PER_WORD

 gcc/config/riscv/riscv-protos.h   |   3 +-
 gcc/config/riscv/riscv-string.cc  | 292 +++---
 gcc/config/riscv/riscv-v.cc   |  12 +
 gcc/config/riscv/riscv.cc |  19 ++
 gcc/config/riscv/riscv.md |  12 +-
 .../gcc.target/riscv/rvv/autovec/pr113206-1.c |   2 +-
 .../gcc.target/riscv/rvv/autovec/pr113206-2.c |   2 +-
 .../gcc.target/riscv/rvv/autovec/pr113469.c   |   3 +-
 .../rvv/autovec/vls/calling-convention-1.c|  11 +-
 .../rvv/autovec/vls/calling-convention-2.c|  11 +-
 .../rvv/autovec/vls/calling-convention-3.c|  11 +-
 .../rvv/autovec/vls/calling-convention-4.c|   8 +-
 .../rvv/autovec/vls/calling-convention-5.c|  11 +-
 .../rvv/autovec/vls/calling-convention-6.c|  11 +-
 .../rvv/autovec/vls/calling-convention-7.c|   8 +-
 .../riscv/rvv/autovec/vls/spill-4.c   |   2 +-
 .../riscv/rvv/autovec/vls/spill-7.c   |   2 +-
 .../gcc.target/riscv/rvv/base/cpymem-1.c  |   4 +-
 .../gcc.target/riscv/rvv/base/cpymem-2.c  |   2 +-
 .../gcc.target/riscv/rvv/base/cpymem-3.c  |  85 +
 .../gcc.target/riscv/rvv/base/movmem-1.c  |   7 +-
 .../gcc.target/riscv/rvv/base/pr111720-0.c|   2 +-
 .../gcc.target/riscv/rvv/base/pr111720-1.c|   2 +-
 .../gcc.target/riscv/rvv/base/pr111720-2.c|   2 +-
 .../gcc.target/riscv/rvv/base/pr111720-3.c|   2 +-
 .../gcc.target/riscv/rvv/base/pr111720-4.c|   2 +-
 .../gcc.target/riscv/rvv/base/pr111720-5.c|   2 +-
 .../gcc.target/riscv/rvv/base/pr111720-6.c|   2 +-
 .../gcc.target/riscv/rvv/base/pr111720-7.c|   2 +-
 .../gcc.target/riscv/rvv/base/pr111720-8.c|   2 +-
 .../gcc.target/riscv/rvv/base/pr111720-9.c|   2 +-
 .../gcc.target/riscv/rvv/base/setmem-1.c  |  37 ++-
 .../gcc.target/riscv/rvv/base/setmem-2.c  |  49 ++-
 .../gcc.target/riscv/rvv/base/setmem-3.c  |  53 +++-
 .../gcc.target/riscv/rvv/vsetvl/pr112929-1.c  |   6 +-
 .../gcc.target/riscv/rvv/vsetvl/pr112988-1.c  |   6 +-
 36 files changed, 463 insertions(+), 226 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cpymem-3.c

-- 
2.43.0



[PATCH 4/7] RISC-V: Honour -mrvv-max-lmul in riscv_vector::expand_block_move

2024-10-18 Thread Craig Blackmore
Unlike the other vector string ops, expand_block_move was using max LMUL
m8 regardless of TARGET_MAX_LMUL.

The check for whether to generate inline vector code for movmem has been
moved from movmem to riscv_vector::expand_block_move to avoid
maintaining multiple versions of similar logic.  They already differed
on the minimum length for which they would generate vector code.  Now
that the expand_block_move value is used, movmem will be generated for
smaller lengths.

Limiting memcpy to m1 caused some memcpy loops to be generated in
the calling convention tests which makes it awkward to add suitable scan
assembler tests checking the return value being set, so
-mrvv-max-lmul=m8 has been added to these tests.  Other tests have been
adjusted to expect the new memcpy m1 generation where reasonably
straight-forward, otherwise -mrvv-max-lmul=m8 has been added.

pr111720-[0-9].c regressed because a memcpy loop is generated instead
of straight-line.  This reveals an existing issue where a redundant
straight-line memcpy gets eliminated but a memcpy loop does not
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117205).

For example, on pr111720-0.c after this patch:

-mrvv-max-lmul=m8:

test:
lui a5,%hi(.LANCHOR0)
li  a4,32
addisp,sp,-32
addia5,a5,%lo(.LANCHOR0)
vsetvli zero,a4,e8,m1,ta,ma
vle8.v  v8,0(a5)
addisp,sp,32
jr  ra

-mrvv-max-lmul=m1:

test:
addisp,sp,-32
lui a5,%hi(.LANCHOR0)
addia5,a5,%lo(.LANCHOR0)
mv  a2,sp
li  a3,32
.L2:
vsetvli a4,a3,e8,m1,ta,ma
vle8.v  v8,0(a5)
sub a3,a3,a4
add a5,a5,a4
vse8.v  v8,0(a2)
add a2,a2,a4
bne a3,zero,.L2
li  a5,32
vsetvli zero,a5,e8,m1,ta,ma
vle8.v  v8,0(sp)
addisp,sp,32
jr  ra

I have added -mrvv-max-lmul=m8 to pr111720-[0-9].c so that we continue
to test the elimination of straight-line memcpy.

gcc/ChangeLog:

* config/riscv/riscv-protos.h (get_lmul_mode): New prototype.
(expand_block_move): Add bool parameter for movmem_p.
* config/riscv/riscv-string.cc (riscv_expand_block_move_scalar):
Pass movmem_p as false to riscv_vector::expand_block_move.
(expand_block_move): Add movmem_p parameter.  Return false if
loop needed and movmem_p is true.  Respect TARGET_MAX_LMUL.
* config/riscv/riscv-v.cc (get_lmul_mode): New function.
* config/riscv/riscv.md (movmem): Move checking for
whether to generate inline vector code to
riscv_vector::expand_block_move by passing movmem_p as true.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr113206-1.c: Add
-mrvv-max-lmul=m8.
* gcc.target/riscv/rvv/autovec/pr113206-2.c: Likewise.
* gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c: Add
-mrvv-max-lmul=m8 and adjust assembly scans.
* gcc.target/riscv/rvv/autovec/vls/calling-convention-2.c:
Likewise.
* gcc.target/riscv/rvv/autovec/vls/calling-convention-3.c:
Likewise.
* gcc.target/riscv/rvv/autovec/vls/calling-convention-4.c:
Likewise.
* gcc.target/riscv/rvv/autovec/vls/calling-convention-5.c:
Likewise.
* gcc.target/riscv/rvv/autovec/vls/calling-convention-6.c:
Likewise.
* gcc.target/riscv/rvv/autovec/vls/calling-convention-7.c:
Likewise.
* gcc.target/riscv/rvv/autovec/vls/spill-4.c: Add
-mrvv-max-lmul=m8.
* gcc.target/riscv/rvv/autovec/vls/spill-7.c: Likewise.
* gcc.target/riscv/rvv/base/cpymem-1.c: Expect m1 in f1 and f2.
* gcc.target/riscv/rvv/base/cpymem-2.c: Add -mrvv-max-lmul=m8.
* gcc.target/riscv/rvv/base/movmem-1.c: Adjust f1 to a length
that will not get vectorized.
* gcc.target/riscv/rvv/base/pr111720-0.c: Add -mrvv-max-lmul=m8.
* gcc.target/riscv/rvv/base/pr111720-1.c: Likewise.
* gcc.target/riscv/rvv/base/pr111720-2.c: Likewise.
* gcc.target/riscv/rvv/base/pr111720-3.c: Likewise.
* gcc.target/riscv/rvv/base/pr111720-4.c: Likewise.
* gcc.target/riscv/rvv/base/pr111720-5.c: Likewise.
* gcc.target/riscv/rvv/base/pr111720-6.c: Likewise.
* gcc.target/riscv/rvv/base/pr111720-7.c: Likewise.
* gcc.target/riscv/rvv/base/pr111720-8.c: Likewise.
* gcc.target/riscv/rvv/base/pr111720-9.c: Likewise.
* gcc.target/riscv/rvv/autovec/pr112929-1.c: Expect memcpy m1
loops.
* gcc.target/riscv/rvv/autovec/pr112988-1.c: Likewise.
---
 gcc/config/riscv/riscv-protos.h   |  3 +-
 gcc/config/riscv/riscv-string.cc  | 65 +++
 gcc/config/riscv/riscv-v.cc   | 12 
 gcc/config/riscv/riscv.md | 12 +---
 .../gcc.target/riscv/rvv/autovec/pr113206-1.c |  2 +-
 .../gcc.target/riscv/rvv/a

[PATCH 6/7] RISC-V: Make vectorized memset handle more cases

2024-10-18 Thread Craig Blackmore
`expand_vec_setmem` only generated vectorized memset if it fitted into a
single vector store.  Extend it to generate a loop for longer and
unknown lengths.

The test cases now use -O1 so that they are not sensitive to scheduling.

gcc/ChangeLog:

* config/riscv/riscv-string.cc
(use_vector_stringop_p): Add comment.
(expand_vec_setmem): Use use_vector_stringop_p instead of
check_vectorise_memory_operation.  Add loop generation.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/setmem-1.c: Use -O1.  Expect a loop
instead of a libcall.  Add test for unknown length.
* gcc.target/riscv/rvv/base/setmem-2.c: Likewise.
* gcc.target/riscv/rvv/base/setmem-3.c: Likewise and expect smaller
lmul.
---
 gcc/config/riscv/riscv-string.cc  | 83 ++-
 .../gcc.target/riscv/rvv/base/setmem-1.c  | 37 -
 .../gcc.target/riscv/rvv/base/setmem-2.c  | 37 -
 .../gcc.target/riscv/rvv/base/setmem-3.c  | 41 +++--
 4 files changed, 160 insertions(+), 38 deletions(-)

diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc
index 118c02a4021..91b0ec03118 100644
--- a/gcc/config/riscv/riscv-string.cc
+++ b/gcc/config/riscv/riscv-string.cc
@@ -1062,6 +1062,9 @@ struct stringop_info {
 
MAX_EW is the maximum element width that the caller wants to use and
LENGTH_IN is the length of the stringop in bytes.
+
+   This is currently used for cpymem and setmem.  If expand_vec_cmpmem switches
+   to using it too then check_vectorise_memory_operation can be removed.
 */
 
 static bool
@@ -1600,41 +1603,75 @@ check_vectorise_memory_operation (rtx length_in, 
HOST_WIDE_INT &lmul_out)
 bool
 expand_vec_setmem (rtx dst_in, rtx length_in, rtx fill_value_in)
 {
-  HOST_WIDE_INT lmul;
+  stringop_info info;
+
   /* Check we are able and allowed to vectorise this operation;
  bail if not.  */
-  if (!check_vectorise_memory_operation (length_in, lmul))
+  if (!use_vector_stringop_p (info, 1, length_in))
 return false;
 
-  machine_mode vmode
-  = riscv_vector::get_vector_mode (QImode, BYTES_PER_RISCV_VECTOR * lmul)
-   .require ();
+  /* avl holds the (remaining) length of the required set.
+ cnt holds the length we set with the current store.  */
+  rtx cnt = info.avl;
   rtx dst_addr = copy_addr_to_reg (XEXP (dst_in, 0));
-  rtx dst = change_address (dst_in, vmode, dst_addr);
+  rtx dst = change_address (dst_in, info.vmode, dst_addr);
 
-  rtx fill_value = gen_reg_rtx (vmode);
+  rtx fill_value = gen_reg_rtx (info.vmode);
   rtx broadcast_ops[] = { fill_value, fill_value_in };
 
-  /* If the length is exactly vlmax for the selected mode, do that.
- Otherwise, use a predicated store.  */
-  if (known_eq (GET_MODE_SIZE (vmode), INTVAL (length_in)))
+  rtx label = NULL_RTX;
+  rtx mask = NULL_RTX;
+
+  /* If we don't need a loop and the length is exactly vlmax for the selected
+ mode do a broadcast and store, otherwise use a predicated store.  */
+  if (!info.need_loop
+  && known_eq (GET_MODE_SIZE (info.vmode), INTVAL (length_in)))
 {
-  emit_vlmax_insn (code_for_pred_broadcast (vmode), UNARY_OP,
- broadcast_ops);
+  emit_vlmax_insn (code_for_pred_broadcast (info.vmode), UNARY_OP,
+  broadcast_ops);
   emit_move_insn (dst, fill_value);
+  return true;
 }
-  else
+
+  machine_mode mask_mode
+= riscv_vector::get_vector_mode (BImode,
+GET_MODE_NUNITS (info.vmode)).require ();
+  mask =  CONSTM1_RTX (mask_mode);
+  if (!satisfies_constraint_K (cnt))
+cnt = force_reg (Pmode, cnt);
+
+  if (info.need_loop)
 {
-  if (!satisfies_constraint_K (length_in))
- length_in = force_reg (Pmode, length_in);
-  emit_nonvlmax_insn (code_for_pred_broadcast (vmode), UNARY_OP,
- broadcast_ops, length_in);
-  machine_mode mask_mode
- = riscv_vector::get_vector_mode (BImode, GET_MODE_NUNITS (vmode))
- .require ();
-  rtx mask = CONSTM1_RTX (mask_mode);
-  emit_insn (gen_pred_store (vmode, dst, mask, fill_value, length_in,
- get_avl_type_rtx (riscv_vector::NONVLMAX)));
+  info.avl = copy_to_mode_reg (Pmode, info.avl);
+  cnt = gen_reg_rtx (Pmode);
+  emit_insn (riscv_vector::gen_no_side_effects_vsetvl_rtx (info.vmode, cnt,
+  info.avl));
+}
+
+  emit_nonvlmax_insn (code_for_pred_broadcast (info.vmode),
+ riscv_vector::UNARY_OP, broadcast_ops, cnt);
+
+  if (info.need_loop)
+{
+  label = gen_label_rtx ();
+
+  emit_label (label);
+  emit_insn (riscv_vector::gen_no_side_effects_vsetvl_rtx (info.vmode, cnt,
+  info.avl));
+}
+
+  emit_insn (gen_pred_store (info.vmode, dst, mask, fill_value, c

[PATCH 2/7] RISC-V: Fix uninitialized reg in memcpy

2024-10-18 Thread Craig Blackmore
riscv_vector::expand_block_move contains a gen_rtx_NE that uses
uninitialized reg rtx `end`.  It looks like `length_rtx` was supposed to
be used here.

gcc/ChangeLog:

* config/riscv/riscv-string.cc (expand_block_move): Replace
`end` with `length_rtx` in gen_rtx_NE.
---
 gcc/config/riscv/riscv-string.cc | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc
index 0c5ffd7d861..0f1353baba3 100644
--- a/gcc/config/riscv/riscv-string.cc
+++ b/gcc/config/riscv/riscv-string.cc
@@ -1078,7 +1078,6 @@ expand_block_move (rtx dst_in, rtx src_in, rtx length_in)
   bool need_loop = true;
   bool size_p = optimize_function_for_size_p (cfun);
   rtx src, dst;
-  rtx end = gen_reg_rtx (Pmode);
   rtx vec;
   rtx length_rtx = length_in;
 
@@ -1245,7 +1244,7 @@ expand_block_move (rtx dst_in, rtx src_in, rtx length_in)
   emit_insn (gen_rtx_SET (length_rtx, gen_rtx_MINUS (Pmode, length_rtx, 
cnt)));
 
   /* Emit the loop condition.  */
-  rtx test = gen_rtx_NE (VOIDmode, end, const0_rtx);
+  rtx test = gen_rtx_NE (VOIDmode, length_rtx, const0_rtx);
   emit_jump_insn (gen_cbranch4 (Pmode, test, length_rtx, const0_rtx, 
label));
   emit_insn (gen_nop ());
 }
-- 
2.43.0



[PATCH 7/7] RISC-V: Disable by pieces for vector setmem length > UNITS_PER_WORD

2024-10-18 Thread Craig Blackmore
For fast unaligned access targets, by pieces uses up to UNITS_PER_WORD
size pieces resulting in more store instructions than needed.  For
example gcc.target/riscv/rvv/base/setmem-1.c:f1 built with
`-O3 -march=rv64gcv -mtune=thead-c906`:
```
f1:
vsetivlizero,8,e8,mf2,ta,ma
vmv.v.x v1,a1
vsetivlizero,0,e32,mf2,ta,ma
sb  a1,14(a0)
vmv.x.s a4,v1
vsetivlizero,8,e16,m1,ta,ma
vmv.x.s a5,v1
vse8.v  v1,0(a0)
sw  a4,8(a0)
sh  a5,12(a0)
ret
```

The slow unaligned access version built with `-O3 -march=rv64gcv` used
15 sb instructions:
```
f1:
sb  a1,0(a0)
sb  a1,1(a0)
sb  a1,2(a0)
sb  a1,3(a0)
sb  a1,4(a0)
sb  a1,5(a0)
sb  a1,6(a0)
sb  a1,7(a0)
sb  a1,8(a0)
sb  a1,9(a0)
sb  a1,10(a0)
sb  a1,11(a0)
sb  a1,12(a0)
sb  a1,13(a0)
sb  a1,14(a0)
ret
```

After this patch, the following is generated in both cases:
```
f1:
vsetivlizero,15,e8,m1,ta,ma
vmv.v.x v1,a1
vse8.v  v1,0(a0)
ret
```

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_use_by_pieces_infrastructure_p):
New function.
(TARGET_USE_BY_PIECES_INFRASTRUCTURE_P): Define.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr113469.c: Expect mf2 setmem.
* gcc.target/riscv/rvv/base/setmem-2.c: Update f1 to expect
straight-line vector memset.
* gcc.target/riscv/rvv/base/setmem-3.c: Likewise.
---
 gcc/config/riscv/riscv.cc | 19 +++
 .../gcc.target/riscv/rvv/autovec/pr113469.c   |  3 ++-
 .../gcc.target/riscv/rvv/base/setmem-2.c  | 12 +++-
 .../gcc.target/riscv/rvv/base/setmem-3.c  | 12 +++-
 4 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index e111cb07284..c008b2da3b7 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -12583,6 +12583,22 @@ riscv_stack_clash_protection_alloca_probe_range (void)
   return STACK_CLASH_CALLER_GUARD;
 }
 
+static bool
+riscv_use_by_pieces_infrastructure_p (unsigned HOST_WIDE_INT size,
+ unsigned alignment,
+ enum by_pieces_operation op, bool speed_p)
+{
+  /* For set/clear with size > UNITS_PER_WORD, by pieces uses vector broadcasts
+ with UNITS_PER_WORD size pieces.  Use setmem instead which can use
+ bigger chunks.  */
+  if (TARGET_VECTOR && stringop_strategy & STRATEGY_VECTOR
+  && (op == CLEAR_BY_PIECES || op == SET_BY_PIECES)
+  && speed_p && size > UNITS_PER_WORD)
+return false;
+
+  return default_use_by_pieces_infrastructure_p (size, alignment, op, speed_p);
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_ASM_ALIGNED_HI_OP
 #define TARGET_ASM_ALIGNED_HI_OP "\t.half\t"
@@ -12948,6 +12964,9 @@ riscv_stack_clash_protection_alloca_probe_range (void)
 #undef TARGET_C_MODE_FOR_FLOATING_TYPE
 #define TARGET_C_MODE_FOR_FLOATING_TYPE riscv_c_mode_for_floating_type
 
+#undef TARGET_USE_BY_PIECES_INFRASTRUCTURE_P
+#define TARGET_USE_BY_PIECES_INFRASTRUCTURE_P 
riscv_use_by_pieces_infrastructure_p
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-riscv.h"
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113469.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113469.c
index d1c118c02d6..f86084bdb40 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113469.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113469.c
@@ -51,4 +51,5 @@ void p(int buf, __builtin_va_list ab, int q) {
  } while (k);
 }
 
-/* { dg-final { scan-assembler-times 
{vsetivli\tzero,\s*4,\s*e8,\s*mf4,\s*t[au],\s*m[au]} 2 } } */
+/* { dg-final { scan-assembler-times 
{vsetivli\tzero,\s*4,\s*e8,\s*mf4,\s*t[au],\s*m[au]} 1 } } */
+/* { dg-final { scan-assembler-times 
{vsetivli\tzero,\s*8,\s*e8,\s*mf2,\s*t[au],\s*m[au]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-2.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-2.c
index 9da1c9309d8..67d62f7193e 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-2.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-2.c
@@ -5,15 +5,17 @@
 
 #define MIN_VECTOR_BYTES (__riscv_v_min_vlen / 8)
 
-/* Small memsets shouldn't be vectorised.
+/* Vectorise with no loop.
 ** f1:
 **  (
-**  sb\s+a1,0\(a0\)
-**  ...
+**  vsetivli\s+zero,\d+,e8,m1,ta,ma
 **  |
-**  li\s+a2,\d+
-**  tail\s+memset
+**  li\s+a\d+,\d+
+**  vsetvli\s+zero,a\d+,e8,m1,ta,ma
 **  )
+**  vmv\.v\.x\s+v\d+,a1
+**  vse8\.v\s+v\d+,0\(a0\)
+**  ret
 */
 void *
 f1 (void *a, int const b)
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-3.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-3.c
index 2111a139ad4..7ade7ef415b 100644
--- a/gcc

[PATCH 5/7] RISC-V: Move vector memcpy decision making to separate function [NFC]

2024-10-18 Thread Craig Blackmore
This moves the code for deciding whether to generate a vectorized
memcpy, what vector mode to use and whether a loop is needed out of
riscv_vector::expand_block_move and into a new function
riscv_vector::use_stringop_p so that it can be reused for other string
operations.

gcc/ChangeLog:

* config/riscv/riscv-string.cc (struct stringop_info): New.
(expand_block_move): Move decision making code to...
(use_vector_stringop_p): ...here.
---
 gcc/config/riscv/riscv-string.cc | 143 +++
 1 file changed, 87 insertions(+), 56 deletions(-)

diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc
index 64fd6b29092..118c02a4021 100644
--- a/gcc/config/riscv/riscv-string.cc
+++ b/gcc/config/riscv/riscv-string.cc
@@ -1051,35 +1051,31 @@ riscv_expand_block_clear (rtx dest, rtx length)
 
 namespace riscv_vector {
 
-/* Used by cpymemsi in riscv.md .  */
+struct stringop_info {
+  rtx avl;
+  bool need_loop;
+  machine_mode vmode;
+};
 
-bool
-expand_block_move (rtx dst_in, rtx src_in, rtx length_in, bool movmem_p)
-{
-  /*
-memcpy:
-   mv a3, a0   # Copy destination
-loop:
-   vsetvli t0, a2, e8, m8, ta, ma  # Vectors of 8b
-   vle8.v v0, (a1) # Load bytes
-   add a1, a1, t0  # Bump pointer
-   sub a2, a2, t0  # Decrement count
-   vse8.v v0, (a3) # Store bytes
-   add a3, a3, t0  # Bump pointer
-   bnez a2, loop   # Any more?
-   ret # Return
-  */
-  gcc_assert (TARGET_VECTOR);
+/* If a vectorized stringop should be used populate INFO and return TRUE.
+   Otherwise return false and leave INFO unchanged.
 
-  HOST_WIDE_INT potential_ew
-= (MIN (MIN (MEM_ALIGN (src_in), MEM_ALIGN (dst_in)), BITS_PER_WORD)
-   / BITS_PER_UNIT);
-  machine_mode vmode = VOIDmode;
+   MAX_EW is the maximum element width that the caller wants to use and
+   LENGTH_IN is the length of the stringop in bytes.
+*/
+
+static bool
+use_vector_stringop_p (struct stringop_info &info, HOST_WIDE_INT max_ew,
+  rtx length_in)
+{
   bool need_loop = true;
-  bool size_p = optimize_function_for_size_p (cfun);
-  rtx src, dst;
-  rtx vec;
-  rtx length_rtx = length_in;
+  machine_mode vmode = VOIDmode;
+  /* The number of elements in the stringop.  */
+  rtx avl = length_in;
+  HOST_WIDE_INT potential_ew = max_ew;
+
+  if (!TARGET_VECTOR || !(stringop_strategy & STRATEGY_VECTOR))
+return false;
 
   if (CONST_INT_P (length_in))
 {
@@ -1113,17 +1109,7 @@ expand_block_move (rtx dst_in, rtx src_in, rtx 
length_in, bool movmem_p)
 for small element widths, we might allow larger element widths for
 loops too.  */
   if (need_loop)
-   {
- if (movmem_p)
-   /* Inlining general memmove is a pessimisation: we can't avoid
-  having to decide which direction to go at runtime, which is
-  costly in instruction count however for situations where the
-  entire move fits in one vector operation we can do all reads
-  before doing any writes so we don't have to worry so generate
-  the inline vector code in such situations.  */
-   return false;
- potential_ew = 1;
-   }
+   potential_ew = 1;
   for (; potential_ew; potential_ew >>= 1)
{
  scalar_int_mode elem_mode;
@@ -1193,7 +1179,7 @@ expand_block_move (rtx dst_in, rtx src_in, rtx length_in, 
bool movmem_p)
  gcc_assert (potential_ew > 1);
}
   if (potential_ew > 1)
-   length_rtx = GEN_INT (length / potential_ew);
+   avl = GEN_INT (length / potential_ew);
 }
   else
 {
@@ -1203,35 +1189,80 @@ expand_block_move (rtx dst_in, rtx src_in, rtx 
length_in, bool movmem_p)
   /* A memcpy libcall in the worst case takes 3 instructions to prepare the
  arguments + 1 for the call.  When RVV should take 7 instructions and
  we're optimizing for size a libcall may be preferable.  */
-  if (size_p && need_loop)
+  if (optimize_function_for_size_p (cfun) && need_loop)
 return false;
 
-  /* length_rtx holds the (remaining) length of the required copy.
+  info.need_loop = need_loop;
+  info.vmode = vmode;
+  info.avl = avl;
+  return true;
+}
+
+/* Used by cpymemsi in riscv.md .  */
+
+bool
+expand_block_move (rtx dst_in, rtx src_in, rtx length_in, bool movmem_p)
+{
+  /*
+memcpy:
+   mv a3, a0   # Copy destination
+loop:
+   vsetvli t0, a2, e8, m8, ta, ma  # Vectors of 8b
+   vle8.v v0, (a1) # Load bytes
+   add a1, a1, t0  # Bump pointer
+   sub a2, a2, t0  # Decrement count
+   vse8.v v0, (a3) # Store bytes
+   add a3, a3, t0  # Bump pointer
+   bnez a2, loop   # Any more?
+   ret 

Re: [PATCH] SVE intrinsics: Add fold_active_lanes_to method to refactor svmul and svdiv.

2024-10-18 Thread Jennifer Schmitz


> On 18 Oct 2024, at 10:46, Tamar Christina  wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Thursday, October 17, 2024 6:05 PM
>> To: Jennifer Schmitz 
>> Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ; Tamar
>> Christina 
>> Subject: Re: [PATCH] SVE intrinsics: Add fold_active_lanes_to method to 
>> refactor
>> svmul and svdiv.
>> 
>> Jennifer Schmitz  writes:
 On 16 Oct 2024, at 21:16, Richard Sandiford 
>> wrote:
 
 External email: Use caution opening links or attachments
 
 
 Jennifer Schmitz  writes:
> As suggested in
> https://gcc.gnu.org/pipermail/gcc-patches/2024-September/663275.html,
> this patch adds the method gimple_folder::fold_active_lanes_to (tree X).
> This method folds active lanes to X and sets inactive lanes according to
> the predication, returning a new gimple statement. That makes folding of
> SVE intrinsics easier and reduces code duplication in the
> svxxx_impl::fold implementations.
> Using this new method, svdiv_impl::fold and svmul_impl::fold were 
> refactored.
> Additionally, the method was used for two optimizations:
> 1) Fold svdiv to the dividend, if the divisor is all ones and
> 2) for svmul, if one of the operands is all ones, fold to the other 
> operand.
> Both optimizations were previously applied to _x and _m predication on
> the RTL level, but not for _z, where svdiv/svmul were still being used.
> For both optimization, codegen was improved by this patch, for example by
> skipping sel instructions with all-same operands and replacing sel
> instructions by mov instructions.
> 
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no
>> regression.
> OK for mainline?
> 
> Signed-off-by: Jennifer Schmitz 
> 
> gcc/
> * config/aarch64/aarch64-sve-builtins-base.cc (svdiv_impl::fold):
> Refactor using fold_active_lanes_to and fold to dividend, is the
> divisor is all ones.
> (svmul_impl::fold): Refactor using fold_active_lanes_to and fold
> to the other operand, if one of the operands is all ones.
> * config/aarch64/aarch64-sve-builtins.h: Declare
> gimple_folder::fold_active_lanes_to (tree).
> * config/aarch64/aarch64-sve-builtins.cc
> (gimple_folder::fold_actives_lanes_to): Add new method to fold
> actives lanes to given argument and setting inactives lanes
> according to the predication.
> 
> gcc/testsuite/
> * gcc.target/aarch64/sve/acle/asm/div_s32.c: Adjust expected outcome.
> * gcc.target/aarch64/sve/acle/asm/div_s64.c: Likewise.
> * gcc.target/aarch64/sve/acle/asm/div_u32.c: Likewise.
> * gcc.target/aarch64/sve/acle/asm/div_u64.c: Likewise.
> * gcc.target/aarch64/sve/fold_div_zero.c: Likewise.
> * gcc.target/aarch64/sve/acle/asm/mul_s16.c: New test.
> * gcc.target/aarch64/sve/acle/asm/mul_s32.c: Likewise.
> * gcc.target/aarch64/sve/acle/asm/mul_s64.c: Likewise.
> * gcc.target/aarch64/sve/acle/asm/mul_s8.c: Likewise.
> * gcc.target/aarch64/sve/acle/asm/mul_u16.c: Likewise.
> * gcc.target/aarch64/sve/acle/asm/mul_u32.c: Likewise.
> * gcc.target/aarch64/sve/acle/asm/mul_u64.c: Likewise.
> * gcc.target/aarch64/sve/acle/asm/mul_u8.c: Likewise.
> * gcc.target/aarch64/sve/mul_const_run.c: Likewise.
 
 Thanks, this looks great.  Just one comment on the tests:
 
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c
>> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c
> index d5a23bf0726..521f8bb4758 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c
> @@ -57,7 +57,6 @@ TEST_UNIFORM_ZX (div_w0_s32_m_untied, svint32_t,
>> int32_t,
> 
> /*
> ** div_1_s32_m_tied1:
> -**   sel z0\.s, p0, z0\.s, z0\.s
> **   ret
> */
> TEST_UNIFORM_Z (div_1_s32_m_tied1, svint32_t,
> @@ -66,7 +65,7 @@ TEST_UNIFORM_Z (div_1_s32_m_tied1, svint32_t,
> 
> /*
> ** div_1_s32_m_untied:
> -**   sel z0\.s, p0, z1\.s, z1\.s
> +**   mov z0\.d, z1\.d
> **   ret
> */
> TEST_UNIFORM_Z (div_1_s32_m_untied, svint32_t,
> @@ -217,9 +216,8 @@ TEST_UNIFORM_ZX (div_w0_s32_z_untied,
>> svint32_t, int32_t,
> 
> /*
> ** div_1_s32_z_tied1:
> -**   mov (z[0-9]+\.s), #1
> -**   movprfx z0\.s, p0/z, z0\.s
> -**   sdivz0\.s, p0/m, z0\.s, \1
> +**   mov (z[0-9]+)\.b, #0
> +**   sel z0\.s, p0, z0\.s, \1\.s
> **   ret
> */
> TEST_UNIFORM_Z (div_1_s32_z_tied1, svint32_t,
 
 Tamar will soon push a patch to change how we generate zeros.
 Part of that will involve rewriting existing patterns to be more
 forgivin

Re: [PATCH 5/9] Generalise ((X /[ex] A) +- B) * A -> X +- A * B rule

2024-10-18 Thread Richard Biener
On Fri, 18 Oct 2024, Richard Sandiford wrote:

> match.pd had a rule to simplify ((X /[ex] A) +- B) * A -> X +- A * B
> when A and B are INTEGER_CSTs.  This patch extends it to handle the
> case where the outer multiplication is by a factor of A, not just
> A itself.  It also handles addition and multiplication of poly_ints.
> (Exact division by a poly_int seems unlikely.)
> 
> I'm not sure why minus is handled here.  Wouldn't minus of a constant be
> canonicalised to a plus?

All but A - INT_MIN, yes.  For A - INT_MIN we'd know A == INT_MIN.
For unsigned we canonicalize all constants IIRC.  So I agree the
minus case can go away.

OK unchanged or with the minus removed.

Thanks,
Richard.

> gcc/
>   * match.pd: Generalise ((X /[ex] A) +- B) * A -> X +- A * B rule
>   to ((X /[ex] C1) +- C2) * (C1 * C3) -> (X * C3) +- (C1 * C2 * C3).
> 
> gcc/testsuite/
>   * gcc.dg/tree-ssa/mulexactdiv-5.c: New test.
>   * gcc.dg/tree-ssa/mulexactdiv-6.c: Likewise.
>   * gcc.dg/tree-ssa/mulexactdiv-7.c: Likewise.
>   * gcc.dg/tree-ssa/mulexactdiv-8.c: Likewise.
>   * gcc.target/aarch64/sve/cnt_fold_3.c: Likewise.
> ---
>  gcc/match.pd  | 38 +++-
>  gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c | 29 +
>  gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-6.c | 59 +++
>  gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-7.c | 22 +++
>  gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-8.c | 20 +++
>  .../gcc.target/aarch64/sve/cnt_fold_3.c   | 40 +
>  6 files changed, 194 insertions(+), 14 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-6.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-7.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-8.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_3.c
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 6677bc06d80..268316456c3 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -5493,24 +5493,34 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   optab_vector)))
> (eq (trunc_mod @0 @1) { build_zero_cst (TREE_TYPE (@0)); })))
>  
> -/* ((X /[ex] A) +- B) * A  -->  X +- A * B.  */
> +/* ((X /[ex] C1) +- C2) * (C1 * C3)  -->  (X * C3) +- (C1 * C2 * C3).  */
>  (for op (plus minus)
>   (simplify
> -  (mult (convert1? (op (convert2? (exact_div @0 INTEGER_CST@@1)) 
> INTEGER_CST@2)) @1)
> -  (if (tree_nop_conversion_p (type, TREE_TYPE (@2))
> -   && tree_nop_conversion_p (TREE_TYPE (@0), TREE_TYPE (@2)))
> -   (with
> - {
> -   wi::overflow_type overflow;
> -   wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
> -TYPE_SIGN (type), &overflow);
> - }
> +  (mult (convert1? (op (convert2? (exact_div @0 INTEGER_CST@1))
> +poly_int_tree_p@2))
> + poly_int_tree_p@3)
> +  (with { poly_widest_int factor; }
> +   (if (tree_nop_conversion_p (type, TREE_TYPE (@2))
> + && tree_nop_conversion_p (TREE_TYPE (@0), TREE_TYPE (@2))
> + && multiple_p (wi::to_poly_widest (@3), wi::to_widest (@1), &factor))
> +(with
> +  {
> + wi::overflow_type overflow;
> +wide_int mul;
> +  }
>   (if (types_match (type, TREE_TYPE (@2))
> -  && types_match (TREE_TYPE (@0), TREE_TYPE (@2)) && !overflow)
> -  (op @0 { wide_int_to_tree (type, mul); })
> +   && types_match (TREE_TYPE (@0), TREE_TYPE (@2))
> +   && TREE_CODE (@2) == INTEGER_CST
> +   && TREE_CODE (@3) == INTEGER_CST
> +   && (mul = wi::mul (wi::to_wide (@2), wi::to_wide (@3),
> +  TYPE_SIGN (type), &overflow),
> +   !overflow))
> +  (op (mult @0 { wide_int_to_tree (type, factor); })
> +   { wide_int_to_tree (type, mul); })
>(with { tree utype = unsigned_type_for (type); }
> -   (convert (op (convert:utype @0)
> - (mult (convert:utype @1) (convert:utype @2))
> +   (convert (op (mult (convert:utype @0)
> +   { wide_int_to_tree (utype, factor); })
> + (mult (convert:utype @3) (convert:utype @2)))
>  
>  /* Canonicalization of binary operations.  */
>  
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c
> new file mode 100644
> index 000..37cd676fff6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c
> @@ -0,0 +1,29 @@
> +/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
> +
> +#define TEST_CMP(FN, DIV, ADD, MUL)  \
> +  int\
> +  FN (int x) \
> +  {  \
> +if (x & 7)   \
> +  __builtin_unreachable ();  \
> +x /= DIV;   

Re: [PATCH 2/9] Use get_nonzero_bits to simplify trunc_div to exact_div

2024-10-18 Thread Richard Biener
On Fri, 18 Oct 2024, Richard Sandiford wrote:

> There are a limited number of existing rules that benefit from
> knowing that a division is exact.  Later patches will add more.

OK.

Thanks,
Richard.

> gcc/
>   * match.pd: Simplify X / (1 << C) to X /[ex] (1 << C) if the
>   low C bits of X are clear
> 
> gcc/testsuite/
>   * gcc.dg/tree-ssa/cmpexactdiv-6.c: New test.
> ---
>  gcc/match.pd  |  9 ++
>  gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-6.c | 29 +++
>  2 files changed, 38 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-6.c
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 4aea028a866..b952225b08c 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -5431,6 +5431,15 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   TYPE_PRECISION (type)), 0))
> (convert @0)))
>  
> +#if GIMPLE
> +/* X / (1 << C) -> X /[ex] (1 << C) if the low C bits of X are clear.  */
> +(simplify
> + (trunc_div (with_possible_nonzero_bits2 @0) integer_pow2p@1)
> + (if (INTEGRAL_TYPE_P (type)
> +  && !TYPE_UNSIGNED (type)
> +  && wi::multiple_of_p (get_nonzero_bits (@0), wi::to_wide (@1), SIGNED))
> +  (exact_div @0 @1)))
> +#endif
>  
>  /* (X /[ex] A) * A -> X.  */
>  (simplify
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-6.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-6.c
> new file mode 100644
> index 000..82d517b05ab
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-6.c
> @@ -0,0 +1,29 @@
> +/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
> +
> +typedef __INTPTR_TYPE__ intptr_t;
> +
> +int
> +f1 (int x, int y)
> +{
> +  if ((x & 1) || (y & 1))
> +__builtin_unreachable ();
> +  x /= 2;
> +  y /= 2;
> +  return x < y;
> +}
> +
> +int
> +f2 (void *ptr1, void *ptr2, void *ptr3)
> +{
> +  ptr1 = __builtin_assume_aligned (ptr1, 4);
> +  ptr2 = __builtin_assume_aligned (ptr2, 4);
> +  ptr3 = __builtin_assume_aligned (ptr3, 4);
> +  intptr_t diff1 = (intptr_t) ptr1 - (intptr_t) ptr2;
> +  intptr_t diff2 = (intptr_t) ptr1 - (intptr_t) ptr3;
> +  diff1 /= 2;
> +  diff2 /= 2;
> +  return diff1 < diff2;
> +}
> +
> +/* { dg-final { scan-tree-dump-not {<[a-z]*_div_expr,} "optimized" } } */
> +/* { dg-final { scan-tree-dump-not { 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


[PATCH] diagnostics: libcpp: Improve locations for _Pragma lexing diagnostics [PR114423]

2024-10-18 Thread Lewis Hyatt
Hello-

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114423

The diagnostics we issue while lexing tokens from a _Pragma string have
always come out at invalid locations. I had tried a couple years ago to
fix this in a general way, but I think that ended up being too invasive a
change to fix a problem that's pretty minor in practice, and it never got
over the finish line. Here is a simple patch that improves the situation
and addresses the recently filed PR on that topic, hopefully this
incremental improvement is a better way to make some progress on this?

It just adds a way for libcpp to override the location for all diagnostics
temporarily, so that the diagnostics issued while lexing from a _Pragma
string are issued at a real location (the location of the _Pragma token)
and not a bogus one. That's a lot simpler than trying to arrange to
produce valid locations when lexing tokens from an internal
buffer. Bootstrap + regtest all languages on x86-64 Linux, tweaked a few
existing tests to adjust to the new locations. OK for trunk? Thanks!

-Lewis

-- >8 --

libcpp is not currently set up to be able to generate valid
locations for tokens lexed from a _Pragma string. Instead, after obtaining
the tokens, it sets their locations all to the location of the _Pragma
operator itself. This makes things like _Pragma("GCC diagnostic") work well
enough, but if any diagnostics are issued during lexing, prior to resetting
the token locations, those diagnostics get issued at the invalid
locations. Fix that up by adding a new field pfile->diagnostic_override_loc
that instructs libcpp to issue diagnostics at the alternate location.

libcpp/ChangeLog:

PR preprocessor/114423
* internal.h (struct cpp_reader): Add DIAGNOSTIC_OVERRIDE_LOC
field.
* directives.cc (destringize_and_run): Set the new field to the
location of the _Pragma operator.
* errors.cc (cpp_diagnostic_at): Support DIAGNOSTIC_OVERRIDE_LOC to
temporarily issue diagnostics at a different location.
(cpp_diagnostic_with_line): Likewise.

gcc/testsuite/ChangeLog:

PR preprocessor/114423
* c-c++-common/cpp/pragma-diagnostic-loc.c: New test.
* c-c++-common/cpp/diagnostic-pragma-1.c: Adjust expected output.
* g++.dg/pch/operator-1.C: Likewise.
---
 libcpp/directives.cc  |  7 +++
 libcpp/errors.cc  | 19 +--
 libcpp/internal.h |  4 
 .../c-c++-common/cpp/diagnostic-pragma-1.c|  9 -
 .../c-c++-common/cpp/pragma-diagnostic-loc.c  | 17 +
 gcc/testsuite/g++.dg/pch/operator-1.C |  8 +++-
 6 files changed, 52 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/c-c++-common/cpp/pragma-diagnostic-loc.c

diff --git a/libcpp/directives.cc b/libcpp/directives.cc
index 9d235fa1b05..5706c28b835 100644
--- a/libcpp/directives.cc
+++ b/libcpp/directives.cc
@@ -2430,6 +2430,12 @@ destringize_and_run (cpp_reader *pfile, const cpp_string 
*in,
   pfile->buffer->file = pfile->buffer->prev->file;
   pfile->buffer->sysp = pfile->buffer->prev->sysp;
 
+  /* See comment below regarding the use of expansion_loc as the location
+ for all tokens; arrange here that diagnostics issued during lexing
+ get the same treatment.  */
+  const auto prev_loc_override = pfile->diagnostic_override_loc;
+  pfile->diagnostic_override_loc = expansion_loc;
+
   start_directive (pfile);
   _cpp_clean_line (pfile);
   save_directive = pfile->directive;
@@ -2497,6 +2503,7 @@ destringize_and_run (cpp_reader *pfile, const cpp_string 
*in,
  make that applicable to the real buffer too.  */
   pfile->buffer->prev->sysp = pfile->buffer->sysp;
   _cpp_pop_buffer (pfile);
+  pfile->diagnostic_override_loc = prev_loc_override;
 
   /* Reset the old macro state before ...  */
   XDELETE (pfile->context);
diff --git a/libcpp/errors.cc b/libcpp/errors.cc
index ad45f61913c..96fc165c12a 100644
--- a/libcpp/errors.cc
+++ b/libcpp/errors.cc
@@ -60,13 +60,14 @@ cpp_diagnostic_at (cpp_reader * pfile, enum 
cpp_diagnostic_level level,
   enum cpp_warning_reason reason, rich_location *richloc,
   const char *msgid, va_list *ap)
 {
-  bool ret;
-
   if (!pfile->cb.diagnostic)
 abort ();
-  ret = pfile->cb.diagnostic (pfile, level, reason, richloc, _(msgid), ap);
-
-  return ret;
+  if (pfile->diagnostic_override_loc && level != CPP_DL_NOTE)
+{
+  rich_location rc2{pfile->line_table, pfile->diagnostic_override_loc};
+  return pfile->cb.diagnostic (pfile, level, reason, &rc2, _(msgid), ap);
+}
+  return pfile->cb.diagnostic (pfile, level, reason, richloc, _(msgid), ap);
 }
 
 /* Print a diagnostic at the location of the previously lexed token.  */
@@ -201,8 +202,14 @@ cpp_diagnostic_with_line (cpp_reader * pfile, enum 
cpp_diagnostic_level level,
   
   if (!pfile->cb.diagnostic)
 abort ();
+  /* Don't override note locat

[PATCH v2 6/8] gcn: Add else operand to masked loads.

2024-10-18 Thread Robin Dapp
This patch adds an undefined else operand to the masked loads.

gcc/ChangeLog:

* config/gcn/predicates.md (maskload_else_operand): New
predicate.
* config/gcn/gcn-valu.md: Use new predicate.
---
 gcc/config/gcn/gcn-valu.md   | 12 
 gcc/config/gcn/predicates.md |  2 ++
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index cb2f4a78035..15e9fe8da40 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -3989,7 +3989,8 @@ (define_expand "while_ultsidi"
 (define_expand "maskloaddi"
   [(match_operand:V_MOV 0 "register_operand")
(match_operand:V_MOV 1 "memory_operand")
-   (match_operand 2 "")]
+   (match_operand 2 "")
+   (match_operand:V_MOV 3 "maskload_else_operand")]
   ""
   {
 rtx exec = force_reg (DImode, operands[2]);
@@ -3998,9 +3999,6 @@ (define_expand "maskloaddi"
 rtx as = gen_rtx_CONST_INT (VOIDmode, MEM_ADDR_SPACE (operands[1]));
 rtx v = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[1]));
 
-/* Masked lanes are required to hold zero.  */
-emit_move_insn (operands[0], gcn_vec_constant (mode, 0));
-
 emit_insn (gen_gather_expr_exec (operands[0], addr, as, v,
   operands[0], exec));
 DONE;
@@ -4027,7 +4025,8 @@ (define_expand "mask_gather_load"
(match_operand: 2 "register_operand")
(match_operand 3 "immediate_operand")
(match_operand:SI 4 "gcn_alu_operand")
-   (match_operand:DI 5 "")]
+   (match_operand:DI 5 "")
+   (match_operand:V_MOV 6 "maskload_else_operand")]
   ""
   {
 rtx exec = force_reg (DImode, operands[5]);
@@ -4036,9 +4035,6 @@ (define_expand "mask_gather_load"
  operands[2], operands[4],
  INTVAL (operands[3]), exec);
 
-/* Masked lanes are required to hold zero.  */
-emit_move_insn (operands[0], gcn_vec_constant (mode, 0));
-
 if (GET_MODE (addr) == mode)
   emit_insn (gen_gather_insn_1offset_exec (operands[0], addr,
 const0_rtx, const0_rtx,
diff --git a/gcc/config/gcn/predicates.md b/gcc/config/gcn/predicates.md
index 3f59396a649..21beeb586a4 100644
--- a/gcc/config/gcn/predicates.md
+++ b/gcc/config/gcn/predicates.md
@@ -228,3 +228,5 @@ (define_predicate "ascending_zero_int_parallel"
   return gcn_stepped_zero_int_parallel_p (op, 1);
 })
 
+(define_predicate "maskload_else_operand"
+  (match_operand 0 "scratch_operand"))
-- 
2.46.2



Re: [PATCH v16b 3/4] gcc/: Merge definitions of array_type_nelts_top

2024-10-18 Thread Joseph Myers
On Wed, 16 Oct 2024, Alejandro Colomar wrote:

> There were two identical definitions, and none of them are available
> where they are needed for implementing __nelementsof__.  Merge them, and
> provide the single definition in gcc/tree.{h,cc}, where it's available
> for __nelementsof__, which will be added in the following commit.

Thanks, committed.

-- 
Joseph S. Myers
josmy...@redhat.com



[committed] c: Fix -std=gnu23 -Wtraditional for () in function definitions

2024-10-18 Thread Joseph Myers
We don't yet have clear agreement on removing -Wtraditional (although
it seems there is little to no use for most of the warnings therein),
so fix the bug in its interaction with -std=gnu23 to continue progress
on making -std=gnu23 the default while -Wtraditional remains under
discussion.

The warning for ISO C function definitions with -Wtraditional properly
covers (void), but also wrongly warned for () in C23 mode as that has
the same semantics as (void) in that case.  Keep track in c_arg_info
of when () was converted to (void) for C23 so that -Wtraditional can
avoid warning in that case (with an appropriate comment on the
definition of the new field to make clear it can be removed along with
-Wtraditional).

Bootstrapped with no regressions for x86_64-pc-linux-gnu.

gcc/c/
* c-tree.h (c_arg_info): Add c23_empty_parens.
* c-decl.cc (grokparms): Set c23_empty_parens.
(build_arg_info): Clear c23_empty_parens.
(store_parm_decls_newstyle): Do not give -Wtraditional warning for
ISO C function definition if c23_empty_parens.

gcc/testsuite/
* gcc.dg/wtr-gnu17-1.c, gcc.dg/wtr-gnu23-1.c: New tests.

diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc
index 491c24b9fe7..3733ecfc13f 100644
--- a/gcc/c/c-decl.cc
+++ b/gcc/c/c-decl.cc
@@ -8519,7 +8519,10 @@ grokparms (struct c_arg_info *arg_info, bool 
funcdef_flag)
  && !arg_types
  && !arg_info->parms
  && !arg_info->no_named_args_stdarg_p)
-   arg_types = arg_info->types = void_list_node;
+   {
+ arg_types = arg_info->types = void_list_node;
+ arg_info->c23_empty_parens = 1;
+   }
 
   /* If there is a parameter of incomplete type in a definition,
 this is an error.  In a declaration this is valid, and a
@@ -8589,6 +8592,7 @@ build_arg_info (void)
   ret->pending_sizes = NULL;
   ret->had_vla_unspec = 0;
   ret->no_named_args_stdarg_p = 0;
+  ret->c23_empty_parens = 0;
   return ret;
 }
 
@@ -10923,7 +10927,8 @@ store_parm_decls_newstyle (tree fndecl, const struct 
c_arg_info *arg_info)
  its parameter list).  */
   else if (!in_system_header_at (input_location)
   && !current_function_scope
-  && arg_info->types != error_mark_node)
+  && arg_info->types != error_mark_node
+  && !arg_info->c23_empty_parens)
 warning_at (DECL_SOURCE_LOCATION (fndecl), OPT_Wtraditional,
"traditional C rejects ISO C style function definitions");
 
diff --git a/gcc/c/c-tree.h b/gcc/c/c-tree.h
index bfdcb78bbcc..a1435e7cb0c 100644
--- a/gcc/c/c-tree.h
+++ b/gcc/c/c-tree.h
@@ -525,6 +525,10 @@ struct c_arg_info {
   BOOL_BITFIELD had_vla_unspec : 1;
   /* True when the arguments are a (...) prototype.  */
   BOOL_BITFIELD no_named_args_stdarg_p : 1;
+  /* True when empty parentheses have been interpreted as (void) in C23 or
+ later.  This is only for use by -Wtraditional and is no longer needed if
+ -Wtraditional is removed.  */
+  BOOL_BITFIELD c23_empty_parens : 1;
 };
 
 /* A declarator.  */
diff --git a/gcc/testsuite/gcc.dg/wtr-gnu17-1.c 
b/gcc/testsuite/gcc.dg/wtr-gnu17-1.c
new file mode 100644
index 000..74c06e4aa4c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/wtr-gnu17-1.c
@@ -0,0 +1,9 @@
+/* Test -Wtraditional -std=gnu17 does not warn for empty parentheses in
+   function definition.  */
+/* { dg-do compile } */
+/* { dg-options "-Wtraditional -std=gnu17" } */
+
+void
+f ()
+{
+}
diff --git a/gcc/testsuite/gcc.dg/wtr-gnu23-1.c 
b/gcc/testsuite/gcc.dg/wtr-gnu23-1.c
new file mode 100644
index 000..207e7c59d27
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/wtr-gnu23-1.c
@@ -0,0 +1,9 @@
+/* Test -Wtraditional -std=gnu23 does not warn for empty parentheses in
+   function definition.  */
+/* { dg-do compile } */
+/* { dg-options "-Wtraditional -std=gnu23" } */
+
+void
+f ()
+{
+}

-- 
Joseph S. Myers
josmy...@redhat.com



[RFC/RFA][PATCH v5 00/12] CRC optimization.

2024-10-18 Thread Mariam Arutunian
Hello,

This patch series is a respin of the following:
https://gcc.gnu.org/pipermail/gcc-patches/2024-September/662961.html.
Although I sent [PATCH v4 00/12] to the mailing list, it didn’t appear in
the archives, so I've provided the link to the first patch ([PATCH v4
01/12]). The original patch set can be found here:
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652610.html.

I have addressed all the feedback from the previous series, except for the
emit_crc function in patch 01/12, which I am still working on.

Thanks,
Mariam


[RFC/RFA] [PATCH v5 08/12] Add a new pass for naive CRC loops detection.

2024-10-18 Thread Mariam Arutunian
This patch adds a new compiler pass aimed at identifying naive CRC
implementations,
characterized by the presence of a loop calculating a CRC (polynomial long
division).
Upon detection of a potential CRC, the pass prints an informational message.

Performs CRC optimization if optimization level is >= 2 and if
fno_gimple_crc_optimization given.

This pass is added for the detection and optimization of naive CRC
implementations,
improving the efficiency of CRC-related computations.

This patch includes only initial fast checks for filtering out non-CRCs,
detected possible CRCs verification and optimization parts will be provided
in subsequent patches.

  gcc/

* Makefile.in (OBJS): Add gimple-crc-optimization.o.
* common.opt (foptimize-crc): New option.
* common.opt.urls: Regenerate to add foptimize-crc.
* doc/invoke.texi (-foptimize-crc): Add documentation.
* gimple-crc-optimization.cc: New file.
* opts.cc (default_options_table): Add OPT_foptimize_crc.
(enable_fdo_optimizations): Enable optimize_crc.
* passes.def (pass_crc_optimization): Add new pass.
* timevar.def (TV_GIMPLE_CRC_OPTIMIZATION): New timevar.
* tree-pass.h (make_pass_crc_optimization): New extern function
declaration.

Signed-off-by: Mariam Arutunian 
Mentored-by: Jeff Law 
---
 gcc/Makefile.in|1 +
 gcc/common.opt |   10 +
 gcc/common.opt.urls|3 +
 gcc/doc/invoke.texi|   16 +-
 gcc/gimple-crc-optimization.cc | 1000 
 gcc/opts.cc|2 +
 gcc/passes.def |1 +
 gcc/timevar.def|1 +
 gcc/tree-pass.h|1 +
 9 files changed, 1034 insertions(+), 1 deletion(-)
 create mode 100644 gcc/gimple-crc-optimization.cc

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 68fda1a7591..a7054eda9c5 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1717,6 +1717,7 @@ OBJS = \
 	tree-iterator.o \
 	tree-logical-location.o \
 	tree-loop-distribution.o \
+	gimple-crc-optimization.o \
 	tree-nested.o \
 	tree-nrv.o \
 	tree-object-size.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index ea39f87ae71..8395c100fe0 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2399,6 +2399,16 @@ fsave-optimization-record
 Common Var(flag_save_optimization_record) Optimization
 Write a SRCFILE.opt-record.json file detailing what optimizations were performed.
 
+foptimize-crc
+Common Var(flag_optimize_crc) Optimization
+Detect loops calculating CRC and replace with faster implementation.
+If the target supports CRC instruction and the CRC loop uses the same
+polynomial as the one used in the CRC instruction, directly replace with the
+corresponding CRC instruction.
+Otherwise, if the target supports carry-less-multiplication instruction,
+generate CRC using it.
+If neither case applies, generate table-based CRC.
+
 foptimize-register-move
 Common Ignore
 Does nothing. Preserved for backward compatibility.
diff --git a/gcc/common.opt.urls b/gcc/common.opt.urls
index b917f90b0ff..8b61a371c05 100644
--- a/gcc/common.opt.urls
+++ b/gcc/common.opt.urls
@@ -1007,6 +1007,9 @@ UrlSuffix(gcc/Developer-Options.html#index-fopt-info)
 fsave-optimization-record
 UrlSuffix(gcc/Developer-Options.html#index-fsave-optimization-record)
 
+foptimize-crc
+UrlSuffix(gcc/Optimize-Options.html#index-foptimize-crc)
+
 foptimize-sibling-calls
 UrlSuffix(gcc/Optimize-Options.html#index-foptimize-sibling-calls)
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 019e0a5ca80..2b10e7d4bdc 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -598,7 +598,7 @@ Objective-C and Objective-C++ Dialects}.
 -fno-peephole2  -fno-printf-return-value  -fno-sched-interblock
 -fno-sched-spec  -fno-signed-zeros
 -fno-toplevel-reorder  -fno-trapping-math  -fno-zero-initialized-in-bss
--fomit-frame-pointer  -foptimize-sibling-calls
+-fomit-frame-pointer  -foptimize-crc  -foptimize-sibling-calls
 -fpartial-inlining  -fpeel-loops  -fpredictive-commoning
 -fprefetch-loop-arrays
 -fprofile-correction
@@ -12664,6 +12664,7 @@ also turns on the following optimization flags:
 -fipa-ra  -fipa-sra  -fipa-vrp
 -fisolate-erroneous-paths-dereference
 -flra-remat
+-foptimize-crc
 -foptimize-sibling-calls
 -foptimize-strlen
 -fpartial-inlining
@@ -12828,6 +12829,19 @@ leaf functions.
 
 Enabled by default at @option{-O1} and higher.
 
+@opindex foptimize-crc
+@item -foptimize-crc
+Detect loops calculating CRC (performing polynomial long division) and
+replace them with a faster implementation.  Detect 8, 16, 32, and 64 bit CRC,
+with a constant polynomial without the leading 1 bit,
+for both bit-forward and bit-reversed cases.
+If the target supports a CRC instruction and the polynomial used in the source
+code matches the polynomial used in the CRC instruction, generate that CRC
+instruction.  Otherwise, if the target supports a carry-less-multiplication
+instruction, generate CRC using it; otherwise gen

[RFC/RFA][PATCH v5 05/12] i386: Implement new expander for efficient CRC computation.

2024-10-18 Thread Mariam Arutunian
This patch introduces two new expanders for the i386 backend,
dedicated to generating optimized code for CRC computations.
The new expanders are designed to leverage specific hardware capabilities
to achieve faster CRC calculations,
particularly using the pclmulqdq or crc32 instructions when supported by
the target architecture.

Expander 1: Bit-Forward CRC (crc4)
For targets that support both pclmulqdq instruction (TARGET_PCLMUL) and are
64-bit (TARGET_64BIT),
the expander will generate code that uses the pclmulqdq instruction for CRC
computation.

Expander 2: Bit-Reversed CRC (crc_rev4)
The expander first checks if the target supports the CRC32 instruction set
(TARGET_CRC32)
and the polynomial in use is 0x1EDC6F41 (iSCSI). If the conditions are met,
it emits calls to the corresponding crc32 instruction (crc32b, crc32w, or
crc32l depending on the data size).
If the target does not support crc32 but supports pclmulqdq, it then uses
the pclmulqdq instruction for bit-reversed CRC computation.

Otherwise table-based CRC is generated.

  gcc/config/i386/

* i386-protos.h (ix86_expand_crc_using_pclmul): New extern function
declaration.
(ix86_expand_reversed_crc_using_pclmul):  Likewise.
* i386.cc (ix86_expand_crc_using_pclmul): New function.
(ix86_expand_reversed_crc_using_pclmul):  Likewise.
* i386.md (UNSPEC_CRC, UNSPEC_CRC_REV):  New unspecs.
(SWI124dup): New iterator.
(crc4): New expander for bit-forward CRC.
(crc_rev4): New expander for reversed CRC.

  gcc/testsuite/gcc.target/i386/

* crc-crc32-data16.c: New test.
* crc-crc32-data32.c: Likewise.
* crc-crc32-data8.c: Likewise.
* crc-1-pclmul.c: Likewise.
* crc-10-pclmul.c: Likewise.
* crc-12-pclmul.c: Likewise.
* crc-13-pclmul.c: Likewise.
* crc-14-pclmul.c: Likewise.
* crc-17-pclmul.c: Likewise.
* crc-18-pclmul.c: Likewise.
* crc-21-pclmul.c: Likewise.
* crc-22-pclmul.c: Likewise.
* crc-23-pclmul.c: Likewise.
* crc-4-pclmul.c: Likewise.
* crc-5-pclmul.c: Likewise.
* crc-6-pclmul.c: Likewise.
* crc-7-pclmul.c: Likewise.
* crc-8-pclmul.c: Likewise.
* crc-9-pclmul.c: Likewise.
* crc-CCIT-data16-pclmul.c: Likewise.
* crc-CCIT-data8-pclmul.c: Likewise.
* crc-coremark-16bitdata-pclmul.c: Likewise.

Signed-off-by: Mariam Arutunian 
Mentored-by: Jeff Law 
---
 gcc/config/i386/i386-protos.h |   2 +
 gcc/config/i386/i386.cc   | 129 ++
 gcc/config/i386/i386.md   |  59 
 gcc/testsuite/gcc.target/i386/crc-1-pclmul.c  |   8 ++
 gcc/testsuite/gcc.target/i386/crc-10-pclmul.c |   8 ++
 gcc/testsuite/gcc.target/i386/crc-12-pclmul.c |   9 ++
 gcc/testsuite/gcc.target/i386/crc-13-pclmul.c |   8 ++
 gcc/testsuite/gcc.target/i386/crc-14-pclmul.c |   8 ++
 gcc/testsuite/gcc.target/i386/crc-17-pclmul.c |   8 ++
 gcc/testsuite/gcc.target/i386/crc-18-pclmul.c |   8 ++
 gcc/testsuite/gcc.target/i386/crc-21-pclmul.c |   8 ++
 gcc/testsuite/gcc.target/i386/crc-22-pclmul.c |   8 ++
 gcc/testsuite/gcc.target/i386/crc-23-pclmul.c |   8 ++
 gcc/testsuite/gcc.target/i386/crc-4-pclmul.c  |   8 ++
 gcc/testsuite/gcc.target/i386/crc-5-pclmul.c  |   9 ++
 gcc/testsuite/gcc.target/i386/crc-6-pclmul.c  |   8 ++
 gcc/testsuite/gcc.target/i386/crc-7-pclmul.c  |   8 ++
 gcc/testsuite/gcc.target/i386/crc-8-pclmul.c  |   8 ++
 gcc/testsuite/gcc.target/i386/crc-9-pclmul.c  |   8 ++
 .../gcc.target/i386/crc-CCIT-data16-pclmul.c  |   9 ++
 .../gcc.target/i386/crc-CCIT-data8-pclmul.c   |   9 ++
 .../i386/crc-coremark-16bitdata-pclmul.c  |   9 ++
 .../gcc.target/i386/crc-crc32-data16.c|  53 +++
 .../gcc.target/i386/crc-crc32-data32.c|  53 +++
 .../gcc.target/i386/crc-crc32-data8.c |  53 +++
 25 files changed, 506 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-1-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-10-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-12-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-13-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-14-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-17-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-18-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-21-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-22-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-23-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-4-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-5-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-6-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-7-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-8-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-9-pclmul.c
 create mode 100644 gcc/testsuite/gcc.target/i386/crc-CCIT-data16-pclmul.c
 create mode 100644 gcc/testsuite/gcc.t

[RFC/RFA] [PATCH v5 04/12] RISC-V: Add CRC built-ins tests for the target ZBC.

2024-10-18 Thread Mariam Arutunian
gcc/testsuite/gcc.target/riscv/

* crc-builtin-zbc32.c: New file.
* crc-builtin-zbc64.c: Likewise.

Signed-off-by: Mariam Arutunian 
Mentored-by: Jeff Law 
---
 .../gcc.target/riscv/crc-builtin-zbc32.c  | 21 ++
 .../gcc.target/riscv/crc-builtin-zbc64.c  | 66 +++
 2 files changed, 87 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/crc-builtin-zbc32.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/crc-builtin-zbc64.c

diff --git a/gcc/testsuite/gcc.target/riscv/crc-builtin-zbc32.c b/gcc/testsuite/gcc.target/riscv/crc-builtin-zbc32.c
new file mode 100644
index 000..20d7d25f60e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/crc-builtin-zbc32.c
@@ -0,0 +1,21 @@
+/* { dg-do compile { target { riscv32*-*-* } } } */
+/* { dg-options "-march=rv32gc_zbc" } */
+
+#include 
+
+int8_t crc8_data8 ()
+{
+  return __builtin_crc8_data8 (0x34, 'a', 0x12);
+}
+
+int16_t crc16_data8 ()
+{
+  return __builtin_crc16_data8 (0x1234, 'a', 0x1021);
+}
+
+int16_t crc16_data16 ()
+{
+  return __builtin_crc16_data16 (0x1234, 0x3214, 0x1021);
+}
+
+/* { dg-final { scan-assembler-times "clmul\t" 6 } } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.target/riscv/crc-builtin-zbc64.c b/gcc/testsuite/gcc.target/riscv/crc-builtin-zbc64.c
new file mode 100644
index 000..c9509d56d01
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/crc-builtin-zbc64.c
@@ -0,0 +1,66 @@
+/* { dg-do compile { target { riscv64*-*-* } } } */
+/* { dg-options "-march=rv64gc_zbc" } */
+
+#include 
+
+int8_t crc8_data8 ()
+{
+  return __builtin_crc8_data8 (0x34, 'a', 0x12);
+}
+
+int16_t crc16_data8 ()
+{
+  return __builtin_crc16_data8 (0x1234, 'a', 0x1021);
+}
+
+int16_t crc16_data16 ()
+{
+  return __builtin_crc16_data16 (0x1234, 0x3214, 0x1021);
+}
+
+int32_t crc32_data8 ()
+{
+  return __builtin_crc32_data8 (0x, 0x32, 0x4002123);
+}
+
+int32_t crc32_data16 ()
+{
+  return __builtin_crc32_data16 (0x, 0x3232, 0x4002123);
+}
+
+int32_t crc32_data32 ()
+{
+  return __builtin_crc32_data32 (0x, 0x123546ff, 0x4002123);
+}
+
+int8_t rev_crc8_data8 ()
+{
+  return __builtin_rev_crc8_data8 (0x34, 'a', 0x12);
+}
+
+int16_t rev_crc16_data8 ()
+{
+  return __builtin_rev_crc16_data8 (0x1234, 'a', 0x1021);
+}
+
+int16_t rev_crc16_data16 ()
+{
+  return __builtin_rev_crc16_data16 (0x1234, 0x3214, 0x1021);
+}
+
+int32_t rev_crc32_data8 ()
+{
+  return __builtin_rev_crc32_data8 (0x, 0x32, 0x4002123);
+}
+
+int32_t rev_crc32_data16 ()
+{
+  return __builtin_rev_crc32_data16 (0x, 0x3232, 0x4002123);
+}
+
+int32_t rev_crc32_data32 ()
+{
+  return __builtin_rev_crc32_data32 (0x, 0x123546ff, 0x4002123);
+}
+/* { dg-final { scan-assembler-times "clmul\t" 18 } } */
+/* { dg-final { scan-assembler-times "clmulh" 6 } } */
\ No newline at end of file
-- 
2.25.1



[RFC/RFA][PATCH v5 03/12] RISC-V: Add CRC expander to generate faster CRC.

2024-10-18 Thread Mariam Arutunian
If the target is ZBC or ZBKC, it uses clmul instruction for the CRC
calculation.  Otherwise, if the target is ZBKB, generates table-based
CRC, but for reversing inputs and the output uses bswap and brev8
instructions.  Add new tests to check CRC generation for ZBC, ZBKC and
ZBKB targets.

  gcc/

 * expr.cc (gf2n_poly_long_div_quotient): New function.
 * expr.h (gf2n_poly_long_div_quotient):  New function declaration.
 * hwint.cc (reflect_hwi): New function.
 * hwint.h (reflect_hwi): New function declaration.

  gcc/config/riscv/

 * bitmanip.md (crc_rev4): New expander for
reversed CRC.
 (crc4): New expander for bit-forward CRC.
 * iterators.md (SUBX1, ANYI1): New iterators.
 * riscv-protos.h (generate_reflecting_code_using_brev): New function
declaration.
 (expand_crc_using_clmul): Likewise.
 (expand_reversed_crc_using_clmul): Likewise.
 * riscv.cc (generate_reflecting_code_using_brev): New function.
 (expand_crc_using_clmul): Likewise.
 (expand_reversed_crc_using_clmul): Likewise.
 * riscv.md (UNSPEC_CRC, UNSPEC_CRC_REV):  New unspecs.

  gcc/testsuite/gcc.target/riscv/

* crc-1-zbc.c: New test.
* crc-1-zbkc.c: Likewise.
* crc-10-zbc.c: Likewise.
* crc-10-zbkc.c: Likewise.
* crc-12-zbc.c: Likewise.
* crc-12-zbkc.c: Likewise.
* crc-13-zbc.c: Likewise.
* crc-13-zbkc.c: Likewise.
* crc-14-zbc.c: Likewise.
* crc-14-zbkc.c: Likewise.
* crc-17-zbc.c: Likewise.
* crc-17-zbkc.c: Likewise.
* crc-18-zbc.c: Likewise.
* crc-18-zbkc.c: Likewise.
* crc-21-zbc.c: Likewise.
* crc-21-zbkc.c: Likewise.
* crc-22-rv64-zbc.c: Likewise.
* crc-22-rv64-zbkb.c: Likewise.
* crc-22-rv64-zbkc.c: Likewise.
* crc-23-zbc.c: Likewise.
* crc-23-zbkc.c: Likewise.
* crc-4-zbc.c: Likewise.
* crc-4-zbkb.c: Likewise.
* crc-4-zbkc.c: Likewise.
* crc-5-zbc.c: Likewise.
* crc-5-zbkb.c: Likewise.
* crc-5-zbkc.c: Likewise.
* crc-6-zbc.c: Likewise.
* crc-6-zbkc.c: Likewise.
* crc-7-zbc.c: Likewise.
* crc-7-zbkc.c: Likewise.
* crc-8-zbc.c: Likewise.
* crc-8-zbkb.c: Likewise.
* crc-8-zbkc.c: Likewise.
* crc-9-zbc.c: Likewise.
* crc-9-zbkc.c: Likewise.
* crc-CCIT-data16-zbc.c: Likewise.
* crc-CCIT-data16-zbkc.c: Likewise.
* crc-CCIT-data8-zbc.c: Likewise.
* crc-CCIT-data8-zbkc.c: Likewise.
* crc-coremark-16bitdata-zbc.c: Likewise.
* crc-coremark-16bitdata-zbkc.c: Likewise.

Signed-off-by: Mariam Arutunian 
Mentored-by: Jeff Law 
---
 gcc/config/riscv/bitmanip.md  |  63 +++
 gcc/config/riscv/iterators.md |   6 +
 gcc/config/riscv/riscv-protos.h   |   3 +
 gcc/config/riscv/riscv.cc | 155 ++
 gcc/config/riscv/riscv.md |   4 +
 gcc/expr.cc   |  27 +++
 gcc/expr.h|   5 +
 gcc/hwint.cc  |  18 ++
 gcc/hwint.h   |   1 +
 gcc/testsuite/gcc.target/riscv/crc-1-zbc.c|  11 ++
 gcc/testsuite/gcc.target/riscv/crc-1-zbkc.c   |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-10-zbc.c   |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-10-zbkc.c  |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-12-zbc.c   |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-12-zbkc.c  |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-13-zbc.c   |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-13-zbkc.c  |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-14-zbc.c   |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-14-zbkc.c  |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-17-zbc.c   |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-17-zbkc.c  |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-18-zbc.c   |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-18-zbkc.c  |  11 ++
 .../gcc.target/riscv/crc-21-rv64-zbc.c|   9 +
 .../gcc.target/riscv/crc-21-rv64-zbkc.c   |   9 +
 gcc/testsuite/gcc.target/riscv/crc-22-zbc.c   |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-22-zbkb.c  |  10 ++
 gcc/testsuite/gcc.target/riscv/crc-22-zbkc.c  |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-23-zbc.c   |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-23-zbkc.c  |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-4-zbc.c|  11 ++
 gcc/testsuite/gcc.target/riscv/crc-4-zbkb.c   |  10 ++
 gcc/testsuite/gcc.target/riscv/crc-4-zbkc.c   |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-5-zbc.c|  11 ++
 gcc/testsuite/gcc.target/riscv/crc-5-zbkb.c   |  10 ++
 gcc/testsuite/gcc.target/riscv/crc-5-zbkc.c   |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-6-zbc.c|  11 ++
 gcc/testsuite/gcc.target/riscv/crc-6-zbkc.c   |  11 ++
 gcc/testsuite/gcc.target/riscv/crc-7-zbc.c|  11 ++
 gcc/testsuite/gcc.target/riscv/crc-7-zbkc.c   |  11 ++
 gcc/testsuite/gcc.ta

[RFC/RFA][PATCH v5 06/12] aarch64: Implement new expander for efficient CRC computation.

2024-10-18 Thread Mariam Arutunian
This patch introduces two new expanders for the aarch64 backend,
dedicated to generate optimized code for CRC computations.
The new expanders are designed to leverage specific hardware capabilities
to achieve faster CRC calculations,
particularly using the crc32, crc32c and pmull instructions when supported
by the target architecture.

Expander 1: Bit-Forward CRC (crc4)
For targets that support pmul instruction (TARGET_AES),
the expander will generate code that uses the pmull (crypto_pmulldi)
instruction for CRC computation.

Expander 2: Bit-Reversed CRC (crc_rev4)
The expander first checks if the target supports the CRC32* instruction set
(TARGET_CRC32)
and the polynomial in use is 0x1EDC6F41 (iSCSI) or 0x04C11DB7 (HDLC). If
the conditions are met,
it emits calls to the corresponding crc32* instruction (depending on the
data size and the polynomial).
If the target does not support crc32* but supports pmull, it then uses the
pmull (crypto_pmulldi) instruction for bit-reversed CRC computation.
Otherwise table-based CRC is generated.

  gcc/config/aarch64/

* aarch64-protos.h (aarch64_expand_crc_using_pmull): New extern
function declaration.
(aarch64_expand_reversed_crc_using_pmull):  Likewise.
* aarch64.cc (aarch64_expand_crc_using_pmull): New function.
(aarch64_expand_reversed_crc_using_pmull):  Likewise.
* aarch64.md (crc_rev4): New expander for
reversed CRC.
(crc4): New expander for bit-forward CRC.
* iterators.md (crc_data_type): New mode attribute.

  gcc/testsuite/gcc.target/aarch64/

* crc-1-pmul.c: New test.
* crc-10-pmul.c: Likewise.
* crc-12-pmul.c: Likewise.
* crc-13-pmul.c: Likewise.
* crc-14-pmul.c: Likewise.
* crc-17-pmul.c: Likewise.
* crc-18-pmul.c: Likewise.
* crc-21-pmul.c: Likewise.
* crc-22-pmul.c: Likewise.
* crc-23-pmul.c: Likewise.
* crc-4-pmul.c: Likewise.
* crc-5-pmul.c: Likewise.
* crc-6-pmul.c: Likewise.
* crc-7-pmul.c: Likewise.
* crc-8-pmul.c: Likewise.
* crc-9-pmul.c: Likewise.
* crc-CCIT-data16-pmul.c: Likewise.
* crc-CCIT-data8-pmul.c: Likewise.
* crc-coremark-16bitdata-pmul.c: Likewise.
* crc-crc32-data16.c: Likewise.
* crc-crc32-data32.c: Likewise.
* crc-crc32-data8.c: Likewise.
* crc-crc32c-data16.c: Likewise.
* crc-crc32c-data32.c: Likewise.
* crc-crc32c-data8.c: Likewise.

Signed-off-by: Mariam Arutunian 
Co-authored-by: Richard Sandiford 
---
 gcc/config/aarch64/aarch64-protos.h   |   3 +
 gcc/config/aarch64/aarch64.cc | 131 ++
 gcc/config/aarch64/aarch64.md |  57 
 gcc/config/aarch64/iterators.md   |   4 +
 gcc/testsuite/gcc.target/aarch64/crc-1-pmul.c |   8 ++
 .../gcc.target/aarch64/crc-10-pmul.c  |   8 ++
 .../gcc.target/aarch64/crc-12-pmul.c  |   9 ++
 .../gcc.target/aarch64/crc-13-pmul.c  |   8 ++
 .../gcc.target/aarch64/crc-14-pmul.c  |   8 ++
 .../gcc.target/aarch64/crc-17-pmul.c  |   8 ++
 .../gcc.target/aarch64/crc-18-pmul.c  |   8 ++
 .../gcc.target/aarch64/crc-21-pmul.c  |   8 ++
 .../gcc.target/aarch64/crc-22-pmul.c  |   8 ++
 .../gcc.target/aarch64/crc-23-pmul.c  |   8 ++
 gcc/testsuite/gcc.target/aarch64/crc-4-pmul.c |   8 ++
 gcc/testsuite/gcc.target/aarch64/crc-5-pmul.c |   8 ++
 gcc/testsuite/gcc.target/aarch64/crc-6-pmul.c |   8 ++
 gcc/testsuite/gcc.target/aarch64/crc-7-pmul.c |   8 ++
 gcc/testsuite/gcc.target/aarch64/crc-8-pmul.c |   8 ++
 gcc/testsuite/gcc.target/aarch64/crc-9-pmul.c |   8 ++
 .../gcc.target/aarch64/crc-CCIT-data16-pmul.c |   9 ++
 .../gcc.target/aarch64/crc-CCIT-data8-pmul.c  |   9 ++
 .../aarch64/crc-coremark-16bitdata-pmul.c |   9 ++
 .../gcc.target/aarch64/crc-crc32-data16.c |  53 +++
 .../gcc.target/aarch64/crc-crc32-data32.c |  52 +++
 .../gcc.target/aarch64/crc-crc32-data8.c  |  53 +++
 .../gcc.target/aarch64/crc-crc32c-data16.c|  53 +++
 .../gcc.target/aarch64/crc-crc32c-data32.c|  52 +++
 .../gcc.target/aarch64/crc-crc32c-data8.c |  53 +++
 29 files changed, 667 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/crc-1-pmul.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/crc-10-pmul.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/crc-12-pmul.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/crc-13-pmul.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/crc-14-pmul.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/crc-17-pmul.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/crc-18-pmul.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/crc-21-pmul.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/crc-22-pmul.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/crc-23-pmul.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/crc-4-pmul.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/crc-5-pmul.c
 create mode 100644 gcc/testsuite/gcc.t

[RFC/RFA][PATCH v5 02/12] Add built-ins and tests for bit-forward and bit-reversed CRCs.

2024-10-18 Thread Mariam Arutunian
This patch introduces new built-in functions to GCC for computing
bit-forward and bit-reversed CRCs.
These builtins aim to provide efficient CRC calculation capabilities.
When the target architecture supports CRC operations (as indicated by the
presence of a CRC optab),
the builtins will utilize the expander to generate CRC code.
In the absence of hardware support, the builtins default to generating code
for a table-based CRC calculation.

The built-ins are defined as follows:
__builtin_rev_crc16_data8,
__builtin_rev_crc32_data8, __builtin_rev_crc32_data16,
__builtin_rev_crc32_data32
__builtin_rev_crc64_data8, __builtin_rev_crc64_data16,
 __builtin_rev_crc64_data32, __builtin_rev_crc64_data64,
__builtin_crc8_data8,
__builtin_crc16_data16, __builtin_crc16_data8,
__builtin_crc32_data8, __builtin_crc32_data16, __builtin_crc32_data32,
__builtin_crc64_data8, __builtin_crc64_data16,  __builtin_crc64_data32,
__builtin_crc64_data64

Each built-in takes three parameters:
crc: The initial CRC value.
data: The data to be processed.
polynomial: The CRC polynomial without the leading 1.

To validate the correctness of these built-ins, this patch also includes
additions to the GCC testsuite.
This enhancement allows GCC to offer developers high-performance CRC
computation options
that automatically adapt to the capabilities of the target hardware.

Not complete. May continue the work if these built-ins are needed.

gcc/

 * builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): Define.
 (BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise.
  (BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
  (BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise.
  (BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise.
  (BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise.
  (BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise.
  (BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise.
  (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
  (BT_FN_UINT64_UINT64_UINT64_CONST_SIZE): Likewise.
  * builtins.cc (associated_internal_fn): Handle
BUILT_IN_CRC8_DATA8,
  BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
  BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16,
BUILT_IN_CRC32_DATA32,
  BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16,
BUILT_IN_CRC64_DATA32,
  BUILT_IN_CRC64_DATA64,
  BUILT_IN_REV_CRC8_DATA8,
  BUILT_IN_REV_CRC16_DATA8, BUILT_IN_REV_CRC16_DATA16,
  BUILT_IN_REV_CRC32_DATA8, BUILT_IN_REV_CRC32_DATA16,
BUILT_IN_REV_CRC32_DATA32.
  (expand_builtin_crc_table_based): New function.
  (expand_builtin): Handle BUILT_IN_CRC8_DATA8,
  BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
  BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16,
BUILT_IN_CRC32_DATA32,
  BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16,
BUILT_IN_CRC64_DATA32,
  BUILT_IN_CRC64_DATA64,
  BUILT_IN_REV_CRC8_DATA8,
  BUILT_IN_REV_CRC16_DATA8, BUILT_IN_REV_CRC16_DATA16,
  BUILT_IN_REV_CRC32_DATA8, BUILT_IN_REV_CRC32_DATA16,
BUILT_IN_REV_CRC32_DATA32,
  BUILT_IN_REV_CRC64_DATA8, BUILT_IN_REV_CRC64_DATA16,
BUILT_IN_REV_CRC64_DATA32,
  BUILT_IN_REV_CRC64_DATA64.
  * builtins.def (BUILT_IN_CRC8_DATA8): New builtin.
  (BUILT_IN_CRC16_DATA8): Likewise.
  (BUILT_IN_CRC16_DATA16): Likewise.
  (BUILT_IN_CRC32_DATA8): Likewise.
  (BUILT_IN_CRC32_DATA16): Likewise.
  (BUILT_IN_CRC32_DATA32): Likewise.
  (BUILT_IN_CRC64_DATA8): Likewise.
  (BUILT_IN_CRC64_DATA16): Likewise.
  (BUILT_IN_CRC64_DATA32): Likewise.
  (BUILT_IN_CRC64_DATA64): Likewise.
  (BUILT_IN_REV_CRC8_DATA8): New builtin.
  (BUILT_IN_REV_CRC16_DATA8): Likewise.
  (BUILT_IN_REV_CRC16_DATA16): Likewise.
  (BUILT_IN_REV_CRC32_DATA8): Likewise.
  (BUILT_IN_REV_CRC32_DATA16): Likewise.
  (BUILT_IN_REV_CRC32_DATA32): Likewise.
  (BUILT_IN_REV_CRC64_DATA8): Likewise.
  (BUILT_IN_REV_CRC64_DATA16): Likewise.
  (BUILT_IN_REV_CRC64_DATA32): Likewise.
  (BUILT_IN_REV_CRC64_DATA64): Likewise.
  * builtins.h (expand_builtin_crc_table_based): New function
declaration.
  * doc/extend.texti (__builtin_rev_crc8_data8,
 __builtin_rev_crc16_data16, __builtin_rev_crc16_data8,
  __builtin_rev_crc32_data32, __builtin_rev_crc32_data8,
  __builtin_rev_crc32_data16, __builtin_rev_crc64_data64,
  __builtin_rev_crc64_data8, __builtin_rev_crc64_data16,
  __builtin_rev_crc64_data32, __builtin_crc8_data8,
  __builtin_crc16_data16, __builtin_crc16_data8,
  __builtin_crc32_data32, __builtin_crc32_data8,
  __builtin_crc32_data16, __builtin_crc64_data64,
  __builtin_crc64_data8, __builtin_crc64_data16,
  __builtin_crc64_data32): Document.

  gcc/testsuite/

 * gcc.dg/crc-builtin-rev-target32.c
 * gcc.dg/crc-builtin-rev-targ

[RFC/RFA] [PATCH v5 09/12] Add symbolic execution support.

2024-10-18 Thread Mariam Arutunian
Gives an opportunity to execute the code on bit level,
assigning symbolic values to the variables which don't have initial values.
Supports only CRC specific operations.

Example:

uint8_t crc;
uint8_t pol = 1;
crc = crc ^ pol;

during symbolic execution crc's value will be:
crc(8), crc(7), ... crc(1), crc(0) ^ 1

  gcc/

* Makefile.in (OBJS): Add sym-exec/sym-exec-expression.o,
sym-exec/sym-exec-state.o, sym-exec/sym-exec-condition.o.
* configure (sym-exec): New subdir.

  gcc/sym-exec/

* sym-exec-condition.cc: New file.
* sym-exec-condition.h: New file.
* sym-exec-expression-is-a-helper.h: New file.
* sym-exec-expression.cc: New file.
* sym-exec-expression.h: New file.
* sym-exec-state.cc: New file.
* sym-exec-state.h: New file.

Signed-off-by: Mariam Arutunian 
Author: Matevos Mehrabyan 
Co-authored-by: Mariam Arutunian 
Mentored-by: Jeff Law 
---
 gcc/Makefile.in  |3 +
 gcc/configure|2 +-
 gcc/sym-exec/sym-exec-condition.cc   |   59 +
 gcc/sym-exec/sym-exec-condition.h|   33 +
 gcc/sym-exec/sym-exec-expr-is-a-helper.h |  204 ++
 gcc/sym-exec/sym-exec-expression.cc  |  426 +
 gcc/sym-exec/sym-exec-expression.h   |  260 +++
 gcc/sym-exec/sym-exec-state.cc   | 2148 ++
 gcc/sym-exec/sym-exec-state.h|  436 +
 9 files changed, 3570 insertions(+), 1 deletion(-)
 create mode 100644 gcc/sym-exec/sym-exec-condition.cc
 create mode 100644 gcc/sym-exec/sym-exec-condition.h
 create mode 100644 gcc/sym-exec/sym-exec-expr-is-a-helper.h
 create mode 100644 gcc/sym-exec/sym-exec-expression.cc
 create mode 100644 gcc/sym-exec/sym-exec-expression.h
 create mode 100644 gcc/sym-exec/sym-exec-state.cc
 create mode 100644 gcc/sym-exec/sym-exec-state.h

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index a7054eda9c5..6eab34d62bb 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1718,6 +1718,9 @@ OBJS = \
 	tree-logical-location.o \
 	tree-loop-distribution.o \
 	gimple-crc-optimization.o \
+	sym-exec/sym-exec-expression.o \
+	sym-exec/sym-exec-state.o \
+	sym-exec/sym-exec-condition.o \
 	tree-nested.o \
 	tree-nrv.o \
 	tree-object-size.o \
diff --git a/gcc/configure b/gcc/configure
index 3d301b6ecd3..c781f4c24b6 100755
--- a/gcc/configure
+++ b/gcc/configure
@@ -36232,7 +36232,7 @@ $as_echo "$as_me: executing $ac_file commands" >&6;}
 "depdir":C) $SHELL $ac_aux_dir/mkinstalldirs $DEPDIR ;;
 "gccdepdir":C)
   ${CONFIG_SHELL-/bin/sh} $ac_aux_dir/mkinstalldirs build/$DEPDIR
-  for lang in $subdirs c-family common analyzer text-art rtl-ssa
+  for lang in $subdirs c-family common analyzer text-art rtl-ssa sym-exec
   do
   ${CONFIG_SHELL-/bin/sh} $ac_aux_dir/mkinstalldirs $lang/$DEPDIR
   done ;;
diff --git a/gcc/sym-exec/sym-exec-condition.cc b/gcc/sym-exec/sym-exec-condition.cc
new file mode 100644
index 000..ef3f1e3fda5
--- /dev/null
+++ b/gcc/sym-exec/sym-exec-condition.cc
@@ -0,0 +1,59 @@
+#include "sym-exec-condition.h"
+
+bit_condition::bit_condition (value_bit *left, value_bit *right, tree_code code)
+{
+  this->m_left = left;
+  this->m_right = right;
+  this->m_code = code;
+  m_type = BIT_CONDITION;
+}
+
+
+bit_condition::bit_condition (const bit_condition &expr)
+{
+  bit_expression::copy (&expr);
+  m_code = expr.get_code ();
+}
+
+
+/* Returns the condition's code.  */
+
+tree_code
+bit_condition::get_code () const
+{
+  return m_code;
+}
+
+
+/* Returns a copy of the condition.  */
+
+value_bit *
+bit_condition::copy () const
+{
+  return new bit_condition (*this);
+}
+
+
+/* Prints the condition's sign.  */
+
+void
+bit_condition::print_expr_sign ()
+{
+  switch (m_code)
+{
+  case GT_EXPR:
+	fprintf (dump_file, " > ");
+	break;
+  case LT_EXPR:
+	fprintf (dump_file, " < ");
+	break;
+  case EQ_EXPR:
+	fprintf (dump_file, " == ");
+	break;
+  case NE_EXPR:
+	fprintf (dump_file, " != ");
+	break;
+  default:
+	fprintf (dump_file, " ? ");
+}
+}
\ No newline at end of file
diff --git a/gcc/sym-exec/sym-exec-condition.h b/gcc/sym-exec/sym-exec-condition.h
new file mode 100644
index 000..1d9d59512bb
--- /dev/null
+++ b/gcc/sym-exec/sym-exec-condition.h
@@ -0,0 +1,33 @@
+#ifndef SYM_EXEC_CONDITION_H
+#define SYM_EXEC_CONDITION_H
+
+#include "sym-exec-expression.h"
+
+enum condition_status {
+  CS_NO_COND,
+  CS_TRUE,
+  CS_FALSE,
+  CS_SYM
+};
+
+
+class bit_condition : public bit_expression {
+ private:
+  /* Condition's code.  */
+  tree_code m_code;
+
+  /* Prints the condition's sign.  */
+  void print_expr_sign ();
+
+ public:
+  bit_condition (value_bit *left, value_bit *right, tree_code type);
+  bit_condition (const bit_condition &expr);
+
+  /* Returns the condition's code.  */
+  tree_code get_code () const;
+
+  /* Returns a copy of the condition.  */
+  value_bit *copy () const;
+};
+
+#endif /* SYM_EXEC_CONDITION_H.  */
\ No newline at end of file
diff --git a/gcc/sym-exec/sym-ex

[RFC/RFA] [PATCH v5 11/12] Replace the original CRC loops with a faster CRC calculation.

2024-10-18 Thread Mariam Arutunian
After the loop exit an internal function call (CRC, CRC_REV) is added,
and its result is assigned to the output CRC variable (the variable where
the calculated CRC is stored
after the loop execution).
The removal of the loop is left to CFG cleanup and DCE.

  gcc/

* gimple-crc-optimization.cc (optimize_crc_loop): New function.
(execute): Add optimize_crc_loop function call.

Signed-off-by: Mariam Arutunian 
Mentored-by: Jeff Law 
---
 gcc/gimple-crc-optimization.cc | 78 ++
 1 file changed, 78 insertions(+)

diff --git a/gcc/gimple-crc-optimization.cc b/gcc/gimple-crc-optimization.cc
index a05aaf9f217..9dee9be85d0 100644
--- a/gcc/gimple-crc-optimization.cc
+++ b/gcc/gimple-crc-optimization.cc
@@ -225,6 +225,11 @@ class crc_optimization {
   /* Returns phi statement which may hold the calculated CRC.  */
   gphi *get_output_phi ();
 
+  /* Attempts to optimize a CRC calculation loop by replacing it with a call to
+ an internal function (IFN_CRC or IFN_CRC_REV).
+ Returns true if replacement is succeeded, otherwise false.  */
+  bool optimize_crc_loop (gphi *output_crc);
+
  public:
   crc_optimization () : m_visited_stmts (BITMAP_ALLOC (NULL)),
 			m_crc_loop (nullptr), m_polynomial (0)
@@ -1215,6 +1220,73 @@ crc_optimization::get_output_phi ()
   return nullptr;
 }
 
+/* Attempts to optimize a CRC calculation loop by replacing it with a call to
+   an internal function (IFN_CRC or IFN_CRC_REV).
+   Returns true if replacement is succeeded, otherwise false.  */
+
+bool
+crc_optimization::optimize_crc_loop (gphi *output_crc)
+{
+  if (!output_crc)
+{
+  if (dump_file)
+	fprintf (dump_file, "Couldn't determine output CRC.\n");
+  return false;
+}
+
+  if (!m_data_arg)
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "Data and CRC are xor-ed before for loop.  Initializing data "
+		 "with 0.\n");
+  /* Create a new variable for the data.
+   Determine the data's size with the loop iteration count.  */
+  unsigned HOST_WIDE_INT
+	data_size = tree_to_uhwi (m_crc_loop->nb_iterations) + 1;
+  tree type = build_nonstandard_integer_type (data_size, 1);
+ /* For the CRC calculation, it doesn't matter CRC is calculated for the
+	(CRC^data, 0) or (CRC, data).  */
+  m_data_arg = build_int_cstu (type, 0);
+}
+
+  /* Build tree node for the polynomial from its constant value.  */
+  tree polynomial_arg = build_int_cstu (TREE_TYPE (m_crc_arg), m_polynomial);
+  gcc_assert (polynomial_arg);
+
+  internal_fn ifn;
+  if (m_is_bit_forward)
+ifn = IFN_CRC;
+  else
+ifn = IFN_CRC_REV;
+
+  tree phi_result = gimple_phi_result (output_crc);
+  location_t loc;
+  loc = EXPR_LOCATION (phi_result);
+
+  /* Add IFN call and write the return value in the phi_result.  */
+  gcall *call
+  = gimple_build_call_internal (ifn, 3,
+m_crc_arg,
+m_data_arg,
+polynomial_arg);
+
+  gimple_call_set_lhs (call, phi_result);
+  gimple_set_location (call, loc);
+  gimple_stmt_iterator si = gsi_start_bb (output_crc->bb);
+  gsi_insert_before (&si, call, GSI_SAME_STMT);
+
+  /* Remove phi statement, which was holding CRC result.  */
+  gimple_stmt_iterator tmp_gsi = gsi_for_stmt (output_crc);
+  remove_phi_node (&tmp_gsi, false);
+
+  /* Alter the exit condition of the loop to always exit.  */
+  gcond* loop_exit_cond = get_loop_exit_condition (m_crc_loop);
+  gimple_cond_make_false (loop_exit_cond);
+  update_stmt (loop_exit_cond);
+  return true;
+}
+
 unsigned int
 crc_optimization::execute (function *fun)
 {
@@ -1271,6 +1343,12 @@ crc_optimization::execute (function *fun)
 	  if (dump_file)
 	fprintf (dump_file, "The loop with %d header BB index "
 "calculates CRC!\n", m_crc_loop->header->index);
+
+	  if (!optimize_crc_loop (output_crc))
+	{
+	  if (dump_file)
+		fprintf (dump_file, "Couldn't generate faster CRC code.\n");
+	}
 	}
 }
   return 0;
-- 
2.25.1



[RFC/RFA] [PATCH v5 07/12] aarch64: Add CRC built-ins test for the target AES.

2024-10-18 Thread Mariam Arutunian
gcc/testsuite/gcc.target/aarch64/

* crc-builtin-pmul64.c: New test.

Signed-off-by: Mariam Arutunian 
---
 .../gcc.target/aarch64/crc-builtin-pmul64.c   | 61 +++
 1 file changed, 61 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/crc-builtin-pmul64.c

diff --git a/gcc/testsuite/gcc.target/aarch64/crc-builtin-pmul64.c b/gcc/testsuite/gcc.target/aarch64/crc-builtin-pmul64.c
new file mode 100644
index 000..d8bb1724a65
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/crc-builtin-pmul64.c
@@ -0,0 +1,61 @@
+/* { dg-options "-march=armv8-a+crypto" } */
+
+#include 
+int8_t crc8_data8 ()
+{
+  return __builtin_crc8_data8 ('a', 0xff, 0x12);
+}
+int16_t crc16_data8 ()
+{
+  return __builtin_crc16_data8 (0x1234, 'a', 0x1021);
+}
+
+int16_t crc16_data16 ()
+{
+  return __builtin_crc16_data16 (0x1234, 0x3214, 0x1021);
+}
+
+int32_t crc32_data8 ()
+{
+  return __builtin_crc32_data8 (0x, 0x32, 0x4002123);
+}
+int32_t crc32_data16 ()
+{
+  return __builtin_crc32_data16 (0x, 0x3232, 0x4002123);
+}
+
+int32_t crc32_data32 ()
+{
+  return __builtin_crc32_data32 (0x, 0x123546ff, 0x4002123);
+}
+
+int8_t rev_crc8_data8 ()
+{
+  return __builtin_rev_crc8_data8 (0x34, 'a', 0x12);
+}
+
+int16_t rev_crc16_data8 ()
+{
+  return __builtin_rev_crc16_data8 (0x1234, 'a', 0x1021);
+}
+
+int16_t rev_crc16_data16 ()
+{
+  return __builtin_rev_crc16_data16 (0x1234, 0x3214, 0x1021);
+}
+
+int32_t rev_crc32_data8 ()
+{
+  return __builtin_rev_crc32_data8 (0x, 0x32, 0x4002123);
+}
+
+int32_t rev_crc32_data16 ()
+{
+  return __builtin_rev_crc32_data16 (0x, 0x3232, 0x4002123);
+}
+
+int32_t rev_crc32_data32 ()
+{
+  return __builtin_rev_crc32_data32 (0x, 0x123546ff, 0x4002123);
+} 
+/* { dg-final { scan-assembler-times "pmull" 24 } } */
\ No newline at end of file
-- 
2.25.1



arm: Improvements to arm_noce_conversion_profitable_p call [PR 116444]

2024-10-18 Thread Andre Vieira (lists)
Sorry for the delay, some other work popped up in between and this had 
some latent issues. They should all be addressed now in this new patch.



When not dealing with the special armv8.1-m.main conditional 
instructions case make sure it uses the 
default_noce_conversion_profitable_p call to determine whether the 
sequence is cost effective.


Also make sure arm_noce_conversion_profitable_p accepts vsel 
patterns for Armv8.1-M Mainline targets.


gcc/ChangeLog:

* config/arm/arm.cc (arm_noce_conversion_profitable_p): Call
default_noce_conversion_profitable_p when not dealing with the
armv8.1-m.main special case.
(arm_is_vsel_fp_insn): New function.


Regression tested on arm-none-eabi with -mcpu=cortex-m3 and 
-mcpu=cortex-m55/-mfloat-abi=hard.


OK for trunk?

diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 
5c11621327e15b7212b2290769cc0a922347ce2d..41f70154381bcfee3489841c05e4233310f2acee
 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -36092,6 +36092,51 @@ arm_get_mask_mode (machine_mode mode)
   return default_get_mask_mode (mode);
 }
 
+/* Helper function to determine whether SEQ represents a sequence of
+   instructions representing the vsel floating point instructions.  */
+
+static bool
+arm_is_vsel_fp_insn (rtx_insn *seq)
+{
+  rtx_insn *curr_insn = seq;
+  rtx set = NULL_RTX;
+  /* The pattern may start with a simple set with register operands.  Skip
+ through any of those.  */
+  while (curr_insn)
+{
+  set = single_set (curr_insn);
+  if (!set
+ || !REG_P (SET_DEST (set)))
+   return false;
+
+  if (!REG_P (SET_SRC (set)))
+   break;
+  curr_insn = NEXT_INSN (curr_insn);
+}
+
+  if (!set)
+return false;
+
+  /* The next instruction should be a compare.  */
+  if (!REG_P (SET_DEST (set))
+  || GET_CODE (SET_SRC (set)) != COMPARE)
+return false;
+
+  curr_insn = NEXT_INSN (curr_insn);
+  if (!curr_insn)
+return false;
+
+  /* And the last instruction should be an IF_THEN_ELSE.  */
+  set = single_set (curr_insn);
+  if (!set
+  || !REG_P (SET_DEST (set))
+  || GET_CODE (SET_SRC (set)) != IF_THEN_ELSE)
+return false;
+
+  return !NEXT_INSN (curr_insn);
+}
+
+
 /* Helper function to determine whether SEQ represents a sequence of
instructions representing the Armv8.1-M Mainline conditional arithmetic
instructions: csinc, csneg and csinv. The cinc instruction is generated
@@ -36164,15 +36209,20 @@ arm_is_v81m_cond_insn (rtx_insn *seq)
hook to only allow "noce" to generate the patterns that are profitable.  */
 
 bool
-arm_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *)
+arm_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info)
 {
   if (!TARGET_COND_ARITH
   || reload_completed)
-return true;
+return default_noce_conversion_profitable_p (seq, if_info);
 
   if (arm_is_v81m_cond_insn (seq))
 return true;
 
+  /* Look for vsel opportunities as we still want to codegen these for
+ Armv8.1-M Mainline targets.  */
+  if (arm_is_vsel_fp_insn (seq))
+return true;
+
   return false;
 }
 


Re: [Patch] Fortran: Fix translatability of diagnostic strings

2024-10-18 Thread Tobias Burnus

*Patch ping*

Tobias Burnus wrote:

I noticed that several diagnostic strings were not tagged as translatable.
I fixed them by adding _ or G_ as prefix ( →gcc/ABOUT-GCC-NLS) and moved a single-use string to the message to 
make it more readable. One error message did not quit fit the pattern, 
hence, I modified it slightly, and a few '...' should have used proper 
%<...%> quotes. OK for mainline? Tobias PS: Currently, 'make gcc.pot' 
fails because of PR117109.


This issue has been fixed. Thanks David!

Tobias


Re: [PATCH] diagnostics: libcpp: Improve locations for _Pragma lexing diagnostics [PR114423]

2024-10-18 Thread David Malcolm
On Fri, 2024-10-18 at 13:58 -0400, Lewis Hyatt wrote:
> On Fri, Oct 18, 2024 at 11:25 AM David Malcolm 
> wrote:
> > >    if (!pfile->cb.diagnostic)
> > >  abort ();
> > > -  ret = pfile->cb.diagnostic (pfile, level, reason, richloc,
> > > _(msgid), ap);
> > > -
> > > -  return ret;
> > > +  if (pfile->diagnostic_override_loc && level != CPP_DL_NOTE)
> > > +    {
> > > +  rich_location rc2{pfile->line_table, pfile-
> > > > diagnostic_override_loc};
> > > +  return pfile->cb.diagnostic (pfile, level, reason, &rc2,
> > > _(msgid), ap);
> > > +    }
> > 
> > This will effectively override the primary location in the
> > rich_location, but by using a second rich_location instance it will
> > also ignore any secondary locations and fix-it hints.
> > 
> > This might will be what we want here, but did you consider
> >   richloc.set_range (0, pfile->diagnostic_override_loc,
> >  SHOW_RANGE_WITH_CARET);
> > to reset the primary location?
> > 
> > Otherwise, looks good to me.
> > 
> > [...snip...]
> > 
> > Thanks
> > Dave
> > 
> 
> Thanks for taking a look! My thinking was, when libcpp produces
> tokens
> from a _Pragma string, basically every location_t that it generates
> is
> wrong and shouldn't be used. Because it doesn't actually set up the
> line_map, it gets something random that's just sorta close to
> reasonable. So I think it makes sense to discard fixits and secondary
> locations too.

Fair enough.  I think the patch is OK, then, assuming usual testing.

> 
> libcpp does use rich_location pretty sparingly, but I suppose the
> goal
> is to use it more over time. We use one fixit hint for invalid
> directive names currently, that one can't show up in a _Pragma
> though.
> Right now we do define an encoding_rich_location subclass for
> escaping
> unprintable bytes, which inherits rich_location and just adds a new
> constructor to set the escape flag when it is created. You are
> definitely right that this patch as of now loses that information.
> 
> Here's a source that uses an improperly normalized character:
> 
> _Pragma("ோ")
> 
> Currently we output:
> 
> t.cpp:1:1: warning: ‘\U0bc7\U0bbe’ is not in NFC [-
> Wnormalized=]
>     1 | _Pragma("")
>   | ^~
> 
> With the patch we output:
> t.cpp:1:1: warning: ‘\U0bc7\U0bbe’ is not in NFC [-
> Wnormalized=]
>     1 | _Pragma("ோ")
>   | ^~~
> 
> So the main location range is improved (it underlines all of _Pragma
> instead of most of it), but we have indeed lost the intended feature
> that the incorrect bytes are escaped on the source line.
> 
> For this particular case I could improve it with a one line addition
> like
> 
> rc2.set_escape_on_output (richloc->escape_on_output_p ());
> 
> and that would actually handle all currently needed cases since we
> don't use a lot of rich_locations in libcpp. It would just become
> stale if some other option gets added to rich_location in the future
> that we also should preserve. 
> I think to fix it in a fully general
> way, it would be necessary to add a new interface to class
> rich_location. It already has a method to delete all the fixit hints,
> it would also need a method to delete all the ranges. Then we could
> make a copy of the richloc and just delete everything we don't want
> to
> preserve. Do you have a preference one way or the other?

Do we know which diagnostics are likely to be emitted when we override
the location?  I suspect that anywhere that's overriding the location
is an awkward case where we're unlikely to have fix-it hints and
secondary ranges.

I don't have a strong preference here.

> 
> By the way, your suggestion was to directly modify richloc... these
> functions do take the richloc by non-const pointer, but is it OK to
> modify it or is a copy needed? The functions like cpp_warning_at()
> are
> exposed in the public header file, although at the moment, all call
> sites are within libcpp and don't look like they would notice if the
> argument was modified. I wasn't sure what is the expected interface
> here.

The diagnostic-reporting functions take non-const rich_location because
the Fortran frontend doesn't set the locations on its diagnostics, but
instead uses formatting codes %C and %L which write back locations into
the rich_location when pp_format is called.  So there's some precedent
for non-const rich_location, FWIW

Thanks again for the patch.
Dave



Re: [Patch] Fortran: Fix translatability of diagnostic strings

2024-10-18 Thread Jerry D

On 10/18/24 3:20 PM, Tobias Burnus wrote:

*Patch ping*



OK for trunk.

Jerry


Tobias Burnus wrote:

I noticed that several diagnostic strings were not tagged as translatable.
I fixed them by adding _ or G_ as prefix ( →gcc/ABOUT-GCC-NLS) and moved a single-use string to the 
message to make it more readable. One error message did not 
quit fit the pattern, hence, I modified it slightly, and a few 
'...' should have used proper %<...%> quotes. OK for mainline? 
Tobias PS: Currently, 'make gcc.pot' fails because of PR117109.


This issue has been fixed. Thanks David!

Tobias





[pushed: r15-4492] diagnostics: remove forward decl of json::value from diagnostic.h

2024-10-18 Thread David Malcolm
I believe this hasn't been necessary since r15-1413-gd3878c85f331c7.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Pushed to trunk as r15-4492-g83abdb041426b7.

gcc/ChangeLog:
* diagnostic.h (json::value): Remove forward decl.

Signed-off-by: David Malcolm 
---
 gcc/diagnostic.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/gcc/diagnostic.h b/gcc/diagnostic.h
index edd221f1a8ce..423e07230a65 100644
--- a/gcc/diagnostic.h
+++ b/gcc/diagnostic.h
@@ -220,7 +220,6 @@ public:
 };
 
 class edit_context;
-namespace json { class value; }
 class diagnostic_client_data_hooks;
 class logical_location;
 class diagnostic_diagram;
-- 
2.26.3



[pushed: r15-4491] diagnostics: add debug dump functions

2024-10-18 Thread David Malcolm
This commit expands on r15-3973-g4c7a58ac2617e2, which added
debug "dump" member functiosn to pretty_printer and output_buffer.

This followup adds "dump" member functions to diagnostic_context and
diagnostic_format, extends the existing dump functions and adds
indentation to make it much easier to see the various relationships
between context, format, printer, etc.

Hence you can now do:

(gdb) call global_dc->dump ()

and get a useful summary of what the diagnostic subsystem is doing;
for example:

(gdb) call global_dc->dump()
diagnostic_context:
  counts:
  output format:
sarif_output_format
  printer:
m_show_color: false
m_url_format: bel
m_buffer:
  m_formatted_obstack current object: length 0:
  m_chunk_obstack current object: length 0:
  pp_formatted_chunks: depth 0
0: TEXT("Function ")]
1: BEGIN_QUOTE, TEXT("program"), END_QUOTE]
2: TEXT(" requires an argument list at ")]
3: TEXT("(1)")]

showing the counts of all diagnostic kind that are non-zero (none yet),
that we have a sarif output format, and the printer is part-way through
formatting a string.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Pushed to trunk as r15-4491-g2ca19d43fb5d75.

gcc/ChangeLog:
* diagnostic-format-json.cc (json_output_format::dump): New.
* diagnostic-format-sarif.cc (sarif_output_format::dump): New.
(sarif_file_output_format::dump): New.
* diagnostic-format-text.cc (diagnostic_text_output_format::dump):
New.
* diagnostic-format-text.h (diagnostic_text_output_format::dump):
New decl.
* diagnostic-format.h (diagnostic_output_format::dump): New decls.
* diagnostic.cc (diagnostic_context::dump): New.
(diagnostic_output_format::dump): New.
* diagnostic.h (diagnostic_context::dump): New decls.
* pretty-print-format-impl.h (pp_formatted_chunks::dump): Add
"indent" param.
* pretty-print.cc (bytes_per_hexdump_line): New constant.
(print_hexdump_line): New.
(print_hexdump): New.
(output_buffer::dump): Add "indent" param and use it.  Add
hexdump of current object in m_formatted_obstack and
m_chunk_obstack.
(pp_formatted_chunks::dump): Add "indent" param and use it.
(pretty_printer::dump): Likewise.  Add dumping of m_show_color
and m_url_format.
* pretty-print.h (output_buffer::dump): Add "indent" param.
(pretty_printer::dump): Likewise.

gcc/testsuite/ChangeLog:
* gcc.dg/plugin/diagnostic_plugin_xhtml_format.c
(xhtml_output_format::dump): New.

Signed-off-by: David Malcolm 
---
 gcc/diagnostic-format-json.cc |  6 ++
 gcc/diagnostic-format-sarif.cc| 13 +++
 gcc/diagnostic-format-text.cc |  7 ++
 gcc/diagnostic-format-text.h  |  3 +
 gcc/diagnostic-format.h   |  4 +
 gcc/diagnostic.cc | 24 +
 gcc/diagnostic.h  |  3 +
 gcc/pretty-print-format-impl.h|  4 +-
 gcc/pretty-print.cc   | 96 +--
 gcc/pretty-print.h|  8 +-
 .../plugin/diagnostic_plugin_xhtml_format.c   |  6 ++
 11 files changed, 160 insertions(+), 14 deletions(-)

diff --git a/gcc/diagnostic-format-json.cc b/gcc/diagnostic-format-json.cc
index b4c1f13ee671..4f035dd2fae3 100644
--- a/gcc/diagnostic-format-json.cc
+++ b/gcc/diagnostic-format-json.cc
@@ -38,6 +38,12 @@ along with GCC; see the file COPYING3.  If not see
 class json_output_format : public diagnostic_output_format
 {
 public:
+  void dump (FILE *out, int indent) const override
+  {
+fprintf (out, "%*sjson_output_format\n", indent, "");
+diagnostic_output_format::dump (out, indent);
+  }
+
   void on_begin_group () final override
   {
 /* No-op.  */
diff --git a/gcc/diagnostic-format-sarif.cc b/gcc/diagnostic-format-sarif.cc
index 89ac9a5424c9..f64c83ad6e14 100644
--- a/gcc/diagnostic-format-sarif.cc
+++ b/gcc/diagnostic-format-sarif.cc
@@ -3302,6 +3302,12 @@ public:
 gcc_assert (!pending_result);
   }
 
+  void dump (FILE *out, int indent) const override
+  {
+fprintf (out, "%*ssarif_output_format\n", indent, "");
+diagnostic_output_format::dump (out, indent);
+  }
+
   void on_begin_group () final override
   {
 /* No-op,  */
@@ -3386,6 +3392,13 @@ public:
   {
 m_builder.flush_to_file (m_output_file.get_open_file ());
   }
+  void dump (FILE *out, int indent) const override
+  {
+fprintf (out, "%*ssarif_file_output_format: %s\n",
+indent, "",
+m_output_file.get_filename ());
+diagnostic_output_format::dump (out, indent);
+  }
   bool machine_readable_stderr_p () const final override
   {
 return false;
diff --git a/gcc/diagnostic-format-text.cc b/gcc/diagnostic-format-text.cc
index 0d58d5fb082d..f6ec88155c7f 100644
--- a/gcc/diagnostic-format-

Fortran: Add range-based diagnostic

2024-10-18 Thread Tobias Burnus

This patch was motivated by David's talk at Cauldron – and by
getting rather bad locations for some diagnostics, where I wanted
to use the column number to ensure that all items are found.

The main problem was a missing gobbling of spaces, but still
ranges are way nicer. As gfortran uses the common system, that
was trivial - except that during parsing something else is used
and that therefore need to support two formats, which required
a few changes. Still, it is rather non invasive and I think for
trans*.cc is also does a nice cleanup.

Unsurprisingly, there are more opportunities, both for fixing
the location issues due to treating whitespace and ranges instead
of a single locus; however, a single location is also fine. Note
that for 'a + b' the locus could be '~~1~~', i.e. pointing at '+'
but spanning the whole expression. Talking about unused features:
besides 'inform' we could also use fixit hints, providing patches.
And I note that gfc_error also supports %qD or %qE or ... to print
trees.

But back to this patch, an example is the following:

   27 |   deallocate(ALLOCS(1))
  |  1
Error: Allocate-object at (1) must be ALLOCATABLE or a POINTER

Tested on x86_64-gnu-linux.
Comments, suggestions, remarks?
OK for mainline?

Tobias

PS: Andre remarked that there was some issue, logged somewhere
in Bugzilla, due to the old/current handling of locations. I
have not searched Bugzilla and, thus, have no idea whether it
helps or not. Presumably not.
Fortran: Add range-based diagnostic

GCC's diagnostic engine gained a while ago support for ranges, i.e. instead
of pointing at a single character '^', it can also have a '^~' range.

This patch adds support for this and adds 9 users for it, which covers the
most common cases. A single '^' can be still useful. Some location data in
gfortran is rather bad - often the matching pattern includes whitespace such
that the before or after location points to the beginning/end of the
whitespace, which can be far of especially when comments and/or continuation
lines are involed. Otherwise, often a '^' still sufficient, albeit wrong
location data only becomes obvious once starting to use ranges.

The 'locus' is extended to support two ways to store the data; hereby
gfc_current_locus always contains the old format (at least during parsing)
and gfc_current_locus shall not be used in trans*.cc. The latter permits
a nice cleanup to just use input_location. Otherwise, the new format is
only used when switching to ranges.
The only reason to convert from location_t to locus occurs in trans*.cc
for the gfc_error (etc.) diagnostic and for gfc_trans_runtime_check; there
are 5 currently 5 such cases.  For gfc_* diagnostic, we could think of
another letter besides %L or a modifier like '%lL', if deemed useful.

In any case, the new format is just:
  locus->u.location = linemap_position_for_loc_and_offset (line_table,
 loc->u.lb->location, loc->nextc - loc->u.lb->line);
  locus->nextc = (gfc_char_t *) -1;  /* Marker for new format. */
i.e. using the existing location_t location in in the linebuffer (which
points to column 0) and add as offset the actually used column number.

As location_t handles ranges, we just use it also to store them via:
  location = make_location (caret, begin, end)
There are a few convenience macros/functions but that's all.

Alongside, a few minor fixes were done: linemap_location_before_p replaces
a line-number based comparison, which does not handle multiple statements
in the same line that ';' allows for.

gcc/fortran/ChangeLog:

	* data.cc (gfc_assign_data_value): Use linemap_location_before_p
	and GFC_LOCUS_IS_SET.
	* decl.cc (gfc_verify_c_interop_param): Make better translatable.
	(build_sym, variable_decl, gfc_match_formal_arglist,
	gfc_match_subroutine): Add range-based locations, use it in
	diagnostic and gobble whitespace for better locations.
	* error.cc (gfc_get_location_with_offset): Handle new format.
	(gfc_get_location_range): New.
	* expr.cc (gfc_check_assign): Use GFC_LOCUS_IS_SET.
	* frontend-passes.cc (check_locus_code, check_locus_expr):
	Likewise.
	(runtime_error_ne): Use GFC_LOCUS_IS_SET.
	* gfortran.h (locus): Change lb to union with lb and location.
	(GFC_LOCUS_IS_SET): Define.
	(gfc_get_location_range): New prototype.
	(gfc_new_symbol, gfc_get_symbol, gfc_get_sym_tree,
	gfc_get_ha_symbol, gfc_get_ha_sym_tree): Take optional locus
	argument.
	* io.cc (io_constraint): Use GFC_LOCUS_IS_SET.
	* match.cc (gfc_match_sym_tree): Use range locus.
	* openmp.cc (gfc_match_omp_variable_list,
	gfc_match_omp_doacross_sink): Likewise.
	* parse.cc (next_free): Update for locus struct change.
	* primary.cc (gfc_match_varspec): Likewise.
	(match_variable): Use range locus.
	* resolve.cc (find_array_spec): Use GFC_LOCUS_IS_SET.
	* scanner.cc (gfc_at_eof, gfc_at_bol, gfc_start_source_files,
	gfc_advance_line, gfc_define_undef_line, skip_fixed_comments,
	gfc_gobble_whitespace, include_stmt, gfc_new_fil

Re: Fortran: Add range-based diagnostic

2024-10-18 Thread Jerry D

On 10/18/24 3:35 PM, Tobias Burnus wrote:

This patch was motivated by David's talk at Cauldron – and by
getting rather bad locations for some diagnostics, where I wanted
to use the column number to ensure that all items are found.

The main problem was a missing gobbling of spaces, but still
ranges are way nicer. As gfortran uses the common system, that
was trivial - except that during parsing something else is used
and that therefore need to support two formats, which required
a few changes. Still, it is rather non invasive and I think for
trans*.cc is also does a nice cleanup.

Unsurprisingly, there are more opportunities, both for fixing
the location issues due to treating whitespace and ranges instead
of a single locus; however, a single location is also fine. Note
that for 'a + b' the locus could be '~~1~~', i.e. pointing at '+'
but spanning the whole expression. Talking about unused features:
besides 'inform' we could also use fixit hints, providing patches.
And I note that gfc_error also supports %qD or %qE or ... to print
trees.

But back to this patch, an example is the following:

    27 |   deallocate(ALLOCS(1))
   |  1
Error: Allocate-object at (1) must be ALLOCATABLE or a POINTER

Tested on x86_64-gnu-linux.
Comments, suggestions, remarks?
OK for mainline?


After scanning through the whole long patch I think I get the 
pattern of it.


Looks Good To Me,  OK.

Jerry




Tobias

PS: Andre remarked that there was some issue, logged somewhere
in Bugzilla, due to the old/current handling of locations. I
have not searched Bugzilla and, thus, have no idea whether it
helps or not. Presumably not.




[PATCH] doc/cpp: Document __has_include_next

2024-10-18 Thread Arsen Arsenović
OK for trunk?  Seems to build and render fine with makeinfo --info and
--html.  Typesetting it, I see overfull and underfull hboxes, but I
suspect these were here for a while..
-- >8 --
While hacking on an unrelated change, I noticed that __has_include_next
hasn't been documented at all.  This patch adds it to the __has_include
manual node.

gcc/ChangeLog:

* doc/cpp.texi (__has_include): Document __has_include_next
also.
(Conditional Syntax): Mention __has_include_next in the
description for the __has_include menu entry.
---
 gcc/doc/cpp.texi | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/gcc/doc/cpp.texi b/gcc/doc/cpp.texi
index db3a075c5a96..03f8d059681d 100644
--- a/gcc/doc/cpp.texi
+++ b/gcc/doc/cpp.texi
@@ -3204,7 +3204,8 @@ directive}: @samp{#if}, @samp{#ifdef} or @samp{#ifndef}.
 * @code{__has_builtin}::
 * @code{__has_feature}::
 * @code{__has_extension}::
-* @code{__has_include}::
+* @code{__has_include}::@code{__has_include} and
+ @code{__has_include_next}
 * @code{__has_embed}::
 @end menu
 
@@ -3607,22 +3608,27 @@ details of which identifiers are accepted by these 
function-like macros, see
 the Clang documentation}}.
 
 @node @code{__has_include}
-@subsection @code{__has_include}
+@subsection @code{__has_include}, @code{__has_include_next}
 @cindex @code{__has_include}
+@cindex @code{__has_include_next}
 
-The special operator @code{__has_include (@var{operand})} may be used in
-@samp{#if} and @samp{#elif} expressions to test whether the header referenced
-by its @var{operand} can be included using the @samp{#include} directive.  
Using
-the operator in other contexts is not valid.  The @var{operand} takes
-the same form as the file in the @samp{#include} directive (@pxref{Include
-Syntax}) and evaluates to a nonzero value if the header can be included and
-to zero otherwise.  Note that that the ability to include a header doesn't
-imply that the header doesn't contain invalid constructs or @samp{#error}
-directives that would cause the preprocessor to fail.
+The special operators @code{__has_include (@var{operand})} and
+@code{__has_include_next (@var{operand})} may be used in @samp{#if} and
+@samp{#elif} expressions to test whether the header referenced by their
+@var{operand} can be included using the @samp{#include} and
+@samp{#include_next} directive, respectively.  Using the operators in
+other contexts is not valid.  The @var{operand} takes the same form as
+the file in the @samp{#include} and @samp{#include_next} directives
+respectively (@pxref{Include Syntax}) and the operators evaluate to a
+nonzero value if the header can be included and to zero otherwise.  Note
+that that the ability to include a header doesn't imply that the header
+doesn't contain invalid constructs or @samp{#error} directives that
+would cause the preprocessor to fail.
 
-The @code{__has_include} operator by itself, without any @var{operand} or
-parentheses, acts as a predefined macro so that support for it can be tested
-in portable code.  Thus, the recommended use of the operator is as follows:
+The @code{__has_include} and @code{__has_include_next} operators by
+themselves, without any @var{operand} or parentheses, acts as a
+predefined macro so that support for it can be tested in portable code.
+Thus, the recommended use of the operators is as follows:
 
 @smallexample
 #if defined __has_include
@@ -3645,6 +3651,8 @@ but not with others that don't.
 #endif
 @end smallexample
 
+The same holds for @code{__has_include_next}.
+
 @node @code{__has_embed}
 @subsection @code{__has_embed}
 @cindex @code{__has_embed}
-- 
2.47.0



[PATCH] phiopt: do factor_out_conditional_operation for all phis [PR112418]

2024-10-18 Thread Andrew Pinski
Sometimes factor_out_conditional_operation can factor out
an operation that causes a phi node to become the same element.
Other times, we want to factor out a binary operator because
it can improve code generation, an example is PR 110015 (openjpeg).

Note this includes a heuristic to decide if factoring out the operation
is profitable or not. It can be expanded to include a better live range
extend detector. Right now it has a simple one where if it is live on a
dominating path, it is considered a live or if there are a small # of
assign statements (defaults to 5), then it does not extend the live range
too much.

Bootstrapped and tested on x86_64-linux-gnu.

PR tree-optimization/112418

gcc/ChangeLog:

* tree-ssa-phiopt.cc (is_factor_profitable): New function.
(factor_out_conditional_operation): Add merge argument. Remove
arg0/arg1 arguments. Return bool instead of the new phi.
Early return for virtual ops. Call is_factor_profitable to
check if the factoring would be profitable.
(pass_phiopt::execute): Call factor_out_conditional_operation
on all phis instead of just singleton phi.
* doc/invoke.texi (--param phiopt-factor-max-stmts-live=): Document.
* params.opt (--param=phiopt-factor-max-stmts-live=): New opt.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/factor_op_phi-1.c: New test.
* gcc.dg/tree-ssa/factor_op_phi-2.c: New test.
* gcc.dg/tree-ssa/factor_op_phi-3.c: New test.
* gcc.dg/tree-ssa/factor_op_phi-4.c: New test.

Signed-off-by: Andrew Pinski 
---
 gcc/doc/invoke.texi   |   4 +
 gcc/params.opt|   4 +
 .../gcc.dg/tree-ssa/factor_op_phi-1.c |  27 +++
 .../gcc.dg/tree-ssa/factor_op_phi-2.c |  29 +++
 .../gcc.dg/tree-ssa/factor_op_phi-3.c |  33 +++
 .../gcc.dg/tree-ssa/factor_op_phi-4.c |  29 +++
 gcc/tree-ssa-phiopt.cc| 209 --
 7 files changed, 272 insertions(+), 63 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/factor_op_phi-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/factor_op_phi-2.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/factor_op_phi-3.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/factor_op_phi-4.c

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 8437b2029ed..aebcc9082ff 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -15473,6 +15473,10 @@ In each case, the @var{value} is an integer.  The 
following choices
 of @var{name} are recognized for all targets:
 
 @table @gcctabopt
+@item phiopt-factor-max-stmts-live
+When factoring statements out of if/then/else, this is the max # of statements
+after the defining statement to be allow to extend the lifetime of a name
+
 @item predictable-branch-outcome
 When branch is predicted to be taken with probability lower than this threshold
 (in percent), then it is considered well predictable.
diff --git a/gcc/params.opt b/gcc/params.opt
index a08e4c1042d..24f440bbe71 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -861,6 +861,10 @@ Enum(parloops_schedule_type) String(runtime) 
Value(PARLOOPS_SCHEDULE_RUNTIME)
 Common Joined UInteger Var(param_partial_inlining_entry_probability) Init(70) 
Optimization IntegerRange(0, 100) Param
 Maximum probability of the entry BB of split region (in percent relative to 
entry BB of the function) to make partial inlining happen.
 
+-param=phiopt-factor-max-stmts-live=
+Common Joined UInteger Var(param_phiopt_factor_max_stmts_live) Init(5) 
Optimization IntegerRange(0, 100) Param
+Maximum number of statements allowed inbetween the statement and the end to 
considered not extending the liferange.
+
 -param=predictable-branch-outcome=
 Common Joined UInteger Var(param_predictable_branch_outcome) Init(2) 
IntegerRange(0, 50) Param Optimization
 Maximal estimated outcome of branch considered predictable.
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/factor_op_phi-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/factor_op_phi-1.c
new file mode 100644
index 000..6c0971ff801
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/factor_op_phi-1.c
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-phiopt1-details -fdump-tree-optimized" } */
+
+/* PR tree-optimization/112418 */
+
+int f(int a, int b, int d)
+{
+  int c;
+  if (a < 0)
+  {
+a = -a;
+c = d > 0 ? d : -d;
+  }
+  else
+  {
+a = a;
+c = d > 0 ? d : -d;
+  }
+  return a + c;
+}
+
+/* ABS  should be able to pull out of the if statement early on in phiopt1. 
*/
+/* { dg-final { scan-tree-dump "changed to factor operation out from " 
"phiopt1" } } */
+/* { dg-final { scan-tree-dump-not "if " "phiopt1" } } */
+/* { dg-final { scan-tree-dump-times "ABS_EXPR " 2 "phiopt1" } } */
+/* { dg-final { scan-tree-dump-not "if " "optimized" } } */
+/* { dg-final { scan-tree-dump-times "ABS_EXPR " 2 "optimized" } } */
diff --git a/gcc/t

Re: [PATCH 4/7] RISC-V: Honour -mrvv-max-lmul in riscv_vector::expand_block_move

2024-10-18 Thread Robin Dapp
Hi Craig,

thanks for working on this, it has been on my TODO list for a while.

In general this looks reasonable to me.

> +   poly_uint64 mode_units;
> /* Find the mode to use for the copy inside the loop - or the
>sole copy, if there is no loop.  */
> if (!need_loop)
> @@ -1152,12 +1166,12 @@ expand_block_move (rtx dst_in, rtx src_in, rtx 
> length_in)
>pointless.
>Still, by choosing a lower LMUL factor that still allows
>an entire transfer, we can reduce register pressure.  */
> -   for (unsigned lmul = 1; lmul <= 4; lmul <<= 1)
> - if (length * BITS_PER_UNIT <= TARGET_MIN_VLEN * lmul
> - && multiple_p (BYTES_PER_RISCV_VECTOR * lmul, potential_ew)
> +   for (unsigned lmul = 1; lmul < TARGET_MAX_LMUL; lmul <<= 1)
> + if (known_le (length * BITS_PER_UNIT, TARGET_MIN_VLEN * lmul)
> + && multiple_p (BYTES_PER_RISCV_VECTOR * lmul, potential_ew,
> +&mode_units)
>   && (riscv_vector::get_vector_mode
> -  (elem_mode, exact_div (BYTES_PER_RISCV_VECTOR * lmul,
> -  potential_ew)).exists (&vmode)))
> +  (elem_mode, mode_units).exists (&vmode)))
> break;

> +/* Return the appropriate LMUL mode for MODE.  */
> +
> +opt_machine_mode
> +get_lmul_mode (scalar_mode mode, int lmul)
> +{
> +  poly_uint64 lmul_nunits;
> +  unsigned int bytes = GET_MODE_SIZE (mode);
> +  if (multiple_p (BYTES_PER_RISCV_VECTOR * lmul, bytes, &lmul_nunits))
> +return get_vector_mode (mode, lmul_nunits);
> +  return E_VOIDmode;
> +}

I don't fully see the need for this function just for the single caller
The ask for "largest vector mode with inner mode MODE" is
common to other "string" functions as well and what we do there is

  poly_int64 nunits = exact_div
  (BYTES_PER_RISCV_VECTOR * TARGET_MAX_LMUL, GET_MODE_SIZE (mode));

  machine_mode vmode;
  if (!riscv_vector::get_vector_mode (GET_MODE_INNER (mode), nunits)
 .exists (&vmode))
gcc_unreachable ();

The natural generalization to "largest vector mode up to LMUL" is useful
in instances where we know the AVL.
So maybe you'd want to slightly enhance your function and use it for
the other instances we have?  You could probably also use it inside your
loop just above.

-- 
Regards
 Robin



Re: libgo: fix for C23 nullptr keyword

2024-10-18 Thread Ian Lance Taylor
On Thu, Oct 17, 2024 at 1:19 PM Joseph Myers  wrote:
>
> Making GCC default to -std=gnu23 for C code produces Go test failures
> because of C code used by Go that uses a variable called nullptr, which is
> a keyword in C23.
>
> I've submitted this fix upstream at
> https://github.com/golang/go/pull/69927 using the GitHub mirror workflow.
> Ian, once some form of such a fix is upstream, could you backport it to
> GCC's libgo?

Thanks.  Committed upstream and to libgo.

Ian


Re: [PATCH 6/9] Try to simplify (X >> C1) << (C1 + C2) -> X << C2

2024-10-18 Thread Richard Biener
On Fri, 18 Oct 2024, Richard Sandiford wrote:

> This patch adds a rule to simplify (X >> C1) << (C1 + C2) -> X << C2
> when the low C1 bits of X are known to be zero.
> 
> Any single conversion can take place between the shifts.  E.g. for
> a truncating conversion, any extra bits of X that are preserved by
> truncating after the shift are immediately lost by the shift left.
> And the sign bits used for an extending conversion are the same as
> the sign bits used for the rshift.  (A double conversion of say
> int->unsigned->uint64_t would be wrong though.)

OK.

Thanks,
Richard.

> gcc/
>   * match.pd: Simplify (X >> C1) << (C1 + C2) -> X << C2 if the
>   low C1 bits of X are zero.
> 
> gcc/testsuite/
>   * gcc.dg/tree-ssa/shifts-1.c: New test.
>   * gcc.dg/tree-ssa/shifts-2.c: Likewise.
> ---
>  gcc/match.pd | 13 +
>  gcc/testsuite/gcc.dg/tree-ssa/shifts-1.c | 61 
>  gcc/testsuite/gcc.dg/tree-ssa/shifts-2.c | 21 
>  3 files changed, 95 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/shifts-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/shifts-2.c
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 268316456c3..540582dc984 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -4902,6 +4902,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   - TYPE_PRECISION (TREE_TYPE (@2)
>(bit_and (convert @0) (lshift { build_minus_one_cst (type); } @1
>  
> +#if GIMPLE
> +/* (X >> C1) << (C1 + C2) -> X << C2 if the low C1 bits of X are zero.  */
> +(simplify
> + (lshift (convert? (rshift (with_possible_nonzero_bits2 @0) INTEGER_CST@1))
> + INTEGER_CST@2)
> + (if (INTEGRAL_TYPE_P (type)
> +  && wi::ltu_p (wi::to_wide (@1), element_precision (type))
> +  && wi::ltu_p (wi::to_wide (@2), element_precision (type))
> +  && wi::to_widest (@2) >= wi::to_widest (@1)
> +  && wi::to_widest (@1) <= wi::ctz (get_nonzero_bits (@0)))
> +  (lshift (convert @0) (minus @2 @1
> +#endif
> +
>  /* For (x << c) >> c, optimize into x & ((unsigned)-1 >> c) for
> unsigned x OR truncate into the precision(type) - c lowest bits
> of signed x (if they have mode precision or a precision of 1).  */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/shifts-1.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/shifts-1.c
> new file mode 100644
> index 000..d88500ca8dd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/shifts-1.c
> @@ -0,0 +1,61 @@
> +/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
> +
> +unsigned int
> +f1 (unsigned int x)
> +{
> +  if (x & 3)
> +__builtin_unreachable ();
> +  x >>= 2;
> +  return x << 3;
> +}
> +
> +unsigned int
> +f2 (unsigned int x)
> +{
> +  if (x & 3)
> +__builtin_unreachable ();
> +  unsigned char y = x;
> +  y >>= 2;
> +  return y << 3;
> +}
> +
> +unsigned long
> +f3 (unsigned int x)
> +{
> +  if (x & 3)
> +__builtin_unreachable ();
> +  x >>= 2;
> +  return (unsigned long) x << 3;
> +}
> +
> +int
> +f4 (int x)
> +{
> +  if (x & 15)
> +__builtin_unreachable ();
> +  x >>= 4;
> +  return x << 5;
> +}
> +
> +unsigned int
> +f5 (int x)
> +{
> +  if (x & 31)
> +__builtin_unreachable ();
> +  x >>= 5;
> +  return x << 6;
> +}
> +
> +unsigned int
> +f6 (unsigned int x)
> +{
> +  if (x & 1)
> +__builtin_unreachable ();
> +  x >>= 1;
> +  return x << (sizeof (int) * __CHAR_BIT__ - 1);
> +}
> +
> +/* { dg-final { scan-tree-dump-not {<[a-z]*_div_expr,} "optimized" } } */
> +/* { dg-final { scan-tree-dump-not { +/* { dg-final { scan-tree-dump-times { "optimized" } } */
> +/* { dg-final { scan-tree-dump { { target int32 } } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/shifts-2.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/shifts-2.c
> new file mode 100644
> index 000..67ba4a75aec
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/shifts-2.c
> @@ -0,0 +1,21 @@
> +/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
> +
> +unsigned int
> +f1 (unsigned int x)
> +{
> +  if (x & 3)
> +__builtin_unreachable ();
> +  x >>= 3;
> +  return x << 4;
> +}
> +
> +unsigned int
> +f2 (unsigned int x)
> +{
> +  if (x & 3)
> +__builtin_unreachable ();
> +  x >>= 2;
> +  return x << 1;
> +}
> +
> +/* { dg-final { scan-tree-dump-times { 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


[PATCH] match.pd: Add std::pow folding optimizations.

2024-10-18 Thread Jennifer Schmitz
This patch adds the following two simplifications in match.pd:
- pow (1.0/x, y) to pow (x, -y), avoiding the division
- pow (0.0, x) to 0.0, avoiding the call to pow.
The patterns are guarded by flag_unsafe_math_optimizations,
!flag_trapping_math, !flag_errno_math, !HONOR_SIGNED_ZEROS,
and !HONOR_INFINITIES.

Tests were added to confirm the application of the transform for float,
double, and long double.

The patch was bootstrapped and regtested on aarch64-linux-gnu and
x86_64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Jennifer Schmitz 

gcc/
* match.pd: Fold pow (1.0/x, y) -> pow (x, -y) and
pow (0.0, x) -> 0.0.

gcc/testsuite/
* gcc.dg/tree-ssa/pow_fold_1.c: New test.
---
 gcc/match.pd   | 14 +
 gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c | 34 ++
 2 files changed, 48 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 12d81fcac0d..ba100b117e7 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -8203,6 +8203,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(rdiv @0 (exps:s @1))
 (mult @0 (exps (negate @1)
 
+ /* Simplify pow(1.0/x, y) into pow(x, -y).  */
+ (if (! HONOR_INFINITIES (type)
+  && ! HONOR_SIGNED_ZEROS (type)
+  && ! flag_trapping_math
+  && ! flag_errno_math)
+  (simplify
+   (POW (rdiv:s real_onep@0 @1) @2)
+(POW @1 (negate @2)))
+
+  /* Simplify pow(0.0, x) into 0.0.  */
+  (simplify
+   (POW real_zerop@0 @1)
+@0))
+
  (if (! HONOR_SIGN_DEPENDENT_ROUNDING (type)
   && ! HONOR_NANS (type) && ! HONOR_INFINITIES (type)
   && ! flag_trapping_math
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
new file mode 100644
index 000..113df572661
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ffast-math" } */
+/* { dg-require-effective-target c99_runtime } */
+
+extern void link_error (void);
+
+#define POW1OVER(TYPE, C_TY, TY)   \
+  void \
+  pow1over_##TY (TYPE x, TYPE y)   \
+  {\
+TYPE t1 = 1.0##C_TY / x;   \
+TYPE t2 = __builtin_pow##TY (t1, y);   \
+TYPE t3 = -y;  \
+TYPE t4 = __builtin_pow##TY (x, t3);   \
+if (t2 != t4)  \
+  link_error ();   \
+  }\
+
+#define POW0(TYPE, C_TY, TY)   \
+  void \
+  pow0_##TY (TYPE x)   \
+  {\
+TYPE t1 = __builtin_pow##TY (0.0##C_TY, x);\
+if (t1 != 0.0##C_TY)   \
+  link_error ();   \
+  }\
+
+#define TEST_ALL(TYPE, C_TY, TY)   \
+  POW1OVER (TYPE, C_TY, TY)\
+  POW0 (TYPE, C_TY, TY)
+
+TEST_ALL (double, , )
+TEST_ALL (float, f, f)
+TEST_ALL (long double, L, l)
-- 
2.34.1

smime.p7s
Description: S/MIME cryptographic signature


[PATCH] [5/n] remove trapv-*.c special-casing of gcc.dg/vect/ files

2024-10-18 Thread Richard Biener
The following makes -ftrapv explicit.

* gcc.dg/vect/vect.exp: Remove special-casing of tests
named trapv-*
* gcc.dg/vect/trapv-vect-reduc-4.c: Add dg-additional-options -ftrapv.
---
 gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c |  2 +-
 gcc/testsuite/gcc.dg/vect/vect.exp | 10 +++---
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c 
b/gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c
index 24cf1f793c7..e59fbba824f 100644
--- a/gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c
+++ b/gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c
@@ -1,5 +1,5 @@
 /* Disabling epilogues until we find a better way to deal with scans.  */
-/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-additional-options "-ftrapv --param vect-epilogues-nomask=0" } */
 /* { dg-do compile } */
 /* { dg-require-effective-target vect_int } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect.exp 
b/gcc/testsuite/gcc.dg/vect/vect.exp
index 14c6168f6ee..37e7bc424f8 100644
--- a/gcc/testsuite/gcc.dg/vect/vect.exp
+++ b/gcc/testsuite/gcc.dg/vect/vect.exp
@@ -115,6 +115,9 @@ foreach flags $VECT_ADDITIONAL_FLAGS {
 et-dg-runtest dg-runtest [lsort \
[glob -nocomplain $srcdir/$subdir/wrapv-*.\[cS\]]] \
$flags $DEFAULT_VECTCFLAGS
+et-dg-runtest dg-runtest [lsort \
+   [glob -nocomplain $srcdir/$subdir/trapv-*.\[cS\]]] \
+   $flags $DEFAULT_VECTCFLAGS
 
 et-dg-runtest dg-runtest [lsort \
[glob -nocomplain $srcdir/$subdir/fast-math-bb-slp-*.\[cS\]]] \
@@ -129,13 +132,6 @@ global SAVED_DEFAULT_VECTCFLAGS
 set SAVED_DEFAULT_VECTCFLAGS $DEFAULT_VECTCFLAGS
 set SAVED_VECT_SLP_CFLAGS $VECT_SLP_CFLAGS
 
-# -ftrapv tests
-set DEFAULT_VECTCFLAGS $SAVED_DEFAULT_VECTCFLAGS
-lappend DEFAULT_VECTCFLAGS "-ftrapv"
-et-dg-runtest dg-runtest [lsort \
-   [glob -nocomplain $srcdir/$subdir/trapv-*.\[cS\]]] \
-   "" $DEFAULT_VECTCFLAGS
-
 # -fno-tree-dce tests
 set DEFAULT_VECTCFLAGS $SAVED_DEFAULT_VECTCFLAGS
 lappend DEFAULT_VECTCFLAGS "-fno-tree-dce"
-- 
2.43.0


Re: [PATCH 7/9] Handle POLY_INT_CSTs in get_nonzero_bits

2024-10-18 Thread Richard Biener
On Fri, 18 Oct 2024, Richard Sandiford wrote:

> This patch extends get_nonzero_bits to handle POLY_INT_CSTs,
> The easiest (but also most useful) case is that the number
> of trailing zeros in the runtime value is at least the number
> of trailing zeros in each individual component.
> 
> In principle, we could do this for coeffs 1 and above only,
> and then OR in ceoff 0.  This would give ~0x11 for [14, 32], say.
> But that's future work.
> 
> gcc/
>   * tree-ssanames.cc (get_nonzero_bits): Handle POLY_INT_CSTs.
>   * match.pd (with_possible_nonzero_bits): Likewise.
> 
> gcc/testsuite/
>   * gcc.target/aarch64/sve/cnt_fold_4.c: New test.
> ---
>  gcc/match.pd  |  2 +
>  .../gcc.target/aarch64/sve/cnt_fold_4.c   | 61 +++
>  gcc/tree-ssanames.cc  |  3 +
>  3 files changed, 66 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_4.c
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 540582dc984..41903554478 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -2893,6 +2893,8 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> possibly set.  */
>  (match with_possible_nonzero_bits
>   INTEGER_CST@0)
> +(match with_possible_nonzero_bits
> + POLY_INT_CST@0)
>  (match with_possible_nonzero_bits
>   SSA_NAME@0
>   (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) || POINTER_TYPE_P (TREE_TYPE (@0)
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_4.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_4.c
> new file mode 100644
> index 000..b7a53701993
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_4.c
> @@ -0,0 +1,61 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +#include 
> +
> +/*
> +** f1:
> +**   cnthx0
> +**   ret
> +*/
> +uint64_t
> +f1 ()
> +{
> +  uint64_t x = svcntw ();
> +  x >>= 2;
> +  return x << 3;
> +}
> +
> +/*
> +** f2:
> +**   [^\n]+
> +**   [^\n]+
> +**   ...
> +**   ret
> +*/
> +uint64_t
> +f2 ()
> +{
> +  uint64_t x = svcntd ();
> +  x >>= 2;
> +  return x << 3;
> +}
> +
> +/*
> +** f3:
> +**   cntbx0, all, mul #4
> +**   ret
> +*/
> +uint64_t
> +f3 ()
> +{
> +  uint64_t x = svcntd ();
> +  x >>= 1;
> +  return x << 6;
> +}
> +
> +/*
> +** f4:
> +**   [^\n]+
> +**   [^\n]+
> +**   ...
> +**   ret
> +*/
> +uint64_t
> +f4 ()
> +{
> +  uint64_t x = svcntd ();
> +  x >>= 2;
> +  return x << 2;
> +}
> diff --git a/gcc/tree-ssanames.cc b/gcc/tree-ssanames.cc
> index 4f83fcbb517..d2d1ec18797 100644
> --- a/gcc/tree-ssanames.cc
> +++ b/gcc/tree-ssanames.cc
> @@ -505,6 +505,9 @@ get_nonzero_bits (const_tree name)
>/* Use element_precision instead of TYPE_PRECISION so complex and
>   vector types get a non-zero precision.  */
>unsigned int precision = element_precision (TREE_TYPE (name));
> +  if (POLY_INT_CST_P (name))
> +return -known_alignment (wi::to_poly_wide (name));
> +

Since you don't need precision can you move this right after the
INTEGER_CST handling?

OK with that change.

Thanks,
Richard.

>if (POINTER_TYPE_P (TREE_TYPE (name)))
>  {
>struct ptr_info_def *pi = SSA_NAME_PTR_INFO (name);
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [PATCH 9/9] Record nonzero bits in the irange_bitmask of POLY_INT_CSTs

2024-10-18 Thread Richard Biener
On Fri, 18 Oct 2024, Richard Sandiford wrote:

> At the moment, ranger punts entirely on POLY_INT_CSTs.  Numerical
> ranges are a bit difficult, unless we do start modelling bounds on
> the indeterminates.  But we can at least track the nonzero bits.

OK unless Andrew knows a better proper place to do this.

Thanks,
Richard.

> gcc/
>   * value-query.cc (range_query::get_tree_range): Use get_nonzero_bits
>   to populate the irange_bitmask of a POLY_INT_CST.
> 
> gcc/testsuite/
>   * gcc.target/aarch64/sve/cnt_fold_6.c: New test.
> ---
>  .../gcc.target/aarch64/sve/cnt_fold_6.c   | 75 +++
>  gcc/value-query.cc|  7 ++
>  2 files changed, 82 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_6.c
> 
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_6.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_6.c
> new file mode 100644
> index 000..9d9e1ca9330
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_6.c
> @@ -0,0 +1,75 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +#include 
> +
> +/*
> +** f1:
> +**   ...
> +**   cntb(x[0-9]+)
> +**   ...
> +**   add x[0-9]+, \1, #?16
> +**   ...
> +**   csel[^\n]+
> +**   ret
> +*/
> +uint64_t
> +f1 (int x)
> +{
> +  uint64_t y = x ? svcnth () : svcnth () + 8;
> +  y >>= 3;
> +  y <<= 4;
> +  return y;
> +}
> +
> +/*
> +** f2:
> +**   ...
> +**   (?:and|[al]sr)  [^\n]+
> +**   ...
> +**   ret
> +*/
> +uint64_t
> +f2 (int x)
> +{
> +  uint64_t y = x ? svcnth () : svcnth () + 8;
> +  y >>= 4;
> +  y <<= 5;
> +  return y;
> +}
> +
> +/*
> +** f3:
> +**   ...
> +**   cntw(x[0-9]+)
> +**   ...
> +**   add x[0-9]+, \1, #?16
> +**   ...
> +**   csel[^\n]+
> +**   ret
> +*/
> +uint64_t
> +f3 (int x)
> +{
> +  uint64_t y = x ? svcntd () : svcntd () + 8;
> +  y >>= 1;
> +  y <<= 2;
> +  return y;
> +}
> +
> +/*
> +** f4:
> +**   ...
> +**   (?:and|[al]sr)  [^\n]+
> +**   ...
> +**   ret
> +*/
> +uint64_t
> +f4 (int x)
> +{
> +  uint64_t y = x ? svcntd () : svcntd () + 8;
> +  y >>= 2;
> +  y <<= 3;
> +  return y;
> +}
> diff --git a/gcc/value-query.cc b/gcc/value-query.cc
> index cac2cb5b2bc..34499da1a98 100644
> --- a/gcc/value-query.cc
> +++ b/gcc/value-query.cc
> @@ -375,6 +375,13 @@ range_query::get_tree_range (vrange &r, tree expr, 
> gimple *stmt,
>}
>  
>  default:
> +  if (POLY_INT_CST_P (expr))
> + {
> +   unsigned int precision = TYPE_PRECISION (type);
> +   r.set_varying (type);
> +   r.update_bitmask ({ wi::zero (precision), get_nonzero_bits (expr) });
> +   return true;
> + }
>break;
>  }
>if (BINARY_CLASS_P (expr) || COMPARISON_CLASS_P (expr))
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [PATCH 8/9] Try to simplify (X >> C1) * (C2 << C1) -> X * C2

2024-10-18 Thread Richard Biener
On Fri, 18 Oct 2024, Richard Sandiford wrote:

> This patch adds a rule to simplify (X >> C1) * (C2 << C1) -> X * C2
> when the low C1 bits of X are known to be zero.  As with the earlier
> X >> C1 << (C2 + C1) patch, any single conversion is allowed between
> the shift and the multiplication.

OK.

Thanks,
Richard.

> gcc/
>   * match.pd: Simplify (X >> C1) * (C2 << C1) -> X * C2 if the
>   low C1 bits of X are zero.
> 
> gcc/testsuite/
>   * gcc.dg/tree-ssa/shifts-3.c: New test.
>   * gcc.dg/tree-ssa/shifts-4.c: Likewise.
>   * gcc.target/aarch64/sve/cnt_fold_5.c: Likewise.
> ---
>  gcc/match.pd  | 13 
>  gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c  | 65 +++
>  gcc/testsuite/gcc.dg/tree-ssa/shifts-4.c  | 23 +++
>  .../gcc.target/aarch64/sve/cnt_fold_5.c   | 38 +++
>  4 files changed, 139 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/shifts-4.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_5.c
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 41903554478..85f5eeefa08 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -4915,6 +4915,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>&& wi::to_widest (@2) >= wi::to_widest (@1)
>&& wi::to_widest (@1) <= wi::ctz (get_nonzero_bits (@0)))
>(lshift (convert @0) (minus @2 @1
> +
> +/* (X >> C1) * (C2 << C1) -> X * C2 if the low C1 bits of X are zero.  */
> +(simplify
> + (mult (convert? (rshift (with_possible_nonzero_bits2 @0) INTEGER_CST@1))
> +   poly_int_tree_p@2)
> + (with { poly_widest_int factor; }
> +  (if (INTEGRAL_TYPE_P (type)
> +   && wi::ltu_p (wi::to_wide (@1), element_precision (type))
> +   && wi::to_widest (@1) <= wi::ctz (get_nonzero_bits (@0))
> +   && multiple_p (wi::to_poly_widest (@2),
> +   widest_int (1) << tree_to_uhwi (@1),
> +   &factor))
> +   (mult (convert @0) { wide_int_to_tree (type, factor); }
>  #endif
>  
>  /* For (x << c) >> c, optimize into x & ((unsigned)-1 >> c) for
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c
> new file mode 100644
> index 000..dcff518e630
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c
> @@ -0,0 +1,65 @@
> +/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
> +
> +unsigned int
> +f1 (unsigned int x)
> +{
> +  if (x & 3)
> +__builtin_unreachable ();
> +  x >>= 2;
> +  return x * 20;
> +}
> +
> +unsigned int
> +f2 (unsigned int x)
> +{
> +  if (x & 3)
> +__builtin_unreachable ();
> +  unsigned char y = x;
> +  y >>= 2;
> +  return y * 36;
> +}
> +
> +unsigned long
> +f3 (unsigned int x)
> +{
> +  if (x & 3)
> +__builtin_unreachable ();
> +  x >>= 2;
> +  return (unsigned long) x * 88;
> +}
> +
> +int
> +f4 (int x)
> +{
> +  if (x & 15)
> +__builtin_unreachable ();
> +  x >>= 4;
> +  return x * 48;
> +}
> +
> +unsigned int
> +f5 (int x)
> +{
> +  if (x & 31)
> +__builtin_unreachable ();
> +  x >>= 5;
> +  return x * 3200;
> +}
> +
> +unsigned int
> +f6 (unsigned int x)
> +{
> +  if (x & 1)
> +__builtin_unreachable ();
> +  x >>= 1;
> +  return x * (~0U / 3 & -2);
> +}
> +
> +/* { dg-final { scan-tree-dump-not {<[a-z]*_div_expr,} "optimized" } } */
> +/* { dg-final { scan-tree-dump-not { +/* { dg-final { scan-tree-dump { } */
> +/* { dg-final { scan-tree-dump { } */
> +/* { dg-final { scan-tree-dump {<(?:widen_)?mult_expr, [^,]*, [^,]*, 22,} 
> "optimized" } } */
> +/* { dg-final { scan-tree-dump { } */
> +/* { dg-final { scan-tree-dump { } } */
> +/* { dg-final { scan-tree-dump { "optimized" { target int32 } } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/shifts-4.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/shifts-4.c
> new file mode 100644
> index 000..5638653d0c2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/shifts-4.c
> @@ -0,0 +1,23 @@
> +/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
> +
> +unsigned int
> +f1 (unsigned int x)
> +{
> +  if (x & 3)
> +__builtin_unreachable ();
> +  x >>= 2;
> +  return x * 10;
> +}
> +
> +unsigned int
> +f2 (unsigned int x)
> +{
> +  if (x & 3)
> +__builtin_unreachable ();
> +  x >>= 3;
> +  return x * 24;
> +}
> +
> +/* { dg-final { scan-tree-dump-times { +/* { dg-final { scan-tree-dump { } */
> +/* { dg-final { scan-tree-dump { } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_5.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_5.c
> new file mode 100644
> index 000..3f60e9b4941
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_5.c
> @@ -0,0 +1,38 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +#include 
> +
> +/*
> +** f1:
> +**   ...
> +**   cntd[^\n]+
> +**   ...
> +**   mul [^\n]+
> +**   ret
> +*/
> +uint64_t
> +f1 (int x)
> +{
> +  if (x

[PATCH] [3/n] remove fast-math-*.c special-casing of gcc.dg/vect/ files

2024-10-18 Thread Richard Biener
The following makes -ffast-math explicit.

* gcc.dg/vect/vect.exp: Remove special-casing of tests
named fast-math-*
* gcc.dg/vect/fast-math-bb-slp-call-1.c: Add dg-additional-options
-ffast-math.
* gcc.dg/vect/fast-math-bb-slp-call-2.c: Likewise.
* gcc.dg/vect/fast-math-bb-slp-call-3.c: Likewise.
* gcc.dg/vect/fast-math-ifcvt-1.c: Likewise.
* gcc.dg/vect/fast-math-pr35982.c: Likewise.
* gcc.dg/vect/fast-math-pr43074.c: Likewise.
* gcc.dg/vect/fast-math-pr44152.c: Likewise.
* gcc.dg/vect/fast-math-pr55281.c: Likewise.
* gcc.dg/vect/fast-math-slp-27.c: Likewise.
* gcc.dg/vect/fast-math-slp-38.c: Likewise.
* gcc.dg/vect/fast-math-vect-call-1.c: Likewise.
* gcc.dg/vect/fast-math-vect-call-2.c: Likewise.
* gcc.dg/vect/fast-math-vect-complex-3.c: Likewise.
* gcc.dg/vect/fast-math-vect-outer-7.c: Likewise.
* gcc.dg/vect/fast-math-vect-pow-1.c: Likewise.
* gcc.dg/vect/fast-math-vect-pow-2.c: Likewise.
* gcc.dg/vect/fast-math-vect-pr25911.c: Likewise.
* gcc.dg/vect/fast-math-vect-pr29925.c: Likewise.
* gcc.dg/vect/fast-math-vect-reduc-5.c: Likewise.
* gcc.dg/vect/fast-math-vect-reduc-7.c: Likewise.
* gcc.dg/vect/fast-math-vect-reduc-8.c: Likewise.
* gcc.dg/vect/fast-math-vect-reduc-9.c: Likewise.
---
 gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-1.c  |  2 ++
 gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-2.c  |  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-3.c  |  2 ++
 gcc/testsuite/gcc.dg/vect/fast-math-ifcvt-1.c|  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-pr35982.c|  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-pr43074.c|  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-pr44152.c|  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-pr55281.c|  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-slp-27.c |  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c |  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-vect-call-1.c|  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-vect-call-2.c|  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-vect-complex-3.c |  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-vect-outer-7.c   |  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-vect-pow-1.c |  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-vect-pow-2.c |  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-vect-pr25911.c   |  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-vect-pr29925.c   |  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-vect-reduc-5.c   |  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-vect-reduc-7.c   |  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-vect-reduc-8.c   |  1 +
 gcc/testsuite/gcc.dg/vect/fast-math-vect-reduc-9.c   |  1 +
 gcc/testsuite/gcc.dg/vect/vect.exp   | 10 +++---
 23 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-1.c 
b/gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-1.c
index d9f19d90431..a10e8d677a6 100644
--- a/gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-1.c
+++ b/gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-1.c
@@ -1,3 +1,5 @@
+/* { dg-additional-options "-ffast-math" } */
+
 #include "tree-vect.h"
 
 extern float copysignf (float, float);
diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-2.c 
b/gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-2.c
index 76bb044914f..234b31b6e44 100644
--- a/gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-2.c
+++ b/gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-2.c
@@ -1,4 +1,5 @@
 /* { dg-require-effective-target vect_double } */
+/* { dg-additional-options "-ffast-math" } */
 
 #include "tree-vect.h"
 
diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-3.c 
b/gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-3.c
index fd2c8be695a..a57de47f77c 100644
--- a/gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-3.c
+++ b/gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-3.c
@@ -1,3 +1,5 @@
+/* { dg-additional-options "-ffast-math" } */
+
 #include "tree-vect.h"
 
 extern double sqrt (double);
diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-ifcvt-1.c 
b/gcc/testsuite/gcc.dg/vect/fast-math-ifcvt-1.c
index f51202e0855..a8d94440cd9 100644
--- a/gcc/testsuite/gcc.dg/vect/fast-math-ifcvt-1.c
+++ b/gcc/testsuite/gcc.dg/vect/fast-math-ifcvt-1.c
@@ -1,5 +1,6 @@
 /* PR 47892 */
 /* { dg-do compile } */
+/* { dg-additional-options "-ffast-math" } */
 /* { dg-require-effective-target vect_float } */
 /* { dg-require-effective-target vect_condition } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-pr35982.c 
b/gcc/testsuite/gcc.dg/vect/fast-math-pr35982.c
index 50ea7ffc1b9..dc1a174b6e7 100644
--- a/gcc/testsuite/gcc.dg/vect/fast-math-pr35982.c
+++ b/gcc/testsuite/gcc.dg/vect/fast-math-pr35982.c
@@ -1,4 +1,5 @@
 /* { dg-do compile } */
+/* { dg-additional-options "-ffast-math" } */
 /* { dg-require-effective-target vect_float } */
 /* { dg-require-effective-target vect_int } */
 /* { dg-

[PATCH] [2/n] remove no-vfa-*.c special-casing of gcc.dg/vect/ files

2024-10-18 Thread Richard Biener
The following makes --param vect-max-version-for-alias-checks=0
explicit.

* gcc.dg/vect/vect.exp: Remove special-casing of tests
named no-vfa-*
* gcc.dg/vect/no-vfa-pr29145.c: Add dg-additional-options
--param vect-max-version-for-alias-checks=0.
* gcc.dg/vect/no-vfa-vect-101.c: Likewise.
* gcc.dg/vect/no-vfa-vect-102.c: Likewise.
* gcc.dg/vect/no-vfa-vect-102a.c: Likewise.
* gcc.dg/vect/no-vfa-vect-37.c: Likewise.
* gcc.dg/vect/no-vfa-vect-43.c: Likewise.
* gcc.dg/vect/no-vfa-vect-45.c: Likewise.
* gcc.dg/vect/no-vfa-vect-49.c: Likewise.
* gcc.dg/vect/no-vfa-vect-51.c: Likewise.
* gcc.dg/vect/no-vfa-vect-53.c: Likewise.
* gcc.dg/vect/no-vfa-vect-57.c: Likewise.
* gcc.dg/vect/no-vfa-vect-61.c: Likewise.
* gcc.dg/vect/no-vfa-vect-79.c: Likewise.
* gcc.dg/vect/no-vfa-vect-depend-1.c: Likewise.
* gcc.dg/vect/no-vfa-vect-depend-2.c: Likewise.
* gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise.
* gcc.dg/vect/no-vfa-vect-dv-2.c: Likewise.
---
 gcc/testsuite/gcc.dg/vect/no-vfa-pr29145.c   |  2 +-
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-101.c  |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-102.c  |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-102a.c |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-37.c   |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-43.c   |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-45.c   |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-49.c   |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-51.c   |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-53.c   |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-57.c   |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-61.c   |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-79.c   |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-1.c |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-2.c |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-3.c |  1 +
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-dv-2.c |  2 +-
 gcc/testsuite/gcc.dg/vect/vect.exp   | 10 +++---
 18 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/no-vfa-pr29145.c 
b/gcc/testsuite/gcc.dg/vect/no-vfa-pr29145.c
index 45cca1d1991..cb8c72bdea3 100644
--- a/gcc/testsuite/gcc.dg/vect/no-vfa-pr29145.c
+++ b/gcc/testsuite/gcc.dg/vect/no-vfa-pr29145.c
@@ -1,5 +1,5 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-additional-options "-fno-ipa-icf" } */
+/* { dg-additional-options "--param vect-max-version-for-alias-checks=0 
-fno-ipa-icf" } */
 
 #include 
 #include "tree-vect.h"
diff --git a/gcc/testsuite/gcc.dg/vect/no-vfa-vect-101.c 
b/gcc/testsuite/gcc.dg/vect/no-vfa-vect-101.c
index 73b92177dab..4b2b0f60b4c 100644
--- a/gcc/testsuite/gcc.dg/vect/no-vfa-vect-101.c
+++ b/gcc/testsuite/gcc.dg/vect/no-vfa-vect-101.c
@@ -1,4 +1,5 @@
 /* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "--param vect-max-version-for-alias-checks=0" } */
 
 #include 
 #include 
diff --git a/gcc/testsuite/gcc.dg/vect/no-vfa-vect-102.c 
b/gcc/testsuite/gcc.dg/vect/no-vfa-vect-102.c
index 9a3fdab128a..26b9cd1c427 100644
--- a/gcc/testsuite/gcc.dg/vect/no-vfa-vect-102.c
+++ b/gcc/testsuite/gcc.dg/vect/no-vfa-vect-102.c
@@ -1,4 +1,5 @@
 /* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "--param vect-max-version-for-alias-checks=0" } */
 
 #include 
 #include 
diff --git a/gcc/testsuite/gcc.dg/vect/no-vfa-vect-102a.c 
b/gcc/testsuite/gcc.dg/vect/no-vfa-vect-102a.c
index 439347c3bb1..5b9905a04ee 100644
--- a/gcc/testsuite/gcc.dg/vect/no-vfa-vect-102a.c
+++ b/gcc/testsuite/gcc.dg/vect/no-vfa-vect-102a.c
@@ -1,4 +1,5 @@
 /* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "--param vect-max-version-for-alias-checks=0" } */
 
 #include 
 #include 
diff --git a/gcc/testsuite/gcc.dg/vect/no-vfa-vect-37.c 
b/gcc/testsuite/gcc.dg/vect/no-vfa-vect-37.c
index f59eb69d99f..347af57b7c6 100644
--- a/gcc/testsuite/gcc.dg/vect/no-vfa-vect-37.c
+++ b/gcc/testsuite/gcc.dg/vect/no-vfa-vect-37.c
@@ -1,4 +1,5 @@
 /* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "--param vect-max-version-for-alias-checks=0" } */
 
 #include 
 #include "tree-vect.h"
diff --git a/gcc/testsuite/gcc.dg/vect/no-vfa-vect-43.c 
b/gcc/testsuite/gcc.dg/vect/no-vfa-vect-43.c
index 6b4542f5948..d06079e3d72 100644
--- a/gcc/testsuite/gcc.dg/vect/no-vfa-vect-43.c
+++ b/gcc/testsuite/gcc.dg/vect/no-vfa-vect-43.c
@@ -1,4 +1,5 @@
 /* { dg-require-effective-target vect_float } */
+/* { dg-additional-options "--param vect-max-version-for-alias-checks=0" } */
 
 #include 
 #include "tree-vect.h"
diff --git a/gcc/testsuite/gcc.dg/vect/no-vfa-vect-45.c 
b/gcc/testsuite/gcc.dg/vect/no-vfa-vect-45.c
index 5db05288c81..9981a459e92 100644
--- a/gcc/testsuite/gcc.dg/vect/no-vfa-vect-45.c
+++ b/gcc/testsuite/gcc.dg/vect/no-vfa-vect-45.c
@@ -1,4 +1,5 @

Re: [PATCH v2] arm: [MVE intrinsics] Fix support for predicate constants [PR target/114801]

2024-10-18 Thread Andre Vieira (lists)

Hi,

This looks like an acceptable work around. We special case behavior that 
I'm not sure we can express in ways GCC can understand or will make use 
of, whilst at the same time we keep expressing behavior it does 
understand and can optimize.


Nice idea!

LGTM, needs maintainer approval though.

Kind regards,
Andre

On 07/05/2024 17:19, Christophe Lyon wrote:

In this PR, we have to handle a case where MVE predicates are supplied
as a const_int, where individual predicates have illegal boolean
values (such as 0xc for a 4-bit boolean predicate).  To avoid the ICE,
we hide the constant behind an unspec.

On MVE V8BI and V4BI multi-bit masks are interpreted byte-by-byte at
instruction level, see
https://developer.arm.com/documentation/101028/0012/14--M-profile-Vector-Extension--MVE--intrinsics.

This is a workaround until we change such predicates representation to
V16BImode.

2024-05-06  Christophe Lyon  
Jakub Jelinek  

PR target/114801
gcc/
* config/arm/arm-mve-builtins.cc
(function_expander::add_input_operand): Handle CONST_INT
predicates.
* mve.md (set_mve_const_pred): New pattern.
* unspec.md (MVE_PRED): New unspec.

gcc/testsuite/
* gcc.target/arm/mve/pr114801.c: New test.
---
  gcc/config/arm/arm-mve-builtins.cc  | 27 ++-
  gcc/config/arm/mve.md   | 12 +++
  gcc/config/arm/unspecs.md   |  1 +
  gcc/testsuite/gcc.target/arm/mve/pr114801.c | 37 +
  4 files changed, 76 insertions(+), 1 deletion(-)
  create mode 100644 gcc/testsuite/gcc.target/arm/mve/pr114801.c

diff --git a/gcc/config/arm/arm-mve-builtins.cc 
b/gcc/config/arm/arm-mve-builtins.cc
index 6a5775c67e5..7d5af649857 100644
--- a/gcc/config/arm/arm-mve-builtins.cc
+++ b/gcc/config/arm/arm-mve-builtins.cc
@@ -2205,7 +2205,32 @@ function_expander::add_input_operand (insn_code icode, 
rtx x)
mode = GET_MODE (x);
  }
else if (VALID_MVE_PRED_MODE (mode))
-x = gen_lowpart (mode, x);
+{
+  if (CONST_INT_P (x) && (mode == V8BImode || mode == V4BImode))
+   {
+ /* In V8BI or V4BI each element has 2 or 4 bits, if those
+bits aren't all the same, gen_lowpart might ICE.  Hide
+the move behind an unspec to avoid this.
+V8BI and V4BI multi-bit masks are interpreted
+byte-by-byte at instruction level, see
+
https://developer.arm.com/documentation/101028/0012/14--M-profile-Vector-Extension--MVE--intrinsics.
  */
+ unsigned HOST_WIDE_INT xi = UINTVAL (x);
+ if ((xi & 0x) != ((xi >> 1) & 0x)
+ || (mode == V4BImode
+ && (xi & 0x) != ((xi >> 2) & 0x)))
+   {
+ rtx unspec_x;
+ unspec_x = gen_rtx_UNSPEC (HImode, gen_rtvec (1, x), MVE_PRED);
+ x = force_reg (HImode, unspec_x);
+   }
+
+   }
+  else if (SUBREG_P (x))
+   /* gen_lowpart on a SUBREG can ICE.  */
+   x = force_reg (GET_MODE (x), x);
+
+  x = gen_lowpart (mode, x);
+}
  
m_ops.safe_grow (m_ops.length () + 1, true);

create_input_operand (&m_ops.last (), x, mode);
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 35916f62604..d337422d695 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6621,3 +6621,15 @@ (define_expand "@arm_mve_reinterpret"
}
}
  )
+
+;; Hide predicate constants from optimizers
+(define_insn "set_mve_const_pred"
+ [(set
+   (match_operand:HI 0 "s_register_operand" "=r")
+   (unspec:HI [(match_operand:HI 1 "general_operand" "n")] MVE_PRED))]
+  "TARGET_HAVE_MVE"
+{
+return "movw%?\t%0, %L1\t%@ set_mve_const_pred";
+}
+  [(set_attr "type" "mov_imm")]
+)
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 4713ec840ab..336f2fe08e6 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -1256,4 +1256,5 @@ (define_c_enum "unspec" [
SQRSHRL_48
VSHLCQ_M_
REINTERPRET
+  MVE_PRED
  ])
diff --git a/gcc/testsuite/gcc.target/arm/mve/pr114801.c 
b/gcc/testsuite/gcc.target/arm/mve/pr114801.c
new file mode 100644
index 000..fb3e4d855f9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/pr114801.c
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2" } */
+/* { dg-add-options arm_v8_1m_mve } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include 
+
+/*
+** test_32:
+**...
+** movwr[0-9]+, 52428  @ set_mve_const_pred
+**...
+*/
+uint32x4_t test_32() {
+  return vdupq_m_n_u32(vdupq_n_u32(0x), 0, 0x);
+}
+
+/*
+** test_16:
+**...
+** movwr[0-9]+, 6927   @ set_mve_const_pred
+**...
+*/
+uint16x8_t test_16() {
+  return vdupq_m_n_u16(vdupq_n_u16(0x), 0, 0x1b0f);
+}
+
+/*
+** test_8:
+**...
+** mov r[0-9]+, #23055 @ movhi
+**...
+*/
+uint8x16_t test_8() {
+  return vdupq_m_n_

[RFC/RFA] [PATCH v5 10/12] Verify detected CRC loop with symbolic execution and LFSR matching.

2024-10-18 Thread Mariam Arutunian
Symbolically execute potential CRC loops and check whether the loop
actually calculates CRC (uses LFSR matching).
Calculated CRC and created LFSR are compared on each iteration of the
potential CRC loop.

  gcc/

* Makefile.in (OBJS): Add crc-verification.o.
* crc-verification.cc: New file.
* crc-verification.h: New file.
* gimple-crc-optimization.cc (loop_calculates_crc): New function.
(is_output_crc): Likewise.
(swap_crc_and_data_if_needed): Likewise.
(validate_crc_and_data): Likewise.
(optimize_crc_loop): Likewise.
(get_output_phi): Likewise.
(execute): Add check whether potential CRC loop calculates CRC.

  gcc/sym-exec/

* sym-exec-state.cc (create_reversed_lfsr): New function.
(create_forward_lfsr): Likewise.
(last_set_bit): Likewise.
(create_lfsr): Likewise.
* sym-exec-state.h (is_bit_vector): Reorder, make the function public
and static.
(create_reversed_lfsr): New static function declaration.
(create_forward_lfsr): New static function declaration.

Signed-off-by: Mariam Arutunian 
Mentored-by: Jeff Law 
---
 gcc/Makefile.in|1 +
 gcc/crc-verification.cc| 1298 
 gcc/crc-verification.h |  161 
 gcc/gimple-crc-optimization.cc |  327 +++-
 gcc/sym-exec/sym-exec-state.cc |  101 +++
 gcc/sym-exec/sym-exec-state.h  |   11 +
 6 files changed, 1897 insertions(+), 2 deletions(-)
 create mode 100644 gcc/crc-verification.cc
 create mode 100644 gcc/crc-verification.h

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 6eab34d62bb..6b8a37a180c 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1717,6 +1717,7 @@ OBJS = \
 	tree-iterator.o \
 	tree-logical-location.o \
 	tree-loop-distribution.o \
+	crc-verification.o \
 	gimple-crc-optimization.o \
 	sym-exec/sym-exec-expression.o \
 	sym-exec/sym-exec-state.o \
diff --git a/gcc/crc-verification.cc b/gcc/crc-verification.cc
new file mode 100644
index 000..a556bc92467
--- /dev/null
+++ b/gcc/crc-verification.cc
@@ -0,0 +1,1298 @@
+/* Execute symbolically all paths of the loop.
+   Calculate the value of the polynomial by executing loop with real values to
+   create LFSR state.
+   After each iteration check that final states of calculated CRC values match
+   determined LFSR.
+   Copyright (C) 2022-2024 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+.   */
+
+#include "crc-verification.h"
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "tree.h"
+#include "gimple.h"
+#include "ssa.h"
+#include "gimple-iterator.h"
+#include "tree-cfg.h"
+#include "cfganal.h"
+#include "tree-ssa-loop.h"
+
+/* Check whether defined variable is used outside the loop, only
+   CRC's definition is allowed to be used outside the loop.  */
+
+bool
+crc_symbolic_execution::is_used_outside_the_loop (tree def)
+{
+  imm_use_iterator imm_iter;
+  gimple *use_stmt;
+  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, def)
+{
+  if (!flow_bb_inside_loop_p (m_crc_loop, use_stmt->bb))
+	{
+	  if (is_a (use_stmt)
+	  && as_a (use_stmt) == m_output_crc)
+	return false;
+	  if (dump_file)
+	fprintf (dump_file, "Defined variable is used outside the loop.\n");
+	  return true;
+	}
+}
+  return false;
+}
+
+/* Calculate value of the rhs operation of GS assigment statement
+   and assign it to lhs variable.  */
+
+bool
+crc_symbolic_execution::execute_assign_statement (const gassign *gs)
+{
+  enum tree_code rhs_code = gimple_assign_rhs_code (gs);
+  tree lhs = gimple_assign_lhs (gs);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+fprintf (dump_file, "lhs type : %s \n",
+	 get_tree_code_name (TREE_CODE (lhs)));
+
+  /* This will filter some normal cases too.  Ex.  usage of array.  */
+  if (TREE_CODE (lhs) != SSA_NAME)
+return false;
+
+  /* Check uses only when m_output_crc is known.  */
+  if (m_output_crc)
+if (is_used_outside_the_loop (lhs))
+  return false;
+
+  if (gimple_num_ops (gs) != 2 && gimple_num_ops (gs) != 3)
+{
+  if (dump_file)
+	fprintf (dump_file,
+		 "Warning, encountered unsupported operation, "
+		 "with %s code while executing assign statement!\n",
+		 get_tree_code_name (rhs_code));
+  return false;
+}
+
+  tree op1 = gimple_assign_rhs1 (gs);
+  tree op2 = nullptr;
+
+  if (gimple_num_ops (gs) == 3)
+op2 = gimpl

Re: [WIP RFC] libstdc++: add module std

2024-10-18 Thread Iain Sandoe



> On 18 Oct 2024, at 14:38, Jason Merrill  wrote:
> 
> This patch is not ready for integration, but I'd like to get feedback on the
> approach (and various specific questions below).
> 
> -- 8< --
> 
> This patch introduces an installed source form of module std and std.compat.
> To find them, we install a libstdc++.modules.json file alongside
> libstdc++.so, which tells the build system where the files are and any
> special flags it should use when compiling them (none, in our case).  The
> format is from a proposal in SG15.
> 
> The build system can find this file with
> gcc -print-file-name=libstdc++.modules.json
> 
> It seems preferable to use a relative path from this file to the sources so
> that moving the installation doesn't break the reference, but I didn't see
> any obvious way to compute that without relying on coreutils, perl, or
> python, so I wrote a POSIX shell script for it.
> 
> Currently this installs the sources under $(pkgdata), i.e.
> /usr/share/libstdc++/modules.  It could also make sense to install them
> under $(gxx_include_dir), i.e. /usr/include/c++/15/modules.  And/or the
> subdirectory could be "miu" (module interface unit) instead of "modules".

I’d think that $(gxx_include_dir)/modules would be a good place, since that
allows us to tie it to the GCC version in side-by-side installs.  Presumably 
that
might also simplify finding the relevant sources?

I guess it could be just “module”, but my vote would be for an obvious name
like that or “std-module” 

> 
> The sources currently have the extension .cc, like other source files.
> Alternatively, they could use one of the module interface unit extensions,
> perhaps .ccm.
> 
> std.cc started with m.cencora's implementation in PR114600.  I've made some
> adjustments, but more is probably desirable, e.g. of the 
> handling of namespace ranges, and to remove exports of templates that are
> only specialized in a particular header.
> 
> The std module is missing exports for some newer headers, including some
> that are already implemented (, , ).  I've
> added some FIXMEs where I noticed missing bits.
> 
> Since bits/stdc++.h also intends to include the whole standard library, I
> include it rather than duplicate it.  But stdc++.h comments out ,
> so I include it separately.  Alternatively, we could uncomment it in
> stdc++.h.
> 
> stdc++.h also doesn't include the eternally deprecated .  There
> are some other deprecated facilities that I notice are included: 
> and float_denorm_style, at least.  It would be nice for L{E,}WG to clarify
> whether module std is intended to include interfaces that were deprecated in
> C++23, since ancient code isn't going to be relying on module std.
> 
> If they are supposed to included, do we also want to keep exporting them in
> C++26, where they are removed from the standard?
> 
> It seemed most convenient for the two files to be monolithic so we don't
> need to worry about include paths.  So the C library names that module
> std.compat exports in both namespace std and :: in module are a block of
> code that is identical in both files, adjusted based on whether the macro
> STD_COMPAT is defined before the block.
> 
> In this implementation std.compat imports std; it would also be valid for it
> to duplicate everything in std.  I see the libc++ std.compat also imports
> std.
> 
> Is it useful for std.cc to live in a subdirectory of c++23 as in this patch, 
> or
> should it be in c++23 itself?  Or elsewhere?

maybe I’m missing a point here .. 

…  is this not the source code that the end-user needs to build to generate
the ’std’ module for their set of flags etc?

So it and the installed headers need to be available in the installation to the
end-user - and maybe it could live alongside the json?

Iain

> 
> This patch doesn't yet provide a convenient way for a user to find std.cc.
> 
> libstdc++-v3/ChangeLog:
> 
>   * src/c++23/Makefile.am: Add module std/std.compat.
>   * src/c++23/Makefile.in: Regenerate.
>   * src/c++23/modules/std.cc: New file.
>   * src/c++23/modules/std.compat.cc: New file.
>   * src/c++23/libstdc++.modules.json.in: New file.
> 
> contrib/ChangeLog:
> 
>   * relpath.sh: New file.
> ---
> libstdc++-v3/src/c++23/modules/std.cc | 3575 +
> libstdc++-v3/src/c++23/modules/std.compat.cc  |  640 +++
> contrib/relpath.sh|   70 +
> libstdc++-v3/src/c++23/Makefile.am|   18 +
> libstdc++-v3/src/c++23/Makefile.in|  133 +-
> .../src/c++23/libstdc++.modules.json.in   |   17 +
> 6 files changed, 4436 insertions(+), 17 deletions(-)
> create mode 100644 libstdc++-v3/src/c++23/modules/std.cc
> create mode 100644 libstdc++-v3/src/c++23/modules/std.compat.cc
> create mode 100755 contrib/relpath.sh
> create mode 100644 libstdc++-v3/src/c++23/libstdc++.modules.json.in
> 
> diff --git a/libstdc++-v3/src/c++23/modules/std.cc 
> b/libstdc++-v3/src/c++23/modules/std.cc
> new file mode 100644
> inde

Re: [PATCH] diagnostics: libcpp: Improve locations for _Pragma lexing diagnostics [PR114423]

2024-10-18 Thread Lewis Hyatt
On Fri, Oct 18, 2024 at 11:25 AM David Malcolm  wrote:
> >if (!pfile->cb.diagnostic)
> >  abort ();
> > -  ret = pfile->cb.diagnostic (pfile, level, reason, richloc,
> > _(msgid), ap);
> > -
> > -  return ret;
> > +  if (pfile->diagnostic_override_loc && level != CPP_DL_NOTE)
> > +{
> > +  rich_location rc2{pfile->line_table, pfile-
> > >diagnostic_override_loc};
> > +  return pfile->cb.diagnostic (pfile, level, reason, &rc2,
> > _(msgid), ap);
> > +}
>
> This will effectively override the primary location in the
> rich_location, but by using a second rich_location instance it will
> also ignore any secondary locations and fix-it hints.
>
> This might will be what we want here, but did you consider
>   richloc.set_range (0, pfile->diagnostic_override_loc,
>  SHOW_RANGE_WITH_CARET);
> to reset the primary location?
>
> Otherwise, looks good to me.
>
> [...snip...]
>
> Thanks
> Dave
>

Thanks for taking a look! My thinking was, when libcpp produces tokens
from a _Pragma string, basically every location_t that it generates is
wrong and shouldn't be used. Because it doesn't actually set up the
line_map, it gets something random that's just sorta close to
reasonable. So I think it makes sense to discard fixits and secondary
locations too.

libcpp does use rich_location pretty sparingly, but I suppose the goal
is to use it more over time. We use one fixit hint for invalid
directive names currently, that one can't show up in a _Pragma though.
Right now we do define an encoding_rich_location subclass for escaping
unprintable bytes, which inherits rich_location and just adds a new
constructor to set the escape flag when it is created. You are
definitely right that this patch as of now loses that information.

Here's a source that uses an improperly normalized character:

_Pragma("ோ")

Currently we output:

t.cpp:1:1: warning: ‘\U0bc7\U0bbe’ is not in NFC [-Wnormalized=]
1 | _Pragma("")
  | ^~

With the patch we output:
t.cpp:1:1: warning: ‘\U0bc7\U0bbe’ is not in NFC [-Wnormalized=]
1 | _Pragma("ோ")
  | ^~~

So the main location range is improved (it underlines all of _Pragma
instead of most of it), but we have indeed lost the intended feature
that the incorrect bytes are escaped on the source line.

For this particular case I could improve it with a one line addition like

rc2.set_escape_on_output (richloc->escape_on_output_p ());

and that would actually handle all currently needed cases since we
don't use a lot of rich_locations in libcpp. It would just become
stale if some other option gets added to rich_location in the future
that we also should preserve. I think to fix it in a fully general
way, it would be necessary to add a new interface to class
rich_location. It already has a method to delete all the fixit hints,
it would also need a method to delete all the ranges. Then we could
make a copy of the richloc and just delete everything we don't want to
preserve. Do you have a preference one way or the other?

By the way, your suggestion was to directly modify richloc... these
functions do take the richloc by non-const pointer, but is it OK to
modify it or is a copy needed? The functions like cpp_warning_at() are
exposed in the public header file, although at the moment, all call
sites are within libcpp and don't look like they would notice if the
argument was modified. I wasn't sure what is the expected interface
here.

Thanks again...

-Lewis


Re: [PATCH] target: Fix asm codegen for vfpclasss* and vcvtph2* instructions

2024-10-18 Thread Antoni Boucher

Thanks for the review.
Here's the updated patch.

Le 2024-10-17 à 21 h 50, Hongtao Liu a écrit :

On Fri, Oct 18, 2024 at 9:08 AM Antoni Boucher  wrote:


Hi.
This is a patch for the bug 116725.
I'm not sure if it is a good fix, but it seems to do the job.
If you have suggestions for better comments than what I wrote that would
explain what's happening, I'm open to suggestions.



@@ -7548,7 +7548,8 @@ (define_insn 
"avx512fp16_vcvtph2_<
 [(match_operand: 1 "" 
"")]
 UNSPEC_US_FIX_NOTRUNC))]
   "TARGET_AVX512FP16 && "
-  "vcvtph2\t{%1, 
%0|%0, %1}"
+;; %X1 so that we don't emit any *WORD PTR for -masm=intel.
+  "vcvtph2\t{%1, 
%0|%0, %X1}"

Could you define something like

  ;; Pointer size override for 16-bit upper-convert modes (Intel asm dialect)
  (define_mode_attr iptrh
   [(V32HI "") (V16SI "") (V8DI "")
(V16HI "") (V8SI "") (V4DI "q")
(V8HI "") (V4SI "q") (V2DI "k")])


For my own understanding, was my usage of %X equivalent to a mode_attr 
with an empty string for all cases?

How did you know which one needed an empty string?



And use
+  "vcvtph2\t{%1,
%0|%0, %1}"


   [(set_attr "type" "ssecvt")
(set_attr "prefix" "evex")
(set_attr "mode" "")])
@@ -29854,7 +29855,8 @@ (define_insn 
"avx512dq_vmfpclass"
  UNSPEC_FPCLASS)
(const_int 1)))]
"TARGET_AVX512DQ || VALID_AVX512FP16_REG_MODE(mode)"
-   "vfpclass\t{%2, %1, 
%0|%0, %1, %2}";
+;; %X1 so that we don't emit any *WORD PTR for -masm=intel.
+   "vfpclass\t{%2, %1, 
%0|%0, %X1, %2}";


For scaar memory operand rewrite, we usually use , so
"vfpclass\t{%2, %1,
%0|%0,
%1, %2}";




From 358273337bafa62d669064d83c8805cb6e4b4523 Mon Sep 17 00:00:00 2001
From: Antoni Boucher 
Date: Mon, 23 Sep 2024 18:58:47 -0400
Subject: [PATCH] target: Fix asm codegen for vfpclasss* and vcvtph2*
 instructions

This only happens when using -masm=intel.

gcc/ChangeLog:
PR target/116725
* config/i386/sse.md: Fix asm generation.

gcc/testsuite/ChangeLog:
PR target/116725
* gcc.target/i386/pr116725.c: Add test using those AVX builtins.
---
 gcc/config/i386/sse.md   | 10 --
 gcc/testsuite/gcc.target/i386/pr116725.c | 40 
 2 files changed, 48 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr116725.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 685bce3094a..9c1fe8bcfce 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -1303,6 +1303,12 @@ (define_mode_attr iptr
(V8HF "w") (V8BF "w") (V4SF "k") (V2DF "q")
(HF "w") (BF "w") (SF "k") (DF "q")])
 
+;; Pointer size override for 16-bit upper-convert modes (Intel asm dialect)
+(define_mode_attr iptrh
+ [(V32HI "") (V16SI "") (V8DI "")
+  (V16HI "") (V8SI "") (V4DI "q")
+  (V8HI "") (V4SI "q") (V2DI "k")])
+
 ;; Mapping of vector modes to VPTERNLOG suffix
 (define_mode_attr ternlogsuffix
   [(V8DI "q") (V4DI "q") (V2DI "q")
@@ -7548,7 +7554,7 @@ (define_insn "avx512fp16_vcvtph2_<
 	   [(match_operand: 1 "" "")]
 	   UNSPEC_US_FIX_NOTRUNC))]
   "TARGET_AVX512FP16 && "
-  "vcvtph2\t{%1, %0|%0, %1}"
+  "vcvtph2\t{%1, %0|%0, %1}"
   [(set_attr "type" "ssecvt")
(set_attr "prefix" "evex")
(set_attr "mode" "")])
@@ -29854,7 +29860,7 @@ (define_insn "avx512dq_vmfpclass"
 	UNSPEC_FPCLASS)
 	  (const_int 1)))]
"TARGET_AVX512DQ || VALID_AVX512FP16_REG_MODE(mode)"
-   "vfpclass\t{%2, %1, %0|%0, %1, %2}";
+   "vfpclass\t{%2, %1, %0|%0, %1, %2}";
   [(set_attr "type" "sse")
(set_attr "length_immediate" "1")
(set_attr "prefix" "evex")
diff --git a/gcc/testsuite/gcc.target/i386/pr116725.c b/gcc/testsuite/gcc.target/i386/pr116725.c
new file mode 100644
index 000..9e5070e16e7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr116725.c
@@ -0,0 +1,40 @@
+/* PR gcc/116725 */
+/* { dg-do assemble } */
+/* { dg-options "-masm=intel -mavx512dq -mavx512fp16 -mavx512vl" } */
+/* { dg-require-effective-target masm_intel } */
+
+#include 
+
+typedef double __m128d __attribute__ ((__vector_size__ (16)));
+typedef float __m128f __attribute__ ((__vector_size__ (16)));
+typedef int __v16si __attribute__ ((__vector_size__ (64)));
+typedef _Float16 __m256h __attribute__ ((__vector_size__ (32)));
+typedef long long __m512i __attribute__((__vector_size__(64)));
+typedef _Float16 __m128h __attribute__ ((__vector_size__ (16), __may_alias__));
+typedef int __v4si __attribute__ ((__vector_size__ (16)));
+typedef long long __m128i __attribute__ ((__vector_size__ (16)));
+
+int main(void) {
+__m128d vec = {1.0, 2.0};
+char res = __builtin_ia32_fpclasssd_mask(vec, 1, 1);
+printf("%d\n", res);
+
+__m128f vec2 = {1.0, 2.0, 3.0, 4.0};
+char res2 = __builtin_ia32_fpcla_mask(vec2, 1, 1);
+printf("%d\n", res2);
+
+__m128h vec3 = {2.0, 1.0, 3.0};
+__v4si vec4 = {};
+__v4si res3 = __builtin_ia32_vcvtph2dq128_mask(vec3, vec4, -1);
+printf("%d\n", res3[0]);
+
+__v4si res4 = __builtin_ia32

[PATCH 0/9] Add more folds related to exact division

2024-10-18 Thread Richard Sandiford
This series adds some more rules for identifying and folding
exact divisions, including shifts right by C when the low C
bits are known to be zero.  It also extends some existing rules
to handle poly_ints.

The original motivation was to improve address arithmetic for
some upcoming SVE testcases.

Bootstrapped & regression-tested on aarch64-linux-gnu.  OK to install?

Richard


Richard Sandiford (9):
  Make more places handle exact_div like trunc_div
  Use get_nonzero_bits to simplify trunc_div to exact_div
  Simplify X /[ex] Y cmp Z -> X cmp (Y * Z)
  Simplify (X /[ex] C1) * (C1 * C2) -> X * C2
  Generalise ((X /[ex] A) +- B) * A -> X +- A * B rule
  Try to simplify (X >> C1) << (C1 + C2) -> X << C2
  Handle POLY_INT_CSTs in get_nonzero_bits
  Try to simplify (X >> C1) * (C2 << C1) -> X * C2
  Record nonzero bits in the irange_bitmask of POLY_INT_CSTs

 gcc/match.pd  | 164 +-
 gcc/range-op-mixed.h  |   9 +-
 gcc/range-op.cc   |  19 +-
 gcc/range-op.h|  31 +++-
 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-6.c |  29 
 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-7.c |  21 +++
 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-8.c |  20 +++
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-1.c |  23 +++
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-2.c |  19 ++
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-3.c |  21 +++
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-4.c |  14 ++
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c |  29 
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-6.c |  59 +++
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-7.c |  22 +++
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-8.c |  20 +++
 gcc/testsuite/gcc.dg/tree-ssa/shifts-1.c  |  61 +++
 gcc/testsuite/gcc.dg/tree-ssa/shifts-2.c  |  21 +++
 gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c  |  65 +++
 gcc/testsuite/gcc.dg/tree-ssa/shifts-4.c  |  23 +++
 .../gcc.target/aarch64/sve/cnt_fold_1.c   | 110 
 .../gcc.target/aarch64/sve/cnt_fold_2.c   |  55 ++
 .../gcc.target/aarch64/sve/cnt_fold_3.c   |  40 +
 .../gcc.target/aarch64/sve/cnt_fold_4.c   |  61 +++
 .../gcc.target/aarch64/sve/cnt_fold_5.c   |  38 
 .../gcc.target/aarch64/sve/cnt_fold_6.c   |  75 
 gcc/tree-ssa-loop-ivopts.cc   |   2 +
 gcc/tree-ssa-loop-niter.cc|   2 +-
 gcc/tree-ssanames.cc  |   3 +
 gcc/tree.h|  13 ++
 gcc/value-query.cc|   7 +
 30 files changed, 1018 insertions(+), 58 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-6.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-7.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-8.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-2.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-3.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-4.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-6.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-7.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-8.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/shifts-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/shifts-2.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/shifts-4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_5.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_6.c

-- 
2.25.1



[PATCH 8/9] Try to simplify (X >> C1) * (C2 << C1) -> X * C2

2024-10-18 Thread Richard Sandiford
This patch adds a rule to simplify (X >> C1) * (C2 << C1) -> X * C2
when the low C1 bits of X are known to be zero.  As with the earlier
X >> C1 << (C2 + C1) patch, any single conversion is allowed between
the shift and the multiplication.

gcc/
* match.pd: Simplify (X >> C1) * (C2 << C1) -> X * C2 if the
low C1 bits of X are zero.

gcc/testsuite/
* gcc.dg/tree-ssa/shifts-3.c: New test.
* gcc.dg/tree-ssa/shifts-4.c: Likewise.
* gcc.target/aarch64/sve/cnt_fold_5.c: Likewise.
---
 gcc/match.pd  | 13 
 gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c  | 65 +++
 gcc/testsuite/gcc.dg/tree-ssa/shifts-4.c  | 23 +++
 .../gcc.target/aarch64/sve/cnt_fold_5.c   | 38 +++
 4 files changed, 139 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/shifts-4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_5.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 41903554478..85f5eeefa08 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -4915,6 +4915,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   && wi::to_widest (@2) >= wi::to_widest (@1)
   && wi::to_widest (@1) <= wi::ctz (get_nonzero_bits (@0)))
   (lshift (convert @0) (minus @2 @1
+
+/* (X >> C1) * (C2 << C1) -> X * C2 if the low C1 bits of X are zero.  */
+(simplify
+ (mult (convert? (rshift (with_possible_nonzero_bits2 @0) INTEGER_CST@1))
+   poly_int_tree_p@2)
+ (with { poly_widest_int factor; }
+  (if (INTEGRAL_TYPE_P (type)
+   && wi::ltu_p (wi::to_wide (@1), element_precision (type))
+   && wi::to_widest (@1) <= wi::ctz (get_nonzero_bits (@0))
+   && multiple_p (wi::to_poly_widest (@2),
+ widest_int (1) << tree_to_uhwi (@1),
+ &factor))
+   (mult (convert @0) { wide_int_to_tree (type, factor); }
 #endif
 
 /* For (x << c) >> c, optimize into x & ((unsigned)-1 >> c) for
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c 
b/gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c
new file mode 100644
index 000..dcff518e630
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c
@@ -0,0 +1,65 @@
+/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
+
+unsigned int
+f1 (unsigned int x)
+{
+  if (x & 3)
+__builtin_unreachable ();
+  x >>= 2;
+  return x * 20;
+}
+
+unsigned int
+f2 (unsigned int x)
+{
+  if (x & 3)
+__builtin_unreachable ();
+  unsigned char y = x;
+  y >>= 2;
+  return y * 36;
+}
+
+unsigned long
+f3 (unsigned int x)
+{
+  if (x & 3)
+__builtin_unreachable ();
+  x >>= 2;
+  return (unsigned long) x * 88;
+}
+
+int
+f4 (int x)
+{
+  if (x & 15)
+__builtin_unreachable ();
+  x >>= 4;
+  return x * 48;
+}
+
+unsigned int
+f5 (int x)
+{
+  if (x & 31)
+__builtin_unreachable ();
+  x >>= 5;
+  return x * 3200;
+}
+
+unsigned int
+f6 (unsigned int x)
+{
+  if (x & 1)
+__builtin_unreachable ();
+  x >>= 1;
+  return x * (~0U / 3 & -2);
+}
+
+/* { dg-final { scan-tree-dump-not {<[a-z]*_div_expr,} "optimized" } } */
+/* { dg-final { scan-tree-dump-not {>= 2;
+  return x * 10;
+}
+
+unsigned int
+f2 (unsigned int x)
+{
+  if (x & 3)
+__builtin_unreachable ();
+  x >>= 3;
+  return x * 24;
+}
+
+/* { dg-final { scan-tree-dump-times {
+
+/*
+** f1:
+** ...
+** cntd[^\n]+
+** ...
+** mul [^\n]+
+** ret
+*/
+uint64_t
+f1 (int x)
+{
+  if (x & 3)
+__builtin_unreachable ();
+  x >>= 2;
+  return (uint64_t) x * svcnth ();
+}
+
+/*
+** f2:
+** ...
+** asr [^\n]+
+** ...
+** ret
+*/
+uint64_t
+f2 (int x)
+{
+  if (x & 3)
+__builtin_unreachable ();
+  x >>= 2;
+  return (uint64_t) x * svcntw ();
+}
-- 
2.25.1



[PATCH 6/9] Try to simplify (X >> C1) << (C1 + C2) -> X << C2

2024-10-18 Thread Richard Sandiford
This patch adds a rule to simplify (X >> C1) << (C1 + C2) -> X << C2
when the low C1 bits of X are known to be zero.

Any single conversion can take place between the shifts.  E.g. for
a truncating conversion, any extra bits of X that are preserved by
truncating after the shift are immediately lost by the shift left.
And the sign bits used for an extending conversion are the same as
the sign bits used for the rshift.  (A double conversion of say
int->unsigned->uint64_t would be wrong though.)

gcc/
* match.pd: Simplify (X >> C1) << (C1 + C2) -> X << C2 if the
low C1 bits of X are zero.

gcc/testsuite/
* gcc.dg/tree-ssa/shifts-1.c: New test.
* gcc.dg/tree-ssa/shifts-2.c: Likewise.
---
 gcc/match.pd | 13 +
 gcc/testsuite/gcc.dg/tree-ssa/shifts-1.c | 61 
 gcc/testsuite/gcc.dg/tree-ssa/shifts-2.c | 21 
 3 files changed, 95 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/shifts-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/shifts-2.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 268316456c3..540582dc984 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -4902,6 +4902,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
- TYPE_PRECISION (TREE_TYPE (@2)
   (bit_and (convert @0) (lshift { build_minus_one_cst (type); } @1
 
+#if GIMPLE
+/* (X >> C1) << (C1 + C2) -> X << C2 if the low C1 bits of X are zero.  */
+(simplify
+ (lshift (convert? (rshift (with_possible_nonzero_bits2 @0) INTEGER_CST@1))
+ INTEGER_CST@2)
+ (if (INTEGRAL_TYPE_P (type)
+  && wi::ltu_p (wi::to_wide (@1), element_precision (type))
+  && wi::ltu_p (wi::to_wide (@2), element_precision (type))
+  && wi::to_widest (@2) >= wi::to_widest (@1)
+  && wi::to_widest (@1) <= wi::ctz (get_nonzero_bits (@0)))
+  (lshift (convert @0) (minus @2 @1
+#endif
+
 /* For (x << c) >> c, optimize into x & ((unsigned)-1 >> c) for
unsigned x OR truncate into the precision(type) - c lowest bits
of signed x (if they have mode precision or a precision of 1).  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/shifts-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/shifts-1.c
new file mode 100644
index 000..d88500ca8dd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/shifts-1.c
@@ -0,0 +1,61 @@
+/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
+
+unsigned int
+f1 (unsigned int x)
+{
+  if (x & 3)
+__builtin_unreachable ();
+  x >>= 2;
+  return x << 3;
+}
+
+unsigned int
+f2 (unsigned int x)
+{
+  if (x & 3)
+__builtin_unreachable ();
+  unsigned char y = x;
+  y >>= 2;
+  return y << 3;
+}
+
+unsigned long
+f3 (unsigned int x)
+{
+  if (x & 3)
+__builtin_unreachable ();
+  x >>= 2;
+  return (unsigned long) x << 3;
+}
+
+int
+f4 (int x)
+{
+  if (x & 15)
+__builtin_unreachable ();
+  x >>= 4;
+  return x << 5;
+}
+
+unsigned int
+f5 (int x)
+{
+  if (x & 31)
+__builtin_unreachable ();
+  x >>= 5;
+  return x << 6;
+}
+
+unsigned int
+f6 (unsigned int x)
+{
+  if (x & 1)
+__builtin_unreachable ();
+  x >>= 1;
+  return x << (sizeof (int) * __CHAR_BIT__ - 1);
+}
+
+/* { dg-final { scan-tree-dump-not {<[a-z]*_div_expr,} "optimized" } } */
+/* { dg-final { scan-tree-dump-not {>= 3;
+  return x << 4;
+}
+
+unsigned int
+f2 (unsigned int x)
+{
+  if (x & 3)
+__builtin_unreachable ();
+  x >>= 2;
+  return x << 1;
+}
+
+/* { dg-final { scan-tree-dump-times {

[PATCH 4/9] Simplify (X /[ex] C1) * (C1 * C2) -> X * C2

2024-10-18 Thread Richard Sandiford
gcc/
* match.pd: Simplify (X /[ex] C1) * (C1 * C2) -> X * C2.

gcc/testsuite/
* gcc.dg/tree-ssa/mulexactdiv-1.c: New test.
* gcc.dg/tree-ssa/mulexactdiv-2.c: Likewise.
* gcc.dg/tree-ssa/mulexactdiv-3.c: Likewise.
* gcc.dg/tree-ssa/mulexactdiv-4.c: Likewise.
* gcc.target/aarch64/sve/cnt_fold_1.c: Likewise.
* gcc.target/aarch64/sve/cnt_fold_2.c: Likewise.
---
 gcc/match.pd  |   8 ++
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-1.c |  23 
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-2.c |  19 +++
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-3.c |  21 
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-4.c |  14 +++
 .../gcc.target/aarch64/sve/cnt_fold_1.c   | 110 ++
 .../gcc.target/aarch64/sve/cnt_fold_2.c   |  55 +
 7 files changed, 250 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-2.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-3.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_2.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 1b1d38cf105..6677bc06d80 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -31,6 +31,7 @@ along with GCC; see the file COPYING3.  If not see
zerop
initializer_each_zero_or_onep
CONSTANT_CLASS_P
+   poly_int_tree_p
tree_expr_nonnegative_p
tree_expr_nonzero_p
integer_valued_real_p
@@ -5467,6 +5468,13 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (mult (convert1? (exact_div @0 @@1)) (convert2? @1))
   (convert @0))
 
+/* (X /[ex] C1) * (C1 * C2) -> X * C2.  */
+(simplify
+ (mult (convert? (exact_div @0 INTEGER_CST@1)) poly_int_tree_p@2)
+ (with { poly_widest_int factor; }
+  (if (multiple_p (wi::to_poly_widest (@2), wi::to_widest (@1), &factor))
+   (mult (convert @0) { wide_int_to_tree (type, factor); }
+
 /* Simplify (A / B) * B + (A % B) -> A.  */
 (for div (trunc_div ceil_div floor_div round_div)
  mod (trunc_mod ceil_mod floor_mod round_mod)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-1.c
new file mode 100644
index 000..fa853eb7dff
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-1.c
@@ -0,0 +1,23 @@
+/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
+
+#define TEST_CMP(FN, DIV, MUL) \
+  int  \
+  FN (int x)   \
+  {\
+if (x & 7) \
+  __builtin_unreachable ();\
+x /= DIV;  \
+return x * MUL;\
+  }
+
+TEST_CMP (f1, 2, 6)
+TEST_CMP (f2, 2, 10)
+TEST_CMP (f3, 4, 80)
+TEST_CMP (f4, 8, 200)
+
+/* { dg-final { scan-tree-dump-not {<[a-z]*_div_expr, } "optimized" } } */
+/* { dg-final { scan-tree-dump-not {> 1) & -2)
+TEST_CMP (f2, int, 4, unsigned long, -8)
+TEST_CMP (f3, int, 8, unsigned int, -24)
+TEST_CMP (f4, long, 2, int, (~0U >> 1) & -2)
+TEST_CMP (f5, long, 4, unsigned int, 100)
+TEST_CMP (f6, long, 8, unsigned long, 200)
+
+/* { dg-final { scan-tree-dump-not {<[a-z]*_div_expr, } "optimized" } } */
+/* { dg-final { scan-tree-dump-not {
+
+/*
+** f1:
+** cntdx([0-9]+)
+** mul w0, (w0, w\1|w\1, w0)
+** ret
+*/
+int
+f1 (int x)
+{
+  if (x & 1)
+__builtin_unreachable ();
+  x /= 2;
+  return x * svcntw();
+}
+
+/*
+** f2:
+** cntdx([0-9]+)
+** mul w0, (w0, w\1|w\1, w0)
+** ret
+*/
+int
+f2 (int x)
+{
+  if (x & 3)
+__builtin_unreachable ();
+  x /= 4;
+  return x * svcnth();
+}
+
+/*
+** f3:
+** cntdx([0-9]+)
+** mul w0, (w0, w\1|w\1, w0)
+** ret
+*/
+int
+f3 (int x)
+{
+  if (x & 7)
+__builtin_unreachable ();
+  x /= 8;
+  return x * svcntb();
+}
+
+/*
+** f4:
+** cntwx([0-9]+)
+** mul w0, (w0, w\1|w\1, w0)
+** ret
+*/
+int
+f4 (int x)
+{
+  if (x & 1)
+__builtin_unreachable ();
+  x /= 2;
+  return x * svcnth();
+}
+
+/*
+** f5:
+** cntwx([0-9]+)
+** mul w0, (w0, w\1|w\1, w0)
+** ret
+*/
+int
+f5 (int x)
+{
+  if (x & 3)
+__builtin_unreachable ();
+  x /= 4;
+  return x * svcntb();
+}
+
+/*
+** f6:
+** cnthx([0-9]+)
+** mul w0, (w0, w\1|w\1, w0)
+** ret
+*/
+int
+f6 (int x)
+{
+  if (x & 1)
+__builtin_unreachable ();
+  x /= 2;
+  return x * svcntb();
+}
+
+/*
+** f7:
+** cntbx([0-9]+)
+** mul w0, (w0, w\1|w\1, w0)
+** ret
+*/
+int
+f7 (int x)
+{
+  if (x & 15)
+__builtin_unreachable ();
+  x /= 16;
+  return x * svcntb() * 16;
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_2.c 
b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_2.c
new file mode 100644

[PATCH 2/9] Use get_nonzero_bits to simplify trunc_div to exact_div

2024-10-18 Thread Richard Sandiford
There are a limited number of existing rules that benefit from
knowing that a division is exact.  Later patches will add more.

gcc/
* match.pd: Simplify X / (1 << C) to X /[ex] (1 << C) if the
low C bits of X are clear

gcc/testsuite/
* gcc.dg/tree-ssa/cmpexactdiv-6.c: New test.
---
 gcc/match.pd  |  9 ++
 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-6.c | 29 +++
 2 files changed, 38 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-6.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 4aea028a866..b952225b08c 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -5431,6 +5431,15 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
TYPE_PRECISION (type)), 0))
(convert @0)))
 
+#if GIMPLE
+/* X / (1 << C) -> X /[ex] (1 << C) if the low C bits of X are clear.  */
+(simplify
+ (trunc_div (with_possible_nonzero_bits2 @0) integer_pow2p@1)
+ (if (INTEGRAL_TYPE_P (type)
+  && !TYPE_UNSIGNED (type)
+  && wi::multiple_of_p (get_nonzero_bits (@0), wi::to_wide (@1), SIGNED))
+  (exact_div @0 @1)))
+#endif
 
 /* (X /[ex] A) * A -> X.  */
 (simplify
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-6.c 
b/gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-6.c
new file mode 100644
index 000..82d517b05ab
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-6.c
@@ -0,0 +1,29 @@
+/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
+
+typedef __INTPTR_TYPE__ intptr_t;
+
+int
+f1 (int x, int y)
+{
+  if ((x & 1) || (y & 1))
+__builtin_unreachable ();
+  x /= 2;
+  y /= 2;
+  return x < y;
+}
+
+int
+f2 (void *ptr1, void *ptr2, void *ptr3)
+{
+  ptr1 = __builtin_assume_aligned (ptr1, 4);
+  ptr2 = __builtin_assume_aligned (ptr2, 4);
+  ptr3 = __builtin_assume_aligned (ptr3, 4);
+  intptr_t diff1 = (intptr_t) ptr1 - (intptr_t) ptr2;
+  intptr_t diff2 = (intptr_t) ptr1 - (intptr_t) ptr3;
+  diff1 /= 2;
+  diff2 /= 2;
+  return diff1 < diff2;
+}
+
+/* { dg-final { scan-tree-dump-not {<[a-z]*_div_expr,} "optimized" } } */
+/* { dg-final { scan-tree-dump-not {

Re: [PATCH v16b 2/4] gcc/: Rename array_type_nelts => array_type_nelts_minus_one

2024-10-18 Thread Alejandro Colomar
On Fri, Oct 18, 2024 at 10:25:59AM GMT, Alejandro Colomar wrote:
> Hi Joseph,
> 
> On Wed, Oct 16, 2024 at 08:02:05PM GMT, Alejandro Colomar wrote:
> > Hi Joseph,
> > 
> > On Wed, Oct 16, 2024 at 05:21:39PM GMT, Joseph Myers wrote:
> > > On Wed, 16 Oct 2024, Alejandro Colomar wrote:
> > > 
> > > > The old name was misleading.
> > > > 
> > > > While at it, also rename some temporary variables that are used with
> > > > this function, for consistency.
> > > 
> > > This patch is OK and should be committed (assuming it has passed 
> > > bootstrap 
> > > and regression testing).

This patch (2/4) needs patch 1/4 for the commit message.  Otherwise
`git gcc-verify` reports a problem (Cc: tags).  Please revise patch 1/4
too.

Cheers,
Alex

> > 
> > Thanks!
> > 
> > I did bootstrap + regression testing in v15.  In v16 I got lazy and
> > since the changes were small, I only checked that the new tests all
> > pass.
> > 
> > Since it looks like it's close to merging, I'll run bootstrap +
> > regression testing again.
> 
> I've finished with regression testing.  It's all good for the entire
> patch set.  See below.
> 
> Have a lovely day!
> Alex
> 
> 
> $ find | grep sum$ | while read f; do diff -u ../len16b_b4/$f $f; done
> --- ../len16b_b4/./gcc/testsuite/gcc/gcc.sum  2024-10-16 22:35:00.073117065 
> +0200
> +++ ./gcc/testsuite/gcc/gcc.sum   2024-10-18 01:03:29.717522166 +0200
> @@ -1,4 +1,4 @@
> -Test run by alx on Wed Oct 16 20:49:15 2024
> +Test run by alx on Thu Oct 17 23:17:38 2024
>  Native configuration is x86_64-pc-linux-gnu
>  
>   === gcc tests ===
> @@ -77826,6 +77826,29 @@
>  PASS: gcc.dg/conv-2.c (test for excess errors)
>  PASS: gcc.dg/conv-3.c (test for excess errors)
>  PASS: gcc.dg/conv-3.c execution test
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 20)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 24)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 35)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 50)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 53)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 56)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 68)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 69)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 70)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 71)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 72)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 73)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 74)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 75)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 76)
> +PASS: gcc.dg/countof-compile.c  (test for warnings, line 80)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 100)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 109)
> +PASS: gcc.dg/countof-compile.c  (test for errors, line 114)
> +PASS: gcc.dg/countof-compile.c (test for excess errors)
> +PASS: gcc.dg/countof-vla.c (test for excess errors)
> +PASS: gcc.dg/countof.c (test for excess errors)
> +PASS: gcc.dg/countof.c execution test
>  PASS: gcc.dg/cr-decimal-dig-1.c (test for excess errors)
>  PASS: gcc.dg/cr-decimal-dig-2.c (test for excess errors)
>  PASS: gcc.dg/cr-decimal-dig-3.c (test for excess errors)
> @@ -208273,7 +208296,7 @@
>  
>   === gcc Summary ===
>  
> -# of expected passes 203197
> +# of expected passes 203220
>  # of unexpected failures 47
>  # of unexpected successes2
>  # of expected failures   1467
> --- ../len16b_b4/./gcc/testsuite/gfortran/gfortran.sum2024-10-17 
> 00:13:36.027361156 +0200
> +++ ./gcc/testsuite/gfortran/gfortran.sum 2024-10-18 02:42:26.303591286 
> +0200
> @@ -1,4 +1,4 @@
> -Test run by alx on Wed Oct 16 23:38:34 2024
> +Test run by alx on Fri Oct 18 02:07:15 2024
>  Native configuration is x86_64-pc-linux-gnu
>  
>   === gfortran tests ===
> --- ../len16b_b4/./gcc/testsuite/objc/objc.sum2024-10-17 
> 00:14:42.583978154 +0200
> +++ ./gcc/testsuite/objc/objc.sum 2024-10-18 02:43:33.504187051 +0200
> @@ -1,4 +1,4 @@
> -Test run by alx on Thu Oct 17 00:13:36 2024
> +Test run by alx on Fri Oct 18 02:42:26 2024
>  Native configuration is x86_64-pc-linux-gnu
>  
>   === objc tests ===
> --- ../len16b_b4/./gcc/testsuite/g++/g++.sum  2024-10-16 23:38:33.908171142 
> +0200
> +++ ./gcc/testsuite/g++/g++.sum   2024-10-18 02:07:14.745543661 +0200
> @@ -1,4 +1,4 @@
> -Test run by alx on Wed Oct 16 22:35:00 2024
> +Test run by alx on Fri Oct 18 01:03:30 2024
>  Native configuration is x86_64-pc-linux-gnu
>  
>   === g++ tests ===
> --- ../len16b_b4/./x86_64-pc-linux-gnu/libitm/testsuite/libitm.sum
> 2024-10-17 03:41:54.781371501 +0200
> +++ ./x86_64-pc-linux-gnu/libitm/testsuite/libitm.sum 2024-10-18 
> 06:08:27.040788653 +0200
> @@ -1,4 +1,4 @@
> -Test run by alx on Thu Oct 17 03:41:52 2024
> +Test run by al

[PATCH 7/9] Handle POLY_INT_CSTs in get_nonzero_bits

2024-10-18 Thread Richard Sandiford
This patch extends get_nonzero_bits to handle POLY_INT_CSTs,
The easiest (but also most useful) case is that the number
of trailing zeros in the runtime value is at least the number
of trailing zeros in each individual component.

In principle, we could do this for coeffs 1 and above only,
and then OR in ceoff 0.  This would give ~0x11 for [14, 32], say.
But that's future work.

gcc/
* tree-ssanames.cc (get_nonzero_bits): Handle POLY_INT_CSTs.
* match.pd (with_possible_nonzero_bits): Likewise.

gcc/testsuite/
* gcc.target/aarch64/sve/cnt_fold_4.c: New test.
---
 gcc/match.pd  |  2 +
 .../gcc.target/aarch64/sve/cnt_fold_4.c   | 61 +++
 gcc/tree-ssanames.cc  |  3 +
 3 files changed, 66 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_4.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 540582dc984..41903554478 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2893,6 +2893,8 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
possibly set.  */
 (match with_possible_nonzero_bits
  INTEGER_CST@0)
+(match with_possible_nonzero_bits
+ POLY_INT_CST@0)
 (match with_possible_nonzero_bits
  SSA_NAME@0
  (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) || POINTER_TYPE_P (TREE_TYPE (@0)
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_4.c 
b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_4.c
new file mode 100644
index 000..b7a53701993
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_4.c
@@ -0,0 +1,61 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include 
+
+/*
+** f1:
+** cnthx0
+** ret
+*/
+uint64_t
+f1 ()
+{
+  uint64_t x = svcntw ();
+  x >>= 2;
+  return x << 3;
+}
+
+/*
+** f2:
+** [^\n]+
+** [^\n]+
+** ...
+** ret
+*/
+uint64_t
+f2 ()
+{
+  uint64_t x = svcntd ();
+  x >>= 2;
+  return x << 3;
+}
+
+/*
+** f3:
+** cntbx0, all, mul #4
+** ret
+*/
+uint64_t
+f3 ()
+{
+  uint64_t x = svcntd ();
+  x >>= 1;
+  return x << 6;
+}
+
+/*
+** f4:
+** [^\n]+
+** [^\n]+
+** ...
+** ret
+*/
+uint64_t
+f4 ()
+{
+  uint64_t x = svcntd ();
+  x >>= 2;
+  return x << 2;
+}
diff --git a/gcc/tree-ssanames.cc b/gcc/tree-ssanames.cc
index 4f83fcbb517..d2d1ec18797 100644
--- a/gcc/tree-ssanames.cc
+++ b/gcc/tree-ssanames.cc
@@ -505,6 +505,9 @@ get_nonzero_bits (const_tree name)
   /* Use element_precision instead of TYPE_PRECISION so complex and
  vector types get a non-zero precision.  */
   unsigned int precision = element_precision (TREE_TYPE (name));
+  if (POLY_INT_CST_P (name))
+return -known_alignment (wi::to_poly_wide (name));
+
   if (POINTER_TYPE_P (TREE_TYPE (name)))
 {
   struct ptr_info_def *pi = SSA_NAME_PTR_INFO (name);
-- 
2.25.1



[PATCH 9/9] Record nonzero bits in the irange_bitmask of POLY_INT_CSTs

2024-10-18 Thread Richard Sandiford
At the moment, ranger punts entirely on POLY_INT_CSTs.  Numerical
ranges are a bit difficult, unless we do start modelling bounds on
the indeterminates.  But we can at least track the nonzero bits.

gcc/
* value-query.cc (range_query::get_tree_range): Use get_nonzero_bits
to populate the irange_bitmask of a POLY_INT_CST.

gcc/testsuite/
* gcc.target/aarch64/sve/cnt_fold_6.c: New test.
---
 .../gcc.target/aarch64/sve/cnt_fold_6.c   | 75 +++
 gcc/value-query.cc|  7 ++
 2 files changed, 82 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_6.c

diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_6.c 
b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_6.c
new file mode 100644
index 000..9d9e1ca9330
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_6.c
@@ -0,0 +1,75 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include 
+
+/*
+** f1:
+** ...
+** cntb(x[0-9]+)
+** ...
+** add x[0-9]+, \1, #?16
+** ...
+** csel[^\n]+
+** ret
+*/
+uint64_t
+f1 (int x)
+{
+  uint64_t y = x ? svcnth () : svcnth () + 8;
+  y >>= 3;
+  y <<= 4;
+  return y;
+}
+
+/*
+** f2:
+** ...
+** (?:and|[al]sr)  [^\n]+
+** ...
+** ret
+*/
+uint64_t
+f2 (int x)
+{
+  uint64_t y = x ? svcnth () : svcnth () + 8;
+  y >>= 4;
+  y <<= 5;
+  return y;
+}
+
+/*
+** f3:
+** ...
+** cntw(x[0-9]+)
+** ...
+** add x[0-9]+, \1, #?16
+** ...
+** csel[^\n]+
+** ret
+*/
+uint64_t
+f3 (int x)
+{
+  uint64_t y = x ? svcntd () : svcntd () + 8;
+  y >>= 1;
+  y <<= 2;
+  return y;
+}
+
+/*
+** f4:
+** ...
+** (?:and|[al]sr)  [^\n]+
+** ...
+** ret
+*/
+uint64_t
+f4 (int x)
+{
+  uint64_t y = x ? svcntd () : svcntd () + 8;
+  y >>= 2;
+  y <<= 3;
+  return y;
+}
diff --git a/gcc/value-query.cc b/gcc/value-query.cc
index cac2cb5b2bc..34499da1a98 100644
--- a/gcc/value-query.cc
+++ b/gcc/value-query.cc
@@ -375,6 +375,13 @@ range_query::get_tree_range (vrange &r, tree expr, gimple 
*stmt,
   }
 
 default:
+  if (POLY_INT_CST_P (expr))
+   {
+ unsigned int precision = TYPE_PRECISION (type);
+ r.set_varying (type);
+ r.update_bitmask ({ wi::zero (precision), get_nonzero_bits (expr) });
+ return true;
+   }
   break;
 }
   if (BINARY_CLASS_P (expr) || COMPARISON_CLASS_P (expr))
-- 
2.25.1



[PATCH 5/9] Generalise ((X /[ex] A) +- B) * A -> X +- A * B rule

2024-10-18 Thread Richard Sandiford
match.pd had a rule to simplify ((X /[ex] A) +- B) * A -> X +- A * B
when A and B are INTEGER_CSTs.  This patch extends it to handle the
case where the outer multiplication is by a factor of A, not just
A itself.  It also handles addition and multiplication of poly_ints.
(Exact division by a poly_int seems unlikely.)

I'm not sure why minus is handled here.  Wouldn't minus of a constant be
canonicalised to a plus?

gcc/
* match.pd: Generalise ((X /[ex] A) +- B) * A -> X +- A * B rule
to ((X /[ex] C1) +- C2) * (C1 * C3) -> (X * C3) +- (C1 * C2 * C3).

gcc/testsuite/
* gcc.dg/tree-ssa/mulexactdiv-5.c: New test.
* gcc.dg/tree-ssa/mulexactdiv-6.c: Likewise.
* gcc.dg/tree-ssa/mulexactdiv-7.c: Likewise.
* gcc.dg/tree-ssa/mulexactdiv-8.c: Likewise.
* gcc.target/aarch64/sve/cnt_fold_3.c: Likewise.
---
 gcc/match.pd  | 38 +++-
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c | 29 +
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-6.c | 59 +++
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-7.c | 22 +++
 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-8.c | 20 +++
 .../gcc.target/aarch64/sve/cnt_fold_3.c   | 40 +
 6 files changed, 194 insertions(+), 14 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-6.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-7.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_3.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 6677bc06d80..268316456c3 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -5493,24 +5493,34 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
optab_vector)))
(eq (trunc_mod @0 @1) { build_zero_cst (TREE_TYPE (@0)); })))
 
-/* ((X /[ex] A) +- B) * A  -->  X +- A * B.  */
+/* ((X /[ex] C1) +- C2) * (C1 * C3)  -->  (X * C3) +- (C1 * C2 * C3).  */
 (for op (plus minus)
  (simplify
-  (mult (convert1? (op (convert2? (exact_div @0 INTEGER_CST@@1)) 
INTEGER_CST@2)) @1)
-  (if (tree_nop_conversion_p (type, TREE_TYPE (@2))
-   && tree_nop_conversion_p (TREE_TYPE (@0), TREE_TYPE (@2)))
-   (with
- {
-   wi::overflow_type overflow;
-   wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
-  TYPE_SIGN (type), &overflow);
- }
+  (mult (convert1? (op (convert2? (exact_div @0 INTEGER_CST@1))
+  poly_int_tree_p@2))
+   poly_int_tree_p@3)
+  (with { poly_widest_int factor; }
+   (if (tree_nop_conversion_p (type, TREE_TYPE (@2))
+   && tree_nop_conversion_p (TREE_TYPE (@0), TREE_TYPE (@2))
+   && multiple_p (wi::to_poly_widest (@3), wi::to_widest (@1), &factor))
+(with
+  {
+   wi::overflow_type overflow;
+wide_int mul;
+  }
  (if (types_match (type, TREE_TYPE (@2))
-&& types_match (TREE_TYPE (@0), TREE_TYPE (@2)) && !overflow)
-  (op @0 { wide_int_to_tree (type, mul); })
+ && types_match (TREE_TYPE (@0), TREE_TYPE (@2))
+ && TREE_CODE (@2) == INTEGER_CST
+ && TREE_CODE (@3) == INTEGER_CST
+ && (mul = wi::mul (wi::to_wide (@2), wi::to_wide (@3),
+TYPE_SIGN (type), &overflow),
+ !overflow))
+  (op (mult @0 { wide_int_to_tree (type, factor); })
+ { wide_int_to_tree (type, mul); })
   (with { tree utype = unsigned_type_for (type); }
-   (convert (op (convert:utype @0)
-   (mult (convert:utype @1) (convert:utype @2))
+   (convert (op (mult (convert:utype @0)
+ { wide_int_to_tree (utype, factor); })
+   (mult (convert:utype @3) (convert:utype @2)))
 
 /* Canonicalization of binary operations.  */
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c 
b/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c
new file mode 100644
index 000..37cd676fff6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c
@@ -0,0 +1,29 @@
+/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
+
+#define TEST_CMP(FN, DIV, ADD, MUL)\
+  int  \
+  FN (int x)   \
+  {\
+if (x & 7) \
+  __builtin_unreachable ();\
+x /= DIV;  \
+x += ADD;  \
+return x * MUL;\
+  }
+
+TEST_CMP (f1, 2, 1, 6)
+TEST_CMP (f2, 2, 2, 10)
+TEST_CMP (f3, 4, 3, 80)
+TEST_CMP (f4, 8, 4, 200)
+
+/* { dg-final { scan-tree-dump-not {<[a-z]*_div_expr,} "optimized" } } */
+/* { dg-final { scan-tree-dump-not {> 1, 6)
+TEST_CMP (f2, 2, ~(~0U >> 2), 10)
+
+void
+cmp1 (int x)
+{
+  if (x & 3)
+__builtin_un

[PATCH 1/9] Make more places handle exact_div like trunc_div

2024-10-18 Thread Richard Sandiford
I tried to look for places where we were handling TRUNC_DIV_EXPR
more favourably than EXACT_DIV_EXPR.

Most of the places that I looked at but didn't change were handling
div/mod pairs.  But there's bound to be others I missed...

gcc/
* match.pd: Extend some rules to handle exact_div like trunc_div.
* tree.h (trunc_div_p): New function.
* tree-ssa-loop-niter.cc (is_rshift_by_1): Use it.
* tree-ssa-loop-ivopts.cc (force_expr_to_var_cost): Handle
EXACT_DIV_EXPR.
---
 gcc/match.pd| 60 +++--
 gcc/tree-ssa-loop-ivopts.cc |  2 ++
 gcc/tree-ssa-loop-niter.cc  |  2 +-
 gcc/tree.h  | 13 
 4 files changed, 47 insertions(+), 30 deletions(-)

diff --git a/gcc/match.pd b/gcc/match.pd
index 12d81fcac0d..4aea028a866 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -492,27 +492,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
of A starting from shift's type sign bit are zero, as
(unsigned long long) (1 << 31) is -2147483648ULL, not 2147483648ULL,
so it is valid only if A >> 31 is zero.  */
-(simplify
- (trunc_div (convert?@0 @3) (convert2? (lshift integer_onep@1 @2)))
- (if ((TYPE_UNSIGNED (type) || tree_expr_nonnegative_p (@0))
-  && (!VECTOR_TYPE_P (type)
- || target_supports_op_p (type, RSHIFT_EXPR, optab_vector)
- || target_supports_op_p (type, RSHIFT_EXPR, optab_scalar))
-  && (useless_type_conversion_p (type, TREE_TYPE (@1))
- || (element_precision (type) >= element_precision (TREE_TYPE (@1))
- && (TYPE_UNSIGNED (TREE_TYPE (@1))
- || (element_precision (type)
- == element_precision (TREE_TYPE (@1)))
- || (INTEGRAL_TYPE_P (type)
- && (tree_nonzero_bits (@0)
- & wi::mask (element_precision (TREE_TYPE (@1)) - 1,
- true,
- element_precision (type))) == 0)
-   (if (!VECTOR_TYPE_P (type)
-   && useless_type_conversion_p (TREE_TYPE (@3), TREE_TYPE (@1))
-   && element_precision (TREE_TYPE (@3)) < element_precision (type))
-(convert (rshift @3 @2))
-(rshift @0 @2
+(for div (trunc_div exact_div)
+ (simplify
+  (div (convert?@0 @3) (convert2? (lshift integer_onep@1 @2)))
+  (if ((TYPE_UNSIGNED (type) || tree_expr_nonnegative_p (@0))
+   && (!VECTOR_TYPE_P (type)
+  || target_supports_op_p (type, RSHIFT_EXPR, optab_vector)
+  || target_supports_op_p (type, RSHIFT_EXPR, optab_scalar))
+   && (useless_type_conversion_p (type, TREE_TYPE (@1))
+  || (element_precision (type) >= element_precision (TREE_TYPE (@1))
+  && (TYPE_UNSIGNED (TREE_TYPE (@1))
+  || (element_precision (type)
+  == element_precision (TREE_TYPE (@1)))
+  || (INTEGRAL_TYPE_P (type)
+  && (tree_nonzero_bits (@0)
+  & wi::mask (element_precision (TREE_TYPE (@1)) - 1,
+  true,
+  element_precision (type))) == 0)
+(if (!VECTOR_TYPE_P (type)
+&& useless_type_conversion_p (TREE_TYPE (@3), TREE_TYPE (@1))
+&& element_precision (TREE_TYPE (@3)) < element_precision (type))
+ (convert (rshift @3 @2))
+ (rshift @0 @2)
 
 /* Preserve explicit divisions by 0: the C++ front-end wants to detect
undefined behavior in constexpr evaluation, and assuming that the division
@@ -947,13 +948,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
{ build_one_cst (utype); })))
 
 /* Simplify (unsigned t * 2)/2 -> unsigned t & 0x7FFF.  */
-(simplify
- (trunc_div (mult @0 integer_pow2p@1) @1)
- (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) && TYPE_UNSIGNED (TREE_TYPE (@0)))
-  (bit_and @0 { wide_int_to_tree
-   (type, wi::mask (TYPE_PRECISION (type)
-- wi::exact_log2 (wi::to_wide (@1)),
-false, TYPE_PRECISION (type))); })))
+(for div (trunc_div exact_div)
+ (simplify
+  (div (mult @0 integer_pow2p@1) @1)
+  (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) && TYPE_UNSIGNED (TREE_TYPE (@0)))
+   (bit_and @0 { wide_int_to_tree
+(type, wi::mask (TYPE_PRECISION (type)
+ - wi::exact_log2 (wi::to_wide (@1)),
+ false, TYPE_PRECISION (type))); }
 
 /* Simplify (unsigned t / 2) * 2 -> unsigned t & ~1.  */
 (simplify
@@ -5715,7 +5717,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 
 /* Sink binary operation to branches, but only if we can fold it.  */
 (for op (tcc_comparison plus minus mult bit_and bit_ior bit_xor
-lshift rshift rdiv trunc_div ceil_div floor_div round_div
+lshift rshift rdiv trunc_div ceil_div floor_div round_div exact_div
 trunc_mod ceil_mod floor_mod round_mod min max)
 /* (c ? a : b) op (c ? d : e) 

[PATCH] libstdc++: Improve 26_numerics/headers/cmath/types_std_c++0x_neg.cc

2024-10-18 Thread Jonathan Wakely
This test checks that the special functions in  are not declared
prior to C++17. But we can remove the target selector and allow it to be
tested for C++17 and later, and add target selectors to the individual
dg-error directives instead.

Also rename the test to match what it actually tests.

libstdc++-v3/ChangeLog:

* testsuite/26_numerics/headers/cmath/types_std_c++0x_neg.cc:
Move to ...
* testsuite/26_numerics/headers/cmath/specfun_c++17.cc: here and
adjust test to be valid for all -std dialects.
---

Tested x86_64-linux.

 ...ypes_std_c++0x_neg.cc => specfun_c++17.cc} | 47 ++-
 1 file changed, 24 insertions(+), 23 deletions(-)
 rename 
libstdc++-v3/testsuite/26_numerics/headers/cmath/{types_std_c++0x_neg.cc => 
specfun_c++17.cc} (57%)

diff --git 
a/libstdc++-v3/testsuite/26_numerics/headers/cmath/types_std_c++0x_neg.cc 
b/libstdc++-v3/testsuite/26_numerics/headers/cmath/specfun_c++17.cc
similarity index 57%
rename from 
libstdc++-v3/testsuite/26_numerics/headers/cmath/types_std_c++0x_neg.cc
rename to libstdc++-v3/testsuite/26_numerics/headers/cmath/specfun_c++17.cc
index 977f800a4b0..efb60ea1fbb 100644
--- a/libstdc++-v3/testsuite/26_numerics/headers/cmath/types_std_c++0x_neg.cc
+++ b/libstdc++-v3/testsuite/26_numerics/headers/cmath/specfun_c++17.cc
@@ -1,4 +1,4 @@
-// { dg-do compile { target { ! c++17 } } }
+// { dg-do compile }
 
 // Copyright (C) 2007-2024 Free Software Foundation, Inc.
 //
@@ -21,28 +21,29 @@
 
 namespace gnu
 {
-  // C++11 changes from TR1.
-  using std::assoc_laguerre;   // { dg-error "has not been declared" }
-  using std::assoc_legendre;   // { dg-error "has not been declared" }
-  using std::beta; // { dg-error "has not been declared" }
-  using std::comp_ellint_1;// { dg-error "has not been declared" }
-  using std::comp_ellint_2;// { dg-error "has not been declared" }
-  using std::comp_ellint_3;// { dg-error "has not been declared" }
+  // C++17 additions from TR1.
+  using std::assoc_laguerre;   // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::assoc_legendre;   // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::beta; // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::comp_ellint_1;// { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::comp_ellint_2;// { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::comp_ellint_3;// { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::cyl_bessel_i; // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::cyl_bessel_j; // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::cyl_bessel_k; // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::cyl_neumann;  // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::ellint_1; // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::ellint_2; // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::ellint_3; // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::expint;   // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::hermite;  // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::laguerre; // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::legendre; // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::riemann_zeta; // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::sph_bessel;   // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::sph_legendre; // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  using std::sph_neumann;  // { dg-error "has not been declared" "" { 
target { ! c++17 } } }
+  // These two were in TR1 but not added to C++17.
   using std::conf_hyperg;  // { dg-error "has not been declared" }
-  using std::cyl_bessel_i; // { dg-error "has not been declared" }
-  using std::cyl_bessel_j; // { dg-error "has not been declared" }
-  using std::cyl_bessel_k; // { dg-error "has not been declared" }
-  using std::cyl_neumann;  // { dg-error "has not been declared" }
-  using std::ellint_1; // { dg-error "has not been declared" }
-  using std::ellint_2; // { dg-error "has not been declared" }
-  using std::ellint_3; // { dg-error "has not been declared" }
-  using std::expint;   // { dg-error "has not been declared" }
-  using std::hermite;  // { dg-error "has not been declared" }
   using std::hyperg;   // { dg-error "has not been declared" }
-  using std::laguerre; // { dg-error "has not been declared" }

[PATCH] libstdc++: Simplify C++98 std::vector::_M_data_ptr overload set

2024-10-18 Thread Jonathan Wakely
We don't need separate overloads for returning a const or non-const
pointer. We can make the member function const and return a non-const
pointer, and let `vector::data() const` convert it to const as needed.

libstdc++-v3/ChangeLog:

* include/bits/stl_vector.h (vector::_M_data_ptr): Remove
non-const overloads. Always return non-const pointer.
---

Tested x86_64-linux.

 libstdc++-v3/include/bits/stl_vector.h | 12 +---
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/libstdc++-v3/include/bits/stl_vector.h 
b/libstdc++-v3/include/bits/stl_vector.h
index e284536ad31..8982ca2b9ee 100644
--- a/libstdc++-v3/include/bits/stl_vector.h
+++ b/libstdc++-v3/include/bits/stl_vector.h
@@ -2034,20 +2034,10 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
_M_data_ptr(_Ptr __ptr) const
{ return empty() ? nullptr : std::__to_address(__ptr); }
 #else
-  template
-   _Up*
-   _M_data_ptr(_Up* __ptr) _GLIBCXX_NOEXCEPT
-   { return __ptr; }
-
   template
value_type*
-   _M_data_ptr(_Ptr __ptr)
-   { return empty() ? (value_type*)0 : __ptr.operator->(); }
-
-  template
-   const value_type*
_M_data_ptr(_Ptr __ptr) const
-   { return empty() ? (const value_type*)0 : __ptr.operator->(); }
+   { return empty() ? (value_type*)0 : __ptr.operator->(); }
 #endif
 };
 
-- 
2.46.2



[committed] hppa: Fix up pa.opt.urls

2024-10-18 Thread John David Anglin
Regenerated pa.opt.urls.

Dave
---

hppa: Fix up pa.opt.urls

2024-10-18  John David Anglin  

gcc/ChangeLog:

* config/pa/pa.opt.urls: Fix for -mlra.

diff --git a/gcc/config/pa/pa.opt.urls b/gcc/config/pa/pa.opt.urls
index 5b8bcebdd0d..5516332ead1 100644
--- a/gcc/config/pa/pa.opt.urls
+++ b/gcc/config/pa/pa.opt.urls
@@ -36,6 +36,8 @@ UrlSuffix(gcc/HPPA-Options.html#index-mlinker-opt)
 mlong-calls
 UrlSuffix(gcc/HPPA-Options.html#index-mlong-calls-5)
 
+; skipping UrlSuffix for 'mlra' due to finding no URLs
+
 mlong-load-store
 UrlSuffix(gcc/HPPA-Options.html#index-mlong-load-store)
 


signature.asc
Description: PGP signature


Re: [PATCH 3/9] Simplify X /[ex] Y cmp Z -> X cmp (Y * Z)

2024-10-18 Thread Richard Sandiford
[+ranger folks, who I forgot to CC originally, sorry!]

This patch applies X /[ex] Y cmp Z -> X cmp (Y * Z) when Y * Z is
representable.  The closest check for "is representable" on range
operations seemed to be overflow_free_p.  However, that is designed
for testing existing operations and so takes the definedness of
signed overflow into account.  Here, the question is whether we
can create an entirely new value.

The patch adds a new optional argument to overflow_free_p to
distinguish these cases.  It also adds a wrapper, so that it isn't
necessary to specify TRIO_VARYING.

I couldn't find a good way of removing the duplication between
the two operand orders.  The rules are (in a loose sense) symmetric,
but they're not based on commutativity.

gcc/
* range-op.h (range_query_type): New enum.
(range_op_handler::fits_type_p): New function.
(range_operator::overflow_free_p): Add an argument to specify the
type of query.
(range_op_handler::overflow_free_p): Likewise.
* range-op-mixed.h (operator_plus::overflow_free_p): Likewise.
(operator_minus::overflow_free_p): Likewise.
(operator_mult::overflow_free_p): Likewise.
* range-op.cc (range_op_handler::overflow_free_p): Likewise.
(range_operator::overflow_free_p): Likewise.
(operator_plus::overflow_free_p): Likewise.
(operator_minus::overflow_free_p): Likewise.
(operator_mult::overflow_free_p): Likewise.
* match.pd: Simplify X /[ex] Y cmp Z -> X cmp (Y * Z) when
Y * Z is representable.

gcc/testsuite/
* gcc.dg/tree-ssa/cmpexactdiv-7.c: New test.
* gcc.dg/tree-ssa/cmpexactdiv-8.c: Likewise.
---
 gcc/match.pd  | 21 +
 gcc/range-op-mixed.h  |  9 --
 gcc/range-op.cc   | 19 ++--
 gcc/range-op.h| 31 +--
 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-7.c | 21 +
 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-8.c | 20 
 6 files changed, 107 insertions(+), 14 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-7.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-8.c

diff --git a/gcc/match.pd b/gcc/match.pd
index b952225b08c..1b1d38cf105 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2679,6 +2679,27 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(le (minus (convert:etype @0) { lo; }) { hi; })
(gt (minus (convert:etype @0) { lo; }) { hi; })
 
+#if GIMPLE
+/* X /[ex] Y cmp Z -> X cmp (Y * Z), if Y * Z is representable.  */
+(for cmp (simple_comparison)
+ (simplify
+  (cmp (exact_div:s @0 @1) @2)
+  (with { int_range_max r1, r2; }
+   (if (INTEGRAL_TYPE_P (type)
+   && get_range_query (cfun)->range_of_expr (r1, @1)
+   && get_range_query (cfun)->range_of_expr (r2, @2)
+   && range_op_handler (MULT_EXPR).fits_type_p (r1, r2))
+(cmp @0 (mult @1 @2)
+ (simplify
+  (cmp @2 (exact_div:s @0 @1))
+  (with { int_range_max r1, r2; }
+   (if (INTEGRAL_TYPE_P (type)
+   && get_range_query (cfun)->range_of_expr (r1, @1)
+   && get_range_query (cfun)->range_of_expr (r2, @2)
+   && range_op_handler (MULT_EXPR).fits_type_p (r1, r2))
+(cmp (mult @1 @2) @0)
+#endif
+
 /* X + Z < Y + Z is the same as X < Y when there is no overflow.  */
 (for op (lt le ge gt)
  (simplify
diff --git a/gcc/range-op-mixed.h b/gcc/range-op-mixed.h
index cc1db2f6775..402cb87c2b2 100644
--- a/gcc/range-op-mixed.h
+++ b/gcc/range-op-mixed.h
@@ -539,7 +539,8 @@ public:
   const irange &rh) const final override;
 
   virtual bool overflow_free_p (const irange &lh, const irange &rh,
-   relation_trio = TRIO_VARYING) const;
+   relation_trio = TRIO_VARYING,
+   range_query_type = QUERY_WITH_GIMPLE_UB) const;
   // Check compatibility of all operands.
   bool operand_check_p (tree t1, tree t2, tree t3) const final override
 { return range_compatible_p (t1, t2) && range_compatible_p (t1, t3); }
@@ -615,7 +616,8 @@ public:
   const irange &rh) const final override;
 
   virtual bool overflow_free_p (const irange &lh, const irange &rh,
-   relation_trio = TRIO_VARYING) const;
+   relation_trio = TRIO_VARYING,
+   range_query_type = QUERY_WITH_GIMPLE_UB) const;
   // Check compatibility of all operands.
   bool operand_check_p (tree t1, tree t2, tree t3) const final override
 { return range_compatible_p (t1, t2) && range_compatible_p (t1, t3); }
@@ -701,7 +703,8 @@ public:
const REAL_VALUE_TYPE &rh_lb, const REAL_VALUE_TYPE &rh_ub,
relation_kind kind) const final override;
   virtual bool overflow_free_p (const irange &lh, const irange &rh,
-   relation_trio =

Re: pair-fusion: Assume alias conflict if common address reg changes [PR116783]

2024-10-18 Thread Alex Coplan
On 18/10/2024 17:45, Richard Sandiford wrote:
> Alex Coplan  writes:
> > On 11/10/2024 14:30, Richard Biener wrote:
> >> On Fri, 11 Oct 2024, Richard Sandiford wrote:
> >> 
> >> > Alex Coplan  writes:
> >> > > Hi,
> >> > >
> >> > > As the PR shows, pair-fusion was tricking memory_modified_in_insn_p 
> >> > > into
> >> > > returning false when a common base register (in this case, x1) was
> >> > > modified between the mem and the store insn.  This lead to wrong code 
> >> > > as
> >> > > the accesses really did alias.
> >> > >
> >> > > To avoid this sort of problem, this patch avoids invoking RTL alias
> >> > > analysis altogether (and assume an alias conflict) if the two insns to
> >> > > be compared share a common address register R, and the insns see 
> >> > > different
> >> > > definitions of R (i.e. it was modified in between).
> >> > >
> >> > > Bootstrapped/regtested on aarch64-linux-gnu (all languages, both 
> >> > > regular
> >> > > bootstrap and LTO+PGO bootstrap).  OK for trunk?
> >> > 
> >> > Sorry for the slow review.  The patch looks good to me, but...
> >
> > Thanks for the review.  I'd missed that you'd sent this, sorry for not
> > responding sooner.
> >
> >> > 
> >> > > @@ -2544,11 +2624,37 @@ pair_fusion_bb_info::try_fuse_pair (bool 
> >> > > load_p, unsigned access_size,
> >> > >   && bitmap_bit_p (&m_tombstone_bitmap, insn->uid ());
> >> > >};
> >> > >  
> >> > > +  // Maximum number of distinct regnos we expect to appear in a single
> >> > > +  // MEM (and thus in a candidate insn).
> >> > > +  static constexpr int max_mem_regs = 2;
> >> > > +  auto_vec addr_use_vec[2];
> >> > > +  use_array addr_uses[2];
> >> > > +
> >> > > +  // Collect the lists of register uses that occur in the candidate 
> >> > > MEMs.
> >> > > +  for (int i = 0; i < 2; i++)
> >> > > +{
> >> > > +  // N.B. it's safe for us to ignore uses that only occur in notes
> >> > > +  // here (e.g. in a REG_EQUIV expression) since we only pass the
> >> > > +  // MEM down to the alias machinery, so it can't see any 
> >> > > insn-level
> >> > > +  // notes.
> >> > > +  for (auto use : insns[i]->uses ())
> >> > > +  if (use->is_reg ()
> >> > > +  && use->includes_address_uses ()
> >> > > +  && !use->only_occurs_in_notes ())
> >> > > +{
> >> > > +  gcc_checking_assert (addr_use_vec[i].length () < 
> >> > > max_mem_regs);
> >> > > +  addr_use_vec[i].quick_push (use);
> >> > 
> >> > ...if possible, I think it would be better to just use safe_push here,
> >> > without the assert.  There'd then be no need to split max_mem_regs out;
> >> > it could just be hard-coded in the addr_use_vec declaration.
> >
> > I hadn't realised at the time that quick_push () already does a
> > gcc_checking_assert to make sure that we don't overflow.  It does:
> >
> >   template
> >   inline T *
> >   vec::quick_push (const T &obj)
> >   {
> > gcc_checking_assert (space (1));
> > T *slot = &address ()[m_vecpfx.m_num++];
> > ::new (static_cast(slot)) T (obj);
> > return slot;
> >   }
> >
> > (I checked the behaviour by writing a quick selftest in vec.cc, and it
> > indeed aborts as expected with quick_push on overflow for a
> > stack-allocated auto_vec with N = 2.)
> >
> > This means that the assert above is indeed redundant, so I agree that
> > we should be able to drop the assert and drop the max_mem_regs constant,
> > using a literal inside the auto_vec template instead (all while still
> > using quick_push).
> >
> > Does that sound OK to you, or did you have another reason to prefer
> > safe_push?  AIUI the behaviour of safe_push on overflow would be to
> > allocate a new (heap-allocated) vector instead of asserting.
> 
> I just thought it looked odd/unexpected.  Normally the intent of:
> 
>   auto_vec bar;
> 
> is to reserve a sensible amount of stack space for the common case,
> but still support the general case of arbitrarily many elements.
> The common on-stack case will be fast with both quick_push and
> safe_push[*].  The difference is just whether growing beyond the
> static space would abort the compiler or work as expected.
> 
> quick_push makes sense if an earlier loop has calculated the runtime
> length of the vector and if we've already reserved that amount, or if
> there is a static property that guarantees a static limit.  But the limit
> of 2 looked more like a general assumption, rather than something that
> had been definitively checked by earlier code.
> 
> I was also wondering whether using safe_push on an array of auto_vecs
> caused issues, and so you were having to work around that.  (I remember
> sometimes hitting a warning about attempts to delete an on-stack buffer,
> presumably due to code duplication creating contradictory paths that
> jump threading couldn't optimise away as dead.)
> 
> No real objection though.  Just wanted to clarify what I meant. :)

I see.  For development I wanted to validate empirically what this limit
was 

Re: pair-fusion: Assume alias conflict if common address reg changes [PR116783]

2024-10-18 Thread Richard Sandiford
Alex Coplan  writes:
> On 11/10/2024 14:30, Richard Biener wrote:
>> On Fri, 11 Oct 2024, Richard Sandiford wrote:
>> 
>> > Alex Coplan  writes:
>> > > Hi,
>> > >
>> > > As the PR shows, pair-fusion was tricking memory_modified_in_insn_p into
>> > > returning false when a common base register (in this case, x1) was
>> > > modified between the mem and the store insn.  This lead to wrong code as
>> > > the accesses really did alias.
>> > >
>> > > To avoid this sort of problem, this patch avoids invoking RTL alias
>> > > analysis altogether (and assume an alias conflict) if the two insns to
>> > > be compared share a common address register R, and the insns see 
>> > > different
>> > > definitions of R (i.e. it was modified in between).
>> > >
>> > > Bootstrapped/regtested on aarch64-linux-gnu (all languages, both regular
>> > > bootstrap and LTO+PGO bootstrap).  OK for trunk?
>> > 
>> > Sorry for the slow review.  The patch looks good to me, but...
>
> Thanks for the review.  I'd missed that you'd sent this, sorry for not
> responding sooner.
>
>> > 
>> > > @@ -2544,11 +2624,37 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, 
>> > > unsigned access_size,
>> > > && bitmap_bit_p (&m_tombstone_bitmap, insn->uid ());
>> > >};
>> > >  
>> > > +  // Maximum number of distinct regnos we expect to appear in a single
>> > > +  // MEM (and thus in a candidate insn).
>> > > +  static constexpr int max_mem_regs = 2;
>> > > +  auto_vec addr_use_vec[2];
>> > > +  use_array addr_uses[2];
>> > > +
>> > > +  // Collect the lists of register uses that occur in the candidate 
>> > > MEMs.
>> > > +  for (int i = 0; i < 2; i++)
>> > > +{
>> > > +  // N.B. it's safe for us to ignore uses that only occur in notes
>> > > +  // here (e.g. in a REG_EQUIV expression) since we only pass the
>> > > +  // MEM down to the alias machinery, so it can't see any insn-level
>> > > +  // notes.
>> > > +  for (auto use : insns[i]->uses ())
>> > > +if (use->is_reg ()
>> > > +&& use->includes_address_uses ()
>> > > +&& !use->only_occurs_in_notes ())
>> > > +  {
>> > > +gcc_checking_assert (addr_use_vec[i].length () < 
>> > > max_mem_regs);
>> > > +addr_use_vec[i].quick_push (use);
>> > 
>> > ...if possible, I think it would be better to just use safe_push here,
>> > without the assert.  There'd then be no need to split max_mem_regs out;
>> > it could just be hard-coded in the addr_use_vec declaration.
>
> I hadn't realised at the time that quick_push () already does a
> gcc_checking_assert to make sure that we don't overflow.  It does:
>
>   template
>   inline T *
>   vec::quick_push (const T &obj)
>   {
> gcc_checking_assert (space (1));
> T *slot = &address ()[m_vecpfx.m_num++];
> ::new (static_cast(slot)) T (obj);
> return slot;
>   }
>
> (I checked the behaviour by writing a quick selftest in vec.cc, and it
> indeed aborts as expected with quick_push on overflow for a
> stack-allocated auto_vec with N = 2.)
>
> This means that the assert above is indeed redundant, so I agree that
> we should be able to drop the assert and drop the max_mem_regs constant,
> using a literal inside the auto_vec template instead (all while still
> using quick_push).
>
> Does that sound OK to you, or did you have another reason to prefer
> safe_push?  AIUI the behaviour of safe_push on overflow would be to
> allocate a new (heap-allocated) vector instead of asserting.

I just thought it looked odd/unexpected.  Normally the intent of:

  auto_vec bar;

is to reserve a sensible amount of stack space for the common case,
but still support the general case of arbitrarily many elements.
The common on-stack case will be fast with both quick_push and
safe_push[*].  The difference is just whether growing beyond the
static space would abort the compiler or work as expected.

quick_push makes sense if an earlier loop has calculated the runtime
length of the vector and if we've already reserved that amount, or if
there is a static property that guarantees a static limit.  But the limit
of 2 looked more like a general assumption, rather than something that
had been definitively checked by earlier code.

I was also wondering whether using safe_push on an array of auto_vecs
caused issues, and so you were having to work around that.  (I remember
sometimes hitting a warning about attempts to delete an on-stack buffer,
presumably due to code duplication creating contradictory paths that
jump threading couldn't optimise away as dead.)

No real objection though.  Just wanted to clarify what I meant. :)

Thanks,
Richard

[*] well, ok, quick_push will be slightly faster in release builds,
since quick_push won't do a bounds check in that case.  But the
check in safe_push would be highly predictable.


[PATCH 3/9] Simplify X /[ex] Y cmp Z -> X cmp (Y * Z)

2024-10-18 Thread Richard Sandiford
This patch applies X /[ex] Y cmp Z -> X cmp (Y * Z) when Y * Z is
representable.  The closest check for "is representable" on range
operations seemed to be overflow_free_p.  However, that is designed
for testing existing operations and so takes the definedness of
signed overflow into account.  Here, the question is whether we
can create an entirely new value.

The patch adds a new optional argument to overflow_free_p to
distinguish these cases.  It also adds a wrapper, so that it isn't
necessary to specify TRIO_VARYING.

I couldn't find a good way of removing the duplication between
the two operand orders.  The rules are (in a loose sense) symmetric,
but they're not based on commutativity.

gcc/
* range-op.h (range_query_type): New enum.
(range_op_handler::fits_type_p): New function.
(range_operator::overflow_free_p): Add an argument to specify the
type of query.
(range_op_handler::overflow_free_p): Likewise.
* range-op-mixed.h (operator_plus::overflow_free_p): Likewise.
(operator_minus::overflow_free_p): Likewise.
(operator_mult::overflow_free_p): Likewise.
* range-op.cc (range_op_handler::overflow_free_p): Likewise.
(range_operator::overflow_free_p): Likewise.
(operator_plus::overflow_free_p): Likewise.
(operator_minus::overflow_free_p): Likewise.
(operator_mult::overflow_free_p): Likewise.
* match.pd: Simplify X /[ex] Y cmp Z -> X cmp (Y * Z) when
Y * Z is representable.

gcc/testsuite/
* gcc.dg/tree-ssa/cmpexactdiv-7.c: New test.
* gcc.dg/tree-ssa/cmpexactdiv-8.c: Likewise.
---
 gcc/match.pd  | 21 +
 gcc/range-op-mixed.h  |  9 --
 gcc/range-op.cc   | 19 ++--
 gcc/range-op.h| 31 +--
 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-7.c | 21 +
 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-8.c | 20 
 6 files changed, 107 insertions(+), 14 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-7.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/cmpexactdiv-8.c

diff --git a/gcc/match.pd b/gcc/match.pd
index b952225b08c..1b1d38cf105 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2679,6 +2679,27 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(le (minus (convert:etype @0) { lo; }) { hi; })
(gt (minus (convert:etype @0) { lo; }) { hi; })
 
+#if GIMPLE
+/* X /[ex] Y cmp Z -> X cmp (Y * Z), if Y * Z is representable.  */
+(for cmp (simple_comparison)
+ (simplify
+  (cmp (exact_div:s @0 @1) @2)
+  (with { int_range_max r1, r2; }
+   (if (INTEGRAL_TYPE_P (type)
+   && get_range_query (cfun)->range_of_expr (r1, @1)
+   && get_range_query (cfun)->range_of_expr (r2, @2)
+   && range_op_handler (MULT_EXPR).fits_type_p (r1, r2))
+(cmp @0 (mult @1 @2)
+ (simplify
+  (cmp @2 (exact_div:s @0 @1))
+  (with { int_range_max r1, r2; }
+   (if (INTEGRAL_TYPE_P (type)
+   && get_range_query (cfun)->range_of_expr (r1, @1)
+   && get_range_query (cfun)->range_of_expr (r2, @2)
+   && range_op_handler (MULT_EXPR).fits_type_p (r1, r2))
+(cmp (mult @1 @2) @0)
+#endif
+
 /* X + Z < Y + Z is the same as X < Y when there is no overflow.  */
 (for op (lt le ge gt)
  (simplify
diff --git a/gcc/range-op-mixed.h b/gcc/range-op-mixed.h
index cc1db2f6775..402cb87c2b2 100644
--- a/gcc/range-op-mixed.h
+++ b/gcc/range-op-mixed.h
@@ -539,7 +539,8 @@ public:
   const irange &rh) const final override;
 
   virtual bool overflow_free_p (const irange &lh, const irange &rh,
-   relation_trio = TRIO_VARYING) const;
+   relation_trio = TRIO_VARYING,
+   range_query_type = QUERY_WITH_GIMPLE_UB) const;
   // Check compatibility of all operands.
   bool operand_check_p (tree t1, tree t2, tree t3) const final override
 { return range_compatible_p (t1, t2) && range_compatible_p (t1, t3); }
@@ -615,7 +616,8 @@ public:
   const irange &rh) const final override;
 
   virtual bool overflow_free_p (const irange &lh, const irange &rh,
-   relation_trio = TRIO_VARYING) const;
+   relation_trio = TRIO_VARYING,
+   range_query_type = QUERY_WITH_GIMPLE_UB) const;
   // Check compatibility of all operands.
   bool operand_check_p (tree t1, tree t2, tree t3) const final override
 { return range_compatible_p (t1, t2) && range_compatible_p (t1, t3); }
@@ -701,7 +703,8 @@ public:
const REAL_VALUE_TYPE &rh_lb, const REAL_VALUE_TYPE &rh_ub,
relation_kind kind) const final override;
   virtual bool overflow_free_p (const irange &lh, const irange &rh,
-   relation_trio = TRIO_VARYING) const;
+   re

[PATCH] [4/n] remove wrapv-*.c special-casing of gcc.dg/vect/ files

2024-10-18 Thread Richard Biener
The following makes -fwrapv explicit.

* gcc.dg/vect/vect.exp: Remove special-casing of tests
named wrapv-*
* gcc.dg/vect/wrapv-vect-7.c: Add dg-additional-options -fwrapv.
* gcc.dg/vect/wrapv-vect-reduc-2char.c: Likewise.
* gcc.dg/vect/wrapv-vect-reduc-2short.c: Likewise.
* gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c: Likewise.
* gcc.dg/vect/wrapv-vect-reduc-pattern-2c.c: Likewise.
---
 gcc/testsuite/gcc.dg/vect/vect.exp| 21 +++
 gcc/testsuite/gcc.dg/vect/wrapv-vect-7.c  |  1 +
 .../gcc.dg/vect/wrapv-vect-reduc-2char.c  |  1 +
 .../gcc.dg/vect/wrapv-vect-reduc-2short.c |  1 +
 .../gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c|  1 +
 .../gcc.dg/vect/wrapv-vect-reduc-pattern-2c.c |  1 +
 6 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/vect.exp 
b/gcc/testsuite/gcc.dg/vect/vect.exp
index eddebf53c7f..14c6168f6ee 100644
--- a/gcc/testsuite/gcc.dg/vect/vect.exp
+++ b/gcc/testsuite/gcc.dg/vect/vect.exp
@@ -112,6 +112,13 @@ foreach flags $VECT_ADDITIONAL_FLAGS {
 et-dg-runtest dg-runtest [lsort \
[glob -nocomplain $srcdir/$subdir/fast-math-\[ipsvc\]*.\[cS\]]] \
$flags $DEFAULT_VECTCFLAGS
+et-dg-runtest dg-runtest [lsort \
+   [glob -nocomplain $srcdir/$subdir/wrapv-*.\[cS\]]] \
+   $flags $DEFAULT_VECTCFLAGS
+
+et-dg-runtest dg-runtest [lsort \
+   [glob -nocomplain $srcdir/$subdir/fast-math-bb-slp-*.\[cS\]]] \
+   $flags $VECT_SLP_CFLAGS
 et-dg-runtest dg-runtest [lsort \
[glob -nocomplain $srcdir/$subdir/bb-slp*.\[cS\]]] \
$flags $VECT_SLP_CFLAGS
@@ -122,20 +129,6 @@ global SAVED_DEFAULT_VECTCFLAGS
 set SAVED_DEFAULT_VECTCFLAGS $DEFAULT_VECTCFLAGS
 set SAVED_VECT_SLP_CFLAGS $VECT_SLP_CFLAGS
 
-# -ffast-math SLP tests
-set VECT_SLP_CFLAGS $SAVED_VECT_SLP_CFLAGS
-lappend VECT_SLP_CFLAGS "-ffast-math"
-et-dg-runtest dg-runtest [lsort \
-   [glob -nocomplain $srcdir/$subdir/fast-math-bb-slp-*.\[cS\]]] \
-   "" $VECT_SLP_CFLAGS
-
-# -fwrapv tests
-set DEFAULT_VECTCFLAGS $SAVED_DEFAULT_VECTCFLAGS
-lappend DEFAULT_VECTCFLAGS "-fwrapv"
-et-dg-runtest dg-runtest [lsort \
-   [glob -nocomplain $srcdir/$subdir/wrapv-*.\[cS\]]] \
-   "" $DEFAULT_VECTCFLAGS
-
 # -ftrapv tests
 set DEFAULT_VECTCFLAGS $SAVED_DEFAULT_VECTCFLAGS
 lappend DEFAULT_VECTCFLAGS "-ftrapv"
diff --git a/gcc/testsuite/gcc.dg/vect/wrapv-vect-7.c 
b/gcc/testsuite/gcc.dg/vect/wrapv-vect-7.c
index 414bd9d3e12..2a557f697e1 100644
--- a/gcc/testsuite/gcc.dg/vect/wrapv-vect-7.c
+++ b/gcc/testsuite/gcc.dg/vect/wrapv-vect-7.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-fwrapv" } */
 /* { dg-require-effective-target vect_int } */
 /* { dg-add-options bind_pic_locally } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-2char.c 
b/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-2char.c
index 556c2a06dc5..0ee9178025e 100644
--- a/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-2char.c
+++ b/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-2char.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-fwrapv" } */
 /* { dg-require-effective-target vect_int } */
 
 #include 
diff --git a/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-2short.c 
b/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-2short.c
index f9142173b25..aadc9c37da3 100644
--- a/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-2short.c
+++ b/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-2short.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-fwrapv" } */
 /* { dg-require-effective-target vect_int } */
 
 #include 
diff --git a/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c 
b/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c
index 72080af5923..920374d4263 100644
--- a/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c
+++ b/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-fwrapv" } */
 /* Disabling epilogues until we find a better way to deal with scans.  */
 /* { dg-additional-options "--param vect-epilogues-nomask=0" } */
 /* { dg-require-effective-target vect_int } */
diff --git a/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-pattern-2c.c 
b/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-pattern-2c.c
index e3c33cff7e1..be0447c7b10 100644
--- a/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-pattern-2c.c
+++ b/gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-pattern-2c.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-fwrapv" } */
 /* { dg-require-effective-target vect_int } */
 
 #include 
-- 
2.43.0


Re: [WIP RFC] libstdc++: add module std

2024-10-18 Thread Maciej Cencora
Hi,

Thanks for working on this!

> stdc++.h also doesn't include the eternally deprecated .  There
are some other deprecated facilities that I notice are included: 
and float_denorm_style, at least.  It would be nice for L{E,}WG to clarify
whether module std is intended to include interfaces that were deprecated in
C++23, since ancient code isn't going to be relying on module std.

Per P2465r3 Standard Library Modules std and std.compat:
Are deprecated features provided by the Standard Library modules?
Yes. This is implied by the normative wording.

While doing some light testing, one thing that immediately popped up
is that we need to export __normal_iterator related operators from
__gnu_cxx namespace.
Otherwise it is impossible to even use std::vector in range-based for loops.

But I think a better solution (than exporting such impl details) is to
make these operators hidden friends.


Another thing - P2465r3 mentions only lerp, byte and related ops as
special w.r.t skipping export from global namespace in std.compat, but
for some reason Microsoft's impl treats 3-arg hypot as special as
well.


Regards,
Maciej Cencora


Re: [PATCH] match.pd: Add std::pow folding optimizations.

2024-10-18 Thread Andrew Pinski
On Fri, Oct 18, 2024 at 5:09 AM Jennifer Schmitz  wrote:
>
> This patch adds the following two simplifications in match.pd:
> - pow (1.0/x, y) to pow (x, -y), avoiding the division
> - pow (0.0, x) to 0.0, avoiding the call to pow.
> The patterns are guarded by flag_unsafe_math_optimizations,
> !flag_trapping_math, !flag_errno_math, !HONOR_SIGNED_ZEROS,
> and !HONOR_INFINITIES.
>
> Tests were added to confirm the application of the transform for float,
> double, and long double.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu and
> x86_64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
> * match.pd: Fold pow (1.0/x, y) -> pow (x, -y) and
> pow (0.0, x) -> 0.0.

Note this is not to block this patch; it looks good to me.
We have __builtin_powi too and these seem like these simplifications
should apply for that builtin also.

Also I do think you should add a testcase for powf16 too.

Thanks,
Andrew Pinski

>
> gcc/testsuite/
> * gcc.dg/tree-ssa/pow_fold_1.c: New test.
> ---
>  gcc/match.pd   | 14 +
>  gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c | 34 ++
>  2 files changed, 48 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 12d81fcac0d..ba100b117e7 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -8203,6 +8203,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> (rdiv @0 (exps:s @1))
>  (mult @0 (exps (negate @1)
>
> + /* Simplify pow(1.0/x, y) into pow(x, -y).  */
> + (if (! HONOR_INFINITIES (type)
> +  && ! HONOR_SIGNED_ZEROS (type)
> +  && ! flag_trapping_math
> +  && ! flag_errno_math)
> +  (simplify
> +   (POW (rdiv:s real_onep@0 @1) @2)
> +(POW @1 (negate @2)))
> +
> +  /* Simplify pow(0.0, x) into 0.0.  */
> +  (simplify
> +   (POW real_zerop@0 @1)
> +@0))
> +
>   (if (! HONOR_SIGN_DEPENDENT_ROUNDING (type)
>&& ! HONOR_NANS (type) && ! HONOR_INFINITIES (type)
>&& ! flag_trapping_math
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
> new file mode 100644
> index 000..113df572661
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
> @@ -0,0 +1,34 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ffast-math" } */
> +/* { dg-require-effective-target c99_runtime } */
> +
> +extern void link_error (void);
> +
> +#define POW1OVER(TYPE, C_TY, TY)   \
> +  void \
> +  pow1over_##TY (TYPE x, TYPE y)   \
> +  {\
> +TYPE t1 = 1.0##C_TY / x;   \
> +TYPE t2 = __builtin_pow##TY (t1, y);   \
> +TYPE t3 = -y;  \
> +TYPE t4 = __builtin_pow##TY (x, t3);   \
> +if (t2 != t4)  \
> +  link_error ();   \
> +  }\
> +
> +#define POW0(TYPE, C_TY, TY)   \
> +  void \
> +  pow0_##TY (TYPE x)   \
> +  {\
> +TYPE t1 = __builtin_pow##TY (0.0##C_TY, x);\
> +if (t1 != 0.0##C_TY)   \
> +  link_error ();   \
> +  }\
> +
> +#define TEST_ALL(TYPE, C_TY, TY)   \
> +  POW1OVER (TYPE, C_TY, TY)\
> +  POW0 (TYPE, C_TY, TY)
> +
> +TEST_ALL (double, , )
> +TEST_ALL (float, f, f)
> +TEST_ALL (long double, L, l)
> --
> 2.34.1


libbacktrace patch committed

2024-10-18 Thread Ian Lance Taylor
Because libbacktrace merges adjacent address ranges where possible,
and because the GNU linker can deduplicate functions leaving debuginfo
that refers to address ranges in other compilation units, it is
possible for libbacktrace to have overlapping address ranges, in
particular to overlap ranges with gaps between the overlapped ranges.
This can cause libbacktrace to fail to find the right, most specific,
address range for a given address.

This patch fixes the problem by noticing overlapping ranges, and
filling in the gaps with references to the overlapping range.  This
somewhat complicates the computation of address ranges, but that
happens only once.  It does not affect the relatively simple code that
looks up an address range.

There is more discussion of this at
https://github.com/ianlancetaylor/libbacktrace/issues/137.

I've tested this patch with the libbacktrace and Go testsuites on
x86_64-pc-linux-gnu.  I've verified that it fixes the test case in the
github issue.  I haven't tried to add a test case because it's very
fiddly, depending on the exact compiler and linker behavior.

Ian

* dwarf.c (resolve_unit_addrs_overlap_walk): New static function.
(resolve_unit_addrs_overlap): New static function.
(build_dwarf_data): Call resolve_unit_addrs_overlap.
bf4bedf3abb9a6dd84a8bf572ac9c2960b264f1b
diff --git a/libbacktrace/dwarf.c b/libbacktrace/dwarf.c
index 96ffc4cc481..cc5cad70333 100644
--- a/libbacktrace/dwarf.c
+++ b/libbacktrace/dwarf.c
@@ -1276,6 +1276,194 @@ unit_addrs_search (const void *vkey, const void *ventry)
 return 0;
 }
 
+/* Fill in overlapping ranges as needed.  This is a subroutine of
+   resolve_unit_addrs_overlap.  */
+
+static int
+resolve_unit_addrs_overlap_walk (struct backtrace_state *state,
+size_t *pfrom, size_t *pto,
+struct unit_addrs *enclosing,
+struct unit_addrs_vector *old_vec,
+backtrace_error_callback error_callback,
+void *data,
+struct unit_addrs_vector *new_vec)
+{
+  struct unit_addrs *old_addrs;
+  size_t old_count;
+  struct unit_addrs *new_addrs;
+  size_t from;
+  size_t to;
+
+  old_addrs = (struct unit_addrs *) old_vec->vec.base;
+  old_count = old_vec->count;
+  new_addrs = (struct unit_addrs *) new_vec->vec.base;
+
+  for (from = *pfrom, to = *pto; from < old_count; from++, to++)
+{
+  /* If we are in the scope of a larger range that can no longer
+cover any further ranges, return back to the caller.  */
+
+  if (enclosing != NULL
+ && enclosing->high <= old_addrs[from].low)
+   {
+ *pfrom = from;
+ *pto = to;
+ return 1;
+   }
+
+  new_addrs[to] = old_addrs[from];
+
+  /* If we are in scope of a larger range, fill in any gaps
+between this entry and the next one.
+
+There is an extra entry at the end of the vector, so it's
+always OK to refer to from + 1.  */
+
+  if (enclosing != NULL
+ && enclosing->high > old_addrs[from].high
+ && old_addrs[from].high < old_addrs[from + 1].low)
+   {
+ void *grew;
+ size_t new_high;
+
+ grew = backtrace_vector_grow (state, sizeof (struct unit_addrs),
+   error_callback, data, &new_vec->vec);
+ if (grew == NULL)
+   return 0;
+ new_addrs = (struct unit_addrs *) new_vec->vec.base;
+ to++;
+ new_addrs[to].low = old_addrs[from].high;
+ new_high = old_addrs[from + 1].low;
+ if (enclosing->high < new_high)
+   new_high = enclosing->high;
+ new_addrs[to].high = new_high;
+ new_addrs[to].u = enclosing->u;
+   }
+
+  /* If this range has a larger scope than the next one, use it to
+fill in any gaps.  */
+
+  if (old_addrs[from].high > old_addrs[from + 1].high)
+   {
+ *pfrom = from + 1;
+ *pto = to + 1;
+ if (!resolve_unit_addrs_overlap_walk (state, pfrom, pto,
+   &old_addrs[from], old_vec,
+   error_callback, data, new_vec))
+   return 0;
+ from = *pfrom;
+ to = *pto;
+
+ /* Undo the increment the loop is about to do.  */
+ from--;
+ to--;
+   }
+}
+
+  if (enclosing == NULL)
+{
+  struct unit_addrs *pa;
+
+  /* Add trailing entry.  */
+
+  pa = ((struct unit_addrs *)
+   backtrace_vector_grow (state, sizeof (struct unit_addrs),
+  error_callback, data, &new_vec->vec));
+  if (pa == NULL)
+   return 0;
+  pa->low = 0;
+  --pa->low;
+  pa->high = pa->low;
+  pa->u = NULL;
+
+  new_vec->count = to;
+}
+
+  return 1;
+}
+
+/* It is possible for the unit_addrs list to contain overlaps, as in
+
+   10: low == 1

[WIP RFC] libstdc++: add module std

2024-10-18 Thread Jason Merrill
This patch is not ready for integration, but I'd like to get feedback on the
approach (and various specific questions below).

-- 8< --

This patch introduces an installed source form of module std and std.compat.
To find them, we install a libstdc++.modules.json file alongside
libstdc++.so, which tells the build system where the files are and any
special flags it should use when compiling them (none, in our case).  The
format is from a proposal in SG15.

The build system can find this file with
gcc -print-file-name=libstdc++.modules.json

It seems preferable to use a relative path from this file to the sources so
that moving the installation doesn't break the reference, but I didn't see
any obvious way to compute that without relying on coreutils, perl, or
python, so I wrote a POSIX shell script for it.

Currently this installs the sources under $(pkgdata), i.e.
/usr/share/libstdc++/modules.  It could also make sense to install them
under $(gxx_include_dir), i.e. /usr/include/c++/15/modules.  And/or the
subdirectory could be "miu" (module interface unit) instead of "modules".

The sources currently have the extension .cc, like other source files.
Alternatively, they could use one of the module interface unit extensions,
perhaps .ccm.

std.cc started with m.cencora's implementation in PR114600.  I've made some
adjustments, but more is probably desirable, e.g. of the 
handling of namespace ranges, and to remove exports of templates that are
only specialized in a particular header.

The std module is missing exports for some newer headers, including some
that are already implemented (, , ).  I've
added some FIXMEs where I noticed missing bits.

Since bits/stdc++.h also intends to include the whole standard library, I
include it rather than duplicate it.  But stdc++.h comments out ,
so I include it separately.  Alternatively, we could uncomment it in
stdc++.h.

stdc++.h also doesn't include the eternally deprecated .  There
are some other deprecated facilities that I notice are included: 
and float_denorm_style, at least.  It would be nice for L{E,}WG to clarify
whether module std is intended to include interfaces that were deprecated in
C++23, since ancient code isn't going to be relying on module std.

If they are supposed to included, do we also want to keep exporting them in
C++26, where they are removed from the standard?

It seemed most convenient for the two files to be monolithic so we don't
need to worry about include paths.  So the C library names that module
std.compat exports in both namespace std and :: in module are a block of
code that is identical in both files, adjusted based on whether the macro
STD_COMPAT is defined before the block.

In this implementation std.compat imports std; it would also be valid for it
to duplicate everything in std.  I see the libc++ std.compat also imports
std.

Is it useful for std.cc to live in a subdirectory of c++23 as in this patch, or
should it be in c++23 itself?  Or elsewhere?

This patch doesn't yet provide a convenient way for a user to find std.cc.

libstdc++-v3/ChangeLog:

* src/c++23/Makefile.am: Add module std/std.compat.
* src/c++23/Makefile.in: Regenerate.
* src/c++23/modules/std.cc: New file.
* src/c++23/modules/std.compat.cc: New file.
* src/c++23/libstdc++.modules.json.in: New file.

contrib/ChangeLog:

* relpath.sh: New file.
---
 libstdc++-v3/src/c++23/modules/std.cc | 3575 +
 libstdc++-v3/src/c++23/modules/std.compat.cc  |  640 +++
 contrib/relpath.sh|   70 +
 libstdc++-v3/src/c++23/Makefile.am|   18 +
 libstdc++-v3/src/c++23/Makefile.in|  133 +-
 .../src/c++23/libstdc++.modules.json.in   |   17 +
 6 files changed, 4436 insertions(+), 17 deletions(-)
 create mode 100644 libstdc++-v3/src/c++23/modules/std.cc
 create mode 100644 libstdc++-v3/src/c++23/modules/std.compat.cc
 create mode 100755 contrib/relpath.sh
 create mode 100644 libstdc++-v3/src/c++23/libstdc++.modules.json.in

diff --git a/libstdc++-v3/src/c++23/modules/std.cc 
b/libstdc++-v3/src/c++23/modules/std.cc
new file mode 100644
index 000..d823b915b9c
--- /dev/null
+++ b/libstdc++-v3/src/c++23/modules/std.cc
@@ -0,0 +1,3575 @@
+// -*- C++ -*- [std.modules] module std
+
+// Copyright The GNU Toolchain Authors.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1,

[PATCH 1/2] aarch64: Use standard names for saturating arithmetic

2024-10-18 Thread Akram Ahmad
This renames the existing {s,u}q{add,sub} instructions to use the
standard names {s,u}s{add,sub}3 which are used by IFN_SAT_ADD and
IFN_SAT_SUB.

The NEON intrinsics for saturating arithmetic and their corresponding
builtins are changed to use these standard names too.

Using the standard names for the instructions causes 32 and 64-bit
unsigned scalar saturating arithmetic to use the NEON instructions,
resulting in an additional (and inefficient) FMOV to be generated when
the original operands are in GP registers. This patch therefore also
restores the original behaviour of using the adds/subs instructions
in this circumstance.

Additional tests are written for the scalar and Adv. SIMD cases to
ensure that the correct instructions are used. The NEON intrinsics are
already tested elsewhere.

gcc/ChangeLog:

* config/aarch64/aarch64-builtins.cc: Expand iterators.
* config/aarch64/aarch64-simd-builtins.def: Use standard names
* config/aarch64/aarch64-simd.md: Use standard names, split insn
definitions on signedness of operator and type of operands.
* config/aarch64/arm_neon.h: Use standard builtin names.
* config/aarch64/iterators.md: Add VSDQ_I_QI_HI iterator to
simplify splitting of insn for unsigned scalar arithmetic.

gcc/testsuite/ChangeLog:

* 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc:
Template file for unsigned vector saturating arithmetic tests.
* 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c:
8-bit vector type tests.
* 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c:
16-bit vector type tests.
* 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c:
32-bit vector type tests.
* 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c:
64-bit vector type tests.
* gcc.target/aarch64/saturating_arithmetic.inc: Template file
for scalar saturating arithmetic tests.
* gcc.target/aarch64/saturating_arithmetic_1.c: 8-bit tests.
* gcc.target/aarch64/saturating_arithmetic_2.c: 16-bit tests.
* gcc.target/aarch64/saturating_arithmetic_3.c: 32-bit tests.
* gcc.target/aarch64/saturating_arithmetic_4.c: 64-bit tests.
---
 gcc/config/aarch64/aarch64-builtins.cc| 13 +++
 gcc/config/aarch64/aarch64-simd-builtins.def  |  8 +-
 gcc/config/aarch64/aarch64-simd.md| 93 +-
 gcc/config/aarch64/arm_neon.h | 96 +--
 gcc/config/aarch64/iterators.md   |  4 +
 .../saturating_arithmetic_autovect.inc| 58 +++
 .../saturating_arithmetic_autovect_1.c| 79 +++
 .../saturating_arithmetic_autovect_2.c| 79 +++
 .../saturating_arithmetic_autovect_3.c| 75 +++
 .../saturating_arithmetic_autovect_4.c| 77 +++
 .../aarch64/saturating_arithmetic.inc | 39 
 .../aarch64/saturating_arithmetic_1.c | 41 
 .../aarch64/saturating_arithmetic_2.c | 41 
 .../aarch64/saturating_arithmetic_3.c | 30 ++
 .../aarch64/saturating_arithmetic_4.c | 30 ++
 15 files changed, 707 insertions(+), 56 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic.inc
 create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic_4.c

diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
b/gcc/config/aarch64/aarch64-builtins.cc
index 7d737877e0b..f2a1b6ddbf6 100644
--- a/gcc/config/aarch64/aarch64-builtins.cc
+++ b/gcc/config/aarch64/aarch64-builtins.cc
@@ -3849,6 +3849,19 @@ aarch64_general_gimple_fold_builtin (unsigned int fcode, 
gcall *stmt,
  new_stmt = gimple_build_assign (gimple_call_lhs (stmt),
  LSHIFT_EXPR, args[0], args[1]);
break;
+
+  /* lower saturating add/sub neon builtins to gimple.  */
+  BUILTIN_VSDQ_I (BINOP, ssadd, 3, NONE)
+  BUILTIN_VSDQ_I (BINOPU, usadd, 3, NONE)
+   new_stmt = gimple_build_call_internal (IFN_SAT_ADD, 2, args[0], 
args[1])

Re: [PATCH 1/7] libstdc++: Refactor std::uninitialized_{copy, fill, fill_n} algos [PR68350]

2024-10-18 Thread Jonathan Wakely
On Fri, 18 Oct 2024 at 15:24, Patrick Palka  wrote:
>
> On Fri, 18 Oct 2024, Jonathan Wakely wrote:
>
> > On 16/10/24 21:39 -0400, Patrick Palka wrote:
> > > On Tue, 15 Oct 2024, Jonathan Wakely wrote:
> > > > +#if __cplusplus < 201103L
> > > > +
> > > > +  // True if we can unwrap _Iter to get a pointer by using
> > > > std::__niter_base.
> > > > +  template
> > > > +struct __unwrappable_niter
> > > > +{
> > > > +  template struct __is_ptr { enum { __value = 0 }; };
> > > > +  template struct __is_ptr<_Tp*> { enum { __value = 1
> > > > }; };
> > > > +
> > > > +  typedef __decltype(std::__niter_base(*(_Iter*)0)) _Base;
> > > > +
> > > > +  enum { __value = __is_ptr<_Base>::__value };
> > > > +};
> > >
> > > It might be slightly cheaper to define this without the nested class
> > > template as:
> > >
> > >  template > > __decltype(std::__niter_base(*(_Iter*)0))>
> > >  struct __unwrappable_niter
> > >  { enum { __value = false }; };
> > >
> > >  template
> > >  struct __unwrappable_niter<_Iter, _Tp*>
> > >  { enum { __value = true }; };
>
> One minor nit, we might as well use 'value' since it's a reserved name
> even in C++98?

Huh, I was under the impression it wasn't, but indeed we have
binder1st::value and binder2nd::value!

Why have we been using __value throughout cpp_type_traits.h then!?!

I have indeed pushed it now, but could tweak that, as I'm still
working in that file.



Re: [PATCH 2/7] RISC-V: Fix uninitialized reg in memcpy

2024-10-18 Thread Jeff Law




On 10/18/24 7:12 AM, Craig Blackmore wrote:

gcc/ChangeLog:

* config/riscv/riscv-string.cc (expand_block_move): Replace
`end` with `length_rtx` in gen_rtx_NE.

Thanks.  I've pushed this to the trunk.
jeff



Re: [PATCH 1/7] RISC-V: Fix indentation in riscv_vector::expand_block_move [NFC]

2024-10-18 Thread Jeff Law




On 10/18/24 7:12 AM, Craig Blackmore wrote:

gcc/ChangeLog:

* config/riscv/riscv-string.cc (expand_block_move): Fix
indentation.

Thanks.  Pushed to the trunk.

Jeff



[PATCH 2/2] aarch64: Use standard names for SVE saturating arithmetic

2024-10-18 Thread Akram Ahmad
Rename the existing SVE unpredicated saturating arithmetic instructions
to use standard names which are used by IFN_SAT_ADD and IFN_SAT_SUB.

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md: Rename insns

gcc/testsuite/ChangeLog:

* gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc:
Template file for auto-vectorizer tests.
* gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c:
Instantiate 8-bit vector tests.
* gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_2.c:
Instantiate 16-bit vector tests.
* gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_3.c:
Instantiate 32-bit vector tests.
* gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_4.c:
Instantiate 64-bit vector tests.
---
 gcc/config/aarch64/aarch64-sve.md |  4 +-
 .../aarch64/sve/saturating_arithmetic.inc | 68 +++
 .../aarch64/sve/saturating_arithmetic_1.c | 60 
 .../aarch64/sve/saturating_arithmetic_2.c | 60 
 .../aarch64/sve/saturating_arithmetic_3.c | 62 +
 .../aarch64/sve/saturating_arithmetic_4.c | 62 +
 6 files changed, 314 insertions(+), 2 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_2.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_3.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_4.c

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 06bd3e4bb2c..b987b292b20 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -4379,7 +4379,7 @@
 ;; -
 
 ;; Unpredicated saturating signed addition and subtraction.
-(define_insn "@aarch64_sve_"
+(define_insn "s3"
   [(set (match_operand:SVE_FULL_I 0 "register_operand")
(SBINQOPS:SVE_FULL_I
  (match_operand:SVE_FULL_I 1 "register_operand")
@@ -4395,7 +4395,7 @@
 )
 
 ;; Unpredicated saturating unsigned addition and subtraction.
-(define_insn "@aarch64_sve_"
+(define_insn "s3"
   [(set (match_operand:SVE_FULL_I 0 "register_operand")
(UBINQOPS:SVE_FULL_I
  (match_operand:SVE_FULL_I 1 "register_operand")
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc 
b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc
new file mode 100644
index 000..0b3ebbcb0d6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc
@@ -0,0 +1,68 @@
+/* Template file for vector saturating arithmetic validation.
+
+   This file defines saturating addition and subtraction functions for a given
+   scalar type, testing the auto-vectorization of these two operators. This
+   type, along with the corresponding minimum and maximum values for that type,
+   must be defined by any test file which includes this template file.  */
+
+#ifndef SAT_ARIT_AUTOVEC_INC
+#define SAT_ARIT_AUTOVEC_INC
+
+#include 
+#include 
+
+#ifndef UT
+#define UT uint32_t
+#define UMAX UINT_MAX
+#define UMIN 0
+#endif
+
+void uaddq (UT *out, UT *a, UT *b, int n)
+{
+  for (int i = 0; i < n; i++)
+{
+  UT sum = a[i] + b[i];
+  out[i] = sum < a[i] ? UMAX : sum;
+}
+}
+
+void uaddq2 (UT *out, UT *a, UT *b, int n)
+{
+  for (int i = 0; i < n; i++)
+{
+  UT sum;
+  if (!__builtin_add_overflow(a[i], b[i], &sum))
+   out[i] = sum;
+  else
+   out[i] = UMAX;
+}
+}
+
+void uaddq_imm (UT *out, UT *a, int n)
+{
+  for (int i = 0; i < n; i++)
+{
+  UT sum = a[i] + 50;
+  out[i] = sum < a[i] ? UMAX : sum;
+}
+}
+
+void usubq (UT *out, UT *a, UT *b, int n)
+{
+  for (int i = 0; i < n; i++)
+{
+  UT sum = a[i] - b[i];
+  out[i] = sum > a[i] ? UMIN : sum;
+}
+}
+
+void usubq_imm (UT *out, UT *a, int n)
+{
+  for (int i = 0; i < n; i++)
+{
+  UT sum = a[i] - 50;
+  out[i] = sum > a[i] ? UMIN : sum;
+}
+}
+
+#endif
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c
new file mode 100644
index 000..6936e9a2704
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c
@@ -0,0 +1,60 @@
+/* { dg-do compile { target { aarch64*-*-* } } } */
+/* { dg-options "-O2 --save-temps -ftree-vectorize" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+/*
+** uaddq:
+** ...
+** ld1b\tz([0-9]+)\.b, .*
+** ld1b\tz([0-9]+)\.b, .*
+** uqadd\tz\2.b, z\1\.b, z\2\.b
+** ...
+** ldr\tb([0-9]+), .*
+** ldr\tb([0-9]+), .*
+** uqadd\tb\4, b\3, b\4
+** ...
+*/
+/*
+** ua

[PATCH] libstdc++: Avoid using std::__to_address with iterators

2024-10-18 Thread Jonathan Wakely
Do others agree with my reasoning below?

The changes to implement the rule "use std::__niter_base before C++20
and use std::to_address after C++20" were easier than I expected. There
weren't many places that were doing it "wrong" and needed to change.

Tested x86_64-linux.

-- >8 --

In r12-3935-g82626be2d633a9 I added the partial specialization
std::pointer_traits<__normal_iterator> so that __to_address
would work with __normal_iterator objects. Soon after that, François
replaced it in r12-6004-g807ad4bc854cae with an overload of __to_address
that served the same purpose, but was less complicated and less wrong.

I now think that both commits were mistakes, and that instead of adding
hacks to make __normal_iterator work with __to_address, we should not be
using __to_address with iterators at all before C++20.

The pre-C++20 std::__to_address function should only be used with
pointer-like types, specifically allocator_traits::pointer types.
Those pointer-like types are guaranteed to be contiguous iterators, so
that getting a raw memory address from them is OK.

For arbitrary iterators, even random access iterators, we don't know
that it's safe to lower the iterator to a pointer e.g. for std::deque
iterators it's not, because (it + n) == (std::to_address(it) + n) only
holds within the same block of the deque's storage.

For C++20, std::to_address does work correctly for contiguous iterators,
including __normal_iterator, and __to_address just calls std::to_address
so also works. But we have to be sure we have an iterator that satisfies
the std::contiguous_iterator concept for it to be safe, and we can't
check that before C++20.

So for pre-C++20 code the correct way to handle iterators that might be
pointers or might be __normal_iterator is to call __niter_base, and if
necessary use is_pointer to check whether __niter_base returned a real
pointer.

We currently have some uses of std::__to_address with iterators where
we've checked that they're either pointers, or __normal_iterator
wrappers around pointers, or satisfy std::contiguous_iterator. But this
seems a little fragile, and it would be better to just use
std::__niter_base for the pointers and __normal_iterator cases, and use
C++20 std::to_address when the C++20 std::contiguous_iterator concept is
satisfied. This patch does that.

libstdc++-v3/ChangeLog:

* include/bits/basic_string.h (basic_string::assign): Replace
use of __to_address with __niter_base or std::to_address as
appropriate.
* include/bits/ptr_traits.h (__to_address): Add comment.
* include/bits/shared_ptr_base.h (__shared_ptr): Qualify calls
to __to_address.
* include/bits/stl_algo.h (find): Replace use of __to_address
with __niter_base or std::to_address as appropriate. Only use
either of them when the range is not empty.
* include/bits/stl_iterator.h (__to_address): Remove overload
for __normal_iterator.
* include/debug/safe_iterator.h (__to_address): Remove overload
for _Safe_iterator.
* include/std/ranges (views::counted): Replace use of
__to_address with std::to_address.
* testsuite/24_iterators/normal_iterator/to_address.cc: Removed.
---
 libstdc++-v3/include/bits/basic_string.h  | 19 +--
 libstdc++-v3/include/bits/ptr_traits.h|  4 
 libstdc++-v3/include/bits/shared_ptr_base.h   |  4 ++--
 libstdc++-v3/include/bits/stl_algo.h  | 16 +++-
 libstdc++-v3/include/bits/stl_iterator.h  | 12 
 libstdc++-v3/include/debug/safe_iterator.h| 17 -
 libstdc++-v3/include/std/ranges   |  2 +-
 .../normal_iterator/to_address.cc | 19 ---
 8 files changed, 31 insertions(+), 62 deletions(-)
 delete mode 100644 
libstdc++-v3/testsuite/24_iterators/normal_iterator/to_address.cc

diff --git a/libstdc++-v3/include/bits/basic_string.h 
b/libstdc++-v3/include/bits/basic_string.h
index e9b17ea48b5..16e356e0678 100644
--- a/libstdc++-v3/include/bits/basic_string.h
+++ b/libstdc++-v3/include/bits/basic_string.h
@@ -1732,18 +1732,25 @@ _GLIBCXX_BEGIN_NAMESPACE_CXX11
basic_string&
assign(_InputIterator __first, _InputIterator __last)
{
+ using _IterTraits = iterator_traits<_InputIterator>;
+ if constexpr (is_pointer::value
+ && is_same::value)
+   {
+ __glibcxx_requires_valid_range(__first, __last);
+ return _M_replace(size_type(0), size(),
+   std::__niter_base(__first), __last - __first);
+   }
 #if __cplusplus >= 202002L
- if constexpr (contiguous_iterator<_InputIterator>
- && is_same_v, _CharT>)
-#else
- if constexpr (__is_one_of<_InputIterator, const_iterator, iterator,
-   const _CharT*, _CharT*>::value)
-#endif
+ else if constexpr (contiguous_iterat

[RFC/RFA] [PATCH v5 01/12] Implement internal functions for efficient CRC computation.

2024-10-18 Thread Mariam Arutunian
Add two new internal functions (IFN_CRC, IFN_CRC_REV), to provide faster
CRC generation.
One performs bit-forward and the other bit-reversed CRC computation.
If CRC optabs are supported, they are used for the CRC computation.
Otherwise, table-based CRC is generated.
The supported data and CRC sizes are 8, 16, 32, and 64 bits.
The polynomial is without the leading 1.
A table with 256 elements is used to store precomputed CRCs.
For the reflection of inputs and the output, a simple algorithm involving
SHIFT, AND, and OR operations is used.

gcc/

* doc/md.texi (crc@var{m}@var{n}4,
crc_rev@var{m}@var{n}4): Document.
* expr.cc (calculate_crc): New function.
(assemble_crc_table): Likewise.
(generate_crc_table): Likewise.
(calculate_table_based_CRC): Likewise.
(emit_crc): Likewise.
(expand_crc_table_based): Likewise.
(gen_common_operation_to_reflect): Likewise.
(reflect_64_bit_value): Likewise.
(reflect_32_bit_value): Likewise.
(reflect_16_bit_value): Likewise.
(reflect_8_bit_value): Likewise.
(generate_reflecting_code_standard): Likewise.
(expand_reversed_crc_table_based): Likewise.
* expr.h (generate_reflecting_code_standard): New function declaration.
(expand_crc_table_based): Likewise.
(expand_reversed_crc_table_based): Likewise.
* internal-fn.cc: (crc_direct): Define.
(direct_crc_optab_supported_p): Likewise.
(expand_crc_optab_fn): New function
* internal-fn.def (CRC, CRC_REV): New internal functions.
* optabs.def (crc_optab, crc_rev_optab): New optabs.

Signed-off-by: Mariam Arutunian 
Co-authored-by: Joern Rennecke 
Mentored-by: Jeff Law 
---
 gcc/doc/md.texi |  14 ++
 gcc/expr.cc | 371 
 gcc/expr.h  |   6 +
 gcc/internal-fn.cc  |  54 +++
 gcc/internal-fn.def |   2 +
 gcc/optabs.def  |   2 +
 6 files changed, 449 insertions(+)

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index a9259112251..913d1f96373 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -8591,6 +8591,20 @@ Return 1 if operand 1 is a normal floating point number and 0
 otherwise.  @var{m} is a scalar floating point mode.  Operand 0
 has mode @code{SImode}, and operand 1 has mode @var{m}.
 
+@cindex @code{crc@var{m}@var{n}4} instruction pattern
+@item @samp{crc@var{m}@var{n}4}
+Calculate a bit-forward CRC using operands 1, 2 and 3,
+then store the result in operand 0.
+Operands 1 is the initial CRC, operands 2 is the data and operands 3 is the
+polynomial without leading 1.
+Operands 0, 1 and 3 have mode @var{n} and operand 2 has mode @var{m}, where
+both modes are integers.  The size of CRC to be calculated is determined by the
+mode; for example, if @var{n} is 'hi', a CRC16 is calculated.
+
+@cindex @code{crc_rev@var{m}@var{n}4} instruction pattern
+@item @samp{crc_rev@var{m}@var{n}4}
+Similar to @samp{crc@var{m}@var{n}4}, but calculates a bit-reversed CRC.
+
 @end table
 
 @end ifset
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 7a471f20e79..3a0ddfaf132 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -14124,3 +14124,374 @@ int_expr_size (const_tree exp)
 
   return tree_to_shwi (size);
 }
+
+/* Calculate CRC for the initial CRC and given POLYNOMIAL.
+   CRC_BITS is CRC size.  */
+
+static unsigned HOST_WIDE_INT
+calculate_crc (unsigned HOST_WIDE_INT crc,
+	   unsigned HOST_WIDE_INT polynomial,
+	   unsigned short crc_bits)
+{
+  unsigned HOST_WIDE_INT msb = HOST_WIDE_INT_1U << (crc_bits - 1);
+  crc = crc << (crc_bits - 8);
+  for (short i = 8; i > 0; --i)
+{
+  if (crc & msb)
+	crc = (crc << 1) ^ polynomial;
+  else
+	crc <<= 1;
+}
+  /* Zero out bits in crc beyond the specified number of crc_bits.  */
+  if (crc_bits < sizeof (crc) * CHAR_BIT)
+crc &= (HOST_WIDE_INT_1U << crc_bits) - 1;
+  return crc;
+}
+
+/* Assemble CRC table with 256 elements for the given POLYNOM and CRC_BITS with
+   given ID.
+   ID is the identifier of the table, the name of the table is unique,
+   contains CRC size and the polynomial.
+   POLYNOM is the polynomial used to calculate the CRC table's elements.
+   CRC_BITS is the size of CRC, may be 8, 16, ... . */
+
+rtx
+assemble_crc_table (tree id, unsigned HOST_WIDE_INT polynom,
+		unsigned short crc_bits)
+{
+  unsigned table_el_n = 0x100;
+  tree ar = build_array_type (make_unsigned_type (crc_bits),
+			  build_index_type (size_int (table_el_n - 1)));
+  tree decl = build_decl (UNKNOWN_LOCATION, VAR_DECL, id, ar);
+  SET_DECL_ASSEMBLER_NAME (decl, id);
+  DECL_ARTIFICIAL (decl) = 1;
+  rtx tab = gen_rtx_SYMBOL_REF (Pmode, IDENTIFIER_POINTER (id));
+  TREE_ASM_WRITTEN (decl) = 0;
+
+  /* Initialize the table.  */
+  vec *initial_values;
+  vec_alloc (initial_values, table_el_n);
+  for (size_t i = 0; i < table_el_n; ++i)
+{
+  unsigned HOST_WIDE_INT crc = calculate_crc (i, polynom, crc_bits);
+  tree element = build_int_cstu (make_unsigned_type (crc_bits), crc);
+  vec_safe_push (initial_values,

[PATCH 0/2] aarch64: Use standard names for saturating arithmetic

2024-10-18 Thread Akram Ahmad
Hi all,

This patch series introduces standard names for scalar, Adv. SIMD, and
SVE saturating arithmetic instructions in the aarch64 backend.

Additional tests are added for unsigned saturating arithmetic, as well
as to test that the auto-vectorizer correctly inserts NEON instructions
or scalar instructions where necessary, such as in 32 and 64-bit scalar
unsigned arithmetic. There are also tests for the auto-vectorized SVE
code.

An important discussion point: this patch causes scalar 32 and 64-bit
unsigned saturating arithmetic to now use adds, csinv / subs, csel as
is expected elsewhere in the backend. This affects the NEON intrinsics
for these two modes as well. This is the cause of a few test failures,
otherwise there are no regressions on aarch64-none-linux-gnu.

SVE currently uses the unpredicated version of the instruction in the
backend.

Many thanks,

Akram

---

Akram Ahmad (2):
  aarch64: Use standard names for saturating arithmetic
  aarch64: Use standard names for SVE saturating arithmetic

 gcc/config/aarch64/aarch64-builtins.cc| 13 +++
 gcc/config/aarch64/aarch64-simd-builtins.def  |  8 +-
 gcc/config/aarch64/aarch64-simd.md| 93 +-
 gcc/config/aarch64/aarch64-sve.md |  4 +-
 gcc/config/aarch64/arm_neon.h | 96 +--
 gcc/config/aarch64/iterators.md   |  4 +
 .../saturating_arithmetic_autovect.inc| 58 +++
 .../saturating_arithmetic_autovect_1.c| 79 +++
 .../saturating_arithmetic_autovect_2.c| 79 +++
 .../saturating_arithmetic_autovect_3.c| 75 +++
 .../saturating_arithmetic_autovect_4.c| 77 +++
 .../aarch64/saturating_arithmetic.inc | 39 
 .../aarch64/saturating_arithmetic_1.c | 41 
 .../aarch64/saturating_arithmetic_2.c | 41 
 .../aarch64/saturating_arithmetic_3.c | 30 ++
 .../aarch64/saturating_arithmetic_4.c | 30 ++
 .../aarch64/sve/saturating_arithmetic.inc | 68 +
 .../aarch64/sve/saturating_arithmetic_1.c | 60 
 .../aarch64/sve/saturating_arithmetic_2.c | 60 
 .../aarch64/sve/saturating_arithmetic_3.c | 62 
 .../aarch64/sve/saturating_arithmetic_4.c | 62 
 21 files changed, 1021 insertions(+), 58 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic.inc
 create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic_4.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_2.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_3.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_4.c

-- 
2.34.1



Re: [PATCH 3/7] RISC-V: Fix vector memcpy smaller LMUL generation

2024-10-18 Thread Jeff Law




On 10/18/24 7:12 AM, Craig Blackmore wrote:

If riscv_vector::expand_block_move is generating a straight-line memcpy
using a predicated store, it tries to use a smaller LMUL to reduce
register pressure if it still allows an entire transfer.

This happens in the inner loop of riscv_vector::expand_block_move,
however, the vmode chosen by this loop gets overwritten later in the
function, so I have added the missing break from the outer loop.

I have also addressed a couple of issues with the conditions of the if
statement within the inner loop.

The first condition did not make sense to me:
```
   TARGET_MIN_VLEN * lmul <= nunits * BITS_PER_UNIT
```
I think this was supposed to be checking that the length fits within the
given LMUL, so I have changed it to do that.

Yea, this just looks broken.



The second condition:
```
   /* Avoid loosing the option of using vsetivli .  */
   && (nunits <= 31 * lmul || nunits > 31 * 8)
```
seems to imply that lmul affects the range of AVL immediate that
vsetivli can take but I don't think that is correct.  Anyway, I don't
think this condition is necessary because if we find a suitable mode we
should stick with it, regardless of whether it allowed vsetivli, rather
than continuing to try larger lmul which would increase register
pressure or smaller potential_ew which would increase AVL.  I have
removed this condition.
I think it's just trying to micro-optimize, but it may not be a 
particularly good tradeoff.  That load immediate should be incredibly 
cheap on a modern design.  Generating a smaller LMUL seems like the 
better tradeoff.  Simplifies the code as well.


Pushed to the trunk.

Thanks!

jeff



Re: [WIP RFC] libstdc++: add module std

2024-10-18 Thread Patrick Palka
On Fri, 18 Oct 2024, Jason Merrill wrote:

> This patch is not ready for integration, but I'd like to get feedback on the
> approach (and various specific questions below).
> 
> -- 8< --
> 
> This patch introduces an installed source form of module std and std.compat.
> To find them, we install a libstdc++.modules.json file alongside
> libstdc++.so, which tells the build system where the files are and any
> special flags it should use when compiling them (none, in our case).  The
> format is from a proposal in SG15.
> 
> The build system can find this file with
> gcc -print-file-name=libstdc++.modules.json
> 
> It seems preferable to use a relative path from this file to the sources so
> that moving the installation doesn't break the reference, but I didn't see
> any obvious way to compute that without relying on coreutils, perl, or
> python, so I wrote a POSIX shell script for it.
> 
> Currently this installs the sources under $(pkgdata), i.e.
> /usr/share/libstdc++/modules.  It could also make sense to install them
> under $(gxx_include_dir), i.e. /usr/include/c++/15/modules.  And/or the
> subdirectory could be "miu" (module interface unit) instead of "modules".
> 
> The sources currently have the extension .cc, like other source files.
> Alternatively, they could use one of the module interface unit extensions,
> perhaps .ccm.
> 
> std.cc started with m.cencora's implementation in PR114600.  I've made some
> adjustments, but more is probably desirable, e.g. of the 
> handling of namespace ranges, and to remove exports of templates that are
> only specialized in a particular header.
> 
> The std module is missing exports for some newer headers, including some
> that are already implemented (, , ).  I've
> added some FIXMEs where I noticed missing bits.
> 
> Since bits/stdc++.h also intends to include the whole standard library, I
> include it rather than duplicate it.  But stdc++.h comments out ,
> so I include it separately.  Alternatively, we could uncomment it in
> stdc++.h.
> 
> stdc++.h also doesn't include the eternally deprecated .  There
> are some other deprecated facilities that I notice are included: 
> and float_denorm_style, at least.  It would be nice for L{E,}WG to clarify
> whether module std is intended to include interfaces that were deprecated in
> C++23, since ancient code isn't going to be relying on module std.
> 
> If they are supposed to included, do we also want to keep exporting them in
> C++26, where they are removed from the standard?
> 
> It seemed most convenient for the two files to be monolithic so we don't
> need to worry about include paths.  So the C library names that module
> std.compat exports in both namespace std and :: in module are a block of
> code that is identical in both files, adjusted based on whether the macro
> STD_COMPAT is defined before the block.
> 
> In this implementation std.compat imports std; it would also be valid for it
> to duplicate everything in std.  I see the libc++ std.compat also imports
> std.
> 
> Is it useful for std.cc to live in a subdirectory of c++23 as in this patch, 
> or
> should it be in c++23 itself?  Or elsewhere?

IIUC the src/ subdirectory is for stuff that gets compiled into the .so
which isn't the case here.  And if we want to support the std module in
C++20 mode as an extension, a c++23 subdirectory might not be ideal either.
Maybe libstdc++-v3/modules/ then?

> 
> This patch doesn't yet provide a convenient way for a user to find std.cc.
> 
> libstdc++-v3/ChangeLog:
> 
>   * src/c++23/Makefile.am: Add module std/std.compat.
>   * src/c++23/Makefile.in: Regenerate.
>   * src/c++23/modules/std.cc: New file.
>   * src/c++23/modules/std.compat.cc: New file.
>   * src/c++23/libstdc++.modules.json.in: New file.
> 
> contrib/ChangeLog:
> 
>   * relpath.sh: New file.
> ---
>  libstdc++-v3/src/c++23/modules/std.cc | 3575 +
>  libstdc++-v3/src/c++23/modules/std.compat.cc  |  640 +++
>  contrib/relpath.sh|   70 +
>  libstdc++-v3/src/c++23/Makefile.am|   18 +
>  libstdc++-v3/src/c++23/Makefile.in|  133 +-
>  .../src/c++23/libstdc++.modules.json.in   |   17 +
>  6 files changed, 4436 insertions(+), 17 deletions(-)
>  create mode 100644 libstdc++-v3/src/c++23/modules/std.cc
>  create mode 100644 libstdc++-v3/src/c++23/modules/std.compat.cc
>  create mode 100755 contrib/relpath.sh
>  create mode 100644 libstdc++-v3/src/c++23/libstdc++.modules.json.in
> 
> diff --git a/libstdc++-v3/src/c++23/modules/std.cc 
> b/libstdc++-v3/src/c++23/modules/std.cc
> new file mode 100644
> index 000..d823b915b9c
> --- /dev/null
> +++ b/libstdc++-v3/src/c++23/modules/std.cc
> @@ -0,0 +1,3575 @@
> +// -*- C++ -*- [std.modules] module std
> +
> +// Copyright The GNU Toolchain Authors.
> +//
> +// This file is part of the GNU ISO C++ Library.  This library is free
> +// software; you can redistribute it and/or modify it under the
> +// terms of 

  1   2   >