GCC 14.1.1 Status Report (2024-07-11)

2024-07-11 Thread Jakub Jelinek via Gcc
Status
==

The gcc-14 branch is open for regression and documentation fixes.

It's time to plan for a GCC 14.2 release which should follow
roughly two to three months after the .1 release.  The plan is
to do a release candidate for GCC 14.2 on Tuesday, Jul 23rd
with the release following a week after that on Tuesday, Jul 30th.

That leaves some (but not plenty) of time to backport regression fixes
from trunk.  Please make sure to go over the list of your assigned
bugreports and consider backporting where that seems safe.


Quality Data


Priority  #   Change from last report
---   ---
P10
P2  603-   2
P3  125+  53
P4  211-   6
P5   24-   1
---   ---
Total P1-P3 728+  51
Total   963+  44


Previous Report
===

https://gcc.gnu.org/pipermail/gcc/2024-May/243920.html



insn attributes: Support blocks of C-code?

2024-07-11 Thread Georg-Johann Lay

So I had that situation where in an insn attribute, providing
a block of code (rather than just an expression) would be
useful.

Expressions can provided by means of symbol_ref, like in

(set (attr "length")
 (symbol_ref ("1 + GET_MODE_SIZE (mode)")))

However providing a block of code gives a syntax error from
the compiler, *NOT* from md_reader:

(set (attr "length")
 (symbol_ref
  {
int len = 1;
return len;
  }))

This means such syntax is already supported to some degree,
there's just no semantics assigned to such code.

Blocks of code are already supported in insn predicates,
like in

(define_predicate "my_operand"
  (match_code "code_label,label_ref,symbol_ref,plus,const")
{
  some code...
  return true-or-false;
})

In the insn attribute case, I hacked a bit and supported
blocks of code like in the example above.  The biggest change
is that class attr_desc has to be moved from genattrtab.cc to
read-md.h so that it is a complete type as required by
md_reader::fprint_c_condition().

That method prints to code for symbol_ref and some others, and
it has to know the type of the attribute, like "int" for the
"length" attribute.  The implementation in fprint_c_condition()
is straight forward:

When cond (which is the payload string of symbol_ref, including the
'{}'s) starts with '{', the print a lambda that's called in place,
like in

   print "( [&]() ->   () )"

The "&" capture is required so that variables like "insn" are
accessible. "operands[]" and "which_alternative" are global,
thus also accessible.

Attached is the code I have so far (which is by no means a
proposed patch, so I am posting here on gcc@).

As far as I can tell, there is no performance penalty, e.g.
in build times, when the feature is not used.  Of course instead
of such syntax, a custom function could be used, or the
braces-brackets-parentheses-gibberish could be written out
in the symbol_ref as an expression.  Though I think this
could be a nice addition, in particular because the scanning
side in md_reader already supports the syntax.

Johanndiff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 7f4335e0aac..3e46693e8c2 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -10265,6 +10265,56 @@ so there is no need to explicitly convert the expression into a boolean
 (match_test "(x & 2) != 0")
 @end smallexample
 
+@cindex @code{symbol_ref} and attributes
+@item (symbol_ref "@var{quoted-c-expr}")
+
+Specifies the value of the attribute sub-expression as a C expression,
+where the surrounding quotes are not part of the expression.
+Similar to @code{match_test}, variables @var{insn}, @var{operands[]}
+and @var{which_alternative} are available.  Moreover, code and mode
+attributes can be used to compose the resulting C expression, like in
+
+@smallexample
+(set (attr "length")
+ (symbol_ref ("1 + GET_MODE_SIZE (mode)")))
+@end smallexample
+
+where the according insn has exactly one mode iterator.
+See @ref{Mode Iterators} and @ref{Code Iterators}.
+
+@item  (symbol_ref "@{ @var{quoted-c-code} @}")
+@itemx (symbol_ref @{ @var{c-code} @})
+
+The value of this subexpression is determined by running a block
+of C code which returns the desired value.
+The braces are part of the code, whereas the quotes in the quoted form are not.
+
+This variant of @code{symbol_ref} allows for more comlpex code than
+just a single C expression, like for example:
+
+@smallexample
+(set (attr "length")
+ (symbol_ref
+  @{
+int len;
+some_function (insn, , mode, & len);
+return len;
+  @}))
+@end smallexample
+
+for an insn that has one code iterator and one mode iterator.
+Again, variables @var{insn}, @var{operands[]} and @var{which_alternative}
+can be used.  The unquoted form only supports a subset of C,
+for example no C comments are supported, and strings that contain
+characters like @samp{@}} are problematic and may need to be escaped
+as @samp{\@}}.
+
+The return type is @code{int} for the @var{length} attribute, and
+@code{enum attr_@var{name}} for an insn attribute named @var{name}.
+The types and available enum values can be looked up in
+@file{$builddir/gcc/insn-attr-common.h}.
+
+
 @cindex @code{le} and attributes
 @cindex @code{leu} and attributes
 @cindex @code{lt} and attributes
diff --git a/gcc/genattrtab.cc b/gcc/genattrtab.cc
index 03c7d6c74a3..20d45ba88af 100644
--- a/gcc/genattrtab.cc
+++ b/gcc/genattrtab.cc
@@ -168,22 +168,6 @@ struct attr_value
   int has_asm_insn;		/* True if this value used for `asm' insns */
 };
 
-/* Structure for each attribute.  */
-
-class attr_desc
-{
-public:
-  char *name;			/* Name of attribute.  */
-  const char *enum_name;	/* Enum name for DEFINE_ENUM_NAME.  */
-  class attr_desc *next;	/* Next attribute.  */
-  struct attr_value *first_value; /* First value of this attribute.  */
-  struct attr_value *default_val; /* Default value for this attribute.  */
-  file_location loc;		/* Where in the .md files it occurs.  */
-  unsigned is_num

Re: Help with vector cost model

2024-07-11 Thread Richard Sandiford via Gcc
Andrew Pinski  writes:
> I need some help with the vector cost model for aarch64.
> I am adding V2HI and V4QI mode support by emulating it using the
> native V4HI/V8QI instructions (similarly to mmx as SSE is done). The
> problem is I am running into a cost model issue with
> gcc.target/aarch64/pr98772.c (wminus is similar to
> gcc.dg/vect/slp-gap-1.c, just slightly different offsets for the
> address).
> It seems like the cost mode is overestimating the number of loads for
> V8QI case .
> With the new cost model usage (-march=armv9-a+nosve), I get:
> ```
> t.c:7:21: note:  * Analysis succeeded with vector mode V4QI
> t.c:7:21: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
> t.c:7:21: note:  Issue info for V4QI loop:
> t.c:7:21: note:load operations = 2
> t.c:7:21: note:store operations = 1
> t.c:7:21: note:general operations = 4
> t.c:7:21: note:reduction latency = 0
> t.c:7:21: note:estimated min cycles per iteration = 2.00
> t.c:7:21: note:  Issue info for V8QI loop:
> t.c:7:21: note:load operations = 12
> t.c:7:21: note:store operations = 1
> t.c:7:21: note:general operations = 6
> t.c:7:21: note:reduction latency = 0
> t.c:7:21: note:estimated min cycles per iteration = 4.33
> t.c:7:21: note:  Weighted cycles per iteration of V4QI loop ~= 4.00
> t.c:7:21: note:  Weighted cycles per iteration of V8QI loop ~= 4.33
> t.c:7:21: note:  Preferring loop with lower cycles per iteration
> t.c:7:21: note:  * Preferring vector mode V4QI to vector mode V8QI
> ```
>
> That is totally wrong and instead of vectorizing using V8QI we
> vectorize using V4QI and the resulting code is worse.
>
> Attached is my current patch for adding V4QI/V2HI to the aarch64
> backend (Note I have not finished up the changelog nor the testcases;
> I have secondary patches that add the testcases already).
> Is there something I am missing here or are we just over estimating
> V8QI cost and is something easy to fix?

Trying it locally, I get:

foo.c:15:23: note:  * Analysis succeeded with vector mode V4QI
foo.c:15:23: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
foo.c:15:23: note:  Issue info for V4QI loop:
foo.c:15:23: note:load operations = 2
foo.c:15:23: note:store operations = 1
foo.c:15:23: note:general operations = 4
foo.c:15:23: note:reduction latency = 0
foo.c:15:23: note:estimated min cycles per iteration = 2.00
foo.c:15:23: note:  Issue info for V8QI loop:
foo.c:15:23: note:load operations = 8
foo.c:15:23: note:store operations = 1
foo.c:15:23: note:general operations = 6
foo.c:15:23: note:reduction latency = 0
foo.c:15:23: note:estimated min cycles per iteration = 3.00
foo.c:15:23: note:  Weighted cycles per iteration of V4QI loop ~= 4.00
foo.c:15:23: note:  Weighted cycles per iteration of V8QI loop ~= 3.00
foo.c:15:23: note:  Preferring loop with lower cycles per iteration

The function is:

extern void
wplus (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
{
for (int y = 0; y < 4; y++ )
{
for (int x = 0; x < 4; x++ )
d[x + y*4] = pix1[x] + pix2[x];
pix1 += 16;
pix2 += 16;
}
}

For V8QI we need a VF of 2, so that there are 8 elements to store to d.
Conceptually, we handle those two iterations by loading 4 V8QIs from
pix1 and pix2 (32 bytes each), with mitigations against overrun,
and then permute the result to single V8QIs.

vectorize_load doesn't seem to be smart enough to realise that only 2
of those 4 loads are actually used in the permuation, and so only 2
loads should be costed for each of pix1 and pix2.

Thanks,
Richard


Re: Help with vector cost model

2024-07-11 Thread Richard Biener via Gcc
On Thu, Jul 11, 2024 at 10:58 AM Richard Sandiford
 wrote:
>
> Andrew Pinski  writes:
> > I need some help with the vector cost model for aarch64.
> > I am adding V2HI and V4QI mode support by emulating it using the
> > native V4HI/V8QI instructions (similarly to mmx as SSE is done). The
> > problem is I am running into a cost model issue with
> > gcc.target/aarch64/pr98772.c (wminus is similar to
> > gcc.dg/vect/slp-gap-1.c, just slightly different offsets for the
> > address).
> > It seems like the cost mode is overestimating the number of loads for
> > V8QI case .
> > With the new cost model usage (-march=armv9-a+nosve), I get:
> > ```
> > t.c:7:21: note:  * Analysis succeeded with vector mode V4QI
> > t.c:7:21: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
> > t.c:7:21: note:  Issue info for V4QI loop:
> > t.c:7:21: note:load operations = 2
> > t.c:7:21: note:store operations = 1
> > t.c:7:21: note:general operations = 4
> > t.c:7:21: note:reduction latency = 0
> > t.c:7:21: note:estimated min cycles per iteration = 2.00
> > t.c:7:21: note:  Issue info for V8QI loop:
> > t.c:7:21: note:load operations = 12
> > t.c:7:21: note:store operations = 1
> > t.c:7:21: note:general operations = 6
> > t.c:7:21: note:reduction latency = 0
> > t.c:7:21: note:estimated min cycles per iteration = 4.33
> > t.c:7:21: note:  Weighted cycles per iteration of V4QI loop ~= 4.00
> > t.c:7:21: note:  Weighted cycles per iteration of V8QI loop ~= 4.33
> > t.c:7:21: note:  Preferring loop with lower cycles per iteration
> > t.c:7:21: note:  * Preferring vector mode V4QI to vector mode V8QI
> > ```
> >
> > That is totally wrong and instead of vectorizing using V8QI we
> > vectorize using V4QI and the resulting code is worse.
> >
> > Attached is my current patch for adding V4QI/V2HI to the aarch64
> > backend (Note I have not finished up the changelog nor the testcases;
> > I have secondary patches that add the testcases already).
> > Is there something I am missing here or are we just over estimating
> > V8QI cost and is something easy to fix?
>
> Trying it locally, I get:
>
> foo.c:15:23: note:  * Analysis succeeded with vector mode V4QI
> foo.c:15:23: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
> foo.c:15:23: note:  Issue info for V4QI loop:
> foo.c:15:23: note:load operations = 2
> foo.c:15:23: note:store operations = 1
> foo.c:15:23: note:general operations = 4
> foo.c:15:23: note:reduction latency = 0
> foo.c:15:23: note:estimated min cycles per iteration = 2.00
> foo.c:15:23: note:  Issue info for V8QI loop:
> foo.c:15:23: note:load operations = 8
> foo.c:15:23: note:store operations = 1
> foo.c:15:23: note:general operations = 6
> foo.c:15:23: note:reduction latency = 0
> foo.c:15:23: note:estimated min cycles per iteration = 3.00
> foo.c:15:23: note:  Weighted cycles per iteration of V4QI loop ~= 4.00
> foo.c:15:23: note:  Weighted cycles per iteration of V8QI loop ~= 3.00
> foo.c:15:23: note:  Preferring loop with lower cycles per iteration
>
> The function is:
>
> extern void
> wplus (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
> {
> for (int y = 0; y < 4; y++ )
> {
> for (int x = 0; x < 4; x++ )
> d[x + y*4] = pix1[x] + pix2[x];
> pix1 += 16;
> pix2 += 16;
> }
> }
>
> For V8QI we need a VF of 2, so that there are 8 elements to store to d.
> Conceptually, we handle those two iterations by loading 4 V8QIs from
> pix1 and pix2 (32 bytes each), with mitigations against overrun,
> and then permute the result to single V8QIs.
>
> vectorize_load doesn't seem to be smart enough to realise that only 2
> of those 4 loads are actually used in the permuation, and so only 2
> loads should be costed for each of pix1 and pix2.

Though it has code to do that.

Richard.

> Thanks,
> Richard


Re: Apply function attributes (e.g., [[gnu::access()]]) to pointees too

2024-07-11 Thread Martin Uecker via Gcc


Am Donnerstag, dem 11.07.2024 um 11:35 +0200 schrieb Alejandro Colomar via Gcc:
> Hi,
> 
> I was wondering how we could extend attributes such as gnu::access() to
> apply it to pointees too.  Currently, there's no way to specify the
> access mode of a pointee.
> 
> Let's take for example strsep(3):
> 
> With current syntax, this is what we can specify:
> 
>   [[gnu::access(read_write, 1)]]
>   [[gnu::access(read_only, 2)]]
>   [[gnu::nonnull(1, 2)]]
>   [[gnu::null_terminated_string_arg(2)]]
>   char *
>   strsep(char **restrict sp, const char *delim);

The main problem from a user perspective is that
these are attributes on the function declaration
and not on the argument (type).

> 
> I was thinking that with floating numbers, one could specify the number
> of dereferences with a number after the decimal point.  It's a bit
> weird, since the floating point is interpreted as two separate integer
> numbers separated by a '.', but could work.  In this case:
> 
>   [[gnu::access(read_write, 1)]]
>   [[gnu::access(read_write, 1.1)]]
>   [[gnu::access(read_only, 2)]]
>   [[gnu::nonnull(1, 2)]]
>   [[gnu::null_terminated_string_arg(1.1)]]
>   [[gnu::null_terminated_string_arg(2)]]
>   char *
>   strsep(char **restrict sp, const char *delim);
> 
> Which would mark the pointer *sp as read_write and a string.  What do
> you think about it?

If the attributes could be applied to the type, then
one could attach them directly at an intermediate
pointer level, which would be more intuitive and
less fragile.


Martin






Nonbootstrap build with Apple clang broken in gm2

2024-07-11 Thread FX Coudert via Gcc
Hi,

I am unable to perform a nonbootstrap build when gm2 is included, with Apple 
clang 15 as compiler. The error is due to incorrect inclusion of headers 
( and ) which are included after GCC’s system.h has been 
included, and macros like abort() are redefined or poisoned.

I think the correct idiom in GCC is to #define INCLUDE_STRING (for example) 
before “system.h” is included, rather than #include  directly. The 
attached patch fixes the issue.

Best,
FX



gm2.diff
Description: Binary data


Re: Apply function attributes (e.g., [[gnu::access()]]) to pointees too

2024-07-11 Thread David Brown via Gcc

On 11/07/2024 11:58, Martin Uecker via Gcc wrote:


Am Donnerstag, dem 11.07.2024 um 11:35 +0200 schrieb Alejandro Colomar via Gcc:

Hi,

I was wondering how we could extend attributes such as gnu::access() to
apply it to pointees too.  Currently, there's no way to specify the
access mode of a pointee.

Let's take for example strsep(3):

With current syntax, this is what we can specify:

[[gnu::access(read_write, 1)]]
[[gnu::access(read_only, 2)]]
[[gnu::nonnull(1, 2)]]
[[gnu::null_terminated_string_arg(2)]]
char *
strsep(char **restrict sp, const char *delim);


The main problem from a user perspective is that
these are attributes on the function declaration
and not on the argument (type).



I was thinking that with floating numbers, one could specify the number
of dereferences with a number after the decimal point.  It's a bit
weird, since the floating point is interpreted as two separate integer
numbers separated by a '.', but could work.  In this case:

[[gnu::access(read_write, 1)]]
[[gnu::access(read_write, 1.1)]]
[[gnu::access(read_only, 2)]]
[[gnu::nonnull(1, 2)]]
[[gnu::null_terminated_string_arg(1.1)]]
[[gnu::null_terminated_string_arg(2)]]
char *
strsep(char **restrict sp, const char *delim);

Which would mark the pointer *sp as read_write and a string.  What do
you think about it?


If the attributes could be applied to the type, then
one could attach them directly at an intermediate
pointer level, which would be more intuitive and
less fragile.



That would be a huge improvement (IMHO).  Then you could write :

#define RW [[gnu::access(read_write)]]
#define RO [[gnu::access(read_only)]]
#define NONNULL [[gnu::nonnull]]
#define CSTRING [[gnu::null_terminated_string_arg]]

char * strsep(char * RW * RW NONNULL CSTRING restrict sp,
const char * RO NUNNULL CSTRING delim);

It would be even better if the characteristics could be tied into a typedef.

typedef const char * [[gnu::access(read_only)]] [[gnu::nonnull]] 
[[gnu::null_terminated_string_arg]] const_cstring;


David



Re: Apply function attributes (e.g., [[gnu::access()]]) to pointees too

2024-07-11 Thread Alejandro Colomar via Gcc
Hi Martin, David,

On Thu, Jul 11, 2024 at 06:08:38PM GMT, David Brown wrote:
> On 11/07/2024 11:58, Martin Uecker via Gcc wrote:
> > >   [[gnu::access(read_write, 1)]]
> > >   [[gnu::access(read_only, 2)]]
> > >   [[gnu::nonnull(1, 2)]]
> > >   [[gnu::null_terminated_string_arg(2)]]
> > >   char *
> > >   strsep(char **restrict sp, const char *delim);
> > 
> > The main problem from a user perspective is that
> > these are attributes on the function declaration
> > and not on the argument (type).
> > 
> > > 
> > > I was thinking that with floating numbers, one could specify the number
> > > of dereferences with a number after the decimal point.  It's a bit
> > > weird, since the floating point is interpreted as two separate integer
> > > numbers separated by a '.', but could work.  In this case:
> > > 
> > >   [[gnu::access(read_write, 1)]]
> > >   [[gnu::access(read_write, 1.1)]]
> > >   [[gnu::access(read_only, 2)]]
> > >   [[gnu::nonnull(1, 2)]]
> > >   [[gnu::null_terminated_string_arg(1.1)]]
> > >   [[gnu::null_terminated_string_arg(2)]]
> > >   char *
> > >   strsep(char **restrict sp, const char *delim);
> > > 
> > > Which would mark the pointer *sp as read_write and a string.  What do
> > > you think about it?
> > 
> > If the attributes could be applied to the type, then
> > one could attach them directly at an intermediate
> > pointer level, which would be more intuitive and
> > less fragile.
> > 
> 
> That would be a huge improvement (IMHO).  Then you could write :
> 
> #define RW [[gnu::access(read_write)]]
> #define RO [[gnu::access(read_only)]]
> #define NONNULL [[gnu::nonnull]]
> #define CSTRING [[gnu::null_terminated_string_arg]]
> 
> char * strsep(char * RW * RW NONNULL CSTRING restrict sp,
>   const char * RO NUNNULL CSTRING delim);

Yup; if that could be done, it would be interesting.  Martin, can it be
done?  I'm worried that it might get ambiguous in some cases.  Is there
any summary of positions where C23 attributes can go and their meanings?
I always have a hard time finding all the possible combinations.

Should such a new attribute go to the left of the '*', or to the right?

> It would be even better if the characteristics could be tied into a typedef.
> 
> typedef const char * [[gnu::access(read_only)]] [[gnu::nonnull]]
> [[gnu::null_terminated_string_arg]] const_cstring;

H.

> David

Cheers,
Alex

-- 



signature.asc
Description: PGP signature


gcc-12-20240711 is now available

2024-07-11 Thread GCC Administrator via Gcc
Snapshot gcc-12-20240711 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/12-20240711/
and on various mirrors, see https://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 12 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch 
releases/gcc-12 revision a655c8d2098aff5235934263b065a389a9fcbbca

You'll find:

 gcc-12-20240711.tar.xz   Complete GCC

  SHA256=770f9b50c012ff62bdf07c46a54e7fee96729fc0efce56b8bd78476e03f24c44
  SHA1=ff2f9720ab37f8b17a5d4050cc4ee2ccc6419a3f

Diffs from 12-20240704 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-12
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


g++12 broke my system of overloaded operator<

2024-07-11 Thread Dalbey, Keith via Gcc
So I'm on redhat 7 and just got devtoolsset-12 and code (a system of 
overloaded<< operators) that was working with devtoolset-10 now break (because 
of ordering)

To not bury the lead..

My code relies on the version 11 or older behavior (see below), and I don't see 
how anyone could call the new behavior an improvement or correction because it 
neuters/cancels out  the power/flexibility of the STL.   Yes one could 
technically work around it by forward declaring templated operator<<  but that 
makes the system not extensible,  a common package/gitlab project that handles 
this for the STL and then gets succeed into another library that overloads the 
operator<< for concrete classes just doesn't work any more... and that was my 
exact use case.

Please reverse this change in future editions of gcc, it is absolutely awful.

>From this link
https://developers.redhat.com/articles/2022/04/25/new-c-features-gcc-12#corrections_and_internal_improvements
Corrections and internal improvements
The changes described in this section bring GCC more in line with recent 
changes to the standard, and permit behavior that previously did not work 
correctly.
Dependent operator lookup changes
GCC 12 corrected a problem where the compiler performed an unqualified lookup 
for a dependent operator expression at template definition time instead of at 
instantiation time. The fix matches the existing behavior for dependent call 
expressions. Consider the following test case demonstrating this change:
#include 

namespace N {
  struct A { };
}

void operator+(N::A, double) {
  std::cout << "#1 ";
}

template
void f(T t) {
  operator+(t, 0);
  t + 0;
}

// Since it's not visible from the template definition, this later-declared
// operator overload should not be considered when instantiating f(N::A),
// for either the call or operator expression.
void operator+(N::A, int) {
  std::cout << "#2 ";
}

int main() {
  N::A a;
  f(a);
  std::cout << std::endl;
}
Copy snippet
This program will print #1 #2 when compiled with versions 11 or older of GCC, 
but GCC 12 correctly prints #1 #1. That's because previously only the call 
expression resolved to the #1 overload, but with GCC 12 the operator expression 
does too.


Re: g++12 broke my system of overloaded operator<

2024-07-11 Thread Andrew Pinski via Gcc
On Thu, Jul 11, 2024 at 5:04 PM Dalbey, Keith via Gcc  wrote:
>
> So I'm on redhat 7 and just got devtoolsset-12 and code (a system of 
> overloaded<< operators) that was working with devtoolset-10 now break 
> (because of ordering)
>
> To not bury the lead..
>
> My code relies on the version 11 or older behavior (see below), and I don't 
> see how anyone could call the new behavior an improvement or correction 
> because it neuters/cancels out  the power/flexibility of the STL.   Yes one 
> could technically work around it by forward declaring templated operator<<  
> but that makes the system not extensible,  a common package/gitlab project 
> that handles this for the STL and then gets succeed into another library that 
> overloads the operator<< for concrete classes just doesn't work any more... 
> and that was my exact use case.

So your code depends on non-standard behavior that GCC accidentally
got wrong until GCC 12.
I feel for you but ...

>
> Please reverse this change in future editions of gcc, it is absolutely awful.

This is not going to happen since your code depended on incorrect
behavior. And code that depends on behavior defined by the standard
would now break.

Thanks,
Andrew

>
> From this link
> https://developers.redhat.com/articles/2022/04/25/new-c-features-gcc-12#corrections_and_internal_improvements
> Corrections and internal improvements
> The changes described in this section bring GCC more in line with recent 
> changes to the standard, and permit behavior that previously did not work 
> correctly.
> Dependent operator lookup changes
> GCC 12 corrected a problem where the compiler performed an unqualified lookup 
> for a dependent operator expression at template definition time instead of at 
> instantiation time. The fix matches the existing behavior for dependent call 
> expressions. Consider the following test case demonstrating this change:
> #include 
>
> namespace N {
>   struct A { };
> }
>
> void operator+(N::A, double) {
>   std::cout << "#1 ";
> }
>
> template
> void f(T t) {
>   operator+(t, 0);
>   t + 0;
> }
>
> // Since it's not visible from the template definition, this later-declared
> // operator overload should not be considered when instantiating 
> f(N::A),
> // for either the call or operator expression.
> void operator+(N::A, int) {
>   std::cout << "#2 ";
> }
>
> int main() {
>   N::A a;
>   f(a);
>   std::cout << std::endl;
> }
> Copy snippet
> This program will print #1 #2 when compiled with versions 11 or older of GCC, 
> but GCC 12 correctly prints #1 #1. That's because previously only the call 
> expression resolved to the #1 overload, but with GCC 12 the operator 
> expression does too.


Re: Help with vector cost model

2024-07-11 Thread Andrew Pinski via Gcc
On Thu, Jul 11, 2024 at 2:14 AM Richard Biener
 wrote:
>
> On Thu, Jul 11, 2024 at 10:58 AM Richard Sandiford
>  wrote:
> >
> > Andrew Pinski  writes:
> > > I need some help with the vector cost model for aarch64.
> > > I am adding V2HI and V4QI mode support by emulating it using the
> > > native V4HI/V8QI instructions (similarly to mmx as SSE is done). The
> > > problem is I am running into a cost model issue with
> > > gcc.target/aarch64/pr98772.c (wminus is similar to
> > > gcc.dg/vect/slp-gap-1.c, just slightly different offsets for the
> > > address).
> > > It seems like the cost mode is overestimating the number of loads for
> > > V8QI case .
> > > With the new cost model usage (-march=armv9-a+nosve), I get:
> > > ```
> > > t.c:7:21: note:  * Analysis succeeded with vector mode V4QI
> > > t.c:7:21: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
> > > t.c:7:21: note:  Issue info for V4QI loop:
> > > t.c:7:21: note:load operations = 2
> > > t.c:7:21: note:store operations = 1
> > > t.c:7:21: note:general operations = 4
> > > t.c:7:21: note:reduction latency = 0
> > > t.c:7:21: note:estimated min cycles per iteration = 2.00
> > > t.c:7:21: note:  Issue info for V8QI loop:
> > > t.c:7:21: note:load operations = 12
> > > t.c:7:21: note:store operations = 1
> > > t.c:7:21: note:general operations = 6
> > > t.c:7:21: note:reduction latency = 0
> > > t.c:7:21: note:estimated min cycles per iteration = 4.33
> > > t.c:7:21: note:  Weighted cycles per iteration of V4QI loop ~= 4.00
> > > t.c:7:21: note:  Weighted cycles per iteration of V8QI loop ~= 4.33
> > > t.c:7:21: note:  Preferring loop with lower cycles per iteration
> > > t.c:7:21: note:  * Preferring vector mode V4QI to vector mode V8QI
> > > ```
> > >
> > > That is totally wrong and instead of vectorizing using V8QI we
> > > vectorize using V4QI and the resulting code is worse.
> > >
> > > Attached is my current patch for adding V4QI/V2HI to the aarch64
> > > backend (Note I have not finished up the changelog nor the testcases;
> > > I have secondary patches that add the testcases already).
> > > Is there something I am missing here or are we just over estimating
> > > V8QI cost and is something easy to fix?
> >
> > Trying it locally, I get:
> >
> > foo.c:15:23: note:  * Analysis succeeded with vector mode V4QI
> > foo.c:15:23: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
> > foo.c:15:23: note:  Issue info for V4QI loop:
> > foo.c:15:23: note:load operations = 2
> > foo.c:15:23: note:store operations = 1
> > foo.c:15:23: note:general operations = 4
> > foo.c:15:23: note:reduction latency = 0
> > foo.c:15:23: note:estimated min cycles per iteration = 2.00
> > foo.c:15:23: note:  Issue info for V8QI loop:
> > foo.c:15:23: note:load operations = 8
> > foo.c:15:23: note:store operations = 1
> > foo.c:15:23: note:general operations = 6
> > foo.c:15:23: note:reduction latency = 0
> > foo.c:15:23: note:estimated min cycles per iteration = 3.00
> > foo.c:15:23: note:  Weighted cycles per iteration of V4QI loop ~= 4.00
> > foo.c:15:23: note:  Weighted cycles per iteration of V8QI loop ~= 3.00
> > foo.c:15:23: note:  Preferring loop with lower cycles per iteration
> >
> > The function is:
> >
> > extern void
> > wplus (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
> > {
> > for (int y = 0; y < 4; y++ )
> > {
> > for (int x = 0; x < 4; x++ )
> > d[x + y*4] = pix1[x] + pix2[x];
> > pix1 += 16;
> > pix2 += 16;
> > }
> > }
> >
> > For V8QI we need a VF of 2, so that there are 8 elements to store to d.
> > Conceptually, we handle those two iterations by loading 4 V8QIs from
> > pix1 and pix2 (32 bytes each), with mitigations against overrun,
> > and then permute the result to single V8QIs.
> >
> > vectorize_load doesn't seem to be smart enough to realise that only 2
> > of those 4 loads are actually used in the permuation, and so only 2
> > loads should be costed for each of pix1 and pix2.
>
> Though it has code to do that.

So looking into this a little further. Yes there is code that does it
but it still adds the extra loads and then removes them. And the
costing part is done before the removal of the extra loads.

>From (a non modifed trunk):
```
/app/example.cpp:2:21: note:   add new stmt: vect__34.7_15 = MEM
 [(unsigned char *)vectp_pix1.5_17];
/app/example.cpp:2:21: note:   add new stmt: vectp_pix1.5_14 =
vectp_pix1.5_17 + 8;
/app/example.cpp:2:21: note:   add new stmt: vect__34.8_13 = MEM
 [(unsigned char *)vectp_pix1.5_14];
/app/example.cpp:2:21: note:   add new stmt: vectp_pix1.5_12 =
vectp_pix1.5_14 + 8;
Applying pattern match.pd:2922, gimple-match-2.cc:7467
gimple_simplified to vectp_pix1.5_12 = vectp_pix1.5_17 + 16;
/app/example.cpp:2:21: note:   add new stmt: _11 = MEM 
[(unsigned char *)vectp_pix1.5_12];
/app/example.cpp:2:21: note:  

Re: Help with vector cost model

2024-07-11 Thread Richard Biener via Gcc
On Fri, Jul 12, 2024 at 4:42 AM Andrew Pinski  wrote:
>
> On Thu, Jul 11, 2024 at 2:14 AM Richard Biener
>  wrote:
> >
> > On Thu, Jul 11, 2024 at 10:58 AM Richard Sandiford
> >  wrote:
> > >
> > > Andrew Pinski  writes:
> > > > I need some help with the vector cost model for aarch64.
> > > > I am adding V2HI and V4QI mode support by emulating it using the
> > > > native V4HI/V8QI instructions (similarly to mmx as SSE is done). The
> > > > problem is I am running into a cost model issue with
> > > > gcc.target/aarch64/pr98772.c (wminus is similar to
> > > > gcc.dg/vect/slp-gap-1.c, just slightly different offsets for the
> > > > address).
> > > > It seems like the cost mode is overestimating the number of loads for
> > > > V8QI case .
> > > > With the new cost model usage (-march=armv9-a+nosve), I get:
> > > > ```
> > > > t.c:7:21: note:  * Analysis succeeded with vector mode V4QI
> > > > t.c:7:21: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
> > > > t.c:7:21: note:  Issue info for V4QI loop:
> > > > t.c:7:21: note:load operations = 2
> > > > t.c:7:21: note:store operations = 1
> > > > t.c:7:21: note:general operations = 4
> > > > t.c:7:21: note:reduction latency = 0
> > > > t.c:7:21: note:estimated min cycles per iteration = 2.00
> > > > t.c:7:21: note:  Issue info for V8QI loop:
> > > > t.c:7:21: note:load operations = 12
> > > > t.c:7:21: note:store operations = 1
> > > > t.c:7:21: note:general operations = 6
> > > > t.c:7:21: note:reduction latency = 0
> > > > t.c:7:21: note:estimated min cycles per iteration = 4.33
> > > > t.c:7:21: note:  Weighted cycles per iteration of V4QI loop ~= 4.00
> > > > t.c:7:21: note:  Weighted cycles per iteration of V8QI loop ~= 4.33
> > > > t.c:7:21: note:  Preferring loop with lower cycles per iteration
> > > > t.c:7:21: note:  * Preferring vector mode V4QI to vector mode V8QI
> > > > ```
> > > >
> > > > That is totally wrong and instead of vectorizing using V8QI we
> > > > vectorize using V4QI and the resulting code is worse.
> > > >
> > > > Attached is my current patch for adding V4QI/V2HI to the aarch64
> > > > backend (Note I have not finished up the changelog nor the testcases;
> > > > I have secondary patches that add the testcases already).
> > > > Is there something I am missing here or are we just over estimating
> > > > V8QI cost and is something easy to fix?
> > >
> > > Trying it locally, I get:
> > >
> > > foo.c:15:23: note:  * Analysis succeeded with vector mode V4QI
> > > foo.c:15:23: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 
> > > 2)
> > > foo.c:15:23: note:  Issue info for V4QI loop:
> > > foo.c:15:23: note:load operations = 2
> > > foo.c:15:23: note:store operations = 1
> > > foo.c:15:23: note:general operations = 4
> > > foo.c:15:23: note:reduction latency = 0
> > > foo.c:15:23: note:estimated min cycles per iteration = 2.00
> > > foo.c:15:23: note:  Issue info for V8QI loop:
> > > foo.c:15:23: note:load operations = 8
> > > foo.c:15:23: note:store operations = 1
> > > foo.c:15:23: note:general operations = 6
> > > foo.c:15:23: note:reduction latency = 0
> > > foo.c:15:23: note:estimated min cycles per iteration = 3.00
> > > foo.c:15:23: note:  Weighted cycles per iteration of V4QI loop ~= 4.00
> > > foo.c:15:23: note:  Weighted cycles per iteration of V8QI loop ~= 3.00
> > > foo.c:15:23: note:  Preferring loop with lower cycles per iteration
> > >
> > > The function is:
> > >
> > > extern void
> > > wplus (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
> > > {
> > > for (int y = 0; y < 4; y++ )
> > > {
> > > for (int x = 0; x < 4; x++ )
> > > d[x + y*4] = pix1[x] + pix2[x];
> > > pix1 += 16;
> > > pix2 += 16;
> > > }
> > > }
> > >
> > > For V8QI we need a VF of 2, so that there are 8 elements to store to d.
> > > Conceptually, we handle those two iterations by loading 4 V8QIs from
> > > pix1 and pix2 (32 bytes each), with mitigations against overrun,
> > > and then permute the result to single V8QIs.
> > >
> > > vectorize_load doesn't seem to be smart enough to realise that only 2
> > > of those 4 loads are actually used in the permuation, and so only 2
> > > loads should be costed for each of pix1 and pix2.
> >
> > Though it has code to do that.
>
> So looking into this a little further. Yes there is code that does it
> but it still adds the extra loads and then removes them. And the
> costing part is done before the removal of the extra loads.
>
> From (a non modifed trunk):
> ```
> /app/example.cpp:2:21: note:   add new stmt: vect__34.7_15 = MEM
>  [(unsigned char *)vectp_pix1.5_17];
> /app/example.cpp:2:21: note:   add new stmt: vectp_pix1.5_14 =
> vectp_pix1.5_17 + 8;
> /app/example.cpp:2:21: note:   add new stmt: vect__34.8_13 = MEM
>  [(unsigned char *)vectp_pix1.5_14];
> /app/example.cpp:2:21: note:   add new stmt: vectp_p