date:20210908

Re: [PATCH v2] rs6000: Add load density heuristic

2021-09-08 Thread Kewen.Lin via Gcc-patches

Hi Segher,

Thanks for the comments!

on 2021/9/7 上午7:43, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Jul 28, 2021 at 10:59:50AM +0800, Kewen.Lin wrote:
 +/* As a visitor function for each statement cost entry handled in
 +   function add_stmt_cost, gather some information and update its
 +   relevant fields in target cost accordingly.  */
>>>
>>> I got lost trying to parse that..  (could be just me :-) 
>>>
>>> Possibly instead something like
>>> /* Helper function for add_stmt_cost ; gather information and update
>>> the target_cost fields accordingly.  */
>>
>> OK, will update.  I was thinking for each entry handled in function
>> add_stmt_cost, this helper acts like a visitor, trying to visit each
>> entry and take some actions if some conditions are satisifed.
> 
> It (thankfully!) has nothing to do with the "visitor pattern", so some
> other name might be better :-)
> 
>>> Maybe clearer to read if you rearrange slightly and flatten it ?  I
>>> defer to others on that..
>>>
>>> if ((kind == vec_to_scalar
>>>  || kind == vec_perm
>>>  || kind == vec_promote_demote
>>>  || kind == vec_construct
>>>  || kind == scalar_to_vec)
>>> || (kind == vector_stmt && where == vect_body)
>>
>> This hunk is factored out from function rs6000_add_stmt_cost, maybe I
>> can keep the original formatting?  The formatting tool isn't so smart,
>> and sometimes rearrange things to become unexpected (although it meets
>> the basic rule, not so elegant), sigh.
> 
> It has too many parens, making grouping where there is none, that is the
> core issue.
> 
>   if (kind == vec_to_scalar
>   || kind == vec_perm
>   || kind == vec_promote_demote
>   || kind == vec_construct
>   || kind == scalar_to_vec
>   || (kind == vector_stmt && where == vect_body))
> 
> 

Good catch, I've updated it in V4.

BR,
Kewen

PING^3 [PATCH v2] combine: Tweak the condition of last_set invalidation

2021-09-08 Thread Kewen.Lin via Gcc-patches

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572555.html

BR,
Kewen

on 2021/7/15 上午10:00, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> Gentle ping this:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572555.html
> 
> BR,
> Kewen
> 
> on 2021/6/28 下午3:00, Kewen.Lin via Gcc-patches wrote:
>> Hi!
>>
>> I'd like to gentle ping this:
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572555.html
>>
>>
>> BR,
>> Kewen
>>
>> on 2021/6/11 下午9:16, Kewen.Lin via Gcc-patches wrote:
>>> Hi Segher,
>>>
>>> Thanks for the review!
>>>
>>> on 2021/6/10 上午4:17, Segher Boessenkool wrote:
 Hi!

 On Wed, Dec 16, 2020 at 04:49:49PM +0800, Kewen.Lin wrote:
> Currently we have the check:
>
>   if (!insn
> || (value && rsp->last_set_table_tick >= label_tick_ebb_start))
>   rsp->last_set_invalid = 1; 
>
> which means if we want to record some value for some reg and
> this reg got refered before in a valid scope,

 If we already know it is *set* in this same extended basic block.
 Possibly by the same instruction btw.

> we invalidate the
> set of reg (last_set_invalid to 1).  It avoids to find the wrong
> set for one reg reference, such as the case like:
>
>... op regX  // this regX could find wrong last_set below
>regX = ...   // if we think this set is valid
>... op regX

 Yup, exactly.

> But because of retry's existence, the last_set_table_tick could
> be set by some later reference insns, but we see it's set due
> to retry on the set (for that reg) insn again, such as:
>
>insn 1
>insn 2
>
>regX = ... --> (a)
>... op regX--> (b)
>
>insn 3
>
>// assume all in the same BB.
>
> Assuming we combine 1, 2 -> 3 sucessfully and replace them as two
> (3 insns -> 2 insns),

 This will delete insn 1 and write the combined result to insns 2 and 3.

> retrying from insn1 or insn2 again:

 Always 2, but your point remains valid.

> it will scan insn (a) again, the below condition holds for regX:
>
>   (value && rsp->last_set_table_tick >= label_tick_ebb_start)
>
> it will mark this set as invalid set.  But actually the
> last_set_table_tick here is set by insn (b) before retrying, so it
> should be safe to be taken as valid set.

 Yup.

> This proposal is to check whether the last_set_table safely happens
> after the current set, make the set still valid if so.

> Full SPEC2017 building shows this patch gets more sucessful combines
> from 1902208 to 1902243 (trivial though).

 Do you have some example, or maybe even a testcase?  :-)

>>>
>>> Sorry for the late reply, it took some time to get one reduced case.
>>>
>>> typedef struct SA *pa_t;
>>>
>>> struct SC {
>>>   int h;
>>>   pa_t elem[];
>>> };
>>>
>>> struct SD {
>>>   struct SC *e;
>>> };
>>>
>>> struct SA {
>>>   struct {
>>> struct SD f[1];
>>>   } g;
>>> };
>>>
>>> void foo(pa_t *k, char **m) {
>>>   int l, i;
>>>   pa_t a;
>>>   l = (int)a->g.f[5].e;
>>>   i = 0;
>>>   for (; i < l; i++) {
>>> k[i] = a->g.f[5].e->elem[i];
>>> m[i] = "";
>>>   }
>>> }
>>>
>>> Baseline is r12-0 and the option is "-O3 -mcpu=power9 -fno-strict-aliasing",
>>> with this patch, the generated assembly can save two rlwinm s.
>>>
> +  /* Record the luid of the insn whose expression involving register n.  
> */
> +
> +  intlast_set_table_luid;

 "Record the luid of the insn for which last_set_table_tick was set",
 right?

>>>
>>> But it can be updated later to one smaller luid, how about the wording like:
>>>
>>>
>>> +  /* Record the luid of the insn which uses register n, the insn should
>>> + be the first one using register n in that block of the insn which
>>> + last_set_table_tick was set for.  */
>>>
>>>
> -static void update_table_tick (rtx);
> +static void update_table_tick (rtx, int);

 Please remove this declaration instead, the function is not used until
 after its actual definition :-)

>>>
>>> Done.
>>>
> @@ -13243,7 +13247,21 @@ update_table_tick (rtx x)
>for (r = regno; r < endregno; r++)
>   {
> reg_stat_type *rsp = ®_stat[r];
> -   rsp->last_set_table_tick = label_tick;
> +   if (rsp->last_set_table_tick >= label_tick_ebb_start)
> + {
> +   /* Later references should not have lower ticks.  */
> +   gcc_assert (label_tick >= rsp->last_set_table_tick);

 This should be obvious, but checking it won't hurt, okay.

> +   /* Should pick up the lowest luid if the references
> +  are in the same block.  */
> +   if (label_tick == rsp->last_set_table_tick
> +   && rsp->last_set_table_luid > insn_luid)
> + rsp->last_

Re: [Committed] Fix fatal typo in gcc.dg/no_profile_instrument_function-attr-2.c

2021-09-08 Thread Martin Liška


On 9/7/21 22:41, Hans-Peter Nilsson wrote:

With r12-3379, the testsuite got such a fatal syntax error,
causing the gcc test-run to abort at (e.g.):


Thank you for the fix! I haven't noticed the problem as I only grep for '^FAIL'
lines in the corresponding log files.

Cheers,
Martin

[PATCH] i386: Fix up @xorsign3_1 [PR102224]

2021-09-08 Thread Jakub Jelinek via Gcc-patches

Hi!

As the testcase shows, we miscompile @xorsign3_1 if both input
operands are in the same register, because the splitter overwrites op1
before with op1 & mask before using op0.

For dest = xorsign op0, op0 we can actually simplify it from
dest = (op0 & mask) ^ op0 to dest = op0 & ~mask (aka abs).

The expander change is an optimization improvement, if we at expansion
time know it is xorsign op0, op0, we can emit abs right away and get better
code through that.

The @xorsign3_1 is a fix for the case where xorsign wouldn't be known
to have same operands during expansion, but during RTL optimizations they
would appear.  For non-AVX we need to use earlyclobber, we require
dest and op1 to be the same but op0 must be different because we overwrite
op1 first.  For AVX the constraints ensure that at most 2 of the 3 operands
may be the same register and if both inputs are the same, handles that case.
This case can be easily tested with the xorsign3 expander change
reverted.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Thinking about it more this morning, while this patch fixes the problems
revealed in the testcase, the recent PR89984 change was buggy too, but
perhaps that can be fixed incrementally.  Because for AVX the new code
destructively modifies op1.  If that is different from dest, say on:
float
foo (float x, float y)
{
  return x * __builtin_copysignf (1.0f, y) + y;
}
then we get after RA:
(insn 8 7 9 2 (set (reg:SF 20 xmm0 [orig:82 _2 ] [82])
(unspec:SF [
(reg:SF 20 xmm0 [88])
(reg:SF 21 xmm1 [89])
(mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S16 
A128])
] UNSPEC_XORSIGN)) "hohoho.c":4:12 649 {xorsignsf3_1}
 (nil))
(insn 9 8 15 2 (set (reg:SF 20 xmm0 [87])
(plus:SF (reg:SF 20 xmm0 [orig:82 _2 ] [82])
(reg:SF 21 xmm1 [89]))) "hohoho.c":4:44 1021 {*fop_sf_comm}
 (nil))
but split the xorsign into:
vandps  .LC0(%rip), %xmm1, %xmm1
vxorps  %xmm0, %xmm1, %xmm0
and then the addition:
vaddss  %xmm1, %xmm0, %xmm0
which means we miscompile it - instead of adding y in the end we add
__builtin_copysignf (0.0f, y).
So, wonder if we don't want instead in addition to the &Yv <- Yv, 0
alternative (enabled for both pre-AVX and AVX as in this patch) the
&Yv <- Yv, Yv where destination must be different from inputs and another
Yv <- Yv, Yv where it can be the same but then need a match_scratch
(with X for the other alternatives and =Yv for the last one).
That way we'd always have a safe register we can store the op1 & mask
value into, either the destination (in the first alternative known to
be equal to op1 which is needed for non-AVX but ok for AVX too), in the
second alternative known to be different from both inputs and in the third
which could be used for those
float bar (float x, float y) { return x * __builtin_copysignf (1.0f, y); }
cases where op1 is naturally xmm1 and dest == op0 naturally xmm0 we'd use
some other register like xmm2.

2021-09-07  Jakub Jelinek  

PR target/102224
* config/i386/i386.md (xorsign3): If operands[1] is equal to
operands[2], emit abs2 instead.
(@xorsign3_1): Add early-clobbers for output operand, enable
first alternative even for avx, add another alternative with
=&Yv <- 0, Yv, Yvm constraints.
* config/i386/i386-expand.c (ix86_split_xorsign): If op0 is equal
to op1, emit vpandn instead.

* gcc.dg/pr102224.c: New test.
* gcc.target/i386/avx-pr102224.c: New test.

--- gcc/config/i386/i386.md.jj  2021-09-06 14:47:43.199049975 +0200
+++ gcc/config/i386/i386.md 2021-09-07 13:13:50.825603852 +0200
@@ -10803,21 +10803,27 @@ (define_expand "xorsign3"
(match_operand:MODEF 1 "register_operand")
(match_operand:MODEF 2 "register_operand")]
   "SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH"
-  "ix86_expand_xorsign (operands); DONE;")
+{
+  if (rtx_equal_p (operands[1], operands[2]))
+emit_insn (gen_abs2 (operands[0], operands[1]));
+  else
+ix86_expand_xorsign (operands);
+  DONE;
+})
 
 (define_insn_and_split "@xorsign3_1"
-  [(set (match_operand:MODEF 0 "register_operand" "=Yv,Yv")
+  [(set (match_operand:MODEF 0 "register_operand" "=&Yv,&Yv,&Yv")
(unspec:MODEF
- [(match_operand:MODEF 1 "register_operand" "Yv,Yv")
-  (match_operand:MODEF 2 "register_operand" "0,Yv")
-  (match_operand: 3 "nonimmediate_operand" "Yvm,Yvm")]
+ [(match_operand:MODEF 1 "register_operand" "Yv,0,Yv")
+  (match_operand:MODEF 2 "register_operand" "0,Yv,Yv")
+  (match_operand: 3 "nonimmediate_operand" "Yvm,Yvm,Yvm")]
  UNSPEC_XORSIGN))]
   "SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH"
   "#"
   "&& reload_completed"
   [(const_int 0)]
   "ix86_split_xorsign (operands); DONE;"
-  [(set_attr "isa" "noavx,avx")])
+  [(set_attr "isa" "*,avx,avx")])
 
 ;; One complement instructions
 
--- gcc/config/i386/i386-expan

[PATCH v2] ipa-inline: Add target info into fn summary [PR102059]

2021-09-08 Thread Kewen.Lin via Gcc-patches

Hi!

Power ISA 2.07 (Power8) introduces transactional memory feature
but ISA3.1 (Power10) removes it.  It exposes one troublesome
issue as PR102059 shows.  Users define some function with
target pragma cpu=power10 then it calls one function with
attribute always_inline which inherits command line option
cpu=power8 which enables HTM implicitly.  The current isa_flags
check doesn't allow this inlining due to "target specific
option mismatch" and error mesasge is emitted.

Normally, the callee function isn't intended to exploit HTM
feature, but the default flag setting make it look it has.
As Richi raised in the PR, we have fp_expressions flag in
function summary, and allow us to check the function actually
contains any floating point expressions to avoid overkill.
So this patch follows the similar idea but is more target
specific, for this rs6000 port specific requirement on HTM
feature check, we would like to check rs6000 specific HTM
built-in functions and inline assembly, it allows targets
to do their own customized checks and updates.

It introduces two target hooks need_ipa_fn_target_info and
update_ipa_fn_target_info.  The former allows target to do
some previous check and decides to collect target specific
information for this function or not.  For some special case,
it can predict the analysis result and push it early without
any scannings.  The latter allows the analyze_function_body
to pass gimple stmts down just like fp_expressions handlings,
target can do its own tricks.  I put them as one hook initially
with one boolean to indicates whether it's initial time, but
the code looks a bit ugly, to separate them seems to have
better readability.

To make it simple, this patch uses HOST_WIDE_INT to record the
flags just like what we use for isa_flags.  For rs6000's HTM
need, one HOST_WIDE_INT variable is quite enough, but it seems
good to have one auto_vec for scalability as I noticed some
targets have more than one HOST_WIDE_INT flag.

Against v1 [1], this v2 addressed Richi's and Segher's review
comments, mainly consists of:
  - Extend it to cover non always_inline.
  - Exclude the case for offload streaming.
  - Some function naming and formatting issues.
  - Adjust rs6000_can_inline_p.
  - Add new cases.

The patch has been bootstrapped and regress-tested on
powerpc64le-linux-gnu Power9.

Any comments are highly appreciated!

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578555.html

BR,
Kewen
-
gcc/ChangeLog:

PR ipa/102059
* config/rs6000/rs6000-call.c (rs6000_fn_has_any_of_these_mask_bits):
New function.
* config/rs6000/rs6000-internal.h
(rs6000_fn_has_any_of_these_mask_bits): New declare.
* config/rs6000/rs6000.c (TARGET_NEED_IPA_FN_TARGET_INFO): New macro.
(TARGET_UPDATE_IPA_FN_TARGET_INFO): Likewise.
(rs6000_need_ipa_fn_target_info): New function.
(rs6000_update_ipa_fn_target_info): Likewise.
(rs6000_can_inline_p): Adjust for ipa function summary target info.
* ipa-fnsummary.c (ipa_dump_fn_summary): Adjust for ipa function
summary target info.
(analyze_function_body): Adjust for ipa function summary target
info and call hook rs6000_need_ipa_fn_target_info and
rs6000_update_ipa_fn_target_info.
(ipa_merge_fn_summary_after_inlining): Adjust for ipa function
summary target info.
(inline_read_section): Likewise.
(ipa_fn_summary_write): Likewise.
* ipa-fnsummary.h (ipa_fn_summary::target_info): New member.
* doc/tm.texi: Regenerate.
* doc/tm.texi.in (TARGET_UPDATE_IPA_FN_TARGET_INFO): Document new
hook.
(TARGET_NEED_IPA_FN_TARGET_INFO): Likewise.
* target.def (update_ipa_fn_target_info): New hook.
(need_ipa_fn_target_info): Likewise.
* targhooks.c (default_need_ipa_fn_target_info): New function.
(default_update_ipa_fn_target_info): Likewise.
* targhooks.h (default_update_ipa_fn_target_info): New declare.
(default_need_ipa_fn_target_info): Likewise.

gcc/testsuite/ChangeLog:

PR ipa/102059
* gcc.dg/lto/pr102059-1_0.c: New test.
* gcc.dg/lto/pr102059-1_1.c: New test.
* gcc.dg/lto/pr102059-1_2.c: New test.
* gcc.dg/lto/pr102059-2_0.c: New test.
* gcc.dg/lto/pr102059-2_1.c: New test.
* gcc.dg/lto/pr102059-2_2.c: New test.
* gcc.target/powerpc/pr102059-5.c: New test.
* gcc.target/powerpc/pr102059-6.c: New test.
* gcc.target/powerpc/pr102059-7.c: New test.
diff --git a/gcc/config/rs6000/rs6000-call.c b/gcc/config/rs6000/rs6000-call.c
index fd7f24da818..6af0d15ed87 100644
--- a/gcc/config/rs6000/rs6000-call.c
+++ b/gcc/config/rs6000/rs6000-call.c
@@ -13795,6 +13795,18 @@ rs6000_builtin_decl (unsigned code, bool initialize_p 
ATTRIBUTE_UNUSED)
   return rs6000_builtin_decls[code];
 }
 
+/* Return true if the builtin with CODE has any mask bits set
+   which are specified by

Re: [PATCH 04/13] arm: Add GENERAL_AND_VPR_REGS regclass

2021-09-08 Thread Christophe LYON via Gcc-patches




On 07/09/2021 15:35, Richard Earnshaw wrote:



On 07/09/2021 13:05, Christophe LYON wrote:


On 07/09/2021 11:42, Richard Earnshaw wrote:



On 07/09/2021 10:15, Christophe Lyon via Gcc-patches wrote:

At some point during the development of this patch series, it appeared
that in some cases the register allocator wants “VPR or general”
rather than “VPR or general or FP” (which is the same thing as
ALL_REGS).  The series does not seem to require this anymore, but it
seems to be a good thing to do anyway, to give the register allocator
more freedom.

2021-09-01  Christophe Lyon 

gcc/
* config/arm/arm.h (reg_class): Add GENERAL_AND_VPR_REGS.
(REG_CLASS_NAMES): Likewise.
(REG_CLASS_CONTENTS): Likewise. Add VPR_REG to ALL_REGS.

diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index 015299c1534..fab39d05916 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -1286,6 +1286,7 @@ enum reg_class
    SFP_REG,
    AFP_REG,
    VPR_REG,
+  GENERAL_AND_VPR_REGS,
    ALL_REGS,
    LIM_REG_CLASSES
  };
@@ -1315,6 +1316,7 @@ enum reg_class
    "SFP_REG",    \
    "AFP_REG",    \
    "VPR_REG",    \
+  "GENERAL_AND_VPR_REGS", \
    "ALL_REGS"    \
  }
  @@ -1343,7 +1345,8 @@ enum reg_class
    { 0x, 0x, 0x, 0x0040 }, /* SFP_REG 
*/    \
    { 0x, 0x, 0x, 0x0080 }, /* AFP_REG 
*/    \
    { 0x, 0x, 0x, 0x0400 }, /* VPR_REG. 
*/    \
-  { 0x7FFF, 0x, 0x, 0x000F }  /* ALL_REGS. 
*/    \
+  { 0x5FFF, 0x, 0x, 0x0400 }, /* 
GENERAL_AND_VPR_REGS.  */ \
+  { 0x7FFF, 0x, 0x, 0x040F }  /* ALL_REGS. 
*/    \

  }


You've changed the definition of ALL_REGS here (to include VPR_REG), 
but not really explained why.  Is that the source of the underlying 
issue with the 'appeared' you mention?



I first added VPR_REG to ALL_REGS, but Richard Sandiford suggested I 
create a new GENERAL_AND_VPR_REGS that would be more restrictive. I 
did not remove VPR_REG from ALL_REGS because I thought it was an 
omission: shouldn't ALL_REGS contain all registers?


Surely that should be a separate patch then.


OK, I can remove that line from this patch and make a separate one-liner 
for ALL_REGS.


Thanks,

Christophe




R.






R.



    #define FP_SYSREGS \

Re: [PATCH v4] rs6000: Add load density heuristic

2021-09-08 Thread Kewen.Lin via Gcc-patches

on 2021/9/8 下午2:57, Kewen.Lin via Gcc-patches wrote:
> Hi Bill,
> 
> Thanks for the review comments!
> 
> on 2021/9/3 下午11:57, Bill Schmidt wrote:
>> Hi Kewen,
>>
>> Sorry that we lost track of this patch!  The heuristic approach looks good.  
>> It is limited in scope and won't kick in often, and the case you're trying 
>> to account for is important.
>>
>> At the time you submitted this, I think reliable P10 testing wasn't 
>> possible.  Now that it is, could you please do a quick sniff test to make 
>> sure there aren't any adjustments that need to be made for P10?  I doubt it, 
>> but worth checking.
>>
> 
> Good point, thanks for the reminder!  I did one SPEC2017 full run on Power10 
> with Ofast unroll, this patch is neutral,
> one SPEC2017 run at O2 vectorization (cheap cost) to verify bwaves_r 
> degradation existed or not and if it can fixed by
> this patch.  The result shows the degradation did exist and got fixed by this 
> patch, besides got extra 3.93% speedup
> against O2 and another bmk 554.roms_r got 3.24% speed up.
> 

hmm, sorry that this improvement on 554.roms_r looks not reliable, I just ran 
it with 10 iterations for both w/ and w/o
the patch, both suffer the jitters and the best scores of them are close.  But 
note that bwaves_r scores are quite stable
so it's reliable.

BR,
Kewen

> In short, the Power10 evaluation result shows this patch is positive.
> 
>> Otherwise I have one comment below...
>>
>> On 7/28/21 12:22 AM, Kewen.Lin wrote:
>>> Hi,
>>>
>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
>>>
>>> This v3 addressed William's review comments in
>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576154.html
>>>
>>> It's mainly to deal with the bwaves_r degradation due to vector
>>> construction fed by strided loads.
>>>
>>> As Richi's comments [1], this follows the similar idea to over
>>> price the vector construction fed by VMAT_ELEMENTWISE or
>>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>>> construction costing immediately, it firstly records how many
>>> loads and vectorized statements in the given loop, later in
>>> rs6000_density_test (called by finish_cost) it computes the
>>> load density ratio against all vectorized stmts, and check
>>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>>> if both thresholds are exceeded.
>>>
>>> Note that this new load density heuristic check is based on
>>> some fields in target cost which are updated as needed when
>>> scanning each add_stmt_cost entry, it's independent of the
>>> current function rs6000_density_test which requires to scan
>>> non_vect stmts.  Since it's checking the load stmts count
>>> vs. all vectorized stmts, it's kind of density, so I put
>>> it in function rs6000_density_test.  With the same reason to
>>> keep it independent, I didn't put it as an else arm of the
>>> current existing density threshold check hunk or before this
>>> hunk.
>>>
>>> In the investigation of -1.04% degradation from 526.blender_r
>>> on Power8, I noticed that the extra penalized cost 320 on one
>>> single vector construction with type V16QI is much exaggerated,
>>> which makes the final body cost unreliable, so this patch adds
>>> one maximum bound for the extra penalized cost for each vector
>>> construction statement.
>>>
>>> Bootstrapped & regtested *again* on powerpc64le-linux-gnu P9.
>>>
>>> Full SPEC2017 performance evaluation on Power8/Power9 with
>>> option combinations (with v2, as v3 is NFC against v2):
>>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>>   * {-O3, -Ofast} {,-funroll-loops}
>>>
>>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>>> remarkable was observed.
>>>
> 
> ...
> 
>>> +  /* Gather some information when we are costing the vectorized instruction
>>> + for the statements located in a loop body.  */
>>> +  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>>> +{
>>> +  data->nstmts += orig_count;
>>> +
>>> +  if (kind == scalar_load || kind == vector_load
>>> + || kind == unaligned_load || kind == vector_gather_load)
>>> +   data->nloads += orig_count;
>>> +
>>> +  /* If we have strided or elementwise loads into a vector, it's
>>> +possible to be bounded by latency and execution resources for
>>> +many scalar loads.  Try to account for this by scaling the
>>> +construction cost by the number of elements involved, when
>>> +handling each matching statement we record the possible extra
>>> +penalized cost into target cost, in the end of costing for
>>> +the whole loop, we do the actual penalization once some load
>>> +density heuristics are satisfied.  */
>>
>> The above comment is quite hard to read.  Can you please break up the last
>> sentence into at least two sentences?
>>
> 
> How about the below:
> 
> +  /* If we have strided or elementwise loads into a vector,

[PATCH] testsuite: Remove .exe suffix in prune_gcc_output

2021-09-08 Thread Christophe Lyon via Gcc-patches

When running the testsuite under Windows, we noticed failures in
testcase which attempt to match compiler error messages containing the
name of the executable.

For instance, gcc.dg/analyzer/signal-4a.c tries to match 'cc1:' which
obviously fails when the executable is called cc1.exe.

This patch removes the .exe suffix from various toolchain executables
to avoid this problem.

2021-09-08  Christophe Lyon  
Torbjörn SVENSSON 

gcc/testsuite/
* lib/prune.exp (prune_gcc_output): Remove .exe suffix from
toolchain executables names.
---
 gcc/testsuite/lib/prune.exp | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/testsuite/lib/prune.exp b/gcc/testsuite/lib/prune.exp
index 91f165bec38..fac212ecf60 100644
--- a/gcc/testsuite/lib/prune.exp
+++ b/gcc/testsuite/lib/prune.exp
@@ -37,6 +37,9 @@ proc prune_gcc_output { text } {
 # Handle any freeform regexps.
 set text [handle-dg-regexps $text]
 
+# Remove Windows .exe suffix
+regsub -all "(as|cc1|cc1plus|collect2|f951|ld|lto-wrapper)\.exe?:" $text 
{\1:} text
+
 regsub -all "(^|\n)(\[^\n\]*: \[iI\]|I)n ((static member |lambda 
)?function|member|method|(copy 
)?constructor|destructor|instantiation|substitution|program|subroutine|block-data)\[^\n\]*"
 $text "" text
 regsub -all "(^|\n)\[^\n\]*(: )?At (top level|global scope):\[^\n\]*" 
$text "" text
 regsub -all "(^|\n)\[^\n\]*:   (recursively )?required \[^\n\]*" $text "" 
text
-- 
2.25.1

Re: [PATCH] libgcc, i386: Export hf and hc from libgcc_s.so.1

2021-09-08 Thread Jakub Jelinek via Gcc-patches

On Wed, Sep 08, 2021 at 10:37:17AM +0800, Hongtao Liu wrote:
>   Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
>   Ok for trunk?
> 
> libgcc/ChangeLog:
> 
> * config/i386/t-softfp: Compile __{mul,div}hc3 into
> libgcc_s.so.1.

I think this is ok, but not really useful until the *.ver change is acked,
because what this patch alone does is add extra non-exported entrypoints to
the shared library (aka wasted .text).

>  libgcc/config/i386/t-softfp | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/libgcc/config/i386/t-softfp b/libgcc/config/i386/t-softfp
> index 2363ea17194..7620cc0cec5 100644
> --- a/libgcc/config/i386/t-softfp
> +++ b/libgcc/config/i386/t-softfp
> @@ -2,9 +2,8 @@ LIB2ADD += $(srcdir)/config/i386/sfp-exceptions.c
> 
>  # Replace _divhc3 and _mulhc3.
>  libgcc2-hf-functions = _divhc3 _mulhc3
> -LIB2FUNCS_EXCLUDE += $(libgcc2-hf-functions)
>  libgcc2-hf-extras = $(addsuffix .c, $(libgcc2-hf-functions))
> -LIB2ADD_ST += $(addprefix $(srcdir)/config/i386/, $(libgcc2-hf-extras))
> +LIB2ADD += $(addprefix $(srcdir)/config/i386/, $(libgcc2-hf-extras))
> 
>  softfp_extensions := hfsf hfdf hftf hfxf sfdf sftf dftf xftf
>  softfp_truncations := tfhf xfhf dfhf sfhf tfsf dfsf tfdf tfxf

Jakub

[PATCH] c++/102228 - make lookup_anon_field O(1)

2021-09-08 Thread Richard Biener via Gcc-patches

For the testcase in PR101555 lookup_anon_field takes the majority
of parsing time followed by get_class_binding_direct/fields_linear_search
which is PR83309.  The situation with anon aggregates is particularly
dire when we need to build accesses to their members and the anon
aggregates are nested.  There for each such access we recursively
build sub-accesses to the anon aggregate FIELD_DECLs bottom-up,
DFS searching for them.  That's inefficient since as I believe
there's a 1:1 relationship between anon aggregate types and the
FIELD_DECL used to place them.

The patch below does away with the search in lookup_anon_field and
instead records the single FIELD_DECL in the anon aggregate types
lang-specific data, re-using the RTTI typeinfo_var field.  That
speeds up the compile of the testcase with -fsyntax-only from
about 4.5s to slightly less than 1s.

I tried to poke holes into the 1:1 relationship idea with my C++
knowledge but failed (which might not say much).  It also leaves
a hole for the case when the C++ FE itself duplicates such type
and places it at a semantically different position.  I've tried
to poke holes into it with the duplication mechanism I understand
(templates) but failed.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

OK for trunk?

Thanks,
Richard.

2021-09-08  Richard Biener  

PR c++/102228
* cp-tree.h (ANON_AGGR_TYPE_FIELD): New define.
* decl.c (fixup_anonymous_aggr): Wipe RTTI info put in
place on invalid code.
* decl2.c (reset_type_linkage): Guard CLASSTYPE_TYPEINFO_VAR
access.
* module.cc (trees_in::read_class_def): Likewise.  Reconstruct
ANON_AGGR_TYPE_FIELD.
* semantics.c (finish_member_declaration): Populate
ANON_AGGR_TYPE_FIELD for anon aggregate typed members.
* typeck.c (lookup_anon_field): Remove DFS search and return
ANON_AGGR_TYPE_FIELD directly.
---
 gcc/cp/cp-tree.h   |  4 
 gcc/cp/decl.c  |  3 +++
 gcc/cp/decl2.c | 17 +
 gcc/cp/module.cc   | 10 --
 gcc/cp/semantics.c |  8 
 gcc/cp/typeck.c| 32 +---
 6 files changed, 37 insertions(+), 37 deletions(-)

diff --git a/gcc/cp/cp-tree.h b/gcc/cp/cp-tree.h
index 435f32bf909..ceb53591926 100644
--- a/gcc/cp/cp-tree.h
+++ b/gcc/cp/cp-tree.h
@@ -4800,6 +4800,10 @@ more_aggr_init_expr_args_p (const 
aggr_init_expr_arg_iterator *iter)
 #define ANON_UNION_TYPE_P(NODE) \
   (TREE_CODE (NODE) == UNION_TYPE && ANON_AGGR_TYPE_P (NODE))
 
+/* For an ANON_AGGR_TYPE_P the single FIELD_DECL it is used with.  */
+#define ANON_AGGR_TYPE_FIELD(NODE) \
+  (LANG_TYPE_CLASS_CHECK (NODE)->typeinfo_var)
+
 /* Define fields and accessors for nodes representing declared names.  */
 
 /* True if TYPE is an unnamed structured type with a typedef for
diff --git a/gcc/cp/decl.c b/gcc/cp/decl.c
index 146979ba96b..bce62ad202a 100644
--- a/gcc/cp/decl.c
+++ b/gcc/cp/decl.c
@@ -5096,6 +5096,9 @@ fixup_anonymous_aggr (tree t)
   (*vec)[store++] = elt;
   vec_safe_truncate (vec, store);
 
+  /* Wipe RTTI info.  */
+  CLASSTYPE_TYPEINFO_VAR (t) = NULL_TREE;
+
   /* Anonymous aggregates cannot have fields with ctors, dtors or complex
  assignment operators (because they cannot have these methods themselves).
  For anonymous unions this is already checked because they are not allowed
diff --git a/gcc/cp/decl2.c b/gcc/cp/decl2.c
index 107edcaaccf..a79a70ba9c2 100644
--- a/gcc/cp/decl2.c
+++ b/gcc/cp/decl2.c
@@ -2977,14 +2977,15 @@ reset_type_linkage (tree type)
  SET_DECL_ASSEMBLER_NAME (vt, name);
  reset_decl_linkage (vt);
}
-  if (tree ti = CLASSTYPE_TYPEINFO_VAR (type))
-   {
- tree name = mangle_typeinfo_for_type (type);
- DECL_NAME (ti) = name;
- SET_DECL_ASSEMBLER_NAME (ti, name);
- TREE_TYPE (name) = type;
- reset_decl_linkage (ti);
-   }
+  if (!ANON_AGGR_TYPE_P (type))
+   if (tree ti = CLASSTYPE_TYPEINFO_VAR (type))
+ {
+   tree name = mangle_typeinfo_for_type (type);
+   DECL_NAME (ti) = name;
+   SET_DECL_ASSEMBLER_NAME (ti, name);
+   TREE_TYPE (name) = type;
+   reset_decl_linkage (ti);
+ }
   for (tree m = TYPE_FIELDS (type); m; m = DECL_CHAIN (m))
{
  tree mem = STRIP_TEMPLATE (m);
diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
index 4b2ad6f3db8..71d0fab411f 100644
--- a/gcc/cp/module.cc
+++ b/gcc/cp/module.cc
@@ -11859,8 +11859,9 @@ trees_in::read_class_def (tree defn, tree 
maybe_template)
{
  CLASSTYPE_BEFRIENDING_CLASSES (type_dup)
= CLASSTYPE_BEFRIENDING_CLASSES (type);
- CLASSTYPE_TYPEINFO_VAR (type_dup)
-   = CLASSTYPE_TYPEINFO_VAR (type);
+ if (!ANON_AGGR_TYPE_P (type))
+   CLASSTYPE_TYPEINFO_VAR (type_dup)
+ = CLASSTYPE_TYPEINFO_VAR (type);

Re: [PATCH] i386: Fix up @xorsign3_1 [PR102224]

2021-09-08 Thread Hongtao Liu via Gcc-patches

On Wed, Sep 8, 2021 at 3:43 PM Jakub Jelinek via Gcc-patches
 wrote:
>
> Hi!
>
> As the testcase shows, we miscompile @xorsign3_1 if both input
> operands are in the same register, because the splitter overwrites op1
> before with op1 & mask before using op0.
>
> For dest = xorsign op0, op0 we can actually simplify it from
> dest = (op0 & mask) ^ op0 to dest = op0 & ~mask (aka abs).
>
> The expander change is an optimization improvement, if we at expansion
> time know it is xorsign op0, op0, we can emit abs right away and get better
> code through that.
>
> The @xorsign3_1 is a fix for the case where xorsign wouldn't be known
> to have same operands during expansion, but during RTL optimizations they
> would appear.  For non-AVX we need to use earlyclobber, we require
> dest and op1 to be the same but op0 must be different because we overwrite
> op1 first.  For AVX the constraints ensure that at most 2 of the 3 operands
> may be the same register and if both inputs are the same, handles that case.
> This case can be easily tested with the xorsign3 expander change
> reverted.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
Patch LGTM.

PS:
  I'm curious why we need the  post_reload splitter @xorsign3_1
for scalar mode, can't we just expand them into and/xor operations in
the expander, just like vector modes did.
Let me do some experiments to see whether it is ok to remove the splitter.




> Thinking about it more this morning, while this patch fixes the problems
> revealed in the testcase, the recent PR89984 change was buggy too, but
> perhaps that can be fixed incrementally.  Because for AVX the new code
> destructively modifies op1.  If that is different from dest, say on:
> float
> foo (float x, float y)
> {
>   return x * __builtin_copysignf (1.0f, y) + y;
> }
> then we get after RA:
> (insn 8 7 9 2 (set (reg:SF 20 xmm0 [orig:82 _2 ] [82])
> (unspec:SF [
> (reg:SF 20 xmm0 [88])
> (reg:SF 21 xmm1 [89])
> (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S16 
> A128])
> ] UNSPEC_XORSIGN)) "hohoho.c":4:12 649 {xorsignsf3_1}
>  (nil))
> (insn 9 8 15 2 (set (reg:SF 20 xmm0 [87])
> (plus:SF (reg:SF 20 xmm0 [orig:82 _2 ] [82])
> (reg:SF 21 xmm1 [89]))) "hohoho.c":4:44 1021 {*fop_sf_comm}
>  (nil))
> but split the xorsign into:
> vandps  .LC0(%rip), %xmm1, %xmm1
> vxorps  %xmm0, %xmm1, %xmm0
> and then the addition:
> vaddss  %xmm1, %xmm0, %xmm0
> which means we miscompile it - instead of adding y in the end we add
> __builtin_copysignf (0.0f, y).
> So, wonder if we don't want instead in addition to the &Yv <- Yv, 0
> alternative (enabled for both pre-AVX and AVX as in this patch) the
> &Yv <- Yv, Yv where destination must be different from inputs and another
> Yv <- Yv, Yv where it can be the same but then need a match_scratch
> (with X for the other alternatives and =Yv for the last one).
> That way we'd always have a safe register we can store the op1 & mask
> value into, either the destination (in the first alternative known to
> be equal to op1 which is needed for non-AVX but ok for AVX too), in the
> second alternative known to be different from both inputs and in the third
> which could be used for those
> float bar (float x, float y) { return x * __builtin_copysignf (1.0f, y); }
> cases where op1 is naturally xmm1 and dest == op0 naturally xmm0 we'd use
> some other register like xmm2.
>
> 2021-09-07  Jakub Jelinek  
>
> PR target/102224
> * config/i386/i386.md (xorsign3): If operands[1] is equal to
> operands[2], emit abs2 instead.
> (@xorsign3_1): Add early-clobbers for output operand, enable
> first alternative even for avx, add another alternative with
> =&Yv <- 0, Yv, Yvm constraints.
> * config/i386/i386-expand.c (ix86_split_xorsign): If op0 is equal
> to op1, emit vpandn instead.
>
> * gcc.dg/pr102224.c: New test.
> * gcc.target/i386/avx-pr102224.c: New test.
>
> --- gcc/config/i386/i386.md.jj  2021-09-06 14:47:43.199049975 +0200
> +++ gcc/config/i386/i386.md 2021-09-07 13:13:50.825603852 +0200
> @@ -10803,21 +10803,27 @@ (define_expand "xorsign3"
> (match_operand:MODEF 1 "register_operand")
> (match_operand:MODEF 2 "register_operand")]
>"SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH"
> -  "ix86_expand_xorsign (operands); DONE;")
> +{
> +  if (rtx_equal_p (operands[1], operands[2]))
> +emit_insn (gen_abs2 (operands[0], operands[1]));
> +  else
> +ix86_expand_xorsign (operands);
> +  DONE;
> +})
>
>  (define_insn_and_split "@xorsign3_1"
> -  [(set (match_operand:MODEF 0 "register_operand" "=Yv,Yv")
> +  [(set (match_operand:MODEF 0 "register_operand" "=&Yv,&Yv,&Yv")
> (unspec:MODEF
> - [(match_operand:MODEF 1 "register_operand" "Yv,Yv")
> -  (match_operand:MODEF 2 "register_operand" "0,Yv")
> -  (match_operand: 3

Re: [PATCH] libgcc, i386: Export hf and hc from libgcc_s.so.1

2021-09-08 Thread Hongtao Liu via Gcc-patches

On Wed, Sep 8, 2021 at 5:09 PM Jakub Jelinek  wrote:
>
> On Wed, Sep 08, 2021 at 10:37:17AM +0800, Hongtao Liu wrote:
> >   Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> >   Ok for trunk?
> >
> > libgcc/ChangeLog:
> >
> > * config/i386/t-softfp: Compile __{mul,div}hc3 into
> > libgcc_s.so.1.
>
> I think this is ok, but not really useful until the *.ver change is acked,
> because what this patch alone does is add extra non-exported entrypoints to
> the shared library (aka wasted .text).
I think your patch is ok (with divhc3/mulhc3), we can check in the
respective patches. :)
>
> >  libgcc/config/i386/t-softfp | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/libgcc/config/i386/t-softfp b/libgcc/config/i386/t-softfp
> > index 2363ea17194..7620cc0cec5 100644
> > --- a/libgcc/config/i386/t-softfp
> > +++ b/libgcc/config/i386/t-softfp
> > @@ -2,9 +2,8 @@ LIB2ADD += $(srcdir)/config/i386/sfp-exceptions.c
> >
> >  # Replace _divhc3 and _mulhc3.
> >  libgcc2-hf-functions = _divhc3 _mulhc3
> > -LIB2FUNCS_EXCLUDE += $(libgcc2-hf-functions)
> >  libgcc2-hf-extras = $(addsuffix .c, $(libgcc2-hf-functions))
> > -LIB2ADD_ST += $(addprefix $(srcdir)/config/i386/, $(libgcc2-hf-extras))
> > +LIB2ADD += $(addprefix $(srcdir)/config/i386/, $(libgcc2-hf-extras))
> >
> >  softfp_extensions := hfsf hfdf hftf hfxf sfdf sftf dftf xftf
> >  softfp_truncations := tfhf xfhf dfhf sfhf tfsf dfsf tfdf tfxf
>
> Jakub
>


-- 
BR,
Hongtao

Re: [PATCH] i386: Fix up @xorsign3_1 [PR102224]

2021-09-08 Thread Jakub Jelinek via Gcc-patches

On Wed, Sep 08, 2021 at 05:23:40PM +0800, Hongtao Liu wrote:
> > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> >
> Patch LGTM.

Thanks, committed.

> PS:
>   I'm curious why we need the  post_reload splitter @xorsign3_1
> for scalar mode, can't we just expand them into and/xor operations in
> the expander, just like vector modes did.
> Let me do some experiments to see whether it is ok to remove the splitter.

I bet it is the question how should the code look like before RA.
stv is somewhat related, but as that replaces whole chains, it can do:
(insn 14 5 6 2 (set (subreg:V2DI (reg:DI 92) 0)
(vec_concat:V2DI (mem/c:DI (symbol_ref:SI ("c") [flags 0x2]  ) [1 c+0 S8 A64])
(const_int 0 [0]))) "hohohou.c":6:9 -1
 (nil))
on loads of memory.
But it stv still does use paradoxical subregs:
(insn 10 16 11 2 (set (subreg:V2DI (reg:DI 91) 0)
(minus:V2DI (subreg:V2DI (reg:DI 87) 0)
(subreg:V2DI (reg:DI 94) 0))) "hohohou.c":6:13 5003 {*subv2di3}
 (expr_list:REG_DEAD (reg:DI 87)
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil
(insn 11 10 0 2 (set (mem/c:DI (symbol_ref:SI ("a") [flags 0x2]  ) [1 a+0 S8 A64])
(reg:DI 91)) "hohohou.c":6:5 76 {*movdi_internal}
 (expr_list:REG_DEAD (reg:DI 91)
(nil)))
so perhaps just using paradoxical subregs everywhere would be ok?

Jakub

Re: [PATCH v2] c-family: Add __builtin_assoc_barrier

2021-09-08 Thread Matthias Kretz

On Tuesday, 7 September 2021 19:36:22 CEST Jason Merrill wrote:
> > case PAREN_EXPR:
> > -  RETURN (finish_parenthesized_expr (RECUR (TREE_OPERAND (t, 0;
> > +  if (REF_PARENTHESIZED_P (t))
> > +   RETURN (finish_parenthesized_expr (RECUR (TREE_OPERAND (t, 0;
> > +  else
> > +   RETURN (RECUR (TREE_OPERAND (t, 0)));
> 
> I think you need to build a new PAREN_EXPR in the assoc barrier case as
> well, for it to have any effect in templates.

My intent was to ignore __builtin_assoc_barrier in templates / constexpr 
evaluation since it's not affected by -fassociative-math anyway. Or do you 
mean something else?

> Please also add a comment mentioning __builtin_assoc_barrier.

I added a comment to that effect to both the cp/pt.c and cp/constexpr.c 
changes.

New patch attached.

-- 
──
 Dr. Matthias Kretz   https://mattkretz.github.io
 GSI Helmholtz Centre for Heavy Ion Research   https://gsi.de
 stdₓ::simd
──
diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c
index 681fcc972f4..c62a6398a47 100644
--- a/gcc/c-family/c-common.c
+++ b/gcc/c-family/c-common.c
@@ -384,6 +384,7 @@ const struct c_common_resword c_common_reswords[] =
   { "__builtin_convertvector", RID_BUILTIN_CONVERTVECTOR, 0 },
   { "__builtin_has_attribute", RID_BUILTIN_HAS_ATTRIBUTE, 0 },
   { "__builtin_launder", RID_BUILTIN_LAUNDER, D_CXXONLY },
+  { "__builtin_assoc_barrier", RID_BUILTIN_ASSOC_BARRIER, 0 },
   { "__builtin_shuffle", RID_BUILTIN_SHUFFLE, 0 },
   { "__builtin_shufflevector", RID_BUILTIN_SHUFFLEVECTOR, 0 },
   { "__builtin_tgmath", RID_BUILTIN_TGMATH, D_CONLY },
diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h
index 50ca8fb6ebd..f34dc47c2ba 100644
--- a/gcc/c-family/c-common.h
+++ b/gcc/c-family/c-common.h
@@ -108,7 +108,7 @@ enum rid
   RID_EXTENSION, RID_IMAGPART, RID_REALPART, RID_LABEL,  RID_CHOOSE_EXPR,
   RID_TYPES_COMPATIBLE_P,  RID_BUILTIN_COMPLEX,	 RID_BUILTIN_SHUFFLE,
   RID_BUILTIN_SHUFFLEVECTOR,   RID_BUILTIN_CONVERTVECTOR,   RID_BUILTIN_TGMATH,
-  RID_BUILTIN_HAS_ATTRIBUTE,
+  RID_BUILTIN_HAS_ATTRIBUTE,   RID_BUILTIN_ASSOC_BARRIER,
   RID_DFLOAT32, RID_DFLOAT64, RID_DFLOAT128,
 
   /* TS 18661-3 keywords, in the same sequence as the TI_* values.  */
diff --git a/gcc/c/c-decl.c b/gcc/c/c-decl.c
index 983d65e930c..dcf4a2d7c32 100644
--- a/gcc/c/c-decl.c
+++ b/gcc/c/c-decl.c
@@ -10557,6 +10557,7 @@ names_builtin_p (const char *name)
 case RID_BUILTIN_HAS_ATTRIBUTE:
 case RID_BUILTIN_SHUFFLE:
 case RID_BUILTIN_SHUFFLEVECTOR:
+case RID_BUILTIN_ASSOC_BARRIER:
 case RID_CHOOSE_EXPR:
 case RID_OFFSETOF:
 case RID_TYPES_COMPATIBLE_P:
diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
index 9a56e0c04c6..fffd81f4e5b 100644
--- a/gcc/c/c-parser.c
+++ b/gcc/c/c-parser.c
@@ -8931,6 +8931,7 @@ c_parser_predefined_identifier (c_parser *parser)
 			 assignment-expression ,
 			 assignment-expression, )
  __builtin_convertvector ( assignment-expression , type-name )
+ __builtin_assoc_barrier ( assignment-expression )
 
offsetof-member-designator:
  identifier
@@ -10076,6 +10077,25 @@ c_parser_postfix_expression (c_parser *parser)
 	  }
 	  }
 	  break;
+	case RID_BUILTIN_ASSOC_BARRIER:
+	  {
+	location_t start_loc = loc;
+	c_parser_consume_token (parser);
+	matching_parens parens;
+	if (!parens.require_open (parser))
+	  {
+		expr.set_error ();
+		break;
+	  }
+	e1 = c_parser_expr_no_commas (parser, NULL);
+	mark_exp_read (e1.value);
+	location_t end_loc = c_parser_peek_token (parser)->get_finish ();
+	parens.skip_until_found_close (parser);
+	expr.value = build1_loc (loc, PAREN_EXPR, TREE_TYPE (e1.value),
+ e1.value);
+	set_c_expr_source_range (&expr, start_loc, end_loc);
+	  }
+	  break;
 	case RID_AT_SELECTOR:
 	  {
 	gcc_assert (c_dialect_objc ());
diff --git a/gcc/cp/constexpr.c b/gcc/cp/constexpr.c
index 31fa5b66865..6e964837d24 100644
--- a/gcc/cp/constexpr.c
+++ b/gcc/cp/constexpr.c
@@ -6730,6 +6730,14 @@ cxx_eval_constant_expression (const constexpr_ctx *ctx, tree t,
    non_constant_p, overflow_p);
   break;
 
+case PAREN_EXPR:
+  gcc_assert (!REF_PARENTHESIZED_P (t));
+  /* A PAREN_EXPR resulting from __builtin_assoc_barrier has no effect in
+ constant expressions since it's unaffected by -fassociative-math.  */
+  r = cxx_eval_constant_expression (ctx, TREE_OPERAND (t, 0), lval,
+	non_constant_p, overflow_p);
+  break;
+
 case NOP_EXPR:
   if (REINTERPRET_CAST_P (t))
 	{
diff --git a/gcc/cp/cp-objcp-common.c b/gcc/cp/cp-objcp-common.c
index ee255732d5a..04522a23eda 100644
--- a/gcc/cp/cp-objcp-common.c
+++ b/gcc/cp/cp-objcp-common.c
@@ -395,6 +395,7 @@ names_builtin_p (const char *name)
 case RID_BUILTIN_SHUFFLE:
 case

Re: [PATCH] i386: Fix up @xorsign3_1 [PR102224]

2021-09-08 Thread Hongtao Liu via Gcc-patches

On Wed, Sep 8, 2021 at 5:33 PM Jakub Jelinek  wrote:
>
> On Wed, Sep 08, 2021 at 05:23:40PM +0800, Hongtao Liu wrote:
> > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> > >
> > Patch LGTM.
>
> Thanks, committed.
>
> > PS:
> >   I'm curious why we need the  post_reload splitter @xorsign3_1
> > for scalar mode, can't we just expand them into and/xor operations in
> > the expander, just like vector modes did.
> > Let me do some experiments to see whether it is ok to remove the splitter.
>
> I bet it is the question how should the code look like before RA.
> stv is somewhat related, but as that replaces whole chains, it can do:
> (insn 14 5 6 2 (set (subreg:V2DI (reg:DI 92) 0)
> (vec_concat:V2DI (mem/c:DI (symbol_ref:SI ("c") [flags 0x2]  
> ) [1 c+0 S8 A64])
> (const_int 0 [0]))) "hohohou.c":6:9 -1
>  (nil))
> on loads of memory.
> But it stv still does use paradoxical subregs:
> (insn 10 16 11 2 (set (subreg:V2DI (reg:DI 91) 0)
> (minus:V2DI (subreg:V2DI (reg:DI 87) 0)
> (subreg:V2DI (reg:DI 94) 0))) "hohohou.c":6:13 5003 {*subv2di3}
>  (expr_list:REG_DEAD (reg:DI 87)
> (expr_list:REG_UNUSED (reg:CC 17 flags)
> (nil
> (insn 11 10 0 2 (set (mem/c:DI (symbol_ref:SI ("a") [flags 0x2]   0x7f65a131fc60 a>) [1 a+0 S8 A64])
> (reg:DI 91)) "hohohou.c":6:5 76 {*movdi_internal}
>  (expr_list:REG_DEAD (reg:DI 91)
> (nil)))
> so perhaps just using paradoxical subregs everywhere would be ok?
Yes, I think so.
And I find paradoxical subreg like (subreg:V4SF (reg:SF)) are not
allowed by validate_subreg until r11-621.
That's why post_reload splitter is needed here.
> Jakub
>



--
BR,
Hongtao

[PATCH] Optimize vec_extract for 256/512-bit vector when index exceeds the lower 128 bits.

2021-09-08 Thread liuhongt via Gcc-patches

Hi:
  As decribed in PR, valign{d,q} can be used for vector extract one element.
For elements located in the lower 128 bits, only one instruction is needed,
so this patch only optimizes elements located above 128 bits.

The optimization is like:

-   vextracti32x8   $0x1, %zmm0, %ymm0
-   vmovd   %xmm0, %eax
+   valignd $8, %zmm0, %zmm0, %zmm1
+   vmovd   %xmm1, %eax

-   vextracti32x8   $0x1, %zmm0, %ymm0
-   vextracti128$0x1, %ymm0, %xmm0
-   vpextrd $3, %xmm0, %eax
+   valignd $15, %zmm0, %zmm0, %zmm1
+   vmovd   %xmm1, %eax

-   vextractf64x2   $0x1, %ymm0, %xmm0
+   valignq $2, %ymm0, %ymm0, %ymm0

-   vextractf64x4   $0x1, %zmm0, %ymm0
-   vextractf64x2   $0x1, %ymm0, %xmm0
-   vunpckhpd   %xmm0, %xmm0, %xmm0
+   valignq $7, %zmm0, %zmm0, %zmm0

  Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.

gcc/ChangeLog:

PR target/91103
* config/i386/sse.md (*vec_extract_valign):
New define_insn.

gcc/testsuite/ChangeLog:

PR target/91103
* gcc.target/i386/pr91103-1.c: New test.
* gcc.target/i386/pr91103-2.c: New test.
---
 gcc/config/i386/sse.md| 32 +
 gcc/testsuite/gcc.target/i386/pr91103-1.c | 37 +++
 gcc/testsuite/gcc.target/i386/pr91103-2.c | 81 +++
 3 files changed, 150 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr91103-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr91103-2.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 5785e73241c..57c736ff44a 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -232,6 +232,12 @@ (define_mode_iterator V48_AVX512VL
V16SF (V8SF "TARGET_AVX512VL") (V4SF "TARGET_AVX512VL")
V8DF  (V4DF "TARGET_AVX512VL") (V2DF "TARGET_AVX512VL")])
 
+(define_mode_iterator V48_256_512_AVX512VL
+  [V16SI (V8SI "TARGET_AVX512VL")
+   V8DI  (V4DI "TARGET_AVX512VL")
+   V16SF (V8SF "TARGET_AVX512VL")
+   V8DF  (V4DF "TARGET_AVX512VL")])
+
 ;; 1,2 byte AVX-512{BW,VL} vector modes. Supposed TARGET_AVX512BW baseline.
 (define_mode_iterator VI12_AVX512VL
   [V64QI (V16QI "TARGET_AVX512VL") (V32QI "TARGET_AVX512VL")
@@ -786,6 +792,15 @@ (define_mode_attr sseinsnmode
(V4SF "V4SF") (V2DF "V2DF")
(TI "TI")])
 
+(define_mode_attr sseintvecinsnmode
+  [(V64QI "XI") (V32HI "XI") (V16SI "XI") (V8DI "XI") (V4TI "XI")
+   (V32QI "OI") (V16HI "OI") (V8SI "OI") (V4DI "OI") (V2TI "OI")
+   (V16QI "TI") (V8HI "TI") (V4SI "TI") (V2DI "TI") (V1TI "TI")
+   (V16SF "XI") (V8DF "XI")
+   (V8SF "OI") (V4DF "OI")
+   (V4SF "TI") (V2DF "TI")
+   (TI "TI")])
+
 ;; SSE constant -1 constraint
 (define_mode_attr sseconstm1
   [(V64QI "BC") (V32HI "BC") (V16SI "BC") (V8DI "BC") (V4TI "BC")
@@ -10326,6 +10341,23 @@ (define_insn 
"_align"
   [(set_attr "prefix" "evex")
(set_attr "mode" "")])
 
+(define_mode_attr vec_extract_imm_predicate
+  [(V16SF "const_0_to_15_operand") (V8SF "const_0_to_7_operand")
+   (V16SI "const_0_to_15_operand") (V8SI "const_0_to_7_operand")
+   (V8DF "const_0_to_7_operand") (V4DF "const_0_to_3_operand")
+   (V8DI "const_0_to_7_operand") (V4DI "const_0_to_3_operand")])
+
+(define_insn "*vec_extract_valign"
+  [(set (match_operand: 0 "register_operand" "=v")
+   (vec_select:
+ (match_operand:V48_256_512_AVX512VL 1 "register_operand" "v")
+ (parallel [(match_operand 2 "")])))]
+  "TARGET_AVX512F
+   && INTVAL(operands[2]) >= 16 / GET_MODE_SIZE (mode)"
+  "valign\t{%2, %1, %1, %0|%0, %1, %1, %2}";
+  [(set_attr "prefix" "evex")
+   (set_attr "mode" "")])
+
 (define_expand "avx512f_shufps512_mask"
   [(match_operand:V16SF 0 "register_operand")
(match_operand:V16SF 1 "register_operand")
diff --git a/gcc/testsuite/gcc.target/i386/pr91103-1.c 
b/gcc/testsuite/gcc.target/i386/pr91103-1.c
new file mode 100644
index 000..11caaa8bd1b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91103-1.c
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx512vl -O2" } */
+/* { dg-final { scan-assembler-times "valign\[dq\]" 16 } } */
+
+typedef float v8sf __attribute__((vector_size(32)));
+typedef float v16sf __attribute__((vector_size(64)));
+typedef int v8si __attribute__((vector_size(32)));
+typedef int v16si __attribute__((vector_size(64)));
+typedef double v4df __attribute__((vector_size(32)));
+typedef double v8df __attribute__((vector_size(64)));
+typedef long long v4di __attribute__((vector_size(32)));
+typedef long long v8di __attribute__((vector_size(64)));
+
+#define EXTRACT(V,S,IDX)   \
+  S\
+  __attribute__((noipa))   \
+  foo_##V##_##IDX (V v)\
+  {\
+return v[IDX]; \
+  }\
+
+EXTRACT (v8sf, float, 4);
+EXTRACT (v8sf, float, 7);
+EXTRACT (v8si, int, 4);
+EXTRACT (v8si, int

Re: [PATCH] i386: Fix up @xorsign3_1 [PR102224]

2021-09-08 Thread Jakub Jelinek via Gcc-patches

On Wed, Sep 08, 2021 at 06:00:50PM +0800, Hongtao Liu wrote:
> Yes, I think so.
> And I find paradoxical subreg like (subreg:V4SF (reg:SF)) are not
> allowed by validate_subreg until r11-621.
> That's why post_reload splitter is needed here.

Following seems to work for all the testcases I've find (and in some
generates better code than the post-reload splitter):

2021-09-08  Jakub Jelinek  
liuhongt  

PR target/89984
* config/i386/i386.md (@xorsign3_1): Remove.
* config/i386/i386-expand.c (ix86_expand_xorsign): Expand right away
into AND with mask and XOR, using paradoxical subregs.
(ix86_split_xorsign): Remove.

* gcc.target/i386/avx-pr102224.c: Fix up PR number.
* gcc.dg/pr89984.c: New test.
* gcc.target/i386/avx-pr89984.c: New test.

--- gcc/config/i386/i386.md.jj  2021-09-08 11:40:55.826534981 +0200
+++ gcc/config/i386/i386.md 2021-09-08 11:44:08.394828674 +0200
@@ -10918,20 +10918,6 @@ (define_expand "xorsign3"
   DONE;
 })
 
-(define_insn_and_split "@xorsign3_1"
-  [(set (match_operand:MODEF 0 "register_operand" "=&Yv,&Yv,&Yv")
-   (unspec:MODEF
- [(match_operand:MODEF 1 "register_operand" "Yv,0,Yv")
-  (match_operand:MODEF 2 "register_operand" "0,Yv,Yv")
-  (match_operand: 3 "nonimmediate_operand" "Yvm,Yvm,Yvm")]
- UNSPEC_XORSIGN))]
-  "SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH"
-  "#"
-  "&& reload_completed"
-  [(const_int 0)]
-  "ix86_split_xorsign (operands); DONE;"
-  [(set_attr "isa" "*,avx,avx")])
-
 ;; One complement instructions
 
 (define_expand "one_cmpl2"
--- gcc/config/i386/i386-expand.c.jj2021-09-08 11:40:55.824535010 +0200
+++ gcc/config/i386/i386-expand.c   2021-09-08 11:51:15.969819626 +0200
@@ -2270,7 +2270,7 @@ void
 ix86_expand_xorsign (rtx operands[])
 {
   machine_mode mode, vmode;
-  rtx dest, op0, op1, mask;
+  rtx dest, op0, op1, mask, x, temp;
 
   dest = operands[0];
   op0 = operands[1];
@@ -2285,60 +2285,15 @@ ix86_expand_xorsign (rtx operands[])
   else
 gcc_unreachable ();
 
+  temp = gen_reg_rtx (vmode);
   mask = ix86_build_signbit_mask (vmode, 0, 0);
 
-  emit_insn (gen_xorsign3_1 (mode, dest, op0, op1, mask));
-}
+  op1 = lowpart_subreg (vmode, op1, mode);
+  x = gen_rtx_AND (vmode, op1, mask);
+  emit_insn (gen_rtx_SET (temp, x));
 
-/* Deconstruct an xorsign operation into bit masks.  */
-
-void
-ix86_split_xorsign (rtx operands[])
-{
-  machine_mode mode, vmode;
-  rtx dest, op0, op1, mask, x;
-
-  dest = operands[0];
-  op0 = operands[1];
-  op1 = operands[2];
-  mask = operands[3];
-
-  mode = GET_MODE (dest);
-  vmode = GET_MODE (mask);
-
-  /* The constraints ensure that for non-AVX dest == op1 is
- different from op0, and for AVX that at most two of
- dest, op0 and op1 are the same register but the third one
- is different.  */
-  if (rtx_equal_p (op0, op1))
-{
-  gcc_assert (TARGET_AVX && !rtx_equal_p (op0, dest));
-  if (vmode == V4SFmode)
-   vmode = V4SImode;
-  else
-   {
- gcc_assert (vmode == V2DFmode);
- vmode = V2DImode;
-   }
-  mask = lowpart_subreg (vmode, mask, GET_MODE (mask));
-  if (MEM_P (mask))
-   {
- rtx msk = lowpart_subreg (vmode, dest, mode);
- emit_insn (gen_rtx_SET (msk, mask));
- mask = msk;
-   }
-  op0 = lowpart_subreg (vmode, op0, mode);
-  x = gen_rtx_AND (vmode, gen_rtx_NOT (vmode, mask), op0);
-}
-  else
-{
-  op1 = lowpart_subreg (vmode, op1, mode);
-  x = gen_rtx_AND (vmode, op1, mask);
-  emit_insn (gen_rtx_SET (op1, x));
-
-  op0 = lowpart_subreg (vmode, op0, mode);
-  x = gen_rtx_XOR (vmode, op1, op0);
-}
+  op0 = lowpart_subreg (vmode, op0, mode);
+  x = gen_rtx_XOR (vmode, temp, op0);
 
   dest = lowpart_subreg (vmode, dest, mode);
   emit_insn (gen_rtx_SET (dest, x));
--- gcc/testsuite/gcc.target/i386/avx-pr102224.c.jj 2021-09-08 
11:40:55.826534981 +0200
+++ gcc/testsuite/gcc.target/i386/avx-pr102224.c2021-09-08 
11:57:41.741386062 +0200
@@ -1,4 +1,4 @@
-/* PR tree-optimization/51581 */
+/* PR target/102224 */
 /* { dg-do run } */
 /* { dg-options "-O2 -mavx" } */
 /* { dg-require-effective-target avx } */
--- gcc/testsuite/gcc.dg/pr89984.c.jj   2021-09-08 11:56:33.799343240 +0200
+++ gcc/testsuite/gcc.dg/pr89984.c  2021-09-08 11:54:36.070001821 +0200
@@ -0,0 +1,20 @@
+/* PR target/89984 */
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+__attribute__((noipa)) float
+foo (float x, float y)
+{
+  return x * __builtin_copysignf (1.0f, y) + y;
+}
+
+int
+main ()
+{
+  if (foo (1.25f, 7.25f) != 1.25f + 7.25f
+  || foo (1.75f, -3.25f) != -1.75f + -3.25f
+  || foo (-2.25f, 7.5f) != -2.25f + 7.5f
+  || foo (-3.0f, -4.0f) != 3.0f + -4.0f)
+__builtin_abort ();
+  return 0;
+}
--- gcc/testsuite/gcc.target/i386/avx-pr89984.c.jj  2021-09-08 
11:57:12.297800869 +0200
+++ gcc/testsuite/gcc.target/i386/avx-pr89984.c 2021-09-08 11

Re: [PATCH] i386: Fix up @xorsign3_1 [PR102224]

2021-09-08 Thread Jakub Jelinek via Gcc-patches

On Wed, Sep 08, 2021 at 06:00:50PM +0800, Hongtao Liu wrote:
> And I find paradoxical subreg like (subreg:V4SF (reg:SF)) are not
> allowed by validate_subreg until r11-621.

My reading of the r11-621 change is that it allowed (subreg:V4SF (reg:V2SF) 0)
but that (subreg:V4SF (reg:SF) 0) has been valid even before that.
In any case, the PR89984 fix was a missed-optimization fix, so we don't need
to backport that and thus don't need to backport the follow-up patch either.

Jakub

Re: [PATCH] i386: Fix up @xorsign3_1 [PR102224]

2021-09-08 Thread Hongtao Liu via Gcc-patches

On Wed, Sep 8, 2021 at 6:02 PM Jakub Jelinek  wrote:
>
> On Wed, Sep 08, 2021 at 06:00:50PM +0800, Hongtao Liu wrote:
> > Yes, I think so.
> > And I find paradoxical subreg like (subreg:V4SF (reg:SF)) are not
> > allowed by validate_subreg until r11-621.
> > That's why post_reload splitter is needed here.
>
> Following seems to work for all the testcases I've find (and in some
> generates better code than the post-reload splitter):
>
> 2021-09-08  Jakub Jelinek  
> liuhongt  
>
> PR target/89984
> * config/i386/i386.md (@xorsign3_1): Remove.
> * config/i386/i386-expand.c (ix86_expand_xorsign): Expand right away
> into AND with mask and XOR, using paradoxical subregs.
> (ix86_split_xorsign): Remove.
Also remove this from i386-protos.h.
>
> * gcc.target/i386/avx-pr102224.c: Fix up PR number.
> * gcc.dg/pr89984.c: New test.
> * gcc.target/i386/avx-pr89984.c: New test.
>
Other LGTM.
> --- gcc/config/i386/i386.md.jj  2021-09-08 11:40:55.826534981 +0200
> +++ gcc/config/i386/i386.md 2021-09-08 11:44:08.394828674 +0200
> @@ -10918,20 +10918,6 @@ (define_expand "xorsign3"
>DONE;
>  })
>
> -(define_insn_and_split "@xorsign3_1"
> -  [(set (match_operand:MODEF 0 "register_operand" "=&Yv,&Yv,&Yv")
> -   (unspec:MODEF
> - [(match_operand:MODEF 1 "register_operand" "Yv,0,Yv")
> -  (match_operand:MODEF 2 "register_operand" "0,Yv,Yv")
> -  (match_operand: 3 "nonimmediate_operand" 
> "Yvm,Yvm,Yvm")]
> - UNSPEC_XORSIGN))]
> -  "SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH"
> -  "#"
> -  "&& reload_completed"
> -  [(const_int 0)]
> -  "ix86_split_xorsign (operands); DONE;"
> -  [(set_attr "isa" "*,avx,avx")])
> -
>  ;; One complement instructions
>
>  (define_expand "one_cmpl2"
> --- gcc/config/i386/i386-expand.c.jj2021-09-08 11:40:55.824535010 +0200
> +++ gcc/config/i386/i386-expand.c   2021-09-08 11:51:15.969819626 +0200
> @@ -2270,7 +2270,7 @@ void
>  ix86_expand_xorsign (rtx operands[])
>  {
>machine_mode mode, vmode;
> -  rtx dest, op0, op1, mask;
> +  rtx dest, op0, op1, mask, x, temp;
>
>dest = operands[0];
>op0 = operands[1];
> @@ -2285,60 +2285,15 @@ ix86_expand_xorsign (rtx operands[])
>else
>  gcc_unreachable ();
>
> +  temp = gen_reg_rtx (vmode);
>mask = ix86_build_signbit_mask (vmode, 0, 0);
>
> -  emit_insn (gen_xorsign3_1 (mode, dest, op0, op1, mask));
> -}
> +  op1 = lowpart_subreg (vmode, op1, mode);
> +  x = gen_rtx_AND (vmode, op1, mask);
> +  emit_insn (gen_rtx_SET (temp, x));
>
> -/* Deconstruct an xorsign operation into bit masks.  */
> -
> -void
> -ix86_split_xorsign (rtx operands[])
> -{
> -  machine_mode mode, vmode;
> -  rtx dest, op0, op1, mask, x;
> -
> -  dest = operands[0];
> -  op0 = operands[1];
> -  op1 = operands[2];
> -  mask = operands[3];
> -
> -  mode = GET_MODE (dest);
> -  vmode = GET_MODE (mask);
> -
> -  /* The constraints ensure that for non-AVX dest == op1 is
> - different from op0, and for AVX that at most two of
> - dest, op0 and op1 are the same register but the third one
> - is different.  */
> -  if (rtx_equal_p (op0, op1))
> -{
> -  gcc_assert (TARGET_AVX && !rtx_equal_p (op0, dest));
> -  if (vmode == V4SFmode)
> -   vmode = V4SImode;
> -  else
> -   {
> - gcc_assert (vmode == V2DFmode);
> - vmode = V2DImode;
> -   }
> -  mask = lowpart_subreg (vmode, mask, GET_MODE (mask));
> -  if (MEM_P (mask))
> -   {
> - rtx msk = lowpart_subreg (vmode, dest, mode);
> - emit_insn (gen_rtx_SET (msk, mask));
> - mask = msk;
> -   }
> -  op0 = lowpart_subreg (vmode, op0, mode);
> -  x = gen_rtx_AND (vmode, gen_rtx_NOT (vmode, mask), op0);
> -}
> -  else
> -{
> -  op1 = lowpart_subreg (vmode, op1, mode);
> -  x = gen_rtx_AND (vmode, op1, mask);
> -  emit_insn (gen_rtx_SET (op1, x));
> -
> -  op0 = lowpart_subreg (vmode, op0, mode);
> -  x = gen_rtx_XOR (vmode, op1, op0);
> -}
> +  op0 = lowpart_subreg (vmode, op0, mode);
> +  x = gen_rtx_XOR (vmode, temp, op0);
>
>dest = lowpart_subreg (vmode, dest, mode);
>emit_insn (gen_rtx_SET (dest, x));
> --- gcc/testsuite/gcc.target/i386/avx-pr102224.c.jj 2021-09-08 
> 11:40:55.826534981 +0200
> +++ gcc/testsuite/gcc.target/i386/avx-pr102224.c2021-09-08 
> 11:57:41.741386062 +0200
> @@ -1,4 +1,4 @@
> -/* PR tree-optimization/51581 */
> +/* PR target/102224 */
>  /* { dg-do run } */
>  /* { dg-options "-O2 -mavx" } */
>  /* { dg-require-effective-target avx } */
> --- gcc/testsuite/gcc.dg/pr89984.c.jj   2021-09-08 11:56:33.799343240 +0200
> +++ gcc/testsuite/gcc.dg/pr89984.c  2021-09-08 11:54:36.070001821 +0200
> @@ -0,0 +1,20 @@
> +/* PR target/89984 */
> +/* { dg-do run } */
> +/* { dg-options "-O2" } */
> +
> +__attribute__((noipa)) float
> +foo (float x, float y)
> +{
> +  return x * __builtin_copysignf (1.0f, y) + y;
> +}
> +
> +int
> +main ()

Re: [Committed] Fix fatal typo in gcc.dg/no_profile_instrument_function-attr-2.c

2021-09-08 Thread Hans-Peter Nilsson via Gcc-patches

> From: Martin Liaka 
> Date: Wed, 8 Sep 2021 09:29:52 +0200

> On 9/7/21 22:41, Hans-Peter Nilsson wrote:
> > With r12-3379, the testsuite got such a fatal syntax error,
> > causing the gcc test-run to abort at (e.g.):
> 
> Thank you for the fix! I haven't noticed the problem as I only grep for 
> '^FAIL'
> lines in the corresponding log files.

You're welcome but *please* also add (conceptually)
"grep -q ^ERROR: makelog && fail" to your workflow.

brgds, H-P
PS. yeah, but worth repeating

Re: [Committed] Fix fatal typo in gcc.dg/no_profile_instrument_function-attr-2.c

2021-09-08 Thread Rainer Orth

Hi Hans-Peter,

>> From: Martin Liaka 
>> Date: Wed, 8 Sep 2021 09:29:52 +0200
>
>> On 9/7/21 22:41, Hans-Peter Nilsson wrote:
>> > With r12-3379, the testsuite got such a fatal syntax error,
>> > causing the gcc test-run to abort at (e.g.):
>> 
>> Thank you for the fix! I haven't noticed the problem as I only grep for 
>> '^FAIL'
>> lines in the corresponding log files.
>
> You're welcome but *please* also add (conceptually)
> "grep -q ^ERROR: makelog && fail" to your workflow.

even that is not enough, though: one also needs to look for UNRESOLVED
and XPASS.  An easy way of doing all this is running make
mail-report.log after the build and comparing the output to an
unmodified build.  Saves everyone loads of trouble.

Rainer

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University

Re: [PATCH 1/2]middle-end Teach CSE to be able to do vector extracts.

2021-09-08 Thread Tamar Christina via Gcc-patches

Hi Jeff & Richard,

> If you can turn that example into a test, even if it's just in the
> aarch64 directory, that would be helpful

The second patch 2/2 has various tests for this as the cost model had to
be made more accurate for it to work.

> 
> As mentioned in the 2/2 thread, I think we should use subregs for
> the case where they're canonical.  It'd probably be worth adding a
> simplify-rtx.c helper to extract one element from a vector, e.g.:
> 
>   rtx simplify_gen_vec_select (rtx op, unsigned int index);
> 
> so that this is easier to do.
> 
> Does making the loop above per-element mean that, for 128-bit Advanced
> SIMD, the optimisation “only” kicks in for 64-bit element sizes?
> Perhaps for other element sizes we could do “top” and “bottom” halves.
> (There's obviously no need to do that as part of this work, was just
> wondering.)
> 

It should handle extraction of any element size, so it's able to use a value
in any abitrary location.  CSE already handles low/hi re-use optimally. So e.g.

#include 

extern int16x8_t bar (int16x8_t, int16x8_t);

int16x8_t foo ()
{
int16_t s[4] = {1,2,3,4};
int16_t d[8] = {1,2,3,4,5,6,7,8};

int16x4_t r1 = vld1_s16 (s);
int16x8_t r2 = vcombine_s16 (r1, r1);
int16x8_t r3 = vld1q_s16 (d);
return bar (r2, r3);
}

but our cost model is currently blocking it because we never costed vec_consts.
Without the 2/2 patch we generate:

foo:
stp x29, x30, [sp, -48]!
adrpx0, .LC0
mov x29, sp
ldr q1, [x0, #:lo12:.LC0]
adrpx0, .LC1
ldr q0, [x0, #:lo12:.LC1]
adrpx0, .LC2
str q1, [sp, 32]
ldr d2, [x0, #:lo12:.LC2]
str d2, [sp, 24]
bl  bar
ldp x29, x30, [sp], 48
ret
.LC0:
.hword  1
.hword  2
.hword  3
.hword  4
.hword  5
.hword  6
.hword  7
.hword  8
.LC1:
.hword  1
.hword  2
.hword  3
.hword  4
.hword  1
.hword  2
.hword  3
.hword  4

but with the 2/2 patch:

foo:
stp x29, x30, [sp, -48]!
adrpx0, .LC0
mov x29, sp
ldr d2, [x0, #:lo12:.LC0]
adrpx0, .LC1
ldr q1, [x0, #:lo12:.LC1]
str d2, [sp, 24]
dup d0, v2.d[0]
str q1, [sp, 32]
ins v0.d[1], v2.d[0]
bl  bar
ldp x29, x30, [sp], 48
ret
.LC1:
.hword  1
.hword  2
.hword  3
.hword  4
.hword  5
.hword  6
.hword  7
.hword  8

It's not entirely optimal of course, but is step forward. I think when we fix
the vld's this should then become optimal as current the MEMs are causing it to
not re-use those values.

> >else
> > sets[n_sets++].rtl = x;
> >  }
> > @@ -4513,7 +4533,14 @@ cse_insn (rtx_insn *insn)
> >struct set *sets = (struct set *) 0;
> >  
> >if (GET_CODE (x) == SET)
> > -sets = XALLOCA (struct set);
> > +{
> > +  /* For CONST_VECTOR we wants to be able to CSE the vector itself 
> > along with
> > +elements inside the vector if the target says it's cheap.  */
> > +  if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> > +   sets = XALLOCAVEC (struct set, const_vector_encoded_nelts (SET_SRC (x)) 
> > + 1);
> > +  else
> > +   sets = XALLOCA (struct set);
> > +}
> >else if (GET_CODE (x) == PARALLEL)
> >  sets = XALLOCAVEC (struct set, XVECLEN (x, 0));
> 
> I think this would be easier if “sets” was first converted to an
> auto_vec, say auto_vec.  We then wouldn't need to
> predict in advance how many elements are needed.
> 

Done.

> > @@ -4997,6 +5024,26 @@ cse_insn (rtx_insn *insn)
> >   src_related_is_const_anchor = src_related != NULL_RTX;
> > }
> >  
> > +  /* Try to re-materialize a vec_dup with an existing constant.   */
> > +  if (GET_CODE (src) == CONST_VECTOR
> > + && const_vector_encoded_nelts (src) == 1)
> > +   {
> > +  rtx const_rtx = CONST_VECTOR_ELT (src, 0);
> 
> Would be simpler as:
> 
>   rtx src_elt;
>   if (const_vec_duplicate_p (src, &src_elt))
> 
> I think we should also check !src_eqv_here, or perhaps:
> 
>   (!src_eqv_here || CONSTANT_P (src_eqv_here))
> 
> so that we don't override any existing reg notes, which could have more
> chance of succeeding.
> 

Done.

> > +  machine_mode const_mode = GET_MODE_INNER (GET_MODE (src));
> > +  struct table_elt *related_elt
> > +   = lookup (const_rtx, HASH (const_rtx, const_mode), const_mode);
> > +  if (related_elt)
> > +   {
> > + for (related_elt = related_elt->first_same_value;
> > +  related_elt; related_elt = related_elt->next_same_value)
> > +   if (REG_P (related_elt->exp))
> > + {
> > +   src_eqv_here
> > +   = gen_rtx_VEC_DUPLICATE (GET_MODE (src),
> > +

RE: [PATCH]AArch64 Make use of FADDP in simple reductions.

2021-09-08 Thread Tamar Christina via Gcc-patches

Slightly updated patch correcting some modes that were giving warnings.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (*aarch64_faddp_scalar,
*aarch64_addp_scalarv2di, *aarch64_faddp_scalar2,
*aarch64_addp_scalar2v2di): New.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/scalar_faddp.c: New test.
* gcc.target/aarch64/simd/scalar_faddp2.c: New test.
* gcc.target/aarch64/simd/scalar_addp.c: New test.

Co-authored-by: Tamar Christina 

--- inline copy of patch ---

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
371990fbe2cfb72d22f22ed582bb7ebdebb3edc0..408500f74c06b63368136a49087558d40972873e
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -3411,6 +3411,70 @@ (define_insn "aarch64_faddp"
   [(set_attr "type" "neon_fp_reduc_add_")]
 )
 
+;; For the case where both operands are a subreg we need to use a
+;; match_dup since reload cannot enforce that the registers are
+;; the same with a constraint in this case.
+(define_insn "*aarch64_faddp_scalar2"
+  [(set (match_operand: 0 "register_operand" "=w")
+   (plus:
+ (vec_select:
+   (match_operator:VHSDF 1 "subreg_lowpart_operator"
+ [(match_operand:VHSDF 2 "register_operand" "w")])
+   (parallel [(match_operand 3 "const_int_operand" "n")]))
+ (match_dup: 2)))]
+  "TARGET_SIMD
+   && ENDIAN_LANE_N (, INTVAL (operands[3])) == 1"
+  "faddp\t%0, %2.2"
+  [(set_attr "type" "neon_fp_reduc_add_")]
+)
+
+(define_insn "*aarch64_faddp_scalar"
+  [(set (match_operand: 0 "register_operand" "=w")
+   (plus:
+ (vec_select:
+   (match_operand:VHSDF 1 "register_operand" "w")
+   (parallel [(match_operand 2 "const_int_operand" "n")]))
+ (match_operand: 3 "register_operand" "1")))]
+  "TARGET_SIMD
+   && ENDIAN_LANE_N (, INTVAL (operands[2])) == 1
+   && SUBREG_P (operands[3]) && !SUBREG_P (operands[1])
+   && subreg_lowpart_p (operands[3])"
+  "faddp\t%0, %1.2"
+  [(set_attr "type" "neon_fp_reduc_add_")]
+)
+
+;; For the case where both operands are a subreg we need to use a
+;; match_dup since reload cannot enforce that the registers are
+;; the same with a constraint in this case.
+(define_insn "*aarch64_addp_scalar2v2di"
+  [(set (match_operand:DI 0 "register_operand" "=w")
+   (plus:DI
+ (vec_select:DI
+   (match_operator:V2DI 1 "subreg_lowpart_operator"
+ [(match_operand:V2DI 2 "register_operand" "w")])
+   (parallel [(match_operand 3 "const_int_operand" "n")]))
+ (match_dup:DI 2)))]
+  "TARGET_SIMD
+   && ENDIAN_LANE_N (2, INTVAL (operands[3])) == 1"
+  "addp\t%d0, %2.2d"
+  [(set_attr "type" "neon_reduc_add_long")]
+)
+
+(define_insn "*aarch64_addp_scalarv2di"
+  [(set (match_operand:DI 0 "register_operand" "=w")
+   (plus:DI
+ (vec_select:DI
+   (match_operand:V2DI 1 "register_operand" "w")
+   (parallel [(match_operand 2 "const_int_operand" "n")]))
+ (match_operand:DI 3 "register_operand" "1")))]
+  "TARGET_SIMD
+   && ENDIAN_LANE_N (2, INTVAL (operands[2])) == 1
+   && SUBREG_P (operands[3]) && !SUBREG_P (operands[1])
+   && subreg_lowpart_p (operands[3])"
+  "addp\t%d0, %1.2d"
+  [(set_attr "type" "neon_reduc_add_long")]
+)
+
 (define_insn "aarch64_reduc_plus_internal"
  [(set (match_operand:VDQV 0 "register_operand" "=w")
(unspec:VDQV [(match_operand:VDQV 1 "register_operand" "w")]
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c 
b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c
new file mode 100644
index 
..ab904ca6a6392a3a068615f68e6b76c0716344ae
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c
@@ -0,0 +1,11 @@
+/* { dg-do assemble } */
+/* { dg-additional-options "-save-temps -O1 -std=c99" } */
+
+typedef long long v2di __attribute__((vector_size (16)));
+
+long long
+foo (v2di x)
+{
+  return x[1] + x[0];
+}
+/* { dg-final { scan-assembler-times {addp\td[0-9]+, v[0-9]+.2d} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c 
b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
new file mode 100644
index 
..2c8a05b46d8b4f7a1634bc04cc61426ba7b9ef91
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
@@ -0,0 +1,44 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
+/* { dg-add-options arm_v8_2a_fp16_scalar } */
+/* { dg-additional-options "-save-temps -O1" } */
+/* { dg-final { scan-assembler-times "dup" 4 } } */
+
+
+typedef double v2df __attribute__((vector_size (16)));
+typedef float v4sf __attribute__((vector_size (16)));
+typedef __fp16 v8hf __attribute__((vector_size (16)));
+
+double
+foo (v2df x)
+{
+  return x[1] + x[0];
+}
+/* { dg-fina

Re: Sv: [PATCH 1/2 v2] jit : Generate debug info for variables

2021-09-08 Thread David Malcolm via Gcc-patches

On Tue, 2021-09-07 at 16:18 +, Petter Tomner via Gcc-patches wrote:
> I realized I still managed to mess up some WS. I have attached a
> patch that is the same, except fixes the WS issue 
> underneath.
> 
> Regards, Petter
> 
> +  FOR_EACH_VEC_ELT (m_globals, i, global)
> +    rest_of_decl_compilation (global, true, true);

Thanks for the updated patch.

Looks good to me, assuming it's been tested successfully.  It should be
combined with the testcase into a single commit when pushed.

Do you have push rights to the GCC git repo?

Dave

RE: [PATCH 2/2]AArch64: Add better costing for vector constants and operations

2021-09-08 Thread Tamar Christina via Gcc-patches



> -Original Message-
> From: Richard Sandiford 
> Sent: Tuesday, August 31, 2021 7:38 PM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> ; Marcus Shawcroft
> ; Kyrylo Tkachov 
> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants
> and operations
> 
> Tamar Christina  writes:
> >> -Original Message-
> >> From: Richard Sandiford 
> >> Sent: Tuesday, August 31, 2021 5:07 PM
> >> To: Tamar Christina 
> >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> >> ; Marcus Shawcroft
> >> ; Kyrylo Tkachov
> 
> >> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector
> >> constants and operations
> >>
> >> Tamar Christina  writes:
> >> >> -Original Message-
> >> >> From: Richard Sandiford 
> >> >> Sent: Tuesday, August 31, 2021 4:14 PM
> >> >> To: Tamar Christina 
> >> >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> >> >> ; Marcus Shawcroft
> >> >> ; Kyrylo Tkachov
> >> 
> >> >> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector
> >> >> constants and operations
> >> >>
> >> >> Tamar Christina  writes:
> >> >> > @@ -13936,8 +13937,65 @@ cost_plus:
> >> >> >mode, MULT, 1, speed);
> >> >> >return true;
> >> >> >  }
> >> >> > + break;
> >> >> > +case PARALLEL:
> >> >> > +  /* Fall through */
> >> >>
> >> >> Which code paths lead to getting a PARALLEL here?
> >> >
> >> > Hi,
> >> >
> >> > Thanks for the review!
> >> >
> >> > I added it for completeness because CSE treats a parallel and
> >> > CONST_VECTOR as equivalent when they each entry in the parallel
> >> > defines
> >> a constant.
> >>
> >> Could you test whether it ever triggers in practice though?
> >> The code would be much simpler without it.
> >
> > Will check 😊

Looks like for AArch64 there's no real way for this to happen so I've removed 
this case.


> >
> >>
> >> >> > +case CONST_VECTOR:
> >> >> > + {
> >> >> > +   rtx gen_insn = aarch64_simd_make_constant (x, true);
> >> >> > +   /* Not a valid const vector.  */
> >> >> > +   if (!gen_insn)
> >> >> > + break;
> >> >> >
> >> >> > -  /* Fall through.  */
> >> >> > +   switch (GET_CODE (gen_insn))
> >> >> > +   {
> >> >> > +   case CONST_VECTOR:
> >> >> > + /* Load using MOVI/MVNI.  */
> >> >> > + if (aarch64_simd_valid_immediate (x, NULL))
> >> >> > +   *cost += extra_cost->vect.movi;
> >> >> > + else /* Load using constant pool.  */
> >> >> > +   *cost += extra_cost->ldst.load;
> >> >> > + break;
> >> >> > +   /* Load using a DUP.  */
> >> >> > +   case VEC_DUPLICATE:
> >> >> > + *cost += extra_cost->vect.dup;
> >> >> > + break;
> >> >>
> >> >> Does this trigger in practice?  The new check==true path (rightly)
> >> >> stops the duplicated element from being forced into a register,
> >> >> but then I would have
> >> >> expected:
> >> >>
> >> >> rtx
> >> >> gen_vec_duplicate (machine_mode mode, rtx x) {
> >> >>   if (valid_for_const_vector_p (mode, x))
> >> >> return gen_const_vec_duplicate (mode, x);
> >> >>   return gen_rtx_VEC_DUPLICATE (mode, x); }
> >> >>
> >> >> to generate the original CONST_VECTOR again.
> >> >
> >> > Yes, but CSE is trying to see whether using a DUP is cheaper than
> >> > another
> >> instruction.
> >> > Normal code won't hit this but CSE is just costing all the
> >> > different ways one can semantically construct a vector, which RTL
> >> > actually comes out
> >> of it depends on how it's folded as you say.
> >>
> >> But what I mean is, you call:
> >>
> >>  rtx gen_insn = aarch64_simd_make_constant (x, true);
> >>  /* Not a valid const vector.  */
> >>  if (!gen_insn)
> >>break;
> >>
> >> where aarch64_simd_make_constant does:
> >>
> >>   if (CONST_VECTOR_P (vals))
> >> const_vec = vals;
> >>   else if (GET_CODE (vals) == PARALLEL)
> >> {
> >>   /* A CONST_VECTOR must contain only CONST_INTs and
> >> CONST_DOUBLEs, but CONSTANT_P allows more (e.g. SYMBOL_REF).
> >> Only store valid constants in a CONST_VECTOR.  */
> >>   int n_elts = XVECLEN (vals, 0);
> >>   for (i = 0; i < n_elts; ++i)
> >>{
> >>  rtx x = XVECEXP (vals, 0, i);
> >>  if (CONST_INT_P (x) || CONST_DOUBLE_P (x))
> >>n_const++;
> >>}
> >>   if (n_const == n_elts)
> >>const_vec = gen_rtx_CONST_VECTOR (mode, XVEC (vals, 0));
> >> }
> >>   else
> >> gcc_unreachable ();
> >>
> >>   if (const_vec != NULL_RTX
> >>   && aarch64_simd_valid_immediate (const_vec, NULL))
> >> /* Load using MOVI/MVNI.  */
> >> return const_vec;
> >>   else if ((const_dup = aarch64_simd_dup_constant (vals, check)) !=
> >> NULL_RTX)
> >> /* Loaded using DUP.  */
> >> return const_dup;
> >>
> >> and aarch64_simd_dup_constant does:
> >>
> >>   machine_mode mode = GET_MODE (vals);
> >>   machine_mode inner_mode = GET_MODE_INNER (mode);
> >>   rtx x;
> >>
> >>   if

Re: [PATCH v2] c-family: Add __builtin_assoc_barrier

2021-09-08 Thread Jason Merrill via Gcc-patches


On 9/8/21 5:37 AM, Matthias Kretz wrote:

On Tuesday, 7 September 2021 19:36:22 CEST Jason Merrill wrote:

case PAREN_EXPR:
-  RETURN (finish_parenthesized_expr (RECUR (TREE_OPERAND (t, 0;
+  if (REF_PARENTHESIZED_P (t))
+   RETURN (finish_parenthesized_expr (RECUR (TREE_OPERAND (t, 0;
+  else
+   RETURN (RECUR (TREE_OPERAND (t, 0)));


I think you need to build a new PAREN_EXPR in the assoc barrier case as
well, for it to have any effect in templates.


My intent was to ignore __builtin_assoc_barrier in templates / constexpr
evaluation since it's not affected by -fassociative-math anyway. Or do you
mean something else?


I agree about constexpr, but why wouldn't template instantiations be 
affected by -fassociative-math like any other function?


Jason

Re: [PATCH] c++/102228 - make lookup_anon_field O(1)

2021-09-08 Thread Jason Merrill via Gcc-patches


On 9/8/21 5:13 AM, Richard Biener wrote:

For the testcase in PR101555 lookup_anon_field takes the majority
of parsing time followed by get_class_binding_direct/fields_linear_search
which is PR83309.  The situation with anon aggregates is particularly
dire when we need to build accesses to their members and the anon
aggregates are nested.  There for each such access we recursively
build sub-accesses to the anon aggregate FIELD_DECLs bottom-up,
DFS searching for them.  That's inefficient since as I believe
there's a 1:1 relationship between anon aggregate types and the
FIELD_DECL used to place them.

The patch below does away with the search in lookup_anon_field and
instead records the single FIELD_DECL in the anon aggregate types
lang-specific data, re-using the RTTI typeinfo_var field.  That
speeds up the compile of the testcase with -fsyntax-only from
about 4.5s to slightly less than 1s.

I tried to poke holes into the 1:1 relationship idea with my C++
knowledge but failed (which might not say much).  It also leaves
a hole for the case when the C++ FE itself duplicates such type
and places it at a semantically different position.  I've tried
to poke holes into it with the duplication mechanism I understand
(templates) but failed.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

OK for trunk?


OK, thanks.


Thanks,
Richard.

2021-09-08  Richard Biener  

PR c++/102228
* cp-tree.h (ANON_AGGR_TYPE_FIELD): New define.
* decl.c (fixup_anonymous_aggr): Wipe RTTI info put in
place on invalid code.
* decl2.c (reset_type_linkage): Guard CLASSTYPE_TYPEINFO_VAR
access.
* module.cc (trees_in::read_class_def): Likewise.  Reconstruct
ANON_AGGR_TYPE_FIELD.
* semantics.c (finish_member_declaration): Populate
ANON_AGGR_TYPE_FIELD for anon aggregate typed members.
* typeck.c (lookup_anon_field): Remove DFS search and return
ANON_AGGR_TYPE_FIELD directly.
---
  gcc/cp/cp-tree.h   |  4 
  gcc/cp/decl.c  |  3 +++
  gcc/cp/decl2.c | 17 +
  gcc/cp/module.cc   | 10 --
  gcc/cp/semantics.c |  8 
  gcc/cp/typeck.c| 32 +---
  6 files changed, 37 insertions(+), 37 deletions(-)

diff --git a/gcc/cp/cp-tree.h b/gcc/cp/cp-tree.h
index 435f32bf909..ceb53591926 100644
--- a/gcc/cp/cp-tree.h
+++ b/gcc/cp/cp-tree.h
@@ -4800,6 +4800,10 @@ more_aggr_init_expr_args_p (const 
aggr_init_expr_arg_iterator *iter)
  #define ANON_UNION_TYPE_P(NODE) \
(TREE_CODE (NODE) == UNION_TYPE && ANON_AGGR_TYPE_P (NODE))
  
+/* For an ANON_AGGR_TYPE_P the single FIELD_DECL it is used with.  */

+#define ANON_AGGR_TYPE_FIELD(NODE) \
+  (LANG_TYPE_CLASS_CHECK (NODE)->typeinfo_var)
+
  /* Define fields and accessors for nodes representing declared names.  */
  
  /* True if TYPE is an unnamed structured type with a typedef for

diff --git a/gcc/cp/decl.c b/gcc/cp/decl.c
index 146979ba96b..bce62ad202a 100644
--- a/gcc/cp/decl.c
+++ b/gcc/cp/decl.c
@@ -5096,6 +5096,9 @@ fixup_anonymous_aggr (tree t)
(*vec)[store++] = elt;
vec_safe_truncate (vec, store);
  
+  /* Wipe RTTI info.  */

+  CLASSTYPE_TYPEINFO_VAR (t) = NULL_TREE;
+
/* Anonymous aggregates cannot have fields with ctors, dtors or complex
   assignment operators (because they cannot have these methods themselves).
   For anonymous unions this is already checked because they are not allowed
diff --git a/gcc/cp/decl2.c b/gcc/cp/decl2.c
index 107edcaaccf..a79a70ba9c2 100644
--- a/gcc/cp/decl2.c
+++ b/gcc/cp/decl2.c
@@ -2977,14 +2977,15 @@ reset_type_linkage (tree type)
  SET_DECL_ASSEMBLER_NAME (vt, name);
  reset_decl_linkage (vt);
}
-  if (tree ti = CLASSTYPE_TYPEINFO_VAR (type))
-   {
- tree name = mangle_typeinfo_for_type (type);
- DECL_NAME (ti) = name;
- SET_DECL_ASSEMBLER_NAME (ti, name);
- TREE_TYPE (name) = type;
- reset_decl_linkage (ti);
-   }
+  if (!ANON_AGGR_TYPE_P (type))
+   if (tree ti = CLASSTYPE_TYPEINFO_VAR (type))
+ {
+   tree name = mangle_typeinfo_for_type (type);
+   DECL_NAME (ti) = name;
+   SET_DECL_ASSEMBLER_NAME (ti, name);
+   TREE_TYPE (name) = type;
+   reset_decl_linkage (ti);
+ }
for (tree m = TYPE_FIELDS (type); m; m = DECL_CHAIN (m))
{
  tree mem = STRIP_TEMPLATE (m);
diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
index 4b2ad6f3db8..71d0fab411f 100644
--- a/gcc/cp/module.cc
+++ b/gcc/cp/module.cc
@@ -11859,8 +11859,9 @@ trees_in::read_class_def (tree defn, tree 
maybe_template)
{
  CLASSTYPE_BEFRIENDING_CLASSES (type_dup)
= CLASSTYPE_BEFRIENDING_CLASSES (type);
- CLASSTYPE_TYPEINFO_VAR (type_dup)
-   = CLASSTYPE_TYPEINFO_VAR (type);
+ if (!ANON_AGGR_TYPE_P (type))
+   CLASSTYPE_TYPEINFO_VAR (

Re: [PATCH v2] c-family: Add __builtin_assoc_barrier

2021-09-08 Thread Matthias Kretz

On Wednesday, 8 September 2021 15:44:28 CEST Jason Merrill wrote:
> On 9/8/21 5:37 AM, Matthias Kretz wrote:
> > On Tuesday, 7 September 2021 19:36:22 CEST Jason Merrill wrote:
> >>> case PAREN_EXPR:
> >>> -  RETURN (finish_parenthesized_expr (RECUR (TREE_OPERAND (t, 0;
> >>> +  if (REF_PARENTHESIZED_P (t))
> >>> +   RETURN (finish_parenthesized_expr (RECUR (TREE_OPERAND (t,
> >>> 0;
> >>> +  else
> >>> +   RETURN (RECUR (TREE_OPERAND (t, 0)));
> >> 
> >> I think you need to build a new PAREN_EXPR in the assoc barrier case as
> >> well, for it to have any effect in templates.
> > 
> > My intent was to ignore __builtin_assoc_barrier in templates / constexpr
> > evaluation since it's not affected by -fassociative-math anyway. Or do you
> > mean something else?
> 
> I agree about constexpr, but why wouldn't template instantiations be
> affected by -fassociative-math like any other function?

Oh, that seems like a major misunderstanding on my part. I assumed 
tsubst_copy_and_build would evaluate the expressions in template arguments 🤦. 
I'll expand the test and will fix.

-- 
──
 Dr. Matthias Kretz   https://mattkretz.github.io
 GSI Helmholtz Centre for Heavy Ion Research   https://gsi.de
 stdₓ::simd
──

[PATCH] rs6000: Fix ELFv2 r12 use in epilogue

2021-09-08 Thread Segher Boessenkool

We cannot use r12 here, it is already in use as the GEP (for sibling
calls).

Committed to trunk.  Will backport later, too.


Segher


2021-09-08  Segher Boessenkool  
PR target/102107
* config/rs6000/rs6000-logue.c (rs6000_emit_epilogue): For ELFv2 use
r11 instead of r12 for restoring CR.
---
 gcc/config/rs6000/rs6000-logue.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/gcc/config/rs6000/rs6000-logue.c b/gcc/config/rs6000/rs6000-logue.c
index e363d56ecec0..9965a8aa6910 100644
--- a/gcc/config/rs6000/rs6000-logue.c
+++ b/gcc/config/rs6000/rs6000-logue.c
@@ -4815,6 +4815,10 @@ rs6000_emit_epilogue (enum epilogue_type epilogue_type)
   else if (REGNO (frame_reg_rtx) == 12)
cr_save_regno = 11;
 
+  /* For ELFv2 r12 is already in use as the GEP.  */
+  if (DEFAULT_ABI == ABI_ELFv2)
+   cr_save_regno = 11;
+
   cr_save_reg = load_cr_save (cr_save_regno, frame_reg_rtx,
  info->cr_save_offset + frame_off,
  exit_func);
-- 
1.8.3.1

testsuite: Allow .sdata in more cases in gcc.dg/array-quals-1.c

2021-09-08 Thread Joseph Myers

When testing for Nios II (gcc-testresults shows this for MIPS as
well), failures of gcc.dg/array-quals-1.c appear where a symbol was
found in .sdata rather than one of the expected sections.

FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?a$ (found 
a) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata)
FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?b$ (found 
b) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata)
FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?c$ (found 
c) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata)
FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?d$ (found 
d) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata)

Jakub's commit 0b34dbc0a24864b1674bff7a92fa3cf0f1cbcea1 allowed .sdata
for many variables in that test where use of .sdata caused a failure
on powerpc-linux.  I'm presuming the choice of which variables had
.sdata allowed was based only on the code generated for powerpc-linux,
not on any reason it would be wrong to allow it for the other
variables; thus, this patch adjusts the test to allow .sdata for some
more variables where that is needed on Nios II (and in one case where
it's not needed on Nios II, but the test results on gcc-testresults
suggest that it is needed on MIPS).

Tested with no regressions with cross to nios2-elf.  OK to commit?

* gcc.dg/array-quals-1.c: Allow .sdata section in more cases.

diff --git a/gcc/testsuite/gcc.dg/array-quals-1.c 
b/gcc/testsuite/gcc.dg/array-quals-1.c
index 2c041649279..b9b55f774bc 100644
--- a/gcc/testsuite/gcc.dg/array-quals-1.c
+++ b/gcc/testsuite/gcc.dg/array-quals-1.c
@@ -7,26 +7,26 @@
 /* { dg-additional-options "-fno-pie" { target pie } } */
 /* The MMIX port always switches to the .data section at the end of a file.  */
 /* { dg-final { scan-assembler-not "\\.data(?!\\.rel\\.ro)" { xfail 
powerpc*-*-aix* mmix-*-* x86_64-*-mingw* } } } */
-/* { dg-final { scan-assembler-symbol-section {^_?a$} 
{^\.(const|rodata|srodata)|\[RO\]} } } */
+/* { dg-final { scan-assembler-symbol-section {^_?a$} 
{^\.(const|rodata|srodata|sdata)|\[RO\]} } } */
 static const int a[2] = { 1, 2 };
 /* { dg-final { scan-assembler-symbol-section {^_?a1$} 
{^\.(const|rodata|srodata|sdata)|\[RO\]} } } */
 const int a1[2] = { 1, 2 };
 typedef const int ci;
-/* { dg-final { scan-assembler-symbol-section {^_?b$} 
{^\.(const|rodata|srodata)|\[RO\]} } } */
+/* { dg-final { scan-assembler-symbol-section {^_?b$} 
{^\.(const|rodata|srodata|sdata)|\[RO\]} } } */
 static ci b[2] = { 3, 4 };
 /* { dg-final { scan-assembler-symbol-section {^_?b1$} 
{^\.(const|rodata|srodata|sdata)|\[RO\]} } } */
 ci b1[2] = { 3, 4 };
 typedef int ia[2];
-/* { dg-final { scan-assembler-symbol-section {^_?c$} 
{^\.(const|rodata|srodata)|\[RO\]} } } */
+/* { dg-final { scan-assembler-symbol-section {^_?c$} 
{^\.(const|rodata|srodata|sdata)|\[RO\]} } } */
 static const ia c = { 5, 6 };
 /* { dg-final { scan-assembler-symbol-section {^_?c1$} 
{^\.(const|rodata|srodata|sdata)|\[RO\]} } } */
 const ia c1 = { 5, 6 };
 typedef const int cia[2];
-/* { dg-final { scan-assembler-symbol-section {^_?d$} 
{^\.(const|rodata|srodata)|\[RO\]} } } */
+/* { dg-final { scan-assembler-symbol-section {^_?d$} 
{^\.(const|rodata|srodata|sdata)|\[RO\]} } } */
 static cia d = { 7, 8 };
 /* { dg-final { scan-assembler-symbol-section {^_?d1$} 
{^\.(const|rodata|srodata|sdata)|\[RO\]} } } */
 cia d1 = { 7, 8 };
-/* { dg-final { scan-assembler-symbol-section {^_?e$} 
{^\.(const|rodata|srodata)|\[RO\]} } } */
+/* { dg-final { scan-assembler-symbol-section {^_?e$} 
{^\.(const|rodata|srodata|sdata)|\[RO\]} } } */
 static cia e[2] = { { 1, 2 }, { 3, 4 } };
 /* { dg-final { scan-assembler-symbol-section {^_?e1$} 
{^\.(const|rodata|srodata|sdata)|\[RO\]} } } */
 cia e1[2] = { { 1, 2 }, { 3, 4 } };

-- 
Joseph S. Myers
jos...@codesourcery.com

Re: testsuite: Allow .sdata in more cases in gcc.dg/array-quals-1.c

2021-09-08 Thread Jakub Jelinek via Gcc-patches

On Wed, Sep 08, 2021 at 03:25:24PM +, Joseph Myers wrote:
> When testing for Nios II (gcc-testresults shows this for MIPS as
> well), failures of gcc.dg/array-quals-1.c appear where a symbol was
> found in .sdata rather than one of the expected sections.
> 
> FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?a$ 
> (found a) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata)
> FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?b$ 
> (found b) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata)
> FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?c$ 
> (found c) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata)
> FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?d$ 
> (found d) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata)
> 
> Jakub's commit 0b34dbc0a24864b1674bff7a92fa3cf0f1cbcea1 allowed .sdata
> for many variables in that test where use of .sdata caused a failure
> on powerpc-linux.  I'm presuming the choice of which variables had
> .sdata allowed was based only on the code generated for powerpc-linux,
> not on any reason it would be wrong to allow it for the other

Yeah.

> variables; thus, this patch adjusts the test to allow .sdata for some
> more variables where that is needed on Nios II (and in one case where
> it's not needed on Nios II, but the test results on gcc-testresults
> suggest that it is needed on MIPS).
> 
> Tested with no regressions with cross to nios2-elf.  OK to commit?
> 
>   * gcc.dg/array-quals-1.c: Allow .sdata section in more cases.

LGTM.

Jakub

Re: [PATCH] Fix SFmode subreg of DImode and TImode

2021-09-08 Thread Michael Meissner via Gcc-patches

On Tue, Sep 07, 2021 at 06:07:30PM -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Sep 07, 2021 at 03:12:36AM -0400, Michael Meissner wrote:
> > [PATCH] Fix SFmode subreg of DImode and TImode
> > 
> > This patch fixes the breakage in the PowerPC due to a recent change in 
> > SUBREG
> > behavior.
> 
> But what was that change?  And was that intentional?  If so, why wasn't
> it documented, was the existing behaviour considered buggy?  But the
> documentation agrees with the previous behaviour afaics.

I haven't investigated what the machine independent change was that suddenly
started allowing:

(subreg:SF (reg:TI ...))

This patch just allows the subreg's to exist instead of getting an insn not
found message.  An alternative approach would be to allow wider subregs in the
insn matching and deal with getting the right physical register when
splitting.

> > While it is arguable that the patch that caused the breakage should
> > be reverted, this patch should be a bandage to prevent these changes from
> > happening again.
> 
> NAK.  This patch will likely cause us to generate worse code.  If that
> is not the case it will need a long, in-depth explanation of why not.

Trying to replicate the problem, I get exactly the same code with an older
compiler.  I will attach the test program as an attachment.

> Sorry.
> 
> > I first noticed it in building the Spec 2017 wrf_r and blender_r
> > benchmarks.  Once I applied this patch, I also noticed several of the
> > tests now pass.
> > 
> > The core of the problem is we need to treat SUBREG's of SFmode and SImode
> > specially on the PowerPC.  This is due to the fact that SFmode values that 
> > are
> > in the vector and floating point registers are represented as DFmode.  When 
> > we
> > want to do a direct move between the GPR registers and the vector 
> > registers, we
> > have to convert the value from the DFmode representation to/from the SFmode
> > representation.
> 
> The core of the problem is that subreg of pseudos has three meanings:
>   -- Paradoxical subregs;
>   -- Actual subregs;
>   -- "bit_cast" thingies: treat the same bits as something else.  Like
>  looking at the bits of a float as its memory image.
> 
> Ignoring paradoxical subregs (as well as subregs of mem, which should
> have disappeared by now), and subregs of hard registers as well (those
> have *different* semantics after all), the other two kinds can be mixed,
> and *have to* be mixed, because subregs of subregs are non-canonical.

This isn't subregs of subregs.  This is a subreg of a larger integer type, for
example having a subreg representing an __int128_t or structure, and accessing
the bottom/top element.  Instead of doing an extract, the compiler generates a
subreg.

Before the machine independent part of the compiler would not generate these
subregs, and now it does.

> 
> Is there any reason why not to allow this kind of subreg?
> 
> If we want to not allow mixing bit_cast with subregs, we should make it
> its own RTL code.
> 
> > +  /* In case we are given a SUBREG for a larger type, reduce it to
> > +SImode.  */
> > +  if (mode == SFmode && GET_MODE_SIZE (inner_mode) > 4)
> > +   {
> > + rtx tmp = gen_reg_rtx (SImode);
> > + emit_move_insn (tmp, gen_lowpart (SImode, source));
> > + emit_insn (gen_movsf_from_si (dest, tmp));
> > + return true;
> > +   }
> 
> This makes it two separate insns.  Is that always optimised to code that
> is at least as good as before?

It is likely even if it generates an extra move, it will be better, because the
default would be to do the transfer via store on one register bank and load of
the other.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meiss...@linux.ibm.com, phone: +1 (978) 899-4797
union u_128 {
  __int128_t i128;
  long l[2];
  int i[4];
  double d[2];
  float f[4];
};

union u_64 {
  long l;
  int i[2];
  double d;
  float f[2];
};

float ret_u128_f0 (long a, long b, long x, long y)
{
  union u_128 u;

  u.l[0] = a ^ x;
  u.l[1] = b ^ y;

  return u.f[0];
}

float ret_u128_f1 (long a, long b, long x, long y)
{
  union u_128 u;

  u.l[0] = a ^ x;
  u.l[1] = b ^ y;

  return u.f[1];
}

float ret_u128_f2 (long a, long b, long x, long y)
{
  union u_128 u;

  u.l[0] = a ^ x;
  u.l[1] = b ^ x;

  return u.f[2];
}

float ret_u128_f3 (long a, long b, long x, long y)
{
  union u_128 u;

  u.l[0] = a ^ x;
  u.l[1] = b ^ x;

  return u.f[3];
}

float ret_u64_f0 (long a, long x)
{
  union u_64 u;

  u.l = a ^ x;

  return u.f[0];
}

float ret_u64_f1 (long a, long x)
{
  union u_64 u;

  u.l = a ^ x;

  return u.f[1];
}

Re: [PATCH] Fix SFmode subreg of DImode and TImode

2021-09-08 Thread Segher Boessenkool

Hi!

On Wed, Sep 08, 2021 at 08:42:44AM +0200, Richard Biener wrote:
> On Wed, Sep 8, 2021 at 1:08 AM Segher Boessenkool
>  wrote:
> > The core of the problem is that subreg of pseudos has three meanings:
> >   -- Paradoxical subregs;
> >   -- Actual subregs;
> >   -- "bit_cast" thingies: treat the same bits as something else.  Like
> >  looking at the bits of a float as its memory image.
> >
> > Ignoring paradoxical subregs (as well as subregs of mem, which should
> > have disappeared by now), and subregs of hard registers as well (those
> > have *different* semantics after all), the other two kinds can be mixed,
> > and *have to* be mixed, because subregs of subregs are non-canonical.
> >
> > Is there any reason why not to allow this kind of subreg?
> 
> In fact the causing rev. in question 
> (d2874d905647a1d146dafa60199d440e837adc4d)
> made all those subregs "valid" in terms of what validate_subreg is verifying
> and thus now the few places using validate_subreg to check whether some
> subreg is valid will now happily do float<->int converting subregs.

Like in PR102154:

error: unrecognizable insn:
(insn 11 10 12 2 (set (reg:SF 118 [  ])
(subreg:SF (reg:DI 122) 0))

This generated valid code before, and now it is not anymore.  If the
patch was meant to allow *more*, it failed that goal: it allows *less*.

The issue is that we do not allow SF subregs in all cases.  We do not
store SFmode quantities in IEEE SP format in registers normally, so it
would require expensive extra code to allow such subregs.

Whatever pass made these subregs is wrong, because they do not pass
recog().

> I do agree that those subregs should be allowed and that the above rev. is
> a strict improvement (given it removes a lot of "but allow special case X
> because target Y wants it" cases by simply allowing all of them).  But the
> previous code seems to have papered over quite some backend issues.

It is not a good idea to do allow all those things.  Most backends can
only support a few combinations of them, and everything else results in
*worse* machine code, best case, and more and more complicated (and more
buggy!) backend code.

But that is a code quality issue.  The current problem is that we have
at least PR102211 and PR102154 (as well as reports elsewhere of bugs on
other targets).  Code that used before doesn't anymore, and we have no
clear way out, no recommendation how to fix this and a) keep the same
functionality without huge changes, and b) keep the same machine code
quality.

I do not think something like that can be done.  That is why I am asking
for the patch to be reverted until all of the groundwork for it has been
done.  That includes making generic testcases that show how such subregs
behave, so that we can see in testresults what changes do to existing
targets.

Segher

Re: [PATCH] Fix SFmode subreg of DImode and TImode

2021-09-08 Thread David Edelsohn via Gcc-patches

On Wed, Sep 8, 2021 at 1:10 PM Segher Boessenkool
 wrote:
>
> Hi!
>
> On Wed, Sep 08, 2021 at 08:42:44AM +0200, Richard Biener wrote:
> > On Wed, Sep 8, 2021 at 1:08 AM Segher Boessenkool
> >  wrote:
> > > The core of the problem is that subreg of pseudos has three meanings:
> > >   -- Paradoxical subregs;
> > >   -- Actual subregs;
> > >   -- "bit_cast" thingies: treat the same bits as something else.  Like
> > >  looking at the bits of a float as its memory image.
> > >
> > > Ignoring paradoxical subregs (as well as subregs of mem, which should
> > > have disappeared by now), and subregs of hard registers as well (those
> > > have *different* semantics after all), the other two kinds can be mixed,
> > > and *have to* be mixed, because subregs of subregs are non-canonical.
> > >
> > > Is there any reason why not to allow this kind of subreg?
> >
> > In fact the causing rev. in question 
> > (d2874d905647a1d146dafa60199d440e837adc4d)
> > made all those subregs "valid" in terms of what validate_subreg is verifying
> > and thus now the few places using validate_subreg to check whether some
> > subreg is valid will now happily do float<->int converting subregs.
>
> Like in PR102154:
>
> error: unrecognizable insn:
> (insn 11 10 12 2 (set (reg:SF 118 [  ])
> (subreg:SF (reg:DI 122) 0))
>
> This generated valid code before, and now it is not anymore.  If the
> patch was meant to allow *more*, it failed that goal: it allows *less*.
>
> The issue is that we do not allow SF subregs in all cases.  We do not
> store SFmode quantities in IEEE SP format in registers normally, so it
> would require expensive extra code to allow such subregs.
>
> Whatever pass made these subregs is wrong, because they do not pass
> recog().
>
> > I do agree that those subregs should be allowed and that the above rev. is
> > a strict improvement (given it removes a lot of "but allow special case X
> > because target Y wants it" cases by simply allowing all of them).  But the
> > previous code seems to have papered over quite some backend issues.
>
> It is not a good idea to do allow all those things.  Most backends can
> only support a few combinations of them, and everything else results in
> *worse* machine code, best case, and more and more complicated (and more
> buggy!) backend code.
>
> But that is a code quality issue.  The current problem is that we have
> at least PR102211 and PR102154 (as well as reports elsewhere of bugs on
> other targets).  Code that used before doesn't anymore, and we have no
> clear way out, no recommendation how to fix this and a) keep the same
> functionality without huge changes, and b) keep the same machine code
> quality.
>
> I do not think something like that can be done.  That is why I am asking
> for the patch to be reverted until all of the groundwork for it has been
> done.  That includes making generic testcases that show how such subregs
> behave, so that we can see in testresults what changes do to existing
> targets.

Richi,

As was mentioned in an earlier reply, the patch removed comments in
the code itself that explained why the code was necessary:

/* ??? This should not be here.  Temporarily continue to allow word_mode
subregs of anything.  The most common offender is (subreg:SI (reg:DF)).
Generally, backends are doing something sketchy but it'll take time to
fix them all.  */

/* ??? Similarly, e.g. with (subreg:DF (reg:TI)).  Though store_bit_field
is the culprit here, and not the backends.  */

The sources of the problem are issues in backends and issues in the
common part of the compiler.  Removal of this functionality should
start with a preemptive analysis of the compiler to determine how this
ugly behavior is being utilized and a plan to fix all backends and the
common code generation paths in the compiler, not a sweeping change to
the backend that flushes out problems.

The GCC Regression policy places the burden on the developer whose
patch introduced the problem.  Liuhong did not test any other targets
although the code comments clearly stated that the behavior was
required by many targets, and he is not stepping forward to resolve
the problems.  This patch should be reverted until the problems
exposed by it can be addressed in a careful and deliberate manner.

Thanks, David

Re: [PATCH] Fix SFmode subreg of DImode and TImode

2021-09-08 Thread Richard Biener via Gcc-patches

On September 8, 2021 7:08:09 PM GMT+02:00, Segher Boessenkool 
 wrote:
>Hi!
>
>On Wed, Sep 08, 2021 at 08:42:44AM +0200, Richard Biener wrote:
>> On Wed, Sep 8, 2021 at 1:08 AM Segher Boessenkool
>>  wrote:
>> > The core of the problem is that subreg of pseudos has three meanings:
>> >   -- Paradoxical subregs;
>> >   -- Actual subregs;
>> >   -- "bit_cast" thingies: treat the same bits as something else.  Like
>> >  looking at the bits of a float as its memory image.
>> >
>> > Ignoring paradoxical subregs (as well as subregs of mem, which should
>> > have disappeared by now), and subregs of hard registers as well (those
>> > have *different* semantics after all), the other two kinds can be mixed,
>> > and *have to* be mixed, because subregs of subregs are non-canonical.
>> >
>> > Is there any reason why not to allow this kind of subreg?
>> 
>> In fact the causing rev. in question 
>> (d2874d905647a1d146dafa60199d440e837adc4d)
>> made all those subregs "valid" in terms of what validate_subreg is verifying
>> and thus now the few places using validate_subreg to check whether some
>> subreg is valid will now happily do float<->int converting subregs.
>
>Like in PR102154:
>
>error: unrecognizable insn:
>(insn 11 10 12 2 (set (reg:SF 118 [  ])
>(subreg:SF (reg:DI 122) 0))
>
>This generated valid code before, and now it is not anymore.  If the
>patch was meant to allow *more*, it failed that goal: it allows *less*.
>
>The issue is that we do not allow SF subregs in all cases.  We do not
>store SFmode quantities in IEEE SP format in registers normally, so it
>would require expensive extra code to allow such subregs.
>
>Whatever pass made these subregs is wrong, because they do not pass
>recog().
>
>> I do agree that those subregs should be allowed and that the above rev. is
>> a strict improvement (given it removes a lot of "but allow special case X
>> because target Y wants it" cases by simply allowing all of them).  But the
>> previous code seems to have papered over quite some backend issues.
>
>It is not a good idea to do allow all those things.  Most backends can
>only support a few combinations of them, and everything else results in
>*worse* machine code, best case, and more and more complicated (and more
>buggy!) backend code.
>
>But that is a code quality issue.  The current problem is that we have
>at least PR102211 and PR102154 (as well as reports elsewhere of bugs on
>other targets).  Code that used before doesn't anymore, and we have no
>clear way out, no recommendation how to fix this and a) keep the same
>functionality without huge changes, and b) keep the same machine code
>quality.
>
>I do not think something like that can be done.  That is why I am asking
>for the patch to be reverted until all of the groundwork for it has been
>done.  That includes making generic testcases that show how such subregs
>behave, so that we can see in testresults what changes do to existing
>targets.

Heh, I understood your earlier reply that you supported the change in principle 
based on the fact that nested subregs are invalid.

Now, I don't think that validate_subreg is supposed to be the decision maker on 
what a target allows. For subregs of hardregs we seem to have a good way of 
validating, but what do we have for subregs of pseudos? Is it the passes 
generating the new unsupported subregs that should do different things? Should 
validate_subreg use a target hook to allow those special casings we removed 
which all were necessary just for specific targets but appearantly did not do 
any harm for other targets? 

Can you give advice as to how to address the needs of the HFmode subregs on x86 
if not by adding another (generic) narrow exception in validate_subreg? 

That said, I fail to see a good way forward after now two appearantly failed 
attempts. 

Btw, I'm fine reverting the patch but then what's the solution here? 

Richard. 

>
>Segher

[committed] analyzer: fix ICE when discarding result of realloc [PR102225]

2021-09-08 Thread David Malcolm via Gcc-patches

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Pushed to trunk as r12-3422-ge66b9f6779f46433b0e2c093b58403604ed131cc.

gcc/analyzer/ChangeLog:
PR analyzer/102225
* analyzer.h (compat_types_p): New decl.
* constraint-manager.cc
(constraint_manager::get_or_add_equiv_class): Guard against NULL
type when checking for pointer types.
* region-model-impl-calls.cc (region_model::impl_call_realloc):
Guard against NULL lhs type/region.  Guard against the size value
not being of a compatible type for dynamic extents.
* region-model.cc (compat_types_p): Make non-static.

gcc/testsuite/ChangeLog:
PR analyzer/102225
* gcc.dg/analyzer/realloc-1.c (test_10): New.
* gcc.dg/analyzer/torture/pr102225.c: New test.
---
 gcc/analyzer/analyzer.h   |  2 +
 gcc/analyzer/constraint-manager.cc|  9 ++--
 gcc/analyzer/region-model-impl-calls.cc   | 42 ---
 gcc/analyzer/region-model.cc  |  2 +-
 gcc/testsuite/gcc.dg/analyzer/realloc-1.c |  5 +++
 .../gcc.dg/analyzer/torture/pr102225.c|  6 +++
 6 files changed, 47 insertions(+), 19 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/analyzer/torture/pr102225.c

diff --git a/gcc/analyzer/analyzer.h b/gcc/analyzer/analyzer.h
index 7ad1081ca6c..3ba4e21b015 100644
--- a/gcc/analyzer/analyzer.h
+++ b/gcc/analyzer/analyzer.h
@@ -203,6 +203,8 @@ private:
 
 extern location_t get_stmt_location (const gimple *stmt, function *fun);
 
+extern bool compat_types_p (tree src_type, tree dst_type);
+
 /* Passed by pointer to PLUGIN_ANALYZER_INIT callbacks.  */
 
 class plugin_analyzer_init_iface
diff --git a/gcc/analyzer/constraint-manager.cc 
b/gcc/analyzer/constraint-manager.cc
index dc65c8dd92b..6df23fb477e 100644
--- a/gcc/analyzer/constraint-manager.cc
+++ b/gcc/analyzer/constraint-manager.cc
@@ -2088,10 +2088,11 @@ constraint_manager::get_or_add_equiv_class (const 
svalue *sval)
 
   /* Convert all NULL pointers to (void *) to avoid state explosions
  involving all of the various (foo *)NULL vs (bar *)NULL.  */
-  if (POINTER_TYPE_P (sval->get_type ()))
-if (tree cst = sval->maybe_get_constant ())
-  if (zerop (cst))
-   sval = m_mgr->get_or_create_constant_svalue (null_pointer_node);
+  if (sval->get_type ())
+if (POINTER_TYPE_P (sval->get_type ()))
+  if (tree cst = sval->maybe_get_constant ())
+   if (zerop (cst))
+ sval = m_mgr->get_or_create_constant_svalue (null_pointer_node);
 
   /* Try svalue match.  */
   if (get_equiv_class_by_svalue (sval, &result))
diff --git a/gcc/analyzer/region-model-impl-calls.cc 
b/gcc/analyzer/region-model-impl-calls.cc
index 875719f9989..ff2ae9ca77d 100644
--- a/gcc/analyzer/region-model-impl-calls.cc
+++ b/gcc/analyzer/region-model-impl-calls.cc
@@ -540,11 +540,14 @@ region_model::impl_call_realloc (const call_details &cd)
 {
   /* Return NULL; everything else is unchanged.  */
   const call_details cd (get_call_details (model, ctxt));
-  const svalue *zero
-   = model->m_mgr->get_or_create_int_cst (cd.get_lhs_type (), 0);
-  model->set_value (cd.get_lhs_region (),
-   zero,
-   cd.get_ctxt ());
+  if (cd.get_lhs_type ())
+   {
+ const svalue *zero
+   = model->m_mgr->get_or_create_int_cst (cd.get_lhs_type (), 0);
+ model->set_value (cd.get_lhs_region (),
+   zero,
+   cd.get_ctxt ());
+   }
   return true;
 }
   };
@@ -575,11 +578,17 @@ region_model::impl_call_realloc (const call_details &cd)
   const svalue *ptr_sval = cd.get_arg_svalue (0);
   const svalue *size_sval = cd.get_arg_svalue (1);
   if (const region *buffer_reg = ptr_sval->maybe_get_region ())
-   model->set_dynamic_extents (buffer_reg, size_sval);
-  model->set_value (cd.get_lhs_region (), ptr_sval, cd.get_ctxt ());
-  const svalue *zero
-   = model->m_mgr->get_or_create_int_cst (cd.get_lhs_type (), 0);
-  return model->add_constraint (ptr_sval, NE_EXPR, zero, cd.get_ctxt ());
+   if (compat_types_p (size_sval->get_type (), size_type_node))
+ model->set_dynamic_extents (buffer_reg, size_sval);
+  if (cd.get_lhs_region ())
+   {
+ model->set_value (cd.get_lhs_region (), ptr_sval, cd.get_ctxt ());
+ const svalue *zero
+   = model->m_mgr->get_or_create_int_cst (cd.get_lhs_type (), 0);
+ return model->add_constraint (ptr_sval, NE_EXPR, zero, cd.get_ctxt 
());
+   }
+  else
+   return true;
 }
   };
 
@@ -643,10 +652,15 @@ region_model::impl_call_realloc (const call_details &cd)
 and the new_ptr_sval as "nonnull".  */
   model->on_realloc_with_move (cd, old_ptr_sval, new_ptr_sval);
 
-  const svalue *zero
-   = model->m_mgr->get_or_create_int_cst (cd.get_lhs_type (), 0);
-  return model->add_constr

Re: [committed] analyzer: support "bifurcation"; reimplement realloc [PR99260]

2021-09-08 Thread Gerald Pfeifer

On Fri, 3 Sep 2021, Gerald Pfeifer wrote:
> On Mon, 30 Aug 2021, David Malcolm via Gcc-patches wrote:
>> gcc/analyzer/ChangeLog:
>>  PR analyzer/99260
>>  * analyzer.h (class custom_edge_info): New class, adapted from
>>  exploded_edge::custom_info_t.  Make member functions const.
>>  Make update_model return bool, converting edge param from
>>  reference to a pointer, and adding a ctxt param.
>>  (class path_context): New class.
>>  * call-info.cc: New file.
>>  * call-info.h: New file.
>>  * engine.cc: Include "analyzer/call-info.h" and .
> I believe it is this change that is causing bootstrap to fail with
> 
>   In file included from 
> /scratch/tmp/gerald/GCC-HEAD/gcc/analyzer/engine.cc:69:
>   In file included from /usr/include/c++/v1/memory:653:
>   /usr/include/c++/v1/typeinfo:346:5: error: no member named 'fancy_abort' 
> in name space 'std::__1'; did you mean simply 'fancy_abort'?
> _VSTD::abort();
> ^~~
>   /usr/include/c++/v1/__config:782:15: note: expanded from macro '_VSTD'
>   #define _VSTD std::_LIBCPP_ABI_NAMESPACE

I'm now 99% certain this patch is the trigger and also filed a bug
report:

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102242

Good news is that I have prototype patch in testing, so if you have not 
started yet I suggest to wait and I should have an updated by tomorrow 
morning.

Gerald

Re: [PATCH] Fix SFmode subreg of DImode and TImode

2021-09-08 Thread Segher Boessenkool

On Wed, Sep 08, 2021 at 08:39:31PM +0200, Richard Biener wrote:
> On September 8, 2021 7:08:09 PM GMT+02:00, Segher Boessenkool 
>  wrote:
> >It is not a good idea to do allow all those things.  Most backends can
> >only support a few combinations of them, and everything else results in
> >*worse* machine code, best case, and more and more complicated (and more
> >buggy!) backend code.
> >
> >But that is a code quality issue.  The current problem is that we have
> >at least PR102211 and PR102154 (as well as reports elsewhere of bugs on
> >other targets).  Code that used before doesn't anymore, and we have no
> >clear way out, no recommendation how to fix this and a) keep the same
> >functionality without huge changes, and b) keep the same machine code
> >quality.
> >
> >I do not think something like that can be done.  That is why I am asking
> >for the patch to be reverted until all of the groundwork for it has been
> >done.  That includes making generic testcases that show how such subregs
> >behave, so that we can see in testresults what changes do to existing
> >targets.
> 
> Heh, I understood your earlier reply that you supported the change in 
> principle based on the fact that nested subregs are invalid.

Ah.  No.  I meant to lament the fact that we use subregs for multiple
things, so that doing a bit_cast of a real subreg has to be expressed as
just one subreg, which is clearly sub-optimal for most backends.

I say *has to* because a subreg of a subreg is not valid RTL; it has to
be written as just one subreg.  Which makes thing more confusing and
confused than this already non-trivial thing has to be.

> Now, I don't think that validate_subreg is supposed to be the decision maker 
> on what a target allows.

Right, some target hook or macro or whatnot should.

> For subregs of hardregs we seem to have a good way of validating, but
> what do we have for subregs of pseudos? Is it the passes generating
> the new unsupported subregs that should do different things? Should
> validate_subreg use a target hook to allow those special casings we
> removed which all were necessary just for specific targets but
> appearantly did not do any harm for other targets? 

Currently, we can disallow things in predicates and/or the instruction
conditions.

> Can you give advice as to how to address the needs of the HFmode subregs on 
> x86 if not by adding another (generic) narrow exception in validate_subreg? 

If I knew what the problem was, perhaps?  Is this explained in some mail
somewhere?

> That said, I fail to see a good way forward after now two appearantly failed 
> attempts. 
> 
> Btw, I'm fine reverting the patch but then what's the solution here? 

I think we should (longer term) get rid of the overloaded meanings and
uses of subregs.  One fairly simple thing is to make a new rtx code
"bit_cast" (or is there a nice short more traditional name for it?)

But that is not the core problem we had here.  The behaviour of the
generic parts of the compiler was changed, without testing if that
works on other targets but x86.  That is an understandable mistake, it
takes some experience to know where the morasses are.  But this change
should have been accompanied by testcases exercising the changed code.
We would have clearly seen there are issues then, simply by watching
gcc-testresults@ (and/or maintainers are on top of the test results
anyway).  Also, if there were testcases for this, we could have some
confidence that a change in this area is robust.

Segher

p.s. Very unrelated...  Should we have __builtin_bit_cast for C as well?
Is there any reason this could not work?

Re: [PATCH v5] Fix for powerpc64 long double complex divide failure

2021-09-08 Thread Patrick McGehearty via Gcc-patches


Ping
Note: I don't have commit privileges for gcc.

On 8/27/2021 2:59 PM, Patrick McGehearty via Gcc-patches wrote:

This revision (v5) adds a test in libgcc/libgcc2.c for when
"__LIBGCC_TF_MANT_DIG__ == 106" to use __LIBGCC_DF_EPSILON__ instead
of __LIBGCC_TF_EPSILON__. That is specific to IBM 128-bit format long
doubles where EPSILON is very, very small and 1/EPSILON oveflows to
infinity. This change avoids the overflow without affecting any other
platform. Discussion in the patch is adjusted to reflect this
limitation.

It does not make any changes to .../rs6000/_divkc3.c, leaving it to
use __LIBGCC_KF__*. That means the upstream gcc will not build in
older IBM environments that do not recognize the KF floating point
mode properly. Environments that do not need IBM longdouble support do
build cleanly.

- - - -
This patch addresses the failure of powerpc64 long double complex divide
in native ibm long double format after the patch "Practical improvement
to libgcc complex divide".

The new code uses the following macros which are intended to be mapped
to appropriate values according to the underlying hardware representation.
See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101104

RBIG a value near the maximum representation
RMIN a value near the minimum representation
  (but not in the subnormal range)
RMIN2a value moderately less than 1
RMINSCAL the inverse of RMIN2
RMAX2RBIG * RMIN2  - a value to limit scaling to not overflow

When "long double" values were not using the IEEE 128-bit format but
the traditional IBM 128-bit, the previous code used the LDBL values
which caused overflow for RMINSCAL. The new code uses the DBL values.

RBIG  LDBL_MAX = 0x1.f800p+1022
   DBL_MAX  = 0x1.f000p+1022

RMIN  LDBL_MIN = 0x1.p-969
RMIN  DBL_MIN  = 0x1.p-1022

RMIN2 LDBL_EPSILON = 0x0.1000p-1022 = 0x1.0p-1074
RMIN2 DBL_EPSILON  = 0x1.p-52

RMINSCAL 1/LDBL_EPSILON = inf (1.0p+1074 does not fit in IBM 128-bit).
  1/DBL_EPSILON  = 0x1.p+52

RMAX2 = RBIG * RMIN2 = 0x1.f800p-52
 RBIG * RMIN2 = 0x1.f000p+970

The MAX and MIN values have only modest changes since the maximum and
minimum values are about the same as for double precision.  The
EPSILON field is considerably different. Due to how very small values
can be represented in the lower 64 bits of the IBM 128-bit floating
point, EPSILON is extremely small, so far beyond the desired value
that inversion of the value overflows and even without the overflow,
the RMAX2 is so small as to eliminate most usage of the test.

The change has been tested on gcc135.fsffrance.org and gains the
expected improvements in accuracy for long double complex divide.

libgcc/
PR target/101104
* libgcc2.c (RMIN2, RMINSCAL, RMAX2):
Use more correct values for native IBM 128-bit.
---
  libgcc/libgcc2.c | 9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c
index 38f935e..bf45576 100644
--- a/libgcc/libgcc2.c
+++ b/libgcc/libgcc2.c
@@ -1906,8 +1906,13 @@ NAME (TYPE x, int m)
  # define NOTRUNC (!__LIBGCC_TF_EXCESS_PRECISION__)
  # define RBIG (__LIBGCC_TF_MAX__ / 2)
  # define RMIN (__LIBGCC_TF_MIN__)
-# define RMIN2 (__LIBGCC_TF_EPSILON__)
-# define RMINSCAL (1 / __LIBGCC_TF_EPSILON__)
+# if __LIBGCC_TF_MANT_DIG__ == 106
+#  define RMIN2  (__LIBGCC_DF_EPSILON__)
+#  define RMINSCAL (1 / __LIBGCC_DF_EPSILON__)
+# else
+#  define RMIN2(__LIBGCC_TF_EPSILON__)
+#  define RMINSCAL (1 / __LIBGCC_TF_EPSILON__)
+# endif
  # define RMAX2(RBIG * RMIN2)
  #else
  # error

Re: [PATCH] c++: Fix docs on assignment of virtual bases [PR60318]

2021-09-08 Thread Jonathan Wakely via Gcc-patches


Ping (and remember to CC a maintainer this time).


On 31/08/21 09:53 +0100, Jonathan Wakely wrote:

The description of behaviour is incorrect, the virtual base gets
assigned before entering the bodies of A::operator= and B::operator=,
not after.

The example is also ill-formed (passing a string literal to char*) and
undefined (missing return from Base::operator=).

Signed-off-by: Jonathan Wakely 

gcc/ChangeLog:

PR c++/60318
* doc/trouble.texi (Copy Assignment): Fix description of
behaviour and fix code in example.

OK for trunk and all active branches?




commit f0fa91971e35c1df7381ac289ece9bd5f07f8535
Author: Jonathan Wakely 
Date:   Tue Aug 31 09:46:41 2021

   c++: Fix docs on assignment of virtual bases [PR60318]

   The description of behaviour is incorrect, the virtual base gets
   assigned before entering the bodies of A::operator= and B::operator=,
   not after.

   The example is also ill-formed (passing a string literal to char*) and
   undefined (missing return from Base::operator=).

   Signed-off-by: Jonathan Wakely 

   gcc/ChangeLog:

   PR c++/60318
   * doc/trouble.texi (Copy Assignment): Fix description of
   behaviour and fix code in example.

diff --git a/gcc/doc/trouble.texi b/gcc/doc/trouble.texi
index 40c51ae21cb..8b34be4aa63 100644
--- a/gcc/doc/trouble.texi
+++ b/gcc/doc/trouble.texi
@@ -865,10 +865,11 @@ objects behave unspecified when being assigned.  For 
example:
@smallexample
struct Base@{
  char *name;
-  Base(char *n) : name(strdup(n))@{@}
+  Base(const char *n) : name(strdup(n))@{@}
  Base& operator= (const Base& other)@{
   free (name);
   name = strdup (other.name);
+   return *this;
  @}
@};

@@ -901,8 +902,8 @@ inside @samp{func} in the example).
G++ implements the ``intuitive'' algorithm for copy-assignment: assign all
direct bases, then assign all members.  In that algorithm, the virtual
base subobject can be encountered more than once.  In the example, copying
-proceeds in the following order: @samp{val}, @samp{name} (via
-@code{strdup}), @samp{bval}, and @samp{name} again.
+proceeds in the following order: @samp{name} (via @code{strdup}),
+@samp{val}, @samp{name} again, and @samp{bval}.

If application code relies on copy-assignment, a user-defined
copy-assignment operator removes any uncertainties.  With such an

Re: [PATCH] c++: Fix docs on assignment of virtual bases [PR60318]

2021-09-08 Thread Jason Merrill via Gcc-patches


On 9/8/21 3:52 PM, Jonathan Wakely via Gcc-patches wrote:

Ping (and remember to CC a maintainer this time).


OK, thanks.



On 31/08/21 09:53 +0100, Jonathan Wakely wrote:

The description of behaviour is incorrect, the virtual base gets
assigned before entering the bodies of A::operator= and B::operator=,
not after.

The example is also ill-formed (passing a string literal to char*) and
undefined (missing return from Base::operator=).

Signed-off-by: Jonathan Wakely 

gcc/ChangeLog:

PR c++/60318
* doc/trouble.texi (Copy Assignment): Fix description of
behaviour and fix code in example.

OK for trunk and all active branches?




commit f0fa91971e35c1df7381ac289ece9bd5f07f8535
Author: Jonathan Wakely 
Date:   Tue Aug 31 09:46:41 2021

   c++: Fix docs on assignment of virtual bases [PR60318]

   The description of behaviour is incorrect, the virtual base gets
   assigned before entering the bodies of A::operator= and B::operator=,
   not after.

   The example is also ill-formed (passing a string literal to char*) and
   undefined (missing return from Base::operator=).

   Signed-off-by: Jonathan Wakely 

   gcc/ChangeLog:

   PR c++/60318
   * doc/trouble.texi (Copy Assignment): Fix description of
   behaviour and fix code in example.

diff --git a/gcc/doc/trouble.texi b/gcc/doc/trouble.texi
index 40c51ae21cb..8b34be4aa63 100644
--- a/gcc/doc/trouble.texi
+++ b/gcc/doc/trouble.texi
@@ -865,10 +865,11 @@ objects behave unspecified when being assigned.  
For example:

@smallexample
struct Base@{
  char *name;
-  Base(char *n) : name(strdup(n))@{@}
+  Base(const char *n) : name(strdup(n))@{@}
  Base& operator= (const Base& other)@{
   free (name);
   name = strdup (other.name);
+   return *this;
  @}
@};

@@ -901,8 +902,8 @@ inside @samp{func} in the example).
G++ implements the ``intuitive'' algorithm for copy-assignment: assign 
all

direct bases, then assign all members.  In that algorithm, the virtual
base subobject can be encountered more than once.  In the example, 
copying

-proceeds in the following order: @samp{val}, @samp{name} (via
-@code{strdup}), @samp{bval}, and @samp{name} again.
+proceeds in the following order: @samp{name} (via @code{strdup}),
+@samp{val}, @samp{name} again, and @samp{bval}.

If application code relies on copy-assignment, a user-defined
copy-assignment operator removes any uncertainties.  With such an

Re: [PATCH] warn for more impossible null pointer tests [PR102103]

2021-09-08 Thread Jason Merrill via Gcc-patches


On 9/2/21 7:53 PM, Martin Sebor wrote:

@@ -4622,28 +4622,94 @@ warn_for_null_address (location_t location, tree op, 
tsubst_flags_t complain)
if (!warn_address
|| (complain & tf_warning) == 0
|| c_inhibit_evaluation_warnings != 0
-  || warning_suppressed_p (op, OPT_Waddress))
+  || warning_suppressed_p (op, OPT_Waddress)
+  || processing_template_decl != 0)


Completely suppressing this warning in templates seems like a 
regression;  I'd think we could recognize many relevant cases before 
instantiation.  You just can't assume that ADDR_EXPR has the default 
meaning if it has unknown type (i.e. because op0 is type-dependent).


Jason

[PING][PATCH] rs6000: Disable optimizing multiple xxsetaccz instructions into one xxsetaccz

2021-09-08 Thread Peter Bergner via Gcc-patches

I'd like to ping the following MMA patch:

  https://gcc.gnu.org/pipermail/gcc-patches/2021-August/578288.html

  Message-ID: <8393a33f-50ab-6720-0017-3f012803b...@linux.ibm.com>

Peter



Fwprop will happily optimize two xxsetaccz instructions into one xxsetaccz
by propagating the results of the first to the uses of the second.
We really don't want that to happen given the late priming/depriming of
accumulators.  I fixed this by making the xxsetaccz source operand an
unspec volatile.  I also removed the mma_xxsetaccz define_expand and
define_insn_and_split and replaced it with a simple define_insn.
The expand and splitter patterns were leftovers from the pre opaque mode
code when the xxsetaccz code was part of the movpxi pattern, and we don't
need them now.

Rather than a new test case, I was able to just modify the current test case
to add another __builtin_mma_xxsetaccz call which shows the bad code gen
with unpatched compilers.

This passed bootstrap on powerpc64le-linux with no regressions.
Ok for trunk?  We'll need this for sure in GCC11.  Ok there too after
some trunk burn in time?

GCC10 suffers from the same issue, but since the code is different, I'll
have to determine a different solution which I'll post as a separate
patch.

gcc/
* config/rs6000/mma.md (unspec): Delete UNSPEC_MMA_XXSETACCZ.
(unspecv): Add UNSPECV_MMA_XXSETACCZ.
(*mma_xxsetaccz): Delete.
(mma_xxsetaccz): Change to define_insn.  Remove match_operand.
Use UNSPECV_MMA_XXSETACCZ.
* config/rs6000/rs6000.c (rs6000_rtx_costs): Use UNSPECV_MMA_XXSETACCZ.

gcc/testsuite/
* gcc.target/powerpc/mma-builtin-6.c: Add second call to xxsetacc
built-in.  Update instruction counts.

Re: [PATCH] Fix SFmode subreg of DImode and TImode

2021-09-08 Thread Hongtao Liu via Gcc-patches

On Thu, Sep 9, 2021 at 3:17 AM Segher Boessenkool
 wrote:
>
> On Wed, Sep 08, 2021 at 08:39:31PM +0200, Richard Biener wrote:
> > On September 8, 2021 7:08:09 PM GMT+02:00, Segher Boessenkool 
> >  wrote:
> > >It is not a good idea to do allow all those things.  Most backends can
> > >only support a few combinations of them, and everything else results in
> > >*worse* machine code, best case, and more and more complicated (and more
> > >buggy!) backend code.
> > >
> > >But that is a code quality issue.  The current problem is that we have
> > >at least PR102211 and PR102154 (as well as reports elsewhere of bugs on
> > >other targets).  Code that used before doesn't anymore, and we have no
> > >clear way out, no recommendation how to fix this and a) keep the same
> > >functionality without huge changes, and b) keep the same machine code
> > >quality.
> > >
> > >I do not think something like that can be done.  That is why I am asking
> > >for the patch to be reverted until all of the groundwork for it has been
> > >done.  That includes making generic testcases that show how such subregs
> > >behave, so that we can see in testresults what changes do to existing
> > >targets.
> >
> > Heh, I understood your earlier reply that you supported the change in 
> > principle based on the fact that nested subregs are invalid.
>
> Ah.  No.  I meant to lament the fact that we use subregs for multiple
> things, so that doing a bit_cast of a real subreg has to be expressed as
> just one subreg, which is clearly sub-optimal for most backends.
>
> I say *has to* because a subreg of a subreg is not valid RTL; it has to
> be written as just one subreg.  Which makes thing more confusing and
> confused than this already non-trivial thing has to be.
>
> > Now, I don't think that validate_subreg is supposed to be the decision 
> > maker on what a target allows.
Since there is no good solution yet, I think we can add a target hook
(targetm.validate_subreg) to recover the original removed code as a
default behavior, and then x86 itself defines the hook as empty
(meaning that the backend allows all kinds of subreg).
something like

@@ -922,6 +922,9 @@ validate_subreg (machine_mode omode, machine_mode imode,

   poly_uint64 regsize = REGMODE_NATURAL_SIZE (imode);

+  if (!targetm.valiate_subreg (omode, imode, reg, offset))
+return false;
+
   /* Paradoxical subregs must have offset zero.  */
   if (maybe_gt (osize, isize))
 return known_eq (offset, 0U);

>
> Right, some target hook or macro or whatnot should.
>
> > For subregs of hardregs we seem to have a good way of validating, but
> > what do we have for subregs of pseudos? Is it the passes generating
> > the new unsupported subregs that should do different things? Should
> > validate_subreg use a target hook to allow those special casings we
> > removed which all were necessary just for specific targets but
> > appearantly did not do any harm for other targets?
>
> Currently, we can disallow things in predicates and/or the instruction
> conditions.
>
> > Can you give advice as to how to address the needs of the HFmode subregs on 
> > x86 if not by adding another (generic) narrow exception in validate_subreg?
>
> If I knew what the problem was, perhaps?  Is this explained in some mail
> somewhere?
>
> > That said, I fail to see a good way forward after now two appearantly 
> > failed attempts.
> >
> > Btw, I'm fine reverting the patch but then what's the solution here?
>
> I think we should (longer term) get rid of the overloaded meanings and
> uses of subregs.  One fairly simple thing is to make a new rtx code
> "bit_cast" (or is there a nice short more traditional name for it?)
>
> But that is not the core problem we had here.  The behaviour of the
> generic parts of the compiler was changed, without testing if that
> works on other targets but x86.  That is an understandable mistake, it
> takes some experience to know where the morasses are.  But this change
> should have been accompanied by testcases exercising the changed code.
> We would have clearly seen there are issues then, simply by watching
> gcc-testresults@ (and/or maintainers are on top of the test results
> anyway).  Also, if there were testcases for this, we could have some
> confidence that a change in this area is robust.
>
>
> Segher
>
>
> p.s. Very unrelated...  Should we have __builtin_bit_cast for C as well?
> Is there any reason this could not work?



-- 
BR,
Hongtao

Re: [PATCH] Fix SFmode subreg of DImode and TImode

2021-09-08 Thread Hongtao Liu via Gcc-patches

On Thu, Sep 9, 2021 at 3:17 AM Segher Boessenkool
 wrote:
>
> On Wed, Sep 08, 2021 at 08:39:31PM +0200, Richard Biener wrote:
> > On September 8, 2021 7:08:09 PM GMT+02:00, Segher Boessenkool 
> >  wrote:
> > >It is not a good idea to do allow all those things.  Most backends can
> > >only support a few combinations of them, and everything else results in
> > >*worse* machine code, best case, and more and more complicated (and more
> > >buggy!) backend code.
> > >
> > >But that is a code quality issue.  The current problem is that we have
> > >at least PR102211 and PR102154 (as well as reports elsewhere of bugs on
> > >other targets).  Code that used before doesn't anymore, and we have no
> > >clear way out, no recommendation how to fix this and a) keep the same
> > >functionality without huge changes, and b) keep the same machine code
> > >quality.
> > >
> > >I do not think something like that can be done.  That is why I am asking
> > >for the patch to be reverted until all of the groundwork for it has been
> > >done.  That includes making generic testcases that show how such subregs
> > >behave, so that we can see in testresults what changes do to existing
> > >targets.
> >
> > Heh, I understood your earlier reply that you supported the change in 
> > principle based on the fact that nested subregs are invalid.
>
> Ah.  No.  I meant to lament the fact that we use subregs for multiple
> things, so that doing a bit_cast of a real subreg has to be expressed as
> just one subreg, which is clearly sub-optimal for most backends.
>
> I say *has to* because a subreg of a subreg is not valid RTL; it has to
> be written as just one subreg.  Which makes thing more confusing and
> confused than this already non-trivial thing has to be.
>
> > Now, I don't think that validate_subreg is supposed to be the decision 
> > maker on what a target allows.
>
> Right, some target hook or macro or whatnot should.
>
> > For subregs of hardregs we seem to have a good way of validating, but
> > what do we have for subregs of pseudos? Is it the passes generating
> > the new unsupported subregs that should do different things? Should
> > validate_subreg use a target hook to allow those special casings we
> > removed which all were necessary just for specific targets but
> > appearantly did not do any harm for other targets?
>
> Currently, we can disallow things in predicates and/or the instruction
> conditions.
>
> > Can you give advice as to how to address the needs of the HFmode subregs on 
> > x86 if not by adding another (generic) narrow exception in validate_subreg?
>
> If I knew what the problem was, perhaps?  Is this explained in some mail
> somewhere?
it's at [1], target_hook may not be the best solution, if other
targets support HFmode, they may encounter the same problem, and
revisit it again.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576827.html

>
> > That said, I fail to see a good way forward after now two appearantly 
> > failed attempts.
> >
> > Btw, I'm fine reverting the patch but then what's the solution here?
>
> I think we should (longer term) get rid of the overloaded meanings and
> uses of subregs.  One fairly simple thing is to make a new rtx code
> "bit_cast" (or is there a nice short more traditional name for it?)
>
> But that is not the core problem we had here.  The behaviour of the
> generic parts of the compiler was changed, without testing if that
> works on other targets but x86.  That is an understandable mistake, it
> takes some experience to know where the morasses are.  But this change
> should have been accompanied by testcases exercising the changed code.
> We would have clearly seen there are issues then, simply by watching
> gcc-testresults@ (and/or maintainers are on top of the test results
> anyway).  Also, if there were testcases for this, we could have some
> confidence that a change in this area is robust.
>
>
> Segher
>
>
> p.s. Very unrelated...  Should we have __builtin_bit_cast for C as well?
> Is there any reason this could not work?



-- 
BR,
Hongtao

Re: [RFC] Don't move cold code out of loop by checking bb count

2021-09-08 Thread Xionghu Luo via Gcc-patches





On 2021/8/26 19:33, Richard Biener wrote:

On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo  wrote:


Hi,

On 2021/8/6 20:15, Richard Biener wrote:

On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo  wrote:


There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

This patch does this profile count check in both gimple LIM
move_computations_worker and RTL loop-invariant.c find_invariants_bb,
if the loop bb is colder than loop preheader, don't hoist it out of
loop.

Also, the profile count in loop split pass should be corrected to avoid
lim2 and lim4 mismatch behavior, currently, the new loop preheader generated
by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
move statement out of loop unexpectely when lim2 didn't move it.  This
change could fix regression on 544.nab_r from -1.55% to +0.46%.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

Regression and bootstrap tested pass on P8LE, any comments?  Thanks.


While I'm not familiar with the RTL invariant motion pass the patch there
looks reasonable.  Note that we should assess the profile quality
somehow - I'm not sure how to do that, CCed Honza for that.


Thanks.



For the GIMPLE part the patch looks quite complicated - but note it
probably has to be since LIM performs kind of a "CSE" on loads
(and stores for store-motion), so when there are multiple stmts
affected by a hoisting decision the biggest block count has to be
accounted.  Likewise when there are dependent stmts involved
that might include conditional stmts (a "PHI"), but the overall
cost should be looked at.


Currently, The gimple code check two situations with the patch:
1) The statement or PHI‘s BB is *colder* then preheader, don't move it out
of loop;
2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs
couldn't be moved out of loop, also don't move it out of loop to avoid 
definition
not dominates use error.


But part 2) is obviously already done.  What I tried to say is your heuristic
doesn't integrate nicely with the pass but I admitted that it might be a bit
difficult to find a place to add this heuristic.

There is lim_data->cost which we could bias negatively but then this is
a cost that is independent on the hoisting distance.  But doing this would
work at least for the case where the immediately enclosing loop preheader
is hotter than the stmt and with this it would be a patch that's similarly
simple as the RTL one.

Another possibility is to simply only adjust PHI processing in
compute_invariantness, capping movement according to the hotness
heuristic.  The same could be done for regular stmts there but I'm
not sure that will do good in the end since this function is supposed
to compute "correctness" (well, it also has the cost stuff), and it's
not the place to do overall cost considerations.


Thanks.  I found that adding a function find_coldest_out_loop and check it in
outermost_invariant_loop to find the coldest invariant loop between outermost
loop and itself could also reach the purpose.  Then the gimple code check is
redundant and could be removed.




May be I could collect the number of instructions not hoisted with the patch
on regression tests and SPEC2017 to do a estimation for "multiple stmts 
affected"
and "overall cost" need to be considered?  But it seems move_computations_worker
couldn't rollback if we still want to hoist multiple stmts out during the 
iterations?



Now - GIMPLE LIM "costing" is somewhat backward right now
and it isn't set up to consider those multiple involved stmts.  Plus
the store-motion part does not have any cost part (but it depends
on previously decided invariant motions).

I think the way you implemented the check will cause no hoisting
to be performed instead of, say, hoisting to a different loop level
only.  Possibly shown when you consider a loop nest like

for (;;)
  if (unlikely_cond)
for (;;)
   invariant;

we want to hoist 'invariant' but only from the inner loop even if it
is invariant also in the outer loop.



For this case, theorotically I think the master GCC will optimize it to:

   invariant;
   for (;;)
 if (unlikely_cond)
   for (;;)
  ;

'invariant' is moved out of outer loop, but with the patch, it will get:

   for (;;)
 if (unlikely_cond)
   {
 invariant;
 for (;;)
;
   }

'invariant' is *cold* for outer loop, but it is still *hot* for inner loop,
so hoist it out of inner loop, this is exactly what we want, right?


Yes.  I had doubts your p

[PATCH] x86: Add TARGET_AVX256_[MOVE|STORE]_BY_PIECES

2021-09-08 Thread H.J. Lu via Gcc-patches

1. Add TARGET_AVX256_MOVE_BY_PIECES to perform move by-pieces operation
with 256-bit AVX instructions.
2. Add TARGET_AVX256_STORE_BY_PIECES to perform move and store by-pieces
operations with 256-bit AVX instructions.

They are enabled only for Intel Alder Lake and Intel processors with
AVX512.

gcc/

PR target/101935
* config/i386/i386.h (TARGET_AVX256_MOVE_BY_PIECES): New.
(TARGET_AVX256_STORE_BY_PIECES): Likewise.
(MOVE_MAX): Check TARGET_AVX256_MOVE_BY_PIECES and
TARGET_AVX256_STORE_BY_PIECES instead of
TARGET_AVX256_SPLIT_UNALIGNED_LOAD and
TARGET_AVX256_SPLIT_UNALIGNED_STORE.
(STORE_MAX_PIECES): Check TARGET_AVX256_STORE_BY_PIECES instead
of TARGET_AVX256_SPLIT_UNALIGNED_STORE.
* config/i386/x86-tune.def (X86_TUNE_AVX256_MOVE_BY_PIECES): New.
(X86_TUNE_AVX256_STORE_BY_PIECES): Likewise.

gcc/testsuite/

PR target/101935
* g++.target/i386/pr80566-1.C: Add
-mtune-ctrl=avx256_store_by_pieces.
* gcc.target/i386/pr100865-4a.c: Likewise.
* gcc.target/i386/pr100865-10a.c: Likewise.
* gcc.target/i386/pr90773-20.c: Likewise.
* gcc.target/i386/pr90773-21.c: Likewise.
* gcc.target/i386/pr90773-22.c: Likewise.
* gcc.target/i386/pr90773-23.c: Likewise.
* g++.target/i386/pr80566-2.C: Add
-mtune-ctrl=avx256_move_by_pieces.
* gcc.target/i386/eh_return-1.c: Likewise.
* gcc.target/i386/pr90773-26.c: Likewise.
* gcc.target/i386/pieces-memcpy-12.c: Replace -mtune=haswell
with -mtune-ctrl=avx256_move_by_pieces.
* gcc.target/i386/pieces-memcpy-15.c: Likewise.
* gcc.target/i386/pieces-memset-2.c: Replace -mtune=haswell
with -mtune-ctrl=avx256_store_by_pieces.
* gcc.target/i386/pieces-memset-5.c: Likewise.
* gcc.target/i386/pieces-memset-11.c: Likewise.
* gcc.target/i386/pieces-memset-14.c: Likewise.
* gcc.target/i386/pieces-memset-20.c: Likewise.
* gcc.target/i386/pieces-memset-23.c: Likewise.
* gcc.target/i386/pieces-memset-29.c: Likewise.
* gcc.target/i386/pieces-memset-30.c: Likewise.
* gcc.target/i386/pieces-memset-33.c: Likewise.
* gcc.target/i386/pieces-memset-34.c: Likewise.
* gcc.target/i386/pieces-memset-44.c: Likewise.
* gcc.target/i386/pieces-memset-37.c: Replace -mtune=generic
with -mtune-ctrl=avx256_store_by_pieces.
---
 gcc/config/i386/i386.h   | 10 +++---
 gcc/config/i386/x86-tune.def | 11 +++
 gcc/testsuite/g++.target/i386/pr80566-1.C|  2 +-
 gcc/testsuite/g++.target/i386/pr80566-2.C|  2 +-
 gcc/testsuite/gcc.target/i386/eh_return-1.c  |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memcpy-12.c |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memcpy-15.c |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memset-11.c |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memset-14.c |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memset-2.c  |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memset-20.c |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memset-23.c |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memset-29.c |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memset-30.c |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memset-33.c |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memset-34.c |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memset-37.c |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memset-44.c |  2 +-
 gcc/testsuite/gcc.target/i386/pieces-memset-5.c  |  2 +-
 gcc/testsuite/gcc.target/i386/pr100865-10a.c |  2 +-
 gcc/testsuite/gcc.target/i386/pr100865-4a.c  |  2 +-
 gcc/testsuite/gcc.target/i386/pr90773-20.c   |  2 +-
 gcc/testsuite/gcc.target/i386/pr90773-21.c   |  2 +-
 gcc/testsuite/gcc.target/i386/pr90773-22.c   |  2 +-
 gcc/testsuite/gcc.target/i386/pr90773-23.c   |  2 +-
 gcc/testsuite/gcc.target/i386/pr90773-26.c   |  2 +-
 26 files changed, 42 insertions(+), 27 deletions(-)

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index f671dae9236..9d45fe9611b 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -403,6 +403,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR]
 #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
+#define TARGET_AVX256_MOVE_BY_PIECES \
+   ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES]
+#define TARGET_AVX256_STORE_BY_PIECES \
+   ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES]
 #define TARGET_AVX256_SPLIT_REGS \
ix86_tune_features[X86_TUNE_AVX256_SPLIT_REGS]
 #define TARGET_GENERAL_REGS_SSE_SPILL \
@@ -1781,8 +1785,8 @@ typedef struct ix86_args {
? 64 \
: ((TARGET_AVX \
&& !TARGET_PREFER_AVX128 \
-   && !TARGET_AVX256_SPLIT_UNALIGNED_LOAD \
-   && !TARGET_AVX256_SPLIT_

Re: [PATCH] x86: Add TARGET_AVX256_[MOVE|STORE]_BY_PIECES

2021-09-08 Thread Hongtao Liu via Gcc-patches

On Thu, Sep 9, 2021 at 11:21 AM H.J. Lu via Gcc-patches
 wrote:
>
> 1. Add TARGET_AVX256_MOVE_BY_PIECES to perform move by-pieces operation
> with 256-bit AVX instructions.
> 2. Add TARGET_AVX256_STORE_BY_PIECES to perform move and store by-pieces
> operations with 256-bit AVX instructions.
>
> They are enabled only for Intel Alder Lake and Intel processors with
> AVX512.

This patch fixes the regression on znver2 and kabylake, although we
don't know why the mov with ymm would behave differently on non-avx512
(except for alderlake) and avx512 machines, the microarchitecture
tuning seems like a reasonable choice.
I'll install this patch if there's no objections in next 72 hours.
>
> gcc/
>
> PR target/101935
> * config/i386/i386.h (TARGET_AVX256_MOVE_BY_PIECES): New.
> (TARGET_AVX256_STORE_BY_PIECES): Likewise.
> (MOVE_MAX): Check TARGET_AVX256_MOVE_BY_PIECES and
> TARGET_AVX256_STORE_BY_PIECES instead of
> TARGET_AVX256_SPLIT_UNALIGNED_LOAD and
> TARGET_AVX256_SPLIT_UNALIGNED_STORE.
> (STORE_MAX_PIECES): Check TARGET_AVX256_STORE_BY_PIECES instead
> of TARGET_AVX256_SPLIT_UNALIGNED_STORE.
> * config/i386/x86-tune.def (X86_TUNE_AVX256_MOVE_BY_PIECES): New.
> (X86_TUNE_AVX256_STORE_BY_PIECES): Likewise.
>
> gcc/testsuite/
>
> PR target/101935
> * g++.target/i386/pr80566-1.C: Add
> -mtune-ctrl=avx256_store_by_pieces.
> * gcc.target/i386/pr100865-4a.c: Likewise.
> * gcc.target/i386/pr100865-10a.c: Likewise.
> * gcc.target/i386/pr90773-20.c: Likewise.
> * gcc.target/i386/pr90773-21.c: Likewise.
> * gcc.target/i386/pr90773-22.c: Likewise.
> * gcc.target/i386/pr90773-23.c: Likewise.
> * g++.target/i386/pr80566-2.C: Add
> -mtune-ctrl=avx256_move_by_pieces.
> * gcc.target/i386/eh_return-1.c: Likewise.
> * gcc.target/i386/pr90773-26.c: Likewise.
> * gcc.target/i386/pieces-memcpy-12.c: Replace -mtune=haswell
> with -mtune-ctrl=avx256_move_by_pieces.
> * gcc.target/i386/pieces-memcpy-15.c: Likewise.
> * gcc.target/i386/pieces-memset-2.c: Replace -mtune=haswell
> with -mtune-ctrl=avx256_store_by_pieces.
> * gcc.target/i386/pieces-memset-5.c: Likewise.
> * gcc.target/i386/pieces-memset-11.c: Likewise.
> * gcc.target/i386/pieces-memset-14.c: Likewise.
> * gcc.target/i386/pieces-memset-20.c: Likewise.
> * gcc.target/i386/pieces-memset-23.c: Likewise.
> * gcc.target/i386/pieces-memset-29.c: Likewise.
> * gcc.target/i386/pieces-memset-30.c: Likewise.
> * gcc.target/i386/pieces-memset-33.c: Likewise.
> * gcc.target/i386/pieces-memset-34.c: Likewise.
> * gcc.target/i386/pieces-memset-44.c: Likewise.
> * gcc.target/i386/pieces-memset-37.c: Replace -mtune=generic
> with -mtune-ctrl=avx256_store_by_pieces.
> ---
>  gcc/config/i386/i386.h   | 10 +++---
>  gcc/config/i386/x86-tune.def | 11 +++
>  gcc/testsuite/g++.target/i386/pr80566-1.C|  2 +-
>  gcc/testsuite/g++.target/i386/pr80566-2.C|  2 +-
>  gcc/testsuite/gcc.target/i386/eh_return-1.c  |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memcpy-12.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memcpy-15.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memset-11.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memset-14.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memset-2.c  |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memset-20.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memset-23.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memset-29.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memset-30.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memset-33.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memset-34.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memset-37.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memset-44.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pieces-memset-5.c  |  2 +-
>  gcc/testsuite/gcc.target/i386/pr100865-10a.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pr100865-4a.c  |  2 +-
>  gcc/testsuite/gcc.target/i386/pr90773-20.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr90773-21.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr90773-22.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr90773-23.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr90773-26.c   |  2 +-
>  26 files changed, 42 insertions(+), 27 deletions(-)
>
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index f671dae9236..9d45fe9611b 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -403,6 +403,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
> ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR]
>  #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
> ix86_tune_features[X86_TUNE_SOFTWARE

Re: [PATCH] Fix SFmode subreg of DImode and TImode

2021-09-08 Thread Richard Biener via Gcc-patches

On Wed, Sep 8, 2021 at 9:18 PM Segher Boessenkool
 wrote:
>
> On Wed, Sep 08, 2021 at 08:39:31PM +0200, Richard Biener wrote:
> > On September 8, 2021 7:08:09 PM GMT+02:00, Segher Boessenkool 
> >  wrote:
> > >It is not a good idea to do allow all those things.  Most backends can
> > >only support a few combinations of them, and everything else results in
> > >*worse* machine code, best case, and more and more complicated (and more
> > >buggy!) backend code.
> > >
> > >But that is a code quality issue.  The current problem is that we have
> > >at least PR102211 and PR102154 (as well as reports elsewhere of bugs on
> > >other targets).  Code that used before doesn't anymore, and we have no
> > >clear way out, no recommendation how to fix this and a) keep the same
> > >functionality without huge changes, and b) keep the same machine code
> > >quality.
> > >
> > >I do not think something like that can be done.  That is why I am asking
> > >for the patch to be reverted until all of the groundwork for it has been
> > >done.  That includes making generic testcases that show how such subregs
> > >behave, so that we can see in testresults what changes do to existing
> > >targets.
> >
> > Heh, I understood your earlier reply that you supported the change in 
> > principle based on the fact that nested subregs are invalid.
>
> Ah.  No.  I meant to lament the fact that we use subregs for multiple
> things, so that doing a bit_cast of a real subreg has to be expressed as
> just one subreg, which is clearly sub-optimal for most backends.
>
> I say *has to* because a subreg of a subreg is not valid RTL; it has to
> be written as just one subreg.  Which makes thing more confusing and
> confused than this already non-trivial thing has to be.
>
> > Now, I don't think that validate_subreg is supposed to be the decision 
> > maker on what a target allows.
>
> Right, some target hook or macro or whatnot should.
>
> > For subregs of hardregs we seem to have a good way of validating, but
> > what do we have for subregs of pseudos? Is it the passes generating
> > the new unsupported subregs that should do different things? Should
> > validate_subreg use a target hook to allow those special casings we
> > removed which all were necessary just for specific targets but
> > appearantly did not do any harm for other targets?
>
> Currently, we can disallow things in predicates and/or the instruction
> conditions.
>
> > Can you give advice as to how to address the needs of the HFmode subregs on 
> > x86 if not by adding another (generic) narrow exception in validate_subreg?
>
> If I knew what the problem was, perhaps?  Is this explained in some mail
> somewhere?
>
> > That said, I fail to see a good way forward after now two appearantly 
> > failed attempts.
> >
> > Btw, I'm fine reverting the patch but then what's the solution here?
>
> I think we should (longer term) get rid of the overloaded meanings and
> uses of subregs.  One fairly simple thing is to make a new rtx code
> "bit_cast" (or is there a nice short more traditional name for it?)

But subreg _is_ bit_cast.  What is odd to me is that a "disallowed" subreg
like (subreg:SF (reg:TI ..) 0) magically becomes valid (in terms of
validate_subreg) if you rewrite it as (subreg:SF (subreg:SI (reg:TI ..) 0) 0).
Of course that's nested and invalid but just push the inner subreg to a
new pseudo and the thing becomes valid.

That was my original idea for dealing with the issue described in
the thread including
https://gcc.gnu.org/pipermail/gcc-patches/2021-August/578437.html
where the original issue is that extract_integral_bit_field creates
a subreg that's not valid according to validate_subreg.  So I thought
well, extracting bitfields should be done on integer modes (thus
pun everything to ints and back), but that had even more(?) fallout.

> But that is not the core problem we had here.  The behaviour of the
> generic parts of the compiler was changed, without testing if that
> works on other targets but x86.  That is an understandable mistake, it
> takes some experience to know where the morasses are.  But this change
> should have been accompanied by testcases exercising the changed code.
> We would have clearly seen there are issues then, simply by watching
> gcc-testresults@ (and/or maintainers are on top of the test results
> anyway).  Also, if there were testcases for this, we could have some
> confidence that a change in this area is robust.

Well, that only works if some maintainers that are familiar enough
with all this chime in ;)   It's stage1 so it's understandable that some
people (like me ...) are tyring to help people making progress even
if that involves trying to decipher 30 years of GCC history in this
area (without much success in the end as we see) ;)

Given validate_subreg's _intent_ is to disallow float subregs
that change size exceptions like

  /* ??? This should not be here.  Temporarily continue to allow word_mode
 subregs of anything.  The most common o

Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk

2021-09-08 Thread Xionghu Luo via Gcc-patches





On 2021/9/2 18:37, Richard Biener wrote:

On Thu, 2 Sep 2021, Xionghu Luo wrote:




On 2021/9/2 16:50, Richard Biener wrote:

On Thu, 2 Sep 2021, Richard Biener wrote:


On Thu, 2 Sep 2021, Xionghu Luo wrote:




On 2021/9/1 17:58, Richard Biener wrote:

This fixes the CFG walk order of fill_always_executed_in to use
RPO oder rather than the dominator based order computed by
get_loop_body_in_dom_order.  That fixes correctness issues with
unordered dominator children.

The RPO order computed by rev_post_order_and_mark_dfs_back_seme in
its for-iteration mode is a good match for the algorithm.

Xionghu, I've tried to only fix the CFG walk order issue and not
change anything else with this so we have a more correct base
to work against.  The code still walks inner loop bodies
up to loop depth times and thus is quadratic in the loop depth.

Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't
have any comments I plan to push this and then revisit what we
were circling around.


LGTM, thanks.


I pushed it, thought again in the attempt to build a testcase and
concluded I was wrong with the appearant mishandling of
contains_call - get_loop_body_in_dom_order seems to be exactly
correct for this specific case.  So I reverted the commit again.


And I figured what the

/* In a loop that is always entered we may proceed anyway.
   But record that we entered it and stop once we leave it.
*/

comment was about.  The code was present before the fix for PR78185
and it was supposed to catch the case where the entered inner loop
is not finite.  Just as the testcase from PR78185 shows the
stopping was done too late when the exit block was already marked
as to be always executed.  A simpler fix for PR78185 would have been
to move

if (!flow_bb_inside_loop_p (inn_loop, bb))
  break;

before setting of last = bb.  In fact the installed fix was more
pessimistic than that given it terminated already when entering
a possibly infinite loop.  So we can improve that by doing
sth like which should also improve the situation for some of
the cases you were looking at?

What remains is that we continue to stop when entering a
not always executed loop:

if (bb->loop_father->header == bb)
  {
if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
  break;


Yes.  This will cause blocks after inner loop missed to be check
if they are actually ALWAYS_EXECUTED.   I am afraid O(N^2) is
inevitable here...


Yes.  What we can try is pre-computing whether a loop
has a call or an inner loop that might not terminate and then
when that's true for the loop to be entered continue to break;
but when not, skip processing that loop blocks (but we still
fill the blocks array, and we do need to do this in the order
for the loop we're processing ...).

So what I was thinking was to somehow embed the dominator
walk of get_loop_body_in_dom_order and instead of pre-recording
the above info (call, infinite loop) for loops, pre-record
it on the dominator tree so that we can ask "in any of our
dominator children, is there a call or an infinite loop" and
thus cut the dominator walk at loop header blocks that are
not dominating the outer loop latch ...

Of course the simplistic solution might be to simply do

if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)
&& ((loop_depth (bb->loop_father) - loop_depth (loop))
> param_max_lim_loop_depth_lookahead)))
  break;

and thus limit the processing of conditionally executed inner
loops by relative depth ... as you say the actual processing
is unlikely to be the bottleneck for the degenerate cases
of a very deep nest of conditionally executed loops.

But still for this case get_loop_body_in_dom_order is
doing quadratic processing so we can also say that
another linear walk over the produced array does not
increase complexity.> 


volatile int flag, bar;
double foo (double *valp)
{
double sum = 0;
for (int i = 0; i < 256; ++i)
  {
for (int j = 0; j < 256; ++j)
  bar = flag;
if (flag)
  sum += 1.;
sum += *valp;
  }
return sum;
}


The patch still fails to handle cases like this:


struct X { int i; int j; int k;};
volatile int m;

void bar (struct X *x, int n, int l, int k)
{
   for (int i = 0; i < l; i++)
 {
  if (k)
 for (int j = 0; j < l; j++)
   {
 if (m)
   x->i = m;
 else
   x->i = 1 - m;

 int *r = &x->k;
 int tem2 = *r;
 x->k += tem2 * j;
 }

   x->i = m;
 }
}

x->i is still not marked ALWAYS_EXECUTED for outer loop.




Collected data when build gcc stage1 and bootstrap.  There are still
about 9% bbs are missed to be marked with ALWAYS_EXECUTED.  Execution time
of fill_always_executed_in_1 is increased obvious in stage1 but it is in
mini-second level while bootstrap build does

50 matches

Mail list logo