Re: [PATCH v2] rs6000: Add load density heuristic
Hi Segher, Thanks for the comments! on 2021/9/7 上午7:43, Segher Boessenkool wrote: > Hi! > > On Wed, Jul 28, 2021 at 10:59:50AM +0800, Kewen.Lin wrote: +/* As a visitor function for each statement cost entry handled in + function add_stmt_cost, gather some information and update its + relevant fields in target cost accordingly. */ >>> >>> I got lost trying to parse that.. (could be just me :-) >>> >>> Possibly instead something like >>> /* Helper function for add_stmt_cost ; gather information and update >>> the target_cost fields accordingly. */ >> >> OK, will update. I was thinking for each entry handled in function >> add_stmt_cost, this helper acts like a visitor, trying to visit each >> entry and take some actions if some conditions are satisifed. > > It (thankfully!) has nothing to do with the "visitor pattern", so some > other name might be better :-) > >>> Maybe clearer to read if you rearrange slightly and flatten it ? I >>> defer to others on that.. >>> >>> if ((kind == vec_to_scalar >>> || kind == vec_perm >>> || kind == vec_promote_demote >>> || kind == vec_construct >>> || kind == scalar_to_vec) >>> || (kind == vector_stmt && where == vect_body) >> >> This hunk is factored out from function rs6000_add_stmt_cost, maybe I >> can keep the original formatting? The formatting tool isn't so smart, >> and sometimes rearrange things to become unexpected (although it meets >> the basic rule, not so elegant), sigh. > > It has too many parens, making grouping where there is none, that is the > core issue. > > if (kind == vec_to_scalar > || kind == vec_perm > || kind == vec_promote_demote > || kind == vec_construct > || kind == scalar_to_vec > || (kind == vector_stmt && where == vect_body)) > > Good catch, I've updated it in V4. BR, Kewen
PING^3 [PATCH v2] combine: Tweak the condition of last_set invalidation
Hi, Gentle ping this: https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572555.html BR, Kewen on 2021/7/15 上午10:00, Kewen.Lin via Gcc-patches wrote: > Hi, > > Gentle ping this: > > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572555.html > > BR, > Kewen > > on 2021/6/28 下午3:00, Kewen.Lin via Gcc-patches wrote: >> Hi! >> >> I'd like to gentle ping this: >> >> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572555.html >> >> >> BR, >> Kewen >> >> on 2021/6/11 下午9:16, Kewen.Lin via Gcc-patches wrote: >>> Hi Segher, >>> >>> Thanks for the review! >>> >>> on 2021/6/10 上午4:17, Segher Boessenkool wrote: Hi! On Wed, Dec 16, 2020 at 04:49:49PM +0800, Kewen.Lin wrote: > Currently we have the check: > > if (!insn > || (value && rsp->last_set_table_tick >= label_tick_ebb_start)) > rsp->last_set_invalid = 1; > > which means if we want to record some value for some reg and > this reg got refered before in a valid scope, If we already know it is *set* in this same extended basic block. Possibly by the same instruction btw. > we invalidate the > set of reg (last_set_invalid to 1). It avoids to find the wrong > set for one reg reference, such as the case like: > >... op regX // this regX could find wrong last_set below >regX = ... // if we think this set is valid >... op regX Yup, exactly. > But because of retry's existence, the last_set_table_tick could > be set by some later reference insns, but we see it's set due > to retry on the set (for that reg) insn again, such as: > >insn 1 >insn 2 > >regX = ... --> (a) >... op regX--> (b) > >insn 3 > >// assume all in the same BB. > > Assuming we combine 1, 2 -> 3 sucessfully and replace them as two > (3 insns -> 2 insns), This will delete insn 1 and write the combined result to insns 2 and 3. > retrying from insn1 or insn2 again: Always 2, but your point remains valid. > it will scan insn (a) again, the below condition holds for regX: > > (value && rsp->last_set_table_tick >= label_tick_ebb_start) > > it will mark this set as invalid set. But actually the > last_set_table_tick here is set by insn (b) before retrying, so it > should be safe to be taken as valid set. Yup. > This proposal is to check whether the last_set_table safely happens > after the current set, make the set still valid if so. > Full SPEC2017 building shows this patch gets more sucessful combines > from 1902208 to 1902243 (trivial though). Do you have some example, or maybe even a testcase? :-) >>> >>> Sorry for the late reply, it took some time to get one reduced case. >>> >>> typedef struct SA *pa_t; >>> >>> struct SC { >>> int h; >>> pa_t elem[]; >>> }; >>> >>> struct SD { >>> struct SC *e; >>> }; >>> >>> struct SA { >>> struct { >>> struct SD f[1]; >>> } g; >>> }; >>> >>> void foo(pa_t *k, char **m) { >>> int l, i; >>> pa_t a; >>> l = (int)a->g.f[5].e; >>> i = 0; >>> for (; i < l; i++) { >>> k[i] = a->g.f[5].e->elem[i]; >>> m[i] = ""; >>> } >>> } >>> >>> Baseline is r12-0 and the option is "-O3 -mcpu=power9 -fno-strict-aliasing", >>> with this patch, the generated assembly can save two rlwinm s. >>> > + /* Record the luid of the insn whose expression involving register n. > */ > + > + intlast_set_table_luid; "Record the luid of the insn for which last_set_table_tick was set", right? >>> >>> But it can be updated later to one smaller luid, how about the wording like: >>> >>> >>> + /* Record the luid of the insn which uses register n, the insn should >>> + be the first one using register n in that block of the insn which >>> + last_set_table_tick was set for. */ >>> >>> > -static void update_table_tick (rtx); > +static void update_table_tick (rtx, int); Please remove this declaration instead, the function is not used until after its actual definition :-) >>> >>> Done. >>> > @@ -13243,7 +13247,21 @@ update_table_tick (rtx x) >for (r = regno; r < endregno; r++) > { > reg_stat_type *rsp = ®_stat[r]; > - rsp->last_set_table_tick = label_tick; > + if (rsp->last_set_table_tick >= label_tick_ebb_start) > + { > + /* Later references should not have lower ticks. */ > + gcc_assert (label_tick >= rsp->last_set_table_tick); This should be obvious, but checking it won't hurt, okay. > + /* Should pick up the lowest luid if the references > + are in the same block. */ > + if (label_tick == rsp->last_set_table_tick > + && rsp->last_set_table_luid > insn_luid) > + rsp->last_
Re: [Committed] Fix fatal typo in gcc.dg/no_profile_instrument_function-attr-2.c
On 9/7/21 22:41, Hans-Peter Nilsson wrote: With r12-3379, the testsuite got such a fatal syntax error, causing the gcc test-run to abort at (e.g.): Thank you for the fix! I haven't noticed the problem as I only grep for '^FAIL' lines in the corresponding log files. Cheers, Martin
[PATCH] i386: Fix up @xorsign3_1 [PR102224]
Hi! As the testcase shows, we miscompile @xorsign3_1 if both input operands are in the same register, because the splitter overwrites op1 before with op1 & mask before using op0. For dest = xorsign op0, op0 we can actually simplify it from dest = (op0 & mask) ^ op0 to dest = op0 & ~mask (aka abs). The expander change is an optimization improvement, if we at expansion time know it is xorsign op0, op0, we can emit abs right away and get better code through that. The @xorsign3_1 is a fix for the case where xorsign wouldn't be known to have same operands during expansion, but during RTL optimizations they would appear. For non-AVX we need to use earlyclobber, we require dest and op1 to be the same but op0 must be different because we overwrite op1 first. For AVX the constraints ensure that at most 2 of the 3 operands may be the same register and if both inputs are the same, handles that case. This case can be easily tested with the xorsign3 expander change reverted. Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? Thinking about it more this morning, while this patch fixes the problems revealed in the testcase, the recent PR89984 change was buggy too, but perhaps that can be fixed incrementally. Because for AVX the new code destructively modifies op1. If that is different from dest, say on: float foo (float x, float y) { return x * __builtin_copysignf (1.0f, y) + y; } then we get after RA: (insn 8 7 9 2 (set (reg:SF 20 xmm0 [orig:82 _2 ] [82]) (unspec:SF [ (reg:SF 20 xmm0 [88]) (reg:SF 21 xmm1 [89]) (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0 S16 A128]) ] UNSPEC_XORSIGN)) "hohoho.c":4:12 649 {xorsignsf3_1} (nil)) (insn 9 8 15 2 (set (reg:SF 20 xmm0 [87]) (plus:SF (reg:SF 20 xmm0 [orig:82 _2 ] [82]) (reg:SF 21 xmm1 [89]))) "hohoho.c":4:44 1021 {*fop_sf_comm} (nil)) but split the xorsign into: vandps .LC0(%rip), %xmm1, %xmm1 vxorps %xmm0, %xmm1, %xmm0 and then the addition: vaddss %xmm1, %xmm0, %xmm0 which means we miscompile it - instead of adding y in the end we add __builtin_copysignf (0.0f, y). So, wonder if we don't want instead in addition to the &Yv <- Yv, 0 alternative (enabled for both pre-AVX and AVX as in this patch) the &Yv <- Yv, Yv where destination must be different from inputs and another Yv <- Yv, Yv where it can be the same but then need a match_scratch (with X for the other alternatives and =Yv for the last one). That way we'd always have a safe register we can store the op1 & mask value into, either the destination (in the first alternative known to be equal to op1 which is needed for non-AVX but ok for AVX too), in the second alternative known to be different from both inputs and in the third which could be used for those float bar (float x, float y) { return x * __builtin_copysignf (1.0f, y); } cases where op1 is naturally xmm1 and dest == op0 naturally xmm0 we'd use some other register like xmm2. 2021-09-07 Jakub Jelinek PR target/102224 * config/i386/i386.md (xorsign3): If operands[1] is equal to operands[2], emit abs2 instead. (@xorsign3_1): Add early-clobbers for output operand, enable first alternative even for avx, add another alternative with =&Yv <- 0, Yv, Yvm constraints. * config/i386/i386-expand.c (ix86_split_xorsign): If op0 is equal to op1, emit vpandn instead. * gcc.dg/pr102224.c: New test. * gcc.target/i386/avx-pr102224.c: New test. --- gcc/config/i386/i386.md.jj 2021-09-06 14:47:43.199049975 +0200 +++ gcc/config/i386/i386.md 2021-09-07 13:13:50.825603852 +0200 @@ -10803,21 +10803,27 @@ (define_expand "xorsign3" (match_operand:MODEF 1 "register_operand") (match_operand:MODEF 2 "register_operand")] "SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH" - "ix86_expand_xorsign (operands); DONE;") +{ + if (rtx_equal_p (operands[1], operands[2])) +emit_insn (gen_abs2 (operands[0], operands[1])); + else +ix86_expand_xorsign (operands); + DONE; +}) (define_insn_and_split "@xorsign3_1" - [(set (match_operand:MODEF 0 "register_operand" "=Yv,Yv") + [(set (match_operand:MODEF 0 "register_operand" "=&Yv,&Yv,&Yv") (unspec:MODEF - [(match_operand:MODEF 1 "register_operand" "Yv,Yv") - (match_operand:MODEF 2 "register_operand" "0,Yv") - (match_operand: 3 "nonimmediate_operand" "Yvm,Yvm")] + [(match_operand:MODEF 1 "register_operand" "Yv,0,Yv") + (match_operand:MODEF 2 "register_operand" "0,Yv,Yv") + (match_operand: 3 "nonimmediate_operand" "Yvm,Yvm,Yvm")] UNSPEC_XORSIGN))] "SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH" "#" "&& reload_completed" [(const_int 0)] "ix86_split_xorsign (operands); DONE;" - [(set_attr "isa" "noavx,avx")]) + [(set_attr "isa" "*,avx,avx")]) ;; One complement instructions --- gcc/config/i386/i386-expan
[PATCH v2] ipa-inline: Add target info into fn summary [PR102059]
Hi! Power ISA 2.07 (Power8) introduces transactional memory feature but ISA3.1 (Power10) removes it. It exposes one troublesome issue as PR102059 shows. Users define some function with target pragma cpu=power10 then it calls one function with attribute always_inline which inherits command line option cpu=power8 which enables HTM implicitly. The current isa_flags check doesn't allow this inlining due to "target specific option mismatch" and error mesasge is emitted. Normally, the callee function isn't intended to exploit HTM feature, but the default flag setting make it look it has. As Richi raised in the PR, we have fp_expressions flag in function summary, and allow us to check the function actually contains any floating point expressions to avoid overkill. So this patch follows the similar idea but is more target specific, for this rs6000 port specific requirement on HTM feature check, we would like to check rs6000 specific HTM built-in functions and inline assembly, it allows targets to do their own customized checks and updates. It introduces two target hooks need_ipa_fn_target_info and update_ipa_fn_target_info. The former allows target to do some previous check and decides to collect target specific information for this function or not. For some special case, it can predict the analysis result and push it early without any scannings. The latter allows the analyze_function_body to pass gimple stmts down just like fp_expressions handlings, target can do its own tricks. I put them as one hook initially with one boolean to indicates whether it's initial time, but the code looks a bit ugly, to separate them seems to have better readability. To make it simple, this patch uses HOST_WIDE_INT to record the flags just like what we use for isa_flags. For rs6000's HTM need, one HOST_WIDE_INT variable is quite enough, but it seems good to have one auto_vec for scalability as I noticed some targets have more than one HOST_WIDE_INT flag. Against v1 [1], this v2 addressed Richi's and Segher's review comments, mainly consists of: - Extend it to cover non always_inline. - Exclude the case for offload streaming. - Some function naming and formatting issues. - Adjust rs6000_can_inline_p. - Add new cases. The patch has been bootstrapped and regress-tested on powerpc64le-linux-gnu Power9. Any comments are highly appreciated! [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578555.html BR, Kewen - gcc/ChangeLog: PR ipa/102059 * config/rs6000/rs6000-call.c (rs6000_fn_has_any_of_these_mask_bits): New function. * config/rs6000/rs6000-internal.h (rs6000_fn_has_any_of_these_mask_bits): New declare. * config/rs6000/rs6000.c (TARGET_NEED_IPA_FN_TARGET_INFO): New macro. (TARGET_UPDATE_IPA_FN_TARGET_INFO): Likewise. (rs6000_need_ipa_fn_target_info): New function. (rs6000_update_ipa_fn_target_info): Likewise. (rs6000_can_inline_p): Adjust for ipa function summary target info. * ipa-fnsummary.c (ipa_dump_fn_summary): Adjust for ipa function summary target info. (analyze_function_body): Adjust for ipa function summary target info and call hook rs6000_need_ipa_fn_target_info and rs6000_update_ipa_fn_target_info. (ipa_merge_fn_summary_after_inlining): Adjust for ipa function summary target info. (inline_read_section): Likewise. (ipa_fn_summary_write): Likewise. * ipa-fnsummary.h (ipa_fn_summary::target_info): New member. * doc/tm.texi: Regenerate. * doc/tm.texi.in (TARGET_UPDATE_IPA_FN_TARGET_INFO): Document new hook. (TARGET_NEED_IPA_FN_TARGET_INFO): Likewise. * target.def (update_ipa_fn_target_info): New hook. (need_ipa_fn_target_info): Likewise. * targhooks.c (default_need_ipa_fn_target_info): New function. (default_update_ipa_fn_target_info): Likewise. * targhooks.h (default_update_ipa_fn_target_info): New declare. (default_need_ipa_fn_target_info): Likewise. gcc/testsuite/ChangeLog: PR ipa/102059 * gcc.dg/lto/pr102059-1_0.c: New test. * gcc.dg/lto/pr102059-1_1.c: New test. * gcc.dg/lto/pr102059-1_2.c: New test. * gcc.dg/lto/pr102059-2_0.c: New test. * gcc.dg/lto/pr102059-2_1.c: New test. * gcc.dg/lto/pr102059-2_2.c: New test. * gcc.target/powerpc/pr102059-5.c: New test. * gcc.target/powerpc/pr102059-6.c: New test. * gcc.target/powerpc/pr102059-7.c: New test. diff --git a/gcc/config/rs6000/rs6000-call.c b/gcc/config/rs6000/rs6000-call.c index fd7f24da818..6af0d15ed87 100644 --- a/gcc/config/rs6000/rs6000-call.c +++ b/gcc/config/rs6000/rs6000-call.c @@ -13795,6 +13795,18 @@ rs6000_builtin_decl (unsigned code, bool initialize_p ATTRIBUTE_UNUSED) return rs6000_builtin_decls[code]; } +/* Return true if the builtin with CODE has any mask bits set + which are specified by
Re: [PATCH 04/13] arm: Add GENERAL_AND_VPR_REGS regclass
On 07/09/2021 15:35, Richard Earnshaw wrote: On 07/09/2021 13:05, Christophe LYON wrote: On 07/09/2021 11:42, Richard Earnshaw wrote: On 07/09/2021 10:15, Christophe Lyon via Gcc-patches wrote: At some point during the development of this patch series, it appeared that in some cases the register allocator wants “VPR or general” rather than “VPR or general or FP” (which is the same thing as ALL_REGS). The series does not seem to require this anymore, but it seems to be a good thing to do anyway, to give the register allocator more freedom. 2021-09-01 Christophe Lyon gcc/ * config/arm/arm.h (reg_class): Add GENERAL_AND_VPR_REGS. (REG_CLASS_NAMES): Likewise. (REG_CLASS_CONTENTS): Likewise. Add VPR_REG to ALL_REGS. diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h index 015299c1534..fab39d05916 100644 --- a/gcc/config/arm/arm.h +++ b/gcc/config/arm/arm.h @@ -1286,6 +1286,7 @@ enum reg_class SFP_REG, AFP_REG, VPR_REG, + GENERAL_AND_VPR_REGS, ALL_REGS, LIM_REG_CLASSES }; @@ -1315,6 +1316,7 @@ enum reg_class "SFP_REG", \ "AFP_REG", \ "VPR_REG", \ + "GENERAL_AND_VPR_REGS", \ "ALL_REGS" \ } @@ -1343,7 +1345,8 @@ enum reg_class { 0x, 0x, 0x, 0x0040 }, /* SFP_REG */ \ { 0x, 0x, 0x, 0x0080 }, /* AFP_REG */ \ { 0x, 0x, 0x, 0x0400 }, /* VPR_REG. */ \ - { 0x7FFF, 0x, 0x, 0x000F } /* ALL_REGS. */ \ + { 0x5FFF, 0x, 0x, 0x0400 }, /* GENERAL_AND_VPR_REGS. */ \ + { 0x7FFF, 0x, 0x, 0x040F } /* ALL_REGS. */ \ } You've changed the definition of ALL_REGS here (to include VPR_REG), but not really explained why. Is that the source of the underlying issue with the 'appeared' you mention? I first added VPR_REG to ALL_REGS, but Richard Sandiford suggested I create a new GENERAL_AND_VPR_REGS that would be more restrictive. I did not remove VPR_REG from ALL_REGS because I thought it was an omission: shouldn't ALL_REGS contain all registers? Surely that should be a separate patch then. OK, I can remove that line from this patch and make a separate one-liner for ALL_REGS. Thanks, Christophe R. R. #define FP_SYSREGS \
Re: [PATCH v4] rs6000: Add load density heuristic
on 2021/9/8 下午2:57, Kewen.Lin via Gcc-patches wrote: > Hi Bill, > > Thanks for the review comments! > > on 2021/9/3 下午11:57, Bill Schmidt wrote: >> Hi Kewen, >> >> Sorry that we lost track of this patch! The heuristic approach looks good. >> It is limited in scope and won't kick in often, and the case you're trying >> to account for is important. >> >> At the time you submitted this, I think reliable P10 testing wasn't >> possible. Now that it is, could you please do a quick sniff test to make >> sure there aren't any adjustments that need to be made for P10? I doubt it, >> but worth checking. >> > > Good point, thanks for the reminder! I did one SPEC2017 full run on Power10 > with Ofast unroll, this patch is neutral, > one SPEC2017 run at O2 vectorization (cheap cost) to verify bwaves_r > degradation existed or not and if it can fixed by > this patch. The result shows the degradation did exist and got fixed by this > patch, besides got extra 3.93% speedup > against O2 and another bmk 554.roms_r got 3.24% speed up. > hmm, sorry that this improvement on 554.roms_r looks not reliable, I just ran it with 10 iterations for both w/ and w/o the patch, both suffer the jitters and the best scores of them are close. But note that bwaves_r scores are quite stable so it's reliable. BR, Kewen > In short, the Power10 evaluation result shows this patch is positive. > >> Otherwise I have one comment below... >> >> On 7/28/21 12:22 AM, Kewen.Lin wrote: >>> Hi, >>> >>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html >>> >>> This v3 addressed William's review comments in >>> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576154.html >>> >>> It's mainly to deal with the bwaves_r degradation due to vector >>> construction fed by strided loads. >>> >>> As Richi's comments [1], this follows the similar idea to over >>> price the vector construction fed by VMAT_ELEMENTWISE or >>> VMAT_STRIDED_SLP. Instead of adding the extra cost on vector >>> construction costing immediately, it firstly records how many >>> loads and vectorized statements in the given loop, later in >>> rs6000_density_test (called by finish_cost) it computes the >>> load density ratio against all vectorized stmts, and check >>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD >>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing >>> if both thresholds are exceeded. >>> >>> Note that this new load density heuristic check is based on >>> some fields in target cost which are updated as needed when >>> scanning each add_stmt_cost entry, it's independent of the >>> current function rs6000_density_test which requires to scan >>> non_vect stmts. Since it's checking the load stmts count >>> vs. all vectorized stmts, it's kind of density, so I put >>> it in function rs6000_density_test. With the same reason to >>> keep it independent, I didn't put it as an else arm of the >>> current existing density threshold check hunk or before this >>> hunk. >>> >>> In the investigation of -1.04% degradation from 526.blender_r >>> on Power8, I noticed that the extra penalized cost 320 on one >>> single vector construction with type V16QI is much exaggerated, >>> which makes the final body cost unreliable, so this patch adds >>> one maximum bound for the extra penalized cost for each vector >>> construction statement. >>> >>> Bootstrapped & regtested *again* on powerpc64le-linux-gnu P9. >>> >>> Full SPEC2017 performance evaluation on Power8/Power9 with >>> option combinations (with v2, as v3 is NFC against v2): >>> * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math} >>> * {-O3, -Ofast} {,-funroll-loops} >>> >>> bwaves_r degradations on P8/P9 have been fixed, nothing else >>> remarkable was observed. >>> > > ... > >>> + /* Gather some information when we are costing the vectorized instruction >>> + for the statements located in a loop body. */ >>> + if (!data->costing_for_scalar && data->loop_info && where == vect_body) >>> +{ >>> + data->nstmts += orig_count; >>> + >>> + if (kind == scalar_load || kind == vector_load >>> + || kind == unaligned_load || kind == vector_gather_load) >>> + data->nloads += orig_count; >>> + >>> + /* If we have strided or elementwise loads into a vector, it's >>> +possible to be bounded by latency and execution resources for >>> +many scalar loads. Try to account for this by scaling the >>> +construction cost by the number of elements involved, when >>> +handling each matching statement we record the possible extra >>> +penalized cost into target cost, in the end of costing for >>> +the whole loop, we do the actual penalization once some load >>> +density heuristics are satisfied. */ >> >> The above comment is quite hard to read. Can you please break up the last >> sentence into at least two sentences? >> > > How about the below: > > + /* If we have strided or elementwise loads into a vector,
[PATCH] testsuite: Remove .exe suffix in prune_gcc_output
When running the testsuite under Windows, we noticed failures in testcase which attempt to match compiler error messages containing the name of the executable. For instance, gcc.dg/analyzer/signal-4a.c tries to match 'cc1:' which obviously fails when the executable is called cc1.exe. This patch removes the .exe suffix from various toolchain executables to avoid this problem. 2021-09-08 Christophe Lyon Torbjörn SVENSSON gcc/testsuite/ * lib/prune.exp (prune_gcc_output): Remove .exe suffix from toolchain executables names. --- gcc/testsuite/lib/prune.exp | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/testsuite/lib/prune.exp b/gcc/testsuite/lib/prune.exp index 91f165bec38..fac212ecf60 100644 --- a/gcc/testsuite/lib/prune.exp +++ b/gcc/testsuite/lib/prune.exp @@ -37,6 +37,9 @@ proc prune_gcc_output { text } { # Handle any freeform regexps. set text [handle-dg-regexps $text] +# Remove Windows .exe suffix +regsub -all "(as|cc1|cc1plus|collect2|f951|ld|lto-wrapper)\.exe?:" $text {\1:} text + regsub -all "(^|\n)(\[^\n\]*: \[iI\]|I)n ((static member |lambda )?function|member|method|(copy )?constructor|destructor|instantiation|substitution|program|subroutine|block-data)\[^\n\]*" $text "" text regsub -all "(^|\n)\[^\n\]*(: )?At (top level|global scope):\[^\n\]*" $text "" text regsub -all "(^|\n)\[^\n\]*: (recursively )?required \[^\n\]*" $text "" text -- 2.25.1
Re: [PATCH] libgcc, i386: Export *hf* and *hc* from libgcc_s.so.1
On Wed, Sep 08, 2021 at 10:37:17AM +0800, Hongtao Liu wrote: > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. > Ok for trunk? > > libgcc/ChangeLog: > > * config/i386/t-softfp: Compile __{mul,div}hc3 into > libgcc_s.so.1. I think this is ok, but not really useful until the *.ver change is acked, because what this patch alone does is add extra non-exported entrypoints to the shared library (aka wasted .text). > libgcc/config/i386/t-softfp | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/libgcc/config/i386/t-softfp b/libgcc/config/i386/t-softfp > index 2363ea17194..7620cc0cec5 100644 > --- a/libgcc/config/i386/t-softfp > +++ b/libgcc/config/i386/t-softfp > @@ -2,9 +2,8 @@ LIB2ADD += $(srcdir)/config/i386/sfp-exceptions.c > > # Replace _divhc3 and _mulhc3. > libgcc2-hf-functions = _divhc3 _mulhc3 > -LIB2FUNCS_EXCLUDE += $(libgcc2-hf-functions) > libgcc2-hf-extras = $(addsuffix .c, $(libgcc2-hf-functions)) > -LIB2ADD_ST += $(addprefix $(srcdir)/config/i386/, $(libgcc2-hf-extras)) > +LIB2ADD += $(addprefix $(srcdir)/config/i386/, $(libgcc2-hf-extras)) > > softfp_extensions := hfsf hfdf hftf hfxf sfdf sftf dftf xftf > softfp_truncations := tfhf xfhf dfhf sfhf tfsf dfsf tfdf tfxf Jakub
[PATCH] c++/102228 - make lookup_anon_field O(1)
For the testcase in PR101555 lookup_anon_field takes the majority of parsing time followed by get_class_binding_direct/fields_linear_search which is PR83309. The situation with anon aggregates is particularly dire when we need to build accesses to their members and the anon aggregates are nested. There for each such access we recursively build sub-accesses to the anon aggregate FIELD_DECLs bottom-up, DFS searching for them. That's inefficient since as I believe there's a 1:1 relationship between anon aggregate types and the FIELD_DECL used to place them. The patch below does away with the search in lookup_anon_field and instead records the single FIELD_DECL in the anon aggregate types lang-specific data, re-using the RTTI typeinfo_var field. That speeds up the compile of the testcase with -fsyntax-only from about 4.5s to slightly less than 1s. I tried to poke holes into the 1:1 relationship idea with my C++ knowledge but failed (which might not say much). It also leaves a hole for the case when the C++ FE itself duplicates such type and places it at a semantically different position. I've tried to poke holes into it with the duplication mechanism I understand (templates) but failed. Bootstrapped and tested on x86_64-unknown-linux-gnu. OK for trunk? Thanks, Richard. 2021-09-08 Richard Biener PR c++/102228 * cp-tree.h (ANON_AGGR_TYPE_FIELD): New define. * decl.c (fixup_anonymous_aggr): Wipe RTTI info put in place on invalid code. * decl2.c (reset_type_linkage): Guard CLASSTYPE_TYPEINFO_VAR access. * module.cc (trees_in::read_class_def): Likewise. Reconstruct ANON_AGGR_TYPE_FIELD. * semantics.c (finish_member_declaration): Populate ANON_AGGR_TYPE_FIELD for anon aggregate typed members. * typeck.c (lookup_anon_field): Remove DFS search and return ANON_AGGR_TYPE_FIELD directly. --- gcc/cp/cp-tree.h | 4 gcc/cp/decl.c | 3 +++ gcc/cp/decl2.c | 17 + gcc/cp/module.cc | 10 -- gcc/cp/semantics.c | 8 gcc/cp/typeck.c| 32 +--- 6 files changed, 37 insertions(+), 37 deletions(-) diff --git a/gcc/cp/cp-tree.h b/gcc/cp/cp-tree.h index 435f32bf909..ceb53591926 100644 --- a/gcc/cp/cp-tree.h +++ b/gcc/cp/cp-tree.h @@ -4800,6 +4800,10 @@ more_aggr_init_expr_args_p (const aggr_init_expr_arg_iterator *iter) #define ANON_UNION_TYPE_P(NODE) \ (TREE_CODE (NODE) == UNION_TYPE && ANON_AGGR_TYPE_P (NODE)) +/* For an ANON_AGGR_TYPE_P the single FIELD_DECL it is used with. */ +#define ANON_AGGR_TYPE_FIELD(NODE) \ + (LANG_TYPE_CLASS_CHECK (NODE)->typeinfo_var) + /* Define fields and accessors for nodes representing declared names. */ /* True if TYPE is an unnamed structured type with a typedef for diff --git a/gcc/cp/decl.c b/gcc/cp/decl.c index 146979ba96b..bce62ad202a 100644 --- a/gcc/cp/decl.c +++ b/gcc/cp/decl.c @@ -5096,6 +5096,9 @@ fixup_anonymous_aggr (tree t) (*vec)[store++] = elt; vec_safe_truncate (vec, store); + /* Wipe RTTI info. */ + CLASSTYPE_TYPEINFO_VAR (t) = NULL_TREE; + /* Anonymous aggregates cannot have fields with ctors, dtors or complex assignment operators (because they cannot have these methods themselves). For anonymous unions this is already checked because they are not allowed diff --git a/gcc/cp/decl2.c b/gcc/cp/decl2.c index 107edcaaccf..a79a70ba9c2 100644 --- a/gcc/cp/decl2.c +++ b/gcc/cp/decl2.c @@ -2977,14 +2977,15 @@ reset_type_linkage (tree type) SET_DECL_ASSEMBLER_NAME (vt, name); reset_decl_linkage (vt); } - if (tree ti = CLASSTYPE_TYPEINFO_VAR (type)) - { - tree name = mangle_typeinfo_for_type (type); - DECL_NAME (ti) = name; - SET_DECL_ASSEMBLER_NAME (ti, name); - TREE_TYPE (name) = type; - reset_decl_linkage (ti); - } + if (!ANON_AGGR_TYPE_P (type)) + if (tree ti = CLASSTYPE_TYPEINFO_VAR (type)) + { + tree name = mangle_typeinfo_for_type (type); + DECL_NAME (ti) = name; + SET_DECL_ASSEMBLER_NAME (ti, name); + TREE_TYPE (name) = type; + reset_decl_linkage (ti); + } for (tree m = TYPE_FIELDS (type); m; m = DECL_CHAIN (m)) { tree mem = STRIP_TEMPLATE (m); diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc index 4b2ad6f3db8..71d0fab411f 100644 --- a/gcc/cp/module.cc +++ b/gcc/cp/module.cc @@ -11859,8 +11859,9 @@ trees_in::read_class_def (tree defn, tree maybe_template) { CLASSTYPE_BEFRIENDING_CLASSES (type_dup) = CLASSTYPE_BEFRIENDING_CLASSES (type); - CLASSTYPE_TYPEINFO_VAR (type_dup) - = CLASSTYPE_TYPEINFO_VAR (type); + if (!ANON_AGGR_TYPE_P (type)) + CLASSTYPE_TYPEINFO_VAR (type_dup) + = CLASSTYPE_TYPEINFO_VAR (type);
Re: [PATCH] i386: Fix up @xorsign3_1 [PR102224]
On Wed, Sep 8, 2021 at 3:43 PM Jakub Jelinek via Gcc-patches wrote: > > Hi! > > As the testcase shows, we miscompile @xorsign3_1 if both input > operands are in the same register, because the splitter overwrites op1 > before with op1 & mask before using op0. > > For dest = xorsign op0, op0 we can actually simplify it from > dest = (op0 & mask) ^ op0 to dest = op0 & ~mask (aka abs). > > The expander change is an optimization improvement, if we at expansion > time know it is xorsign op0, op0, we can emit abs right away and get better > code through that. > > The @xorsign3_1 is a fix for the case where xorsign wouldn't be known > to have same operands during expansion, but during RTL optimizations they > would appear. For non-AVX we need to use earlyclobber, we require > dest and op1 to be the same but op0 must be different because we overwrite > op1 first. For AVX the constraints ensure that at most 2 of the 3 operands > may be the same register and if both inputs are the same, handles that case. > This case can be easily tested with the xorsign3 expander change > reverted. > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > Patch LGTM. PS: I'm curious why we need the post_reload splitter @xorsign3_1 for scalar mode, can't we just expand them into and/xor operations in the expander, just like vector modes did. Let me do some experiments to see whether it is ok to remove the splitter. > Thinking about it more this morning, while this patch fixes the problems > revealed in the testcase, the recent PR89984 change was buggy too, but > perhaps that can be fixed incrementally. Because for AVX the new code > destructively modifies op1. If that is different from dest, say on: > float > foo (float x, float y) > { > return x * __builtin_copysignf (1.0f, y) + y; > } > then we get after RA: > (insn 8 7 9 2 (set (reg:SF 20 xmm0 [orig:82 _2 ] [82]) > (unspec:SF [ > (reg:SF 20 xmm0 [88]) > (reg:SF 21 xmm1 [89]) > (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0 S16 > A128]) > ] UNSPEC_XORSIGN)) "hohoho.c":4:12 649 {xorsignsf3_1} > (nil)) > (insn 9 8 15 2 (set (reg:SF 20 xmm0 [87]) > (plus:SF (reg:SF 20 xmm0 [orig:82 _2 ] [82]) > (reg:SF 21 xmm1 [89]))) "hohoho.c":4:44 1021 {*fop_sf_comm} > (nil)) > but split the xorsign into: > vandps .LC0(%rip), %xmm1, %xmm1 > vxorps %xmm0, %xmm1, %xmm0 > and then the addition: > vaddss %xmm1, %xmm0, %xmm0 > which means we miscompile it - instead of adding y in the end we add > __builtin_copysignf (0.0f, y). > So, wonder if we don't want instead in addition to the &Yv <- Yv, 0 > alternative (enabled for both pre-AVX and AVX as in this patch) the > &Yv <- Yv, Yv where destination must be different from inputs and another > Yv <- Yv, Yv where it can be the same but then need a match_scratch > (with X for the other alternatives and =Yv for the last one). > That way we'd always have a safe register we can store the op1 & mask > value into, either the destination (in the first alternative known to > be equal to op1 which is needed for non-AVX but ok for AVX too), in the > second alternative known to be different from both inputs and in the third > which could be used for those > float bar (float x, float y) { return x * __builtin_copysignf (1.0f, y); } > cases where op1 is naturally xmm1 and dest == op0 naturally xmm0 we'd use > some other register like xmm2. > > 2021-09-07 Jakub Jelinek > > PR target/102224 > * config/i386/i386.md (xorsign3): If operands[1] is equal to > operands[2], emit abs2 instead. > (@xorsign3_1): Add early-clobbers for output operand, enable > first alternative even for avx, add another alternative with > =&Yv <- 0, Yv, Yvm constraints. > * config/i386/i386-expand.c (ix86_split_xorsign): If op0 is equal > to op1, emit vpandn instead. > > * gcc.dg/pr102224.c: New test. > * gcc.target/i386/avx-pr102224.c: New test. > > --- gcc/config/i386/i386.md.jj 2021-09-06 14:47:43.199049975 +0200 > +++ gcc/config/i386/i386.md 2021-09-07 13:13:50.825603852 +0200 > @@ -10803,21 +10803,27 @@ (define_expand "xorsign3" > (match_operand:MODEF 1 "register_operand") > (match_operand:MODEF 2 "register_operand")] >"SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH" > - "ix86_expand_xorsign (operands); DONE;") > +{ > + if (rtx_equal_p (operands[1], operands[2])) > +emit_insn (gen_abs2 (operands[0], operands[1])); > + else > +ix86_expand_xorsign (operands); > + DONE; > +}) > > (define_insn_and_split "@xorsign3_1" > - [(set (match_operand:MODEF 0 "register_operand" "=Yv,Yv") > + [(set (match_operand:MODEF 0 "register_operand" "=&Yv,&Yv,&Yv") > (unspec:MODEF > - [(match_operand:MODEF 1 "register_operand" "Yv,Yv") > - (match_operand:MODEF 2 "register_operand" "0,Yv") > - (match_operand: 3
Re: [PATCH] libgcc, i386: Export *hf* and *hc* from libgcc_s.so.1
On Wed, Sep 8, 2021 at 5:09 PM Jakub Jelinek wrote: > > On Wed, Sep 08, 2021 at 10:37:17AM +0800, Hongtao Liu wrote: > > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. > > Ok for trunk? > > > > libgcc/ChangeLog: > > > > * config/i386/t-softfp: Compile __{mul,div}hc3 into > > libgcc_s.so.1. > > I think this is ok, but not really useful until the *.ver change is acked, > because what this patch alone does is add extra non-exported entrypoints to > the shared library (aka wasted .text). I think your patch is ok (with divhc3/mulhc3), we can check in the respective patches. :) > > > libgcc/config/i386/t-softfp | 3 +-- > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/libgcc/config/i386/t-softfp b/libgcc/config/i386/t-softfp > > index 2363ea17194..7620cc0cec5 100644 > > --- a/libgcc/config/i386/t-softfp > > +++ b/libgcc/config/i386/t-softfp > > @@ -2,9 +2,8 @@ LIB2ADD += $(srcdir)/config/i386/sfp-exceptions.c > > > > # Replace _divhc3 and _mulhc3. > > libgcc2-hf-functions = _divhc3 _mulhc3 > > -LIB2FUNCS_EXCLUDE += $(libgcc2-hf-functions) > > libgcc2-hf-extras = $(addsuffix .c, $(libgcc2-hf-functions)) > > -LIB2ADD_ST += $(addprefix $(srcdir)/config/i386/, $(libgcc2-hf-extras)) > > +LIB2ADD += $(addprefix $(srcdir)/config/i386/, $(libgcc2-hf-extras)) > > > > softfp_extensions := hfsf hfdf hftf hfxf sfdf sftf dftf xftf > > softfp_truncations := tfhf xfhf dfhf sfhf tfsf dfsf tfdf tfxf > > Jakub > -- BR, Hongtao
Re: [PATCH] i386: Fix up @xorsign3_1 [PR102224]
On Wed, Sep 08, 2021 at 05:23:40PM +0800, Hongtao Liu wrote: > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > > > Patch LGTM. Thanks, committed. > PS: > I'm curious why we need the post_reload splitter @xorsign3_1 > for scalar mode, can't we just expand them into and/xor operations in > the expander, just like vector modes did. > Let me do some experiments to see whether it is ok to remove the splitter. I bet it is the question how should the code look like before RA. stv is somewhat related, but as that replaces whole chains, it can do: (insn 14 5 6 2 (set (subreg:V2DI (reg:DI 92) 0) (vec_concat:V2DI (mem/c:DI (symbol_ref:SI ("c") [flags 0x2] ) [1 c+0 S8 A64]) (const_int 0 [0]))) "hohohou.c":6:9 -1 (nil)) on loads of memory. But it stv still does use paradoxical subregs: (insn 10 16 11 2 (set (subreg:V2DI (reg:DI 91) 0) (minus:V2DI (subreg:V2DI (reg:DI 87) 0) (subreg:V2DI (reg:DI 94) 0))) "hohohou.c":6:13 5003 {*subv2di3} (expr_list:REG_DEAD (reg:DI 87) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (insn 11 10 0 2 (set (mem/c:DI (symbol_ref:SI ("a") [flags 0x2] ) [1 a+0 S8 A64]) (reg:DI 91)) "hohohou.c":6:5 76 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 91) (nil))) so perhaps just using paradoxical subregs everywhere would be ok? Jakub
Re: [PATCH v2] c-family: Add __builtin_assoc_barrier
On Tuesday, 7 September 2021 19:36:22 CEST Jason Merrill wrote: > > case PAREN_EXPR: > > - RETURN (finish_parenthesized_expr (RECUR (TREE_OPERAND (t, 0; > > + if (REF_PARENTHESIZED_P (t)) > > + RETURN (finish_parenthesized_expr (RECUR (TREE_OPERAND (t, 0; > > + else > > + RETURN (RECUR (TREE_OPERAND (t, 0))); > > I think you need to build a new PAREN_EXPR in the assoc barrier case as > well, for it to have any effect in templates. My intent was to ignore __builtin_assoc_barrier in templates / constexpr evaluation since it's not affected by -fassociative-math anyway. Or do you mean something else? > Please also add a comment mentioning __builtin_assoc_barrier. I added a comment to that effect to both the cp/pt.c and cp/constexpr.c changes. New patch attached. -- ── Dr. Matthias Kretz https://mattkretz.github.io GSI Helmholtz Centre for Heavy Ion Research https://gsi.de stdₓ::simd ── diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c index 681fcc972f4..c62a6398a47 100644 --- a/gcc/c-family/c-common.c +++ b/gcc/c-family/c-common.c @@ -384,6 +384,7 @@ const struct c_common_resword c_common_reswords[] = { "__builtin_convertvector", RID_BUILTIN_CONVERTVECTOR, 0 }, { "__builtin_has_attribute", RID_BUILTIN_HAS_ATTRIBUTE, 0 }, { "__builtin_launder", RID_BUILTIN_LAUNDER, D_CXXONLY }, + { "__builtin_assoc_barrier", RID_BUILTIN_ASSOC_BARRIER, 0 }, { "__builtin_shuffle", RID_BUILTIN_SHUFFLE, 0 }, { "__builtin_shufflevector", RID_BUILTIN_SHUFFLEVECTOR, 0 }, { "__builtin_tgmath", RID_BUILTIN_TGMATH, D_CONLY }, diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h index 50ca8fb6ebd..f34dc47c2ba 100644 --- a/gcc/c-family/c-common.h +++ b/gcc/c-family/c-common.h @@ -108,7 +108,7 @@ enum rid RID_EXTENSION, RID_IMAGPART, RID_REALPART, RID_LABEL, RID_CHOOSE_EXPR, RID_TYPES_COMPATIBLE_P, RID_BUILTIN_COMPLEX, RID_BUILTIN_SHUFFLE, RID_BUILTIN_SHUFFLEVECTOR, RID_BUILTIN_CONVERTVECTOR, RID_BUILTIN_TGMATH, - RID_BUILTIN_HAS_ATTRIBUTE, + RID_BUILTIN_HAS_ATTRIBUTE, RID_BUILTIN_ASSOC_BARRIER, RID_DFLOAT32, RID_DFLOAT64, RID_DFLOAT128, /* TS 18661-3 keywords, in the same sequence as the TI_* values. */ diff --git a/gcc/c/c-decl.c b/gcc/c/c-decl.c index 983d65e930c..dcf4a2d7c32 100644 --- a/gcc/c/c-decl.c +++ b/gcc/c/c-decl.c @@ -10557,6 +10557,7 @@ names_builtin_p (const char *name) case RID_BUILTIN_HAS_ATTRIBUTE: case RID_BUILTIN_SHUFFLE: case RID_BUILTIN_SHUFFLEVECTOR: +case RID_BUILTIN_ASSOC_BARRIER: case RID_CHOOSE_EXPR: case RID_OFFSETOF: case RID_TYPES_COMPATIBLE_P: diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c index 9a56e0c04c6..fffd81f4e5b 100644 --- a/gcc/c/c-parser.c +++ b/gcc/c/c-parser.c @@ -8931,6 +8931,7 @@ c_parser_predefined_identifier (c_parser *parser) assignment-expression , assignment-expression, ) __builtin_convertvector ( assignment-expression , type-name ) + __builtin_assoc_barrier ( assignment-expression ) offsetof-member-designator: identifier @@ -10076,6 +10077,25 @@ c_parser_postfix_expression (c_parser *parser) } } break; + case RID_BUILTIN_ASSOC_BARRIER: + { + location_t start_loc = loc; + c_parser_consume_token (parser); + matching_parens parens; + if (!parens.require_open (parser)) + { + expr.set_error (); + break; + } + e1 = c_parser_expr_no_commas (parser, NULL); + mark_exp_read (e1.value); + location_t end_loc = c_parser_peek_token (parser)->get_finish (); + parens.skip_until_found_close (parser); + expr.value = build1_loc (loc, PAREN_EXPR, TREE_TYPE (e1.value), + e1.value); + set_c_expr_source_range (&expr, start_loc, end_loc); + } + break; case RID_AT_SELECTOR: { gcc_assert (c_dialect_objc ()); diff --git a/gcc/cp/constexpr.c b/gcc/cp/constexpr.c index 31fa5b66865..6e964837d24 100644 --- a/gcc/cp/constexpr.c +++ b/gcc/cp/constexpr.c @@ -6730,6 +6730,14 @@ cxx_eval_constant_expression (const constexpr_ctx *ctx, tree t, non_constant_p, overflow_p); break; +case PAREN_EXPR: + gcc_assert (!REF_PARENTHESIZED_P (t)); + /* A PAREN_EXPR resulting from __builtin_assoc_barrier has no effect in + constant expressions since it's unaffected by -fassociative-math. */ + r = cxx_eval_constant_expression (ctx, TREE_OPERAND (t, 0), lval, + non_constant_p, overflow_p); + break; + case NOP_EXPR: if (REINTERPRET_CAST_P (t)) { diff --git a/gcc/cp/cp-objcp-common.c b/gcc/cp/cp-objcp-common.c index ee255732d5a..04522a23eda 100644 --- a/gcc/cp/cp-objcp-common.c +++ b/gcc/cp/cp-objcp-common.c @@ -395,6 +395,7 @@ names_builtin_p (const char *name) case RID_BUILTIN_SHUFFLE: case
Re: [PATCH] i386: Fix up @xorsign3_1 [PR102224]
On Wed, Sep 8, 2021 at 5:33 PM Jakub Jelinek wrote: > > On Wed, Sep 08, 2021 at 05:23:40PM +0800, Hongtao Liu wrote: > > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > > > > > Patch LGTM. > > Thanks, committed. > > > PS: > > I'm curious why we need the post_reload splitter @xorsign3_1 > > for scalar mode, can't we just expand them into and/xor operations in > > the expander, just like vector modes did. > > Let me do some experiments to see whether it is ok to remove the splitter. > > I bet it is the question how should the code look like before RA. > stv is somewhat related, but as that replaces whole chains, it can do: > (insn 14 5 6 2 (set (subreg:V2DI (reg:DI 92) 0) > (vec_concat:V2DI (mem/c:DI (symbol_ref:SI ("c") [flags 0x2] > ) [1 c+0 S8 A64]) > (const_int 0 [0]))) "hohohou.c":6:9 -1 > (nil)) > on loads of memory. > But it stv still does use paradoxical subregs: > (insn 10 16 11 2 (set (subreg:V2DI (reg:DI 91) 0) > (minus:V2DI (subreg:V2DI (reg:DI 87) 0) > (subreg:V2DI (reg:DI 94) 0))) "hohohou.c":6:13 5003 {*subv2di3} > (expr_list:REG_DEAD (reg:DI 87) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil > (insn 11 10 0 2 (set (mem/c:DI (symbol_ref:SI ("a") [flags 0x2] 0x7f65a131fc60 a>) [1 a+0 S8 A64]) > (reg:DI 91)) "hohohou.c":6:5 76 {*movdi_internal} > (expr_list:REG_DEAD (reg:DI 91) > (nil))) > so perhaps just using paradoxical subregs everywhere would be ok? Yes, I think so. And I find paradoxical subreg like (subreg:V4SF (reg:SF)) are not allowed by validate_subreg until r11-621. That's why post_reload splitter is needed here. > Jakub > -- BR, Hongtao
[PATCH] Optimize vec_extract for 256/512-bit vector when index exceeds the lower 128 bits.
Hi: As decribed in PR, valign{d,q} can be used for vector extract one element. For elements located in the lower 128 bits, only one instruction is needed, so this patch only optimizes elements located above 128 bits. The optimization is like: - vextracti32x8 $0x1, %zmm0, %ymm0 - vmovd %xmm0, %eax + valignd $8, %zmm0, %zmm0, %zmm1 + vmovd %xmm1, %eax - vextracti32x8 $0x1, %zmm0, %ymm0 - vextracti128$0x1, %ymm0, %xmm0 - vpextrd $3, %xmm0, %eax + valignd $15, %zmm0, %zmm0, %zmm1 + vmovd %xmm1, %eax - vextractf64x2 $0x1, %ymm0, %xmm0 + valignq $2, %ymm0, %ymm0, %ymm0 - vextractf64x4 $0x1, %zmm0, %ymm0 - vextractf64x2 $0x1, %ymm0, %xmm0 - vunpckhpd %xmm0, %xmm0, %xmm0 + valignq $7, %zmm0, %zmm0, %zmm0 Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. gcc/ChangeLog: PR target/91103 * config/i386/sse.md (*vec_extract_valign): New define_insn. gcc/testsuite/ChangeLog: PR target/91103 * gcc.target/i386/pr91103-1.c: New test. * gcc.target/i386/pr91103-2.c: New test. --- gcc/config/i386/sse.md| 32 + gcc/testsuite/gcc.target/i386/pr91103-1.c | 37 +++ gcc/testsuite/gcc.target/i386/pr91103-2.c | 81 +++ 3 files changed, 150 insertions(+) create mode 100644 gcc/testsuite/gcc.target/i386/pr91103-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr91103-2.c diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 5785e73241c..57c736ff44a 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -232,6 +232,12 @@ (define_mode_iterator V48_AVX512VL V16SF (V8SF "TARGET_AVX512VL") (V4SF "TARGET_AVX512VL") V8DF (V4DF "TARGET_AVX512VL") (V2DF "TARGET_AVX512VL")]) +(define_mode_iterator V48_256_512_AVX512VL + [V16SI (V8SI "TARGET_AVX512VL") + V8DI (V4DI "TARGET_AVX512VL") + V16SF (V8SF "TARGET_AVX512VL") + V8DF (V4DF "TARGET_AVX512VL")]) + ;; 1,2 byte AVX-512{BW,VL} vector modes. Supposed TARGET_AVX512BW baseline. (define_mode_iterator VI12_AVX512VL [V64QI (V16QI "TARGET_AVX512VL") (V32QI "TARGET_AVX512VL") @@ -786,6 +792,15 @@ (define_mode_attr sseinsnmode (V4SF "V4SF") (V2DF "V2DF") (TI "TI")]) +(define_mode_attr sseintvecinsnmode + [(V64QI "XI") (V32HI "XI") (V16SI "XI") (V8DI "XI") (V4TI "XI") + (V32QI "OI") (V16HI "OI") (V8SI "OI") (V4DI "OI") (V2TI "OI") + (V16QI "TI") (V8HI "TI") (V4SI "TI") (V2DI "TI") (V1TI "TI") + (V16SF "XI") (V8DF "XI") + (V8SF "OI") (V4DF "OI") + (V4SF "TI") (V2DF "TI") + (TI "TI")]) + ;; SSE constant -1 constraint (define_mode_attr sseconstm1 [(V64QI "BC") (V32HI "BC") (V16SI "BC") (V8DI "BC") (V4TI "BC") @@ -10326,6 +10341,23 @@ (define_insn "_align" [(set_attr "prefix" "evex") (set_attr "mode" "")]) +(define_mode_attr vec_extract_imm_predicate + [(V16SF "const_0_to_15_operand") (V8SF "const_0_to_7_operand") + (V16SI "const_0_to_15_operand") (V8SI "const_0_to_7_operand") + (V8DF "const_0_to_7_operand") (V4DF "const_0_to_3_operand") + (V8DI "const_0_to_7_operand") (V4DI "const_0_to_3_operand")]) + +(define_insn "*vec_extract_valign" + [(set (match_operand: 0 "register_operand" "=v") + (vec_select: + (match_operand:V48_256_512_AVX512VL 1 "register_operand" "v") + (parallel [(match_operand 2 "")])))] + "TARGET_AVX512F + && INTVAL(operands[2]) >= 16 / GET_MODE_SIZE (mode)" + "valign\t{%2, %1, %1, %0|%0, %1, %1, %2}"; + [(set_attr "prefix" "evex") + (set_attr "mode" "")]) + (define_expand "avx512f_shufps512_mask" [(match_operand:V16SF 0 "register_operand") (match_operand:V16SF 1 "register_operand") diff --git a/gcc/testsuite/gcc.target/i386/pr91103-1.c b/gcc/testsuite/gcc.target/i386/pr91103-1.c new file mode 100644 index 000..11caaa8bd1b --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr91103-1.c @@ -0,0 +1,37 @@ +/* { dg-do compile } */ +/* { dg-options "-mavx512vl -O2" } */ +/* { dg-final { scan-assembler-times "valign\[dq\]" 16 } } */ + +typedef float v8sf __attribute__((vector_size(32))); +typedef float v16sf __attribute__((vector_size(64))); +typedef int v8si __attribute__((vector_size(32))); +typedef int v16si __attribute__((vector_size(64))); +typedef double v4df __attribute__((vector_size(32))); +typedef double v8df __attribute__((vector_size(64))); +typedef long long v4di __attribute__((vector_size(32))); +typedef long long v8di __attribute__((vector_size(64))); + +#define EXTRACT(V,S,IDX) \ + S\ + __attribute__((noipa)) \ + foo_##V##_##IDX (V v)\ + {\ +return v[IDX]; \ + }\ + +EXTRACT (v8sf, float, 4); +EXTRACT (v8sf, float, 7); +EXTRACT (v8si, int, 4); +EXTRACT (v8si, int
Re: [PATCH] i386: Fix up @xorsign3_1 [PR102224]
On Wed, Sep 08, 2021 at 06:00:50PM +0800, Hongtao Liu wrote: > Yes, I think so. > And I find paradoxical subreg like (subreg:V4SF (reg:SF)) are not > allowed by validate_subreg until r11-621. > That's why post_reload splitter is needed here. Following seems to work for all the testcases I've find (and in some generates better code than the post-reload splitter): 2021-09-08 Jakub Jelinek liuhongt PR target/89984 * config/i386/i386.md (@xorsign3_1): Remove. * config/i386/i386-expand.c (ix86_expand_xorsign): Expand right away into AND with mask and XOR, using paradoxical subregs. (ix86_split_xorsign): Remove. * gcc.target/i386/avx-pr102224.c: Fix up PR number. * gcc.dg/pr89984.c: New test. * gcc.target/i386/avx-pr89984.c: New test. --- gcc/config/i386/i386.md.jj 2021-09-08 11:40:55.826534981 +0200 +++ gcc/config/i386/i386.md 2021-09-08 11:44:08.394828674 +0200 @@ -10918,20 +10918,6 @@ (define_expand "xorsign3" DONE; }) -(define_insn_and_split "@xorsign3_1" - [(set (match_operand:MODEF 0 "register_operand" "=&Yv,&Yv,&Yv") - (unspec:MODEF - [(match_operand:MODEF 1 "register_operand" "Yv,0,Yv") - (match_operand:MODEF 2 "register_operand" "0,Yv,Yv") - (match_operand: 3 "nonimmediate_operand" "Yvm,Yvm,Yvm")] - UNSPEC_XORSIGN))] - "SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH" - "#" - "&& reload_completed" - [(const_int 0)] - "ix86_split_xorsign (operands); DONE;" - [(set_attr "isa" "*,avx,avx")]) - ;; One complement instructions (define_expand "one_cmpl2" --- gcc/config/i386/i386-expand.c.jj2021-09-08 11:40:55.824535010 +0200 +++ gcc/config/i386/i386-expand.c 2021-09-08 11:51:15.969819626 +0200 @@ -2270,7 +2270,7 @@ void ix86_expand_xorsign (rtx operands[]) { machine_mode mode, vmode; - rtx dest, op0, op1, mask; + rtx dest, op0, op1, mask, x, temp; dest = operands[0]; op0 = operands[1]; @@ -2285,60 +2285,15 @@ ix86_expand_xorsign (rtx operands[]) else gcc_unreachable (); + temp = gen_reg_rtx (vmode); mask = ix86_build_signbit_mask (vmode, 0, 0); - emit_insn (gen_xorsign3_1 (mode, dest, op0, op1, mask)); -} + op1 = lowpart_subreg (vmode, op1, mode); + x = gen_rtx_AND (vmode, op1, mask); + emit_insn (gen_rtx_SET (temp, x)); -/* Deconstruct an xorsign operation into bit masks. */ - -void -ix86_split_xorsign (rtx operands[]) -{ - machine_mode mode, vmode; - rtx dest, op0, op1, mask, x; - - dest = operands[0]; - op0 = operands[1]; - op1 = operands[2]; - mask = operands[3]; - - mode = GET_MODE (dest); - vmode = GET_MODE (mask); - - /* The constraints ensure that for non-AVX dest == op1 is - different from op0, and for AVX that at most two of - dest, op0 and op1 are the same register but the third one - is different. */ - if (rtx_equal_p (op0, op1)) -{ - gcc_assert (TARGET_AVX && !rtx_equal_p (op0, dest)); - if (vmode == V4SFmode) - vmode = V4SImode; - else - { - gcc_assert (vmode == V2DFmode); - vmode = V2DImode; - } - mask = lowpart_subreg (vmode, mask, GET_MODE (mask)); - if (MEM_P (mask)) - { - rtx msk = lowpart_subreg (vmode, dest, mode); - emit_insn (gen_rtx_SET (msk, mask)); - mask = msk; - } - op0 = lowpart_subreg (vmode, op0, mode); - x = gen_rtx_AND (vmode, gen_rtx_NOT (vmode, mask), op0); -} - else -{ - op1 = lowpart_subreg (vmode, op1, mode); - x = gen_rtx_AND (vmode, op1, mask); - emit_insn (gen_rtx_SET (op1, x)); - - op0 = lowpart_subreg (vmode, op0, mode); - x = gen_rtx_XOR (vmode, op1, op0); -} + op0 = lowpart_subreg (vmode, op0, mode); + x = gen_rtx_XOR (vmode, temp, op0); dest = lowpart_subreg (vmode, dest, mode); emit_insn (gen_rtx_SET (dest, x)); --- gcc/testsuite/gcc.target/i386/avx-pr102224.c.jj 2021-09-08 11:40:55.826534981 +0200 +++ gcc/testsuite/gcc.target/i386/avx-pr102224.c2021-09-08 11:57:41.741386062 +0200 @@ -1,4 +1,4 @@ -/* PR tree-optimization/51581 */ +/* PR target/102224 */ /* { dg-do run } */ /* { dg-options "-O2 -mavx" } */ /* { dg-require-effective-target avx } */ --- gcc/testsuite/gcc.dg/pr89984.c.jj 2021-09-08 11:56:33.799343240 +0200 +++ gcc/testsuite/gcc.dg/pr89984.c 2021-09-08 11:54:36.070001821 +0200 @@ -0,0 +1,20 @@ +/* PR target/89984 */ +/* { dg-do run } */ +/* { dg-options "-O2" } */ + +__attribute__((noipa)) float +foo (float x, float y) +{ + return x * __builtin_copysignf (1.0f, y) + y; +} + +int +main () +{ + if (foo (1.25f, 7.25f) != 1.25f + 7.25f + || foo (1.75f, -3.25f) != -1.75f + -3.25f + || foo (-2.25f, 7.5f) != -2.25f + 7.5f + || foo (-3.0f, -4.0f) != 3.0f + -4.0f) +__builtin_abort (); + return 0; +} --- gcc/testsuite/gcc.target/i386/avx-pr89984.c.jj 2021-09-08 11:57:12.297800869 +0200 +++ gcc/testsuite/gcc.target/i386/avx-pr89984.c 2021-09-08 11
Re: [PATCH] i386: Fix up @xorsign3_1 [PR102224]
On Wed, Sep 08, 2021 at 06:00:50PM +0800, Hongtao Liu wrote: > And I find paradoxical subreg like (subreg:V4SF (reg:SF)) are not > allowed by validate_subreg until r11-621. My reading of the r11-621 change is that it allowed (subreg:V4SF (reg:V2SF) 0) but that (subreg:V4SF (reg:SF) 0) has been valid even before that. In any case, the PR89984 fix was a missed-optimization fix, so we don't need to backport that and thus don't need to backport the follow-up patch either. Jakub
Re: [PATCH] i386: Fix up @xorsign3_1 [PR102224]
On Wed, Sep 8, 2021 at 6:02 PM Jakub Jelinek wrote: > > On Wed, Sep 08, 2021 at 06:00:50PM +0800, Hongtao Liu wrote: > > Yes, I think so. > > And I find paradoxical subreg like (subreg:V4SF (reg:SF)) are not > > allowed by validate_subreg until r11-621. > > That's why post_reload splitter is needed here. > > Following seems to work for all the testcases I've find (and in some > generates better code than the post-reload splitter): > > 2021-09-08 Jakub Jelinek > liuhongt > > PR target/89984 > * config/i386/i386.md (@xorsign3_1): Remove. > * config/i386/i386-expand.c (ix86_expand_xorsign): Expand right away > into AND with mask and XOR, using paradoxical subregs. > (ix86_split_xorsign): Remove. Also remove this from i386-protos.h. > > * gcc.target/i386/avx-pr102224.c: Fix up PR number. > * gcc.dg/pr89984.c: New test. > * gcc.target/i386/avx-pr89984.c: New test. > Other LGTM. > --- gcc/config/i386/i386.md.jj 2021-09-08 11:40:55.826534981 +0200 > +++ gcc/config/i386/i386.md 2021-09-08 11:44:08.394828674 +0200 > @@ -10918,20 +10918,6 @@ (define_expand "xorsign3" >DONE; > }) > > -(define_insn_and_split "@xorsign3_1" > - [(set (match_operand:MODEF 0 "register_operand" "=&Yv,&Yv,&Yv") > - (unspec:MODEF > - [(match_operand:MODEF 1 "register_operand" "Yv,0,Yv") > - (match_operand:MODEF 2 "register_operand" "0,Yv,Yv") > - (match_operand: 3 "nonimmediate_operand" > "Yvm,Yvm,Yvm")] > - UNSPEC_XORSIGN))] > - "SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH" > - "#" > - "&& reload_completed" > - [(const_int 0)] > - "ix86_split_xorsign (operands); DONE;" > - [(set_attr "isa" "*,avx,avx")]) > - > ;; One complement instructions > > (define_expand "one_cmpl2" > --- gcc/config/i386/i386-expand.c.jj2021-09-08 11:40:55.824535010 +0200 > +++ gcc/config/i386/i386-expand.c 2021-09-08 11:51:15.969819626 +0200 > @@ -2270,7 +2270,7 @@ void > ix86_expand_xorsign (rtx operands[]) > { >machine_mode mode, vmode; > - rtx dest, op0, op1, mask; > + rtx dest, op0, op1, mask, x, temp; > >dest = operands[0]; >op0 = operands[1]; > @@ -2285,60 +2285,15 @@ ix86_expand_xorsign (rtx operands[]) >else > gcc_unreachable (); > > + temp = gen_reg_rtx (vmode); >mask = ix86_build_signbit_mask (vmode, 0, 0); > > - emit_insn (gen_xorsign3_1 (mode, dest, op0, op1, mask)); > -} > + op1 = lowpart_subreg (vmode, op1, mode); > + x = gen_rtx_AND (vmode, op1, mask); > + emit_insn (gen_rtx_SET (temp, x)); > > -/* Deconstruct an xorsign operation into bit masks. */ > - > -void > -ix86_split_xorsign (rtx operands[]) > -{ > - machine_mode mode, vmode; > - rtx dest, op0, op1, mask, x; > - > - dest = operands[0]; > - op0 = operands[1]; > - op1 = operands[2]; > - mask = operands[3]; > - > - mode = GET_MODE (dest); > - vmode = GET_MODE (mask); > - > - /* The constraints ensure that for non-AVX dest == op1 is > - different from op0, and for AVX that at most two of > - dest, op0 and op1 are the same register but the third one > - is different. */ > - if (rtx_equal_p (op0, op1)) > -{ > - gcc_assert (TARGET_AVX && !rtx_equal_p (op0, dest)); > - if (vmode == V4SFmode) > - vmode = V4SImode; > - else > - { > - gcc_assert (vmode == V2DFmode); > - vmode = V2DImode; > - } > - mask = lowpart_subreg (vmode, mask, GET_MODE (mask)); > - if (MEM_P (mask)) > - { > - rtx msk = lowpart_subreg (vmode, dest, mode); > - emit_insn (gen_rtx_SET (msk, mask)); > - mask = msk; > - } > - op0 = lowpart_subreg (vmode, op0, mode); > - x = gen_rtx_AND (vmode, gen_rtx_NOT (vmode, mask), op0); > -} > - else > -{ > - op1 = lowpart_subreg (vmode, op1, mode); > - x = gen_rtx_AND (vmode, op1, mask); > - emit_insn (gen_rtx_SET (op1, x)); > - > - op0 = lowpart_subreg (vmode, op0, mode); > - x = gen_rtx_XOR (vmode, op1, op0); > -} > + op0 = lowpart_subreg (vmode, op0, mode); > + x = gen_rtx_XOR (vmode, temp, op0); > >dest = lowpart_subreg (vmode, dest, mode); >emit_insn (gen_rtx_SET (dest, x)); > --- gcc/testsuite/gcc.target/i386/avx-pr102224.c.jj 2021-09-08 > 11:40:55.826534981 +0200 > +++ gcc/testsuite/gcc.target/i386/avx-pr102224.c2021-09-08 > 11:57:41.741386062 +0200 > @@ -1,4 +1,4 @@ > -/* PR tree-optimization/51581 */ > +/* PR target/102224 */ > /* { dg-do run } */ > /* { dg-options "-O2 -mavx" } */ > /* { dg-require-effective-target avx } */ > --- gcc/testsuite/gcc.dg/pr89984.c.jj 2021-09-08 11:56:33.799343240 +0200 > +++ gcc/testsuite/gcc.dg/pr89984.c 2021-09-08 11:54:36.070001821 +0200 > @@ -0,0 +1,20 @@ > +/* PR target/89984 */ > +/* { dg-do run } */ > +/* { dg-options "-O2" } */ > + > +__attribute__((noipa)) float > +foo (float x, float y) > +{ > + return x * __builtin_copysignf (1.0f, y) + y; > +} > + > +int > +main ()
Re: [Committed] Fix fatal typo in gcc.dg/no_profile_instrument_function-attr-2.c
> From: Martin Liaka > Date: Wed, 8 Sep 2021 09:29:52 +0200 > On 9/7/21 22:41, Hans-Peter Nilsson wrote: > > With r12-3379, the testsuite got such a fatal syntax error, > > causing the gcc test-run to abort at (e.g.): > > Thank you for the fix! I haven't noticed the problem as I only grep for > '^FAIL' > lines in the corresponding log files. You're welcome but *please* also add (conceptually) "grep -q ^ERROR: makelog && fail" to your workflow. brgds, H-P PS. yeah, but worth repeating
Re: [Committed] Fix fatal typo in gcc.dg/no_profile_instrument_function-attr-2.c
Hi Hans-Peter, >> From: Martin Liaka >> Date: Wed, 8 Sep 2021 09:29:52 +0200 > >> On 9/7/21 22:41, Hans-Peter Nilsson wrote: >> > With r12-3379, the testsuite got such a fatal syntax error, >> > causing the gcc test-run to abort at (e.g.): >> >> Thank you for the fix! I haven't noticed the problem as I only grep for >> '^FAIL' >> lines in the corresponding log files. > > You're welcome but *please* also add (conceptually) > "grep -q ^ERROR: makelog && fail" to your workflow. even that is not enough, though: one also needs to look for UNRESOLVED and XPASS. An easy way of doing all this is running make mail-report.log after the build and comparing the output to an unmodified build. Saves everyone loads of trouble. Rainer -- - Rainer Orth, Center for Biotechnology, Bielefeld University
Re: [PATCH 1/2]middle-end Teach CSE to be able to do vector extracts.
Hi Jeff & Richard, > If you can turn that example into a test, even if it's just in the > aarch64 directory, that would be helpful The second patch 2/2 has various tests for this as the cost model had to be made more accurate for it to work. > > As mentioned in the 2/2 thread, I think we should use subregs for > the case where they're canonical. It'd probably be worth adding a > simplify-rtx.c helper to extract one element from a vector, e.g.: > > rtx simplify_gen_vec_select (rtx op, unsigned int index); > > so that this is easier to do. > > Does making the loop above per-element mean that, for 128-bit Advanced > SIMD, the optimisation “only” kicks in for 64-bit element sizes? > Perhaps for other element sizes we could do “top” and “bottom” halves. > (There's obviously no need to do that as part of this work, was just > wondering.) > It should handle extraction of any element size, so it's able to use a value in any abitrary location. CSE already handles low/hi re-use optimally. So e.g. #include extern int16x8_t bar (int16x8_t, int16x8_t); int16x8_t foo () { int16_t s[4] = {1,2,3,4}; int16_t d[8] = {1,2,3,4,5,6,7,8}; int16x4_t r1 = vld1_s16 (s); int16x8_t r2 = vcombine_s16 (r1, r1); int16x8_t r3 = vld1q_s16 (d); return bar (r2, r3); } but our cost model is currently blocking it because we never costed vec_consts. Without the 2/2 patch we generate: foo: stp x29, x30, [sp, -48]! adrpx0, .LC0 mov x29, sp ldr q1, [x0, #:lo12:.LC0] adrpx0, .LC1 ldr q0, [x0, #:lo12:.LC1] adrpx0, .LC2 str q1, [sp, 32] ldr d2, [x0, #:lo12:.LC2] str d2, [sp, 24] bl bar ldp x29, x30, [sp], 48 ret .LC0: .hword 1 .hword 2 .hword 3 .hword 4 .hword 5 .hword 6 .hword 7 .hword 8 .LC1: .hword 1 .hword 2 .hword 3 .hword 4 .hword 1 .hword 2 .hword 3 .hword 4 but with the 2/2 patch: foo: stp x29, x30, [sp, -48]! adrpx0, .LC0 mov x29, sp ldr d2, [x0, #:lo12:.LC0] adrpx0, .LC1 ldr q1, [x0, #:lo12:.LC1] str d2, [sp, 24] dup d0, v2.d[0] str q1, [sp, 32] ins v0.d[1], v2.d[0] bl bar ldp x29, x30, [sp], 48 ret .LC1: .hword 1 .hword 2 .hword 3 .hword 4 .hword 5 .hword 6 .hword 7 .hword 8 It's not entirely optimal of course, but is step forward. I think when we fix the vld's this should then become optimal as current the MEMs are causing it to not re-use those values. > >else > > sets[n_sets++].rtl = x; > > } > > @@ -4513,7 +4533,14 @@ cse_insn (rtx_insn *insn) > >struct set *sets = (struct set *) 0; > > > >if (GET_CODE (x) == SET) > > -sets = XALLOCA (struct set); > > +{ > > + /* For CONST_VECTOR we wants to be able to CSE the vector itself > > along with > > +elements inside the vector if the target says it's cheap. */ > > + if (GET_CODE (SET_SRC (x)) == CONST_VECTOR) > > + sets = XALLOCAVEC (struct set, const_vector_encoded_nelts (SET_SRC (x)) > > + 1); > > + else > > + sets = XALLOCA (struct set); > > +} > >else if (GET_CODE (x) == PARALLEL) > > sets = XALLOCAVEC (struct set, XVECLEN (x, 0)); > > I think this would be easier if “sets” was first converted to an > auto_vec, say auto_vec. We then wouldn't need to > predict in advance how many elements are needed. > Done. > > @@ -4997,6 +5024,26 @@ cse_insn (rtx_insn *insn) > > src_related_is_const_anchor = src_related != NULL_RTX; > > } > > > > + /* Try to re-materialize a vec_dup with an existing constant. */ > > + if (GET_CODE (src) == CONST_VECTOR > > + && const_vector_encoded_nelts (src) == 1) > > + { > > + rtx const_rtx = CONST_VECTOR_ELT (src, 0); > > Would be simpler as: > > rtx src_elt; > if (const_vec_duplicate_p (src, &src_elt)) > > I think we should also check !src_eqv_here, or perhaps: > > (!src_eqv_here || CONSTANT_P (src_eqv_here)) > > so that we don't override any existing reg notes, which could have more > chance of succeeding. > Done. > > + machine_mode const_mode = GET_MODE_INNER (GET_MODE (src)); > > + struct table_elt *related_elt > > + = lookup (const_rtx, HASH (const_rtx, const_mode), const_mode); > > + if (related_elt) > > + { > > + for (related_elt = related_elt->first_same_value; > > + related_elt; related_elt = related_elt->next_same_value) > > + if (REG_P (related_elt->exp)) > > + { > > + src_eqv_here > > + = gen_rtx_VEC_DUPLICATE (GET_MODE (src), > > +
RE: [PATCH]AArch64 Make use of FADDP in simple reductions.
Slightly updated patch correcting some modes that were giving warnings. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * config/aarch64/aarch64-simd.md (*aarch64_faddp_scalar, *aarch64_addp_scalarv2di, *aarch64_faddp_scalar2, *aarch64_addp_scalar2v2di): New. gcc/testsuite/ChangeLog: * gcc.target/aarch64/simd/scalar_faddp.c: New test. * gcc.target/aarch64/simd/scalar_faddp2.c: New test. * gcc.target/aarch64/simd/scalar_addp.c: New test. Co-authored-by: Tamar Christina --- inline copy of patch --- diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 371990fbe2cfb72d22f22ed582bb7ebdebb3edc0..408500f74c06b63368136a49087558d40972873e 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -3411,6 +3411,70 @@ (define_insn "aarch64_faddp" [(set_attr "type" "neon_fp_reduc_add_")] ) +;; For the case where both operands are a subreg we need to use a +;; match_dup since reload cannot enforce that the registers are +;; the same with a constraint in this case. +(define_insn "*aarch64_faddp_scalar2" + [(set (match_operand: 0 "register_operand" "=w") + (plus: + (vec_select: + (match_operator:VHSDF 1 "subreg_lowpart_operator" + [(match_operand:VHSDF 2 "register_operand" "w")]) + (parallel [(match_operand 3 "const_int_operand" "n")])) + (match_dup: 2)))] + "TARGET_SIMD + && ENDIAN_LANE_N (, INTVAL (operands[3])) == 1" + "faddp\t%0, %2.2" + [(set_attr "type" "neon_fp_reduc_add_")] +) + +(define_insn "*aarch64_faddp_scalar" + [(set (match_operand: 0 "register_operand" "=w") + (plus: + (vec_select: + (match_operand:VHSDF 1 "register_operand" "w") + (parallel [(match_operand 2 "const_int_operand" "n")])) + (match_operand: 3 "register_operand" "1")))] + "TARGET_SIMD + && ENDIAN_LANE_N (, INTVAL (operands[2])) == 1 + && SUBREG_P (operands[3]) && !SUBREG_P (operands[1]) + && subreg_lowpart_p (operands[3])" + "faddp\t%0, %1.2" + [(set_attr "type" "neon_fp_reduc_add_")] +) + +;; For the case where both operands are a subreg we need to use a +;; match_dup since reload cannot enforce that the registers are +;; the same with a constraint in this case. +(define_insn "*aarch64_addp_scalar2v2di" + [(set (match_operand:DI 0 "register_operand" "=w") + (plus:DI + (vec_select:DI + (match_operator:V2DI 1 "subreg_lowpart_operator" + [(match_operand:V2DI 2 "register_operand" "w")]) + (parallel [(match_operand 3 "const_int_operand" "n")])) + (match_dup:DI 2)))] + "TARGET_SIMD + && ENDIAN_LANE_N (2, INTVAL (operands[3])) == 1" + "addp\t%d0, %2.2d" + [(set_attr "type" "neon_reduc_add_long")] +) + +(define_insn "*aarch64_addp_scalarv2di" + [(set (match_operand:DI 0 "register_operand" "=w") + (plus:DI + (vec_select:DI + (match_operand:V2DI 1 "register_operand" "w") + (parallel [(match_operand 2 "const_int_operand" "n")])) + (match_operand:DI 3 "register_operand" "1")))] + "TARGET_SIMD + && ENDIAN_LANE_N (2, INTVAL (operands[2])) == 1 + && SUBREG_P (operands[3]) && !SUBREG_P (operands[1]) + && subreg_lowpart_p (operands[3])" + "addp\t%d0, %1.2d" + [(set_attr "type" "neon_reduc_add_long")] +) + (define_insn "aarch64_reduc_plus_internal" [(set (match_operand:VDQV 0 "register_operand" "=w") (unspec:VDQV [(match_operand:VDQV 1 "register_operand" "w")] diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c new file mode 100644 index ..ab904ca6a6392a3a068615f68e6b76c0716344ae --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c @@ -0,0 +1,11 @@ +/* { dg-do assemble } */ +/* { dg-additional-options "-save-temps -O1 -std=c99" } */ + +typedef long long v2di __attribute__((vector_size (16))); + +long long +foo (v2di x) +{ + return x[1] + x[0]; +} +/* { dg-final { scan-assembler-times {addp\td[0-9]+, v[0-9]+.2d} 1 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c new file mode 100644 index ..2c8a05b46d8b4f7a1634bc04cc61426ba7b9ef91 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c @@ -0,0 +1,44 @@ +/* { dg-do assemble } */ +/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */ +/* { dg-add-options arm_v8_2a_fp16_scalar } */ +/* { dg-additional-options "-save-temps -O1" } */ +/* { dg-final { scan-assembler-times "dup" 4 } } */ + + +typedef double v2df __attribute__((vector_size (16))); +typedef float v4sf __attribute__((vector_size (16))); +typedef __fp16 v8hf __attribute__((vector_size (16))); + +double +foo (v2df x) +{ + return x[1] + x[0]; +} +/* { dg-fina
Re: Sv: [PATCH 1/2 v2] jit : Generate debug info for variables
On Tue, 2021-09-07 at 16:18 +, Petter Tomner via Gcc-patches wrote: > I realized I still managed to mess up some WS. I have attached a > patch that is the same, except fixes the WS issue > underneath. > > Regards, Petter > > + FOR_EACH_VEC_ELT (m_globals, i, global) > + rest_of_decl_compilation (global, true, true); Thanks for the updated patch. Looks good to me, assuming it's been tested successfully. It should be combined with the testcase into a single commit when pushed. Do you have push rights to the GCC git repo? Dave
RE: [PATCH 2/2]AArch64: Add better costing for vector constants and operations
> -Original Message- > From: Richard Sandiford > Sent: Tuesday, August 31, 2021 7:38 PM > To: Tamar Christina > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw > ; Marcus Shawcroft > ; Kyrylo Tkachov > Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants > and operations > > Tamar Christina writes: > >> -Original Message- > >> From: Richard Sandiford > >> Sent: Tuesday, August 31, 2021 5:07 PM > >> To: Tamar Christina > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw > >> ; Marcus Shawcroft > >> ; Kyrylo Tkachov > > >> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector > >> constants and operations > >> > >> Tamar Christina writes: > >> >> -Original Message- > >> >> From: Richard Sandiford > >> >> Sent: Tuesday, August 31, 2021 4:14 PM > >> >> To: Tamar Christina > >> >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw > >> >> ; Marcus Shawcroft > >> >> ; Kyrylo Tkachov > >> > >> >> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector > >> >> constants and operations > >> >> > >> >> Tamar Christina writes: > >> >> > @@ -13936,8 +13937,65 @@ cost_plus: > >> >> >mode, MULT, 1, speed); > >> >> >return true; > >> >> > } > >> >> > + break; > >> >> > +case PARALLEL: > >> >> > + /* Fall through */ > >> >> > >> >> Which code paths lead to getting a PARALLEL here? > >> > > >> > Hi, > >> > > >> > Thanks for the review! > >> > > >> > I added it for completeness because CSE treats a parallel and > >> > CONST_VECTOR as equivalent when they each entry in the parallel > >> > defines > >> a constant. > >> > >> Could you test whether it ever triggers in practice though? > >> The code would be much simpler without it. > > > > Will check 😊 Looks like for AArch64 there's no real way for this to happen so I've removed this case. > > > >> > >> >> > +case CONST_VECTOR: > >> >> > + { > >> >> > + rtx gen_insn = aarch64_simd_make_constant (x, true); > >> >> > + /* Not a valid const vector. */ > >> >> > + if (!gen_insn) > >> >> > + break; > >> >> > > >> >> > - /* Fall through. */ > >> >> > + switch (GET_CODE (gen_insn)) > >> >> > + { > >> >> > + case CONST_VECTOR: > >> >> > + /* Load using MOVI/MVNI. */ > >> >> > + if (aarch64_simd_valid_immediate (x, NULL)) > >> >> > + *cost += extra_cost->vect.movi; > >> >> > + else /* Load using constant pool. */ > >> >> > + *cost += extra_cost->ldst.load; > >> >> > + break; > >> >> > + /* Load using a DUP. */ > >> >> > + case VEC_DUPLICATE: > >> >> > + *cost += extra_cost->vect.dup; > >> >> > + break; > >> >> > >> >> Does this trigger in practice? The new check==true path (rightly) > >> >> stops the duplicated element from being forced into a register, > >> >> but then I would have > >> >> expected: > >> >> > >> >> rtx > >> >> gen_vec_duplicate (machine_mode mode, rtx x) { > >> >> if (valid_for_const_vector_p (mode, x)) > >> >> return gen_const_vec_duplicate (mode, x); > >> >> return gen_rtx_VEC_DUPLICATE (mode, x); } > >> >> > >> >> to generate the original CONST_VECTOR again. > >> > > >> > Yes, but CSE is trying to see whether using a DUP is cheaper than > >> > another > >> instruction. > >> > Normal code won't hit this but CSE is just costing all the > >> > different ways one can semantically construct a vector, which RTL > >> > actually comes out > >> of it depends on how it's folded as you say. > >> > >> But what I mean is, you call: > >> > >> rtx gen_insn = aarch64_simd_make_constant (x, true); > >> /* Not a valid const vector. */ > >> if (!gen_insn) > >>break; > >> > >> where aarch64_simd_make_constant does: > >> > >> if (CONST_VECTOR_P (vals)) > >> const_vec = vals; > >> else if (GET_CODE (vals) == PARALLEL) > >> { > >> /* A CONST_VECTOR must contain only CONST_INTs and > >> CONST_DOUBLEs, but CONSTANT_P allows more (e.g. SYMBOL_REF). > >> Only store valid constants in a CONST_VECTOR. */ > >> int n_elts = XVECLEN (vals, 0); > >> for (i = 0; i < n_elts; ++i) > >>{ > >> rtx x = XVECEXP (vals, 0, i); > >> if (CONST_INT_P (x) || CONST_DOUBLE_P (x)) > >>n_const++; > >>} > >> if (n_const == n_elts) > >>const_vec = gen_rtx_CONST_VECTOR (mode, XVEC (vals, 0)); > >> } > >> else > >> gcc_unreachable (); > >> > >> if (const_vec != NULL_RTX > >> && aarch64_simd_valid_immediate (const_vec, NULL)) > >> /* Load using MOVI/MVNI. */ > >> return const_vec; > >> else if ((const_dup = aarch64_simd_dup_constant (vals, check)) != > >> NULL_RTX) > >> /* Loaded using DUP. */ > >> return const_dup; > >> > >> and aarch64_simd_dup_constant does: > >> > >> machine_mode mode = GET_MODE (vals); > >> machine_mode inner_mode = GET_MODE_INNER (mode); > >> rtx x; > >> > >> if
Re: [PATCH v2] c-family: Add __builtin_assoc_barrier
On 9/8/21 5:37 AM, Matthias Kretz wrote: On Tuesday, 7 September 2021 19:36:22 CEST Jason Merrill wrote: case PAREN_EXPR: - RETURN (finish_parenthesized_expr (RECUR (TREE_OPERAND (t, 0; + if (REF_PARENTHESIZED_P (t)) + RETURN (finish_parenthesized_expr (RECUR (TREE_OPERAND (t, 0; + else + RETURN (RECUR (TREE_OPERAND (t, 0))); I think you need to build a new PAREN_EXPR in the assoc barrier case as well, for it to have any effect in templates. My intent was to ignore __builtin_assoc_barrier in templates / constexpr evaluation since it's not affected by -fassociative-math anyway. Or do you mean something else? I agree about constexpr, but why wouldn't template instantiations be affected by -fassociative-math like any other function? Jason
Re: [PATCH] c++/102228 - make lookup_anon_field O(1)
On 9/8/21 5:13 AM, Richard Biener wrote: For the testcase in PR101555 lookup_anon_field takes the majority of parsing time followed by get_class_binding_direct/fields_linear_search which is PR83309. The situation with anon aggregates is particularly dire when we need to build accesses to their members and the anon aggregates are nested. There for each such access we recursively build sub-accesses to the anon aggregate FIELD_DECLs bottom-up, DFS searching for them. That's inefficient since as I believe there's a 1:1 relationship between anon aggregate types and the FIELD_DECL used to place them. The patch below does away with the search in lookup_anon_field and instead records the single FIELD_DECL in the anon aggregate types lang-specific data, re-using the RTTI typeinfo_var field. That speeds up the compile of the testcase with -fsyntax-only from about 4.5s to slightly less than 1s. I tried to poke holes into the 1:1 relationship idea with my C++ knowledge but failed (which might not say much). It also leaves a hole for the case when the C++ FE itself duplicates such type and places it at a semantically different position. I've tried to poke holes into it with the duplication mechanism I understand (templates) but failed. Bootstrapped and tested on x86_64-unknown-linux-gnu. OK for trunk? OK, thanks. Thanks, Richard. 2021-09-08 Richard Biener PR c++/102228 * cp-tree.h (ANON_AGGR_TYPE_FIELD): New define. * decl.c (fixup_anonymous_aggr): Wipe RTTI info put in place on invalid code. * decl2.c (reset_type_linkage): Guard CLASSTYPE_TYPEINFO_VAR access. * module.cc (trees_in::read_class_def): Likewise. Reconstruct ANON_AGGR_TYPE_FIELD. * semantics.c (finish_member_declaration): Populate ANON_AGGR_TYPE_FIELD for anon aggregate typed members. * typeck.c (lookup_anon_field): Remove DFS search and return ANON_AGGR_TYPE_FIELD directly. --- gcc/cp/cp-tree.h | 4 gcc/cp/decl.c | 3 +++ gcc/cp/decl2.c | 17 + gcc/cp/module.cc | 10 -- gcc/cp/semantics.c | 8 gcc/cp/typeck.c| 32 +--- 6 files changed, 37 insertions(+), 37 deletions(-) diff --git a/gcc/cp/cp-tree.h b/gcc/cp/cp-tree.h index 435f32bf909..ceb53591926 100644 --- a/gcc/cp/cp-tree.h +++ b/gcc/cp/cp-tree.h @@ -4800,6 +4800,10 @@ more_aggr_init_expr_args_p (const aggr_init_expr_arg_iterator *iter) #define ANON_UNION_TYPE_P(NODE) \ (TREE_CODE (NODE) == UNION_TYPE && ANON_AGGR_TYPE_P (NODE)) +/* For an ANON_AGGR_TYPE_P the single FIELD_DECL it is used with. */ +#define ANON_AGGR_TYPE_FIELD(NODE) \ + (LANG_TYPE_CLASS_CHECK (NODE)->typeinfo_var) + /* Define fields and accessors for nodes representing declared names. */ /* True if TYPE is an unnamed structured type with a typedef for diff --git a/gcc/cp/decl.c b/gcc/cp/decl.c index 146979ba96b..bce62ad202a 100644 --- a/gcc/cp/decl.c +++ b/gcc/cp/decl.c @@ -5096,6 +5096,9 @@ fixup_anonymous_aggr (tree t) (*vec)[store++] = elt; vec_safe_truncate (vec, store); + /* Wipe RTTI info. */ + CLASSTYPE_TYPEINFO_VAR (t) = NULL_TREE; + /* Anonymous aggregates cannot have fields with ctors, dtors or complex assignment operators (because they cannot have these methods themselves). For anonymous unions this is already checked because they are not allowed diff --git a/gcc/cp/decl2.c b/gcc/cp/decl2.c index 107edcaaccf..a79a70ba9c2 100644 --- a/gcc/cp/decl2.c +++ b/gcc/cp/decl2.c @@ -2977,14 +2977,15 @@ reset_type_linkage (tree type) SET_DECL_ASSEMBLER_NAME (vt, name); reset_decl_linkage (vt); } - if (tree ti = CLASSTYPE_TYPEINFO_VAR (type)) - { - tree name = mangle_typeinfo_for_type (type); - DECL_NAME (ti) = name; - SET_DECL_ASSEMBLER_NAME (ti, name); - TREE_TYPE (name) = type; - reset_decl_linkage (ti); - } + if (!ANON_AGGR_TYPE_P (type)) + if (tree ti = CLASSTYPE_TYPEINFO_VAR (type)) + { + tree name = mangle_typeinfo_for_type (type); + DECL_NAME (ti) = name; + SET_DECL_ASSEMBLER_NAME (ti, name); + TREE_TYPE (name) = type; + reset_decl_linkage (ti); + } for (tree m = TYPE_FIELDS (type); m; m = DECL_CHAIN (m)) { tree mem = STRIP_TEMPLATE (m); diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc index 4b2ad6f3db8..71d0fab411f 100644 --- a/gcc/cp/module.cc +++ b/gcc/cp/module.cc @@ -11859,8 +11859,9 @@ trees_in::read_class_def (tree defn, tree maybe_template) { CLASSTYPE_BEFRIENDING_CLASSES (type_dup) = CLASSTYPE_BEFRIENDING_CLASSES (type); - CLASSTYPE_TYPEINFO_VAR (type_dup) - = CLASSTYPE_TYPEINFO_VAR (type); + if (!ANON_AGGR_TYPE_P (type)) + CLASSTYPE_TYPEINFO_VAR (
Re: [PATCH v2] c-family: Add __builtin_assoc_barrier
On Wednesday, 8 September 2021 15:44:28 CEST Jason Merrill wrote: > On 9/8/21 5:37 AM, Matthias Kretz wrote: > > On Tuesday, 7 September 2021 19:36:22 CEST Jason Merrill wrote: > >>> case PAREN_EXPR: > >>> - RETURN (finish_parenthesized_expr (RECUR (TREE_OPERAND (t, 0; > >>> + if (REF_PARENTHESIZED_P (t)) > >>> + RETURN (finish_parenthesized_expr (RECUR (TREE_OPERAND (t, > >>> 0; > >>> + else > >>> + RETURN (RECUR (TREE_OPERAND (t, 0))); > >> > >> I think you need to build a new PAREN_EXPR in the assoc barrier case as > >> well, for it to have any effect in templates. > > > > My intent was to ignore __builtin_assoc_barrier in templates / constexpr > > evaluation since it's not affected by -fassociative-math anyway. Or do you > > mean something else? > > I agree about constexpr, but why wouldn't template instantiations be > affected by -fassociative-math like any other function? Oh, that seems like a major misunderstanding on my part. I assumed tsubst_copy_and_build would evaluate the expressions in template arguments 🤦. I'll expand the test and will fix. -- ── Dr. Matthias Kretz https://mattkretz.github.io GSI Helmholtz Centre for Heavy Ion Research https://gsi.de stdₓ::simd ──
[PATCH] rs6000: Fix ELFv2 r12 use in epilogue
We cannot use r12 here, it is already in use as the GEP (for sibling calls). Committed to trunk. Will backport later, too. Segher 2021-09-08 Segher Boessenkool PR target/102107 * config/rs6000/rs6000-logue.c (rs6000_emit_epilogue): For ELFv2 use r11 instead of r12 for restoring CR. --- gcc/config/rs6000/rs6000-logue.c | 4 1 file changed, 4 insertions(+) diff --git a/gcc/config/rs6000/rs6000-logue.c b/gcc/config/rs6000/rs6000-logue.c index e363d56ecec0..9965a8aa6910 100644 --- a/gcc/config/rs6000/rs6000-logue.c +++ b/gcc/config/rs6000/rs6000-logue.c @@ -4815,6 +4815,10 @@ rs6000_emit_epilogue (enum epilogue_type epilogue_type) else if (REGNO (frame_reg_rtx) == 12) cr_save_regno = 11; + /* For ELFv2 r12 is already in use as the GEP. */ + if (DEFAULT_ABI == ABI_ELFv2) + cr_save_regno = 11; + cr_save_reg = load_cr_save (cr_save_regno, frame_reg_rtx, info->cr_save_offset + frame_off, exit_func); -- 1.8.3.1
testsuite: Allow .sdata in more cases in gcc.dg/array-quals-1.c
When testing for Nios II (gcc-testresults shows this for MIPS as well), failures of gcc.dg/array-quals-1.c appear where a symbol was found in .sdata rather than one of the expected sections. FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?a$ (found a) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata) FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?b$ (found b) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata) FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?c$ (found c) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata) FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?d$ (found d) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata) Jakub's commit 0b34dbc0a24864b1674bff7a92fa3cf0f1cbcea1 allowed .sdata for many variables in that test where use of .sdata caused a failure on powerpc-linux. I'm presuming the choice of which variables had .sdata allowed was based only on the code generated for powerpc-linux, not on any reason it would be wrong to allow it for the other variables; thus, this patch adjusts the test to allow .sdata for some more variables where that is needed on Nios II (and in one case where it's not needed on Nios II, but the test results on gcc-testresults suggest that it is needed on MIPS). Tested with no regressions with cross to nios2-elf. OK to commit? * gcc.dg/array-quals-1.c: Allow .sdata section in more cases. diff --git a/gcc/testsuite/gcc.dg/array-quals-1.c b/gcc/testsuite/gcc.dg/array-quals-1.c index 2c041649279..b9b55f774bc 100644 --- a/gcc/testsuite/gcc.dg/array-quals-1.c +++ b/gcc/testsuite/gcc.dg/array-quals-1.c @@ -7,26 +7,26 @@ /* { dg-additional-options "-fno-pie" { target pie } } */ /* The MMIX port always switches to the .data section at the end of a file. */ /* { dg-final { scan-assembler-not "\\.data(?!\\.rel\\.ro)" { xfail powerpc*-*-aix* mmix-*-* x86_64-*-mingw* } } } */ -/* { dg-final { scan-assembler-symbol-section {^_?a$} {^\.(const|rodata|srodata)|\[RO\]} } } */ +/* { dg-final { scan-assembler-symbol-section {^_?a$} {^\.(const|rodata|srodata|sdata)|\[RO\]} } } */ static const int a[2] = { 1, 2 }; /* { dg-final { scan-assembler-symbol-section {^_?a1$} {^\.(const|rodata|srodata|sdata)|\[RO\]} } } */ const int a1[2] = { 1, 2 }; typedef const int ci; -/* { dg-final { scan-assembler-symbol-section {^_?b$} {^\.(const|rodata|srodata)|\[RO\]} } } */ +/* { dg-final { scan-assembler-symbol-section {^_?b$} {^\.(const|rodata|srodata|sdata)|\[RO\]} } } */ static ci b[2] = { 3, 4 }; /* { dg-final { scan-assembler-symbol-section {^_?b1$} {^\.(const|rodata|srodata|sdata)|\[RO\]} } } */ ci b1[2] = { 3, 4 }; typedef int ia[2]; -/* { dg-final { scan-assembler-symbol-section {^_?c$} {^\.(const|rodata|srodata)|\[RO\]} } } */ +/* { dg-final { scan-assembler-symbol-section {^_?c$} {^\.(const|rodata|srodata|sdata)|\[RO\]} } } */ static const ia c = { 5, 6 }; /* { dg-final { scan-assembler-symbol-section {^_?c1$} {^\.(const|rodata|srodata|sdata)|\[RO\]} } } */ const ia c1 = { 5, 6 }; typedef const int cia[2]; -/* { dg-final { scan-assembler-symbol-section {^_?d$} {^\.(const|rodata|srodata)|\[RO\]} } } */ +/* { dg-final { scan-assembler-symbol-section {^_?d$} {^\.(const|rodata|srodata|sdata)|\[RO\]} } } */ static cia d = { 7, 8 }; /* { dg-final { scan-assembler-symbol-section {^_?d1$} {^\.(const|rodata|srodata|sdata)|\[RO\]} } } */ cia d1 = { 7, 8 }; -/* { dg-final { scan-assembler-symbol-section {^_?e$} {^\.(const|rodata|srodata)|\[RO\]} } } */ +/* { dg-final { scan-assembler-symbol-section {^_?e$} {^\.(const|rodata|srodata|sdata)|\[RO\]} } } */ static cia e[2] = { { 1, 2 }, { 3, 4 } }; /* { dg-final { scan-assembler-symbol-section {^_?e1$} {^\.(const|rodata|srodata|sdata)|\[RO\]} } } */ cia e1[2] = { { 1, 2 }, { 3, 4 } }; -- Joseph S. Myers jos...@codesourcery.com
Re: testsuite: Allow .sdata in more cases in gcc.dg/array-quals-1.c
On Wed, Sep 08, 2021 at 03:25:24PM +, Joseph Myers wrote: > When testing for Nios II (gcc-testresults shows this for MIPS as > well), failures of gcc.dg/array-quals-1.c appear where a symbol was > found in .sdata rather than one of the expected sections. > > FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?a$ > (found a) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata) > FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?b$ > (found b) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata) > FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?c$ > (found c) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata) > FAIL: gcc.dg/array-quals-1.c scan-assembler-symbol-section symbol ^_?d$ > (found d) has section ^\\.(const|rodata|srodata)|\\[RO\\] (found .sdata) > > Jakub's commit 0b34dbc0a24864b1674bff7a92fa3cf0f1cbcea1 allowed .sdata > for many variables in that test where use of .sdata caused a failure > on powerpc-linux. I'm presuming the choice of which variables had > .sdata allowed was based only on the code generated for powerpc-linux, > not on any reason it would be wrong to allow it for the other Yeah. > variables; thus, this patch adjusts the test to allow .sdata for some > more variables where that is needed on Nios II (and in one case where > it's not needed on Nios II, but the test results on gcc-testresults > suggest that it is needed on MIPS). > > Tested with no regressions with cross to nios2-elf. OK to commit? > > * gcc.dg/array-quals-1.c: Allow .sdata section in more cases. LGTM. Jakub
Re: [PATCH] Fix SFmode subreg of DImode and TImode
On Tue, Sep 07, 2021 at 06:07:30PM -0500, Segher Boessenkool wrote: > Hi! > > On Tue, Sep 07, 2021 at 03:12:36AM -0400, Michael Meissner wrote: > > [PATCH] Fix SFmode subreg of DImode and TImode > > > > This patch fixes the breakage in the PowerPC due to a recent change in > > SUBREG > > behavior. > > But what was that change? And was that intentional? If so, why wasn't > it documented, was the existing behaviour considered buggy? But the > documentation agrees with the previous behaviour afaics. I haven't investigated what the machine independent change was that suddenly started allowing: (subreg:SF (reg:TI ...)) This patch just allows the subreg's to exist instead of getting an insn not found message. An alternative approach would be to allow wider subregs in the insn matching and deal with getting the right physical register when splitting. > > While it is arguable that the patch that caused the breakage should > > be reverted, this patch should be a bandage to prevent these changes from > > happening again. > > NAK. This patch will likely cause us to generate worse code. If that > is not the case it will need a long, in-depth explanation of why not. Trying to replicate the problem, I get exactly the same code with an older compiler. I will attach the test program as an attachment. > Sorry. > > > I first noticed it in building the Spec 2017 wrf_r and blender_r > > benchmarks. Once I applied this patch, I also noticed several of the > > tests now pass. > > > > The core of the problem is we need to treat SUBREG's of SFmode and SImode > > specially on the PowerPC. This is due to the fact that SFmode values that > > are > > in the vector and floating point registers are represented as DFmode. When > > we > > want to do a direct move between the GPR registers and the vector > > registers, we > > have to convert the value from the DFmode representation to/from the SFmode > > representation. > > The core of the problem is that subreg of pseudos has three meanings: > -- Paradoxical subregs; > -- Actual subregs; > -- "bit_cast" thingies: treat the same bits as something else. Like > looking at the bits of a float as its memory image. > > Ignoring paradoxical subregs (as well as subregs of mem, which should > have disappeared by now), and subregs of hard registers as well (those > have *different* semantics after all), the other two kinds can be mixed, > and *have to* be mixed, because subregs of subregs are non-canonical. This isn't subregs of subregs. This is a subreg of a larger integer type, for example having a subreg representing an __int128_t or structure, and accessing the bottom/top element. Instead of doing an extract, the compiler generates a subreg. Before the machine independent part of the compiler would not generate these subregs, and now it does. > > Is there any reason why not to allow this kind of subreg? > > If we want to not allow mixing bit_cast with subregs, we should make it > its own RTL code. > > > + /* In case we are given a SUBREG for a larger type, reduce it to > > +SImode. */ > > + if (mode == SFmode && GET_MODE_SIZE (inner_mode) > 4) > > + { > > + rtx tmp = gen_reg_rtx (SImode); > > + emit_move_insn (tmp, gen_lowpart (SImode, source)); > > + emit_insn (gen_movsf_from_si (dest, tmp)); > > + return true; > > + } > > This makes it two separate insns. Is that always optimised to code that > is at least as good as before? It is likely even if it generates an extra move, it will be better, because the default would be to do the transfer via store on one register bank and load of the other. -- Michael Meissner, IBM IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA email: meiss...@linux.ibm.com, phone: +1 (978) 899-4797 union u_128 { __int128_t i128; long l[2]; int i[4]; double d[2]; float f[4]; }; union u_64 { long l; int i[2]; double d; float f[2]; }; float ret_u128_f0 (long a, long b, long x, long y) { union u_128 u; u.l[0] = a ^ x; u.l[1] = b ^ y; return u.f[0]; } float ret_u128_f1 (long a, long b, long x, long y) { union u_128 u; u.l[0] = a ^ x; u.l[1] = b ^ y; return u.f[1]; } float ret_u128_f2 (long a, long b, long x, long y) { union u_128 u; u.l[0] = a ^ x; u.l[1] = b ^ x; return u.f[2]; } float ret_u128_f3 (long a, long b, long x, long y) { union u_128 u; u.l[0] = a ^ x; u.l[1] = b ^ x; return u.f[3]; } float ret_u64_f0 (long a, long x) { union u_64 u; u.l = a ^ x; return u.f[0]; } float ret_u64_f1 (long a, long x) { union u_64 u; u.l = a ^ x; return u.f[1]; }
Re: [PATCH] Fix SFmode subreg of DImode and TImode
Hi! On Wed, Sep 08, 2021 at 08:42:44AM +0200, Richard Biener wrote: > On Wed, Sep 8, 2021 at 1:08 AM Segher Boessenkool > wrote: > > The core of the problem is that subreg of pseudos has three meanings: > > -- Paradoxical subregs; > > -- Actual subregs; > > -- "bit_cast" thingies: treat the same bits as something else. Like > > looking at the bits of a float as its memory image. > > > > Ignoring paradoxical subregs (as well as subregs of mem, which should > > have disappeared by now), and subregs of hard registers as well (those > > have *different* semantics after all), the other two kinds can be mixed, > > and *have to* be mixed, because subregs of subregs are non-canonical. > > > > Is there any reason why not to allow this kind of subreg? > > In fact the causing rev. in question > (d2874d905647a1d146dafa60199d440e837adc4d) > made all those subregs "valid" in terms of what validate_subreg is verifying > and thus now the few places using validate_subreg to check whether some > subreg is valid will now happily do float<->int converting subregs. Like in PR102154: error: unrecognizable insn: (insn 11 10 12 2 (set (reg:SF 118 [ ]) (subreg:SF (reg:DI 122) 0)) This generated valid code before, and now it is not anymore. If the patch was meant to allow *more*, it failed that goal: it allows *less*. The issue is that we do not allow SF subregs in all cases. We do not store SFmode quantities in IEEE SP format in registers normally, so it would require expensive extra code to allow such subregs. Whatever pass made these subregs is wrong, because they do not pass recog(). > I do agree that those subregs should be allowed and that the above rev. is > a strict improvement (given it removes a lot of "but allow special case X > because target Y wants it" cases by simply allowing all of them). But the > previous code seems to have papered over quite some backend issues. It is not a good idea to do allow all those things. Most backends can only support a few combinations of them, and everything else results in *worse* machine code, best case, and more and more complicated (and more buggy!) backend code. But that is a code quality issue. The current problem is that we have at least PR102211 and PR102154 (as well as reports elsewhere of bugs on other targets). Code that used before doesn't anymore, and we have no clear way out, no recommendation how to fix this and a) keep the same functionality without huge changes, and b) keep the same machine code quality. I do not think something like that can be done. That is why I am asking for the patch to be reverted until all of the groundwork for it has been done. That includes making generic testcases that show how such subregs behave, so that we can see in testresults what changes do to existing targets. Segher
Re: [PATCH] Fix SFmode subreg of DImode and TImode
On Wed, Sep 8, 2021 at 1:10 PM Segher Boessenkool wrote: > > Hi! > > On Wed, Sep 08, 2021 at 08:42:44AM +0200, Richard Biener wrote: > > On Wed, Sep 8, 2021 at 1:08 AM Segher Boessenkool > > wrote: > > > The core of the problem is that subreg of pseudos has three meanings: > > > -- Paradoxical subregs; > > > -- Actual subregs; > > > -- "bit_cast" thingies: treat the same bits as something else. Like > > > looking at the bits of a float as its memory image. > > > > > > Ignoring paradoxical subregs (as well as subregs of mem, which should > > > have disappeared by now), and subregs of hard registers as well (those > > > have *different* semantics after all), the other two kinds can be mixed, > > > and *have to* be mixed, because subregs of subregs are non-canonical. > > > > > > Is there any reason why not to allow this kind of subreg? > > > > In fact the causing rev. in question > > (d2874d905647a1d146dafa60199d440e837adc4d) > > made all those subregs "valid" in terms of what validate_subreg is verifying > > and thus now the few places using validate_subreg to check whether some > > subreg is valid will now happily do float<->int converting subregs. > > Like in PR102154: > > error: unrecognizable insn: > (insn 11 10 12 2 (set (reg:SF 118 [ ]) > (subreg:SF (reg:DI 122) 0)) > > This generated valid code before, and now it is not anymore. If the > patch was meant to allow *more*, it failed that goal: it allows *less*. > > The issue is that we do not allow SF subregs in all cases. We do not > store SFmode quantities in IEEE SP format in registers normally, so it > would require expensive extra code to allow such subregs. > > Whatever pass made these subregs is wrong, because they do not pass > recog(). > > > I do agree that those subregs should be allowed and that the above rev. is > > a strict improvement (given it removes a lot of "but allow special case X > > because target Y wants it" cases by simply allowing all of them). But the > > previous code seems to have papered over quite some backend issues. > > It is not a good idea to do allow all those things. Most backends can > only support a few combinations of them, and everything else results in > *worse* machine code, best case, and more and more complicated (and more > buggy!) backend code. > > But that is a code quality issue. The current problem is that we have > at least PR102211 and PR102154 (as well as reports elsewhere of bugs on > other targets). Code that used before doesn't anymore, and we have no > clear way out, no recommendation how to fix this and a) keep the same > functionality without huge changes, and b) keep the same machine code > quality. > > I do not think something like that can be done. That is why I am asking > for the patch to be reverted until all of the groundwork for it has been > done. That includes making generic testcases that show how such subregs > behave, so that we can see in testresults what changes do to existing > targets. Richi, As was mentioned in an earlier reply, the patch removed comments in the code itself that explained why the code was necessary: /* ??? This should not be here. Temporarily continue to allow word_mode subregs of anything. The most common offender is (subreg:SI (reg:DF)). Generally, backends are doing something sketchy but it'll take time to fix them all. */ /* ??? Similarly, e.g. with (subreg:DF (reg:TI)). Though store_bit_field is the culprit here, and not the backends. */ The sources of the problem are issues in backends and issues in the common part of the compiler. Removal of this functionality should start with a preemptive analysis of the compiler to determine how this ugly behavior is being utilized and a plan to fix all backends and the common code generation paths in the compiler, not a sweeping change to the backend that flushes out problems. The GCC Regression policy places the burden on the developer whose patch introduced the problem. Liuhong did not test any other targets although the code comments clearly stated that the behavior was required by many targets, and he is not stepping forward to resolve the problems. This patch should be reverted until the problems exposed by it can be addressed in a careful and deliberate manner. Thanks, David
Re: [PATCH] Fix SFmode subreg of DImode and TImode
On September 8, 2021 7:08:09 PM GMT+02:00, Segher Boessenkool wrote: >Hi! > >On Wed, Sep 08, 2021 at 08:42:44AM +0200, Richard Biener wrote: >> On Wed, Sep 8, 2021 at 1:08 AM Segher Boessenkool >> wrote: >> > The core of the problem is that subreg of pseudos has three meanings: >> > -- Paradoxical subregs; >> > -- Actual subregs; >> > -- "bit_cast" thingies: treat the same bits as something else. Like >> > looking at the bits of a float as its memory image. >> > >> > Ignoring paradoxical subregs (as well as subregs of mem, which should >> > have disappeared by now), and subregs of hard registers as well (those >> > have *different* semantics after all), the other two kinds can be mixed, >> > and *have to* be mixed, because subregs of subregs are non-canonical. >> > >> > Is there any reason why not to allow this kind of subreg? >> >> In fact the causing rev. in question >> (d2874d905647a1d146dafa60199d440e837adc4d) >> made all those subregs "valid" in terms of what validate_subreg is verifying >> and thus now the few places using validate_subreg to check whether some >> subreg is valid will now happily do float<->int converting subregs. > >Like in PR102154: > >error: unrecognizable insn: >(insn 11 10 12 2 (set (reg:SF 118 [ ]) >(subreg:SF (reg:DI 122) 0)) > >This generated valid code before, and now it is not anymore. If the >patch was meant to allow *more*, it failed that goal: it allows *less*. > >The issue is that we do not allow SF subregs in all cases. We do not >store SFmode quantities in IEEE SP format in registers normally, so it >would require expensive extra code to allow such subregs. > >Whatever pass made these subregs is wrong, because they do not pass >recog(). > >> I do agree that those subregs should be allowed and that the above rev. is >> a strict improvement (given it removes a lot of "but allow special case X >> because target Y wants it" cases by simply allowing all of them). But the >> previous code seems to have papered over quite some backend issues. > >It is not a good idea to do allow all those things. Most backends can >only support a few combinations of them, and everything else results in >*worse* machine code, best case, and more and more complicated (and more >buggy!) backend code. > >But that is a code quality issue. The current problem is that we have >at least PR102211 and PR102154 (as well as reports elsewhere of bugs on >other targets). Code that used before doesn't anymore, and we have no >clear way out, no recommendation how to fix this and a) keep the same >functionality without huge changes, and b) keep the same machine code >quality. > >I do not think something like that can be done. That is why I am asking >for the patch to be reverted until all of the groundwork for it has been >done. That includes making generic testcases that show how such subregs >behave, so that we can see in testresults what changes do to existing >targets. Heh, I understood your earlier reply that you supported the change in principle based on the fact that nested subregs are invalid. Now, I don't think that validate_subreg is supposed to be the decision maker on what a target allows. For subregs of hardregs we seem to have a good way of validating, but what do we have for subregs of pseudos? Is it the passes generating the new unsupported subregs that should do different things? Should validate_subreg use a target hook to allow those special casings we removed which all were necessary just for specific targets but appearantly did not do any harm for other targets? Can you give advice as to how to address the needs of the HFmode subregs on x86 if not by adding another (generic) narrow exception in validate_subreg? That said, I fail to see a good way forward after now two appearantly failed attempts. Btw, I'm fine reverting the patch but then what's the solution here? Richard. > >Segher
[committed] analyzer: fix ICE when discarding result of realloc [PR102225]
Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu. Pushed to trunk as r12-3422-ge66b9f6779f46433b0e2c093b58403604ed131cc. gcc/analyzer/ChangeLog: PR analyzer/102225 * analyzer.h (compat_types_p): New decl. * constraint-manager.cc (constraint_manager::get_or_add_equiv_class): Guard against NULL type when checking for pointer types. * region-model-impl-calls.cc (region_model::impl_call_realloc): Guard against NULL lhs type/region. Guard against the size value not being of a compatible type for dynamic extents. * region-model.cc (compat_types_p): Make non-static. gcc/testsuite/ChangeLog: PR analyzer/102225 * gcc.dg/analyzer/realloc-1.c (test_10): New. * gcc.dg/analyzer/torture/pr102225.c: New test. --- gcc/analyzer/analyzer.h | 2 + gcc/analyzer/constraint-manager.cc| 9 ++-- gcc/analyzer/region-model-impl-calls.cc | 42 --- gcc/analyzer/region-model.cc | 2 +- gcc/testsuite/gcc.dg/analyzer/realloc-1.c | 5 +++ .../gcc.dg/analyzer/torture/pr102225.c| 6 +++ 6 files changed, 47 insertions(+), 19 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/analyzer/torture/pr102225.c diff --git a/gcc/analyzer/analyzer.h b/gcc/analyzer/analyzer.h index 7ad1081ca6c..3ba4e21b015 100644 --- a/gcc/analyzer/analyzer.h +++ b/gcc/analyzer/analyzer.h @@ -203,6 +203,8 @@ private: extern location_t get_stmt_location (const gimple *stmt, function *fun); +extern bool compat_types_p (tree src_type, tree dst_type); + /* Passed by pointer to PLUGIN_ANALYZER_INIT callbacks. */ class plugin_analyzer_init_iface diff --git a/gcc/analyzer/constraint-manager.cc b/gcc/analyzer/constraint-manager.cc index dc65c8dd92b..6df23fb477e 100644 --- a/gcc/analyzer/constraint-manager.cc +++ b/gcc/analyzer/constraint-manager.cc @@ -2088,10 +2088,11 @@ constraint_manager::get_or_add_equiv_class (const svalue *sval) /* Convert all NULL pointers to (void *) to avoid state explosions involving all of the various (foo *)NULL vs (bar *)NULL. */ - if (POINTER_TYPE_P (sval->get_type ())) -if (tree cst = sval->maybe_get_constant ()) - if (zerop (cst)) - sval = m_mgr->get_or_create_constant_svalue (null_pointer_node); + if (sval->get_type ()) +if (POINTER_TYPE_P (sval->get_type ())) + if (tree cst = sval->maybe_get_constant ()) + if (zerop (cst)) + sval = m_mgr->get_or_create_constant_svalue (null_pointer_node); /* Try svalue match. */ if (get_equiv_class_by_svalue (sval, &result)) diff --git a/gcc/analyzer/region-model-impl-calls.cc b/gcc/analyzer/region-model-impl-calls.cc index 875719f9989..ff2ae9ca77d 100644 --- a/gcc/analyzer/region-model-impl-calls.cc +++ b/gcc/analyzer/region-model-impl-calls.cc @@ -540,11 +540,14 @@ region_model::impl_call_realloc (const call_details &cd) { /* Return NULL; everything else is unchanged. */ const call_details cd (get_call_details (model, ctxt)); - const svalue *zero - = model->m_mgr->get_or_create_int_cst (cd.get_lhs_type (), 0); - model->set_value (cd.get_lhs_region (), - zero, - cd.get_ctxt ()); + if (cd.get_lhs_type ()) + { + const svalue *zero + = model->m_mgr->get_or_create_int_cst (cd.get_lhs_type (), 0); + model->set_value (cd.get_lhs_region (), + zero, + cd.get_ctxt ()); + } return true; } }; @@ -575,11 +578,17 @@ region_model::impl_call_realloc (const call_details &cd) const svalue *ptr_sval = cd.get_arg_svalue (0); const svalue *size_sval = cd.get_arg_svalue (1); if (const region *buffer_reg = ptr_sval->maybe_get_region ()) - model->set_dynamic_extents (buffer_reg, size_sval); - model->set_value (cd.get_lhs_region (), ptr_sval, cd.get_ctxt ()); - const svalue *zero - = model->m_mgr->get_or_create_int_cst (cd.get_lhs_type (), 0); - return model->add_constraint (ptr_sval, NE_EXPR, zero, cd.get_ctxt ()); + if (compat_types_p (size_sval->get_type (), size_type_node)) + model->set_dynamic_extents (buffer_reg, size_sval); + if (cd.get_lhs_region ()) + { + model->set_value (cd.get_lhs_region (), ptr_sval, cd.get_ctxt ()); + const svalue *zero + = model->m_mgr->get_or_create_int_cst (cd.get_lhs_type (), 0); + return model->add_constraint (ptr_sval, NE_EXPR, zero, cd.get_ctxt ()); + } + else + return true; } }; @@ -643,10 +652,15 @@ region_model::impl_call_realloc (const call_details &cd) and the new_ptr_sval as "nonnull". */ model->on_realloc_with_move (cd, old_ptr_sval, new_ptr_sval); - const svalue *zero - = model->m_mgr->get_or_create_int_cst (cd.get_lhs_type (), 0); - return model->add_constr
Re: [committed] analyzer: support "bifurcation"; reimplement realloc [PR99260]
On Fri, 3 Sep 2021, Gerald Pfeifer wrote: > On Mon, 30 Aug 2021, David Malcolm via Gcc-patches wrote: >> gcc/analyzer/ChangeLog: >> PR analyzer/99260 >> * analyzer.h (class custom_edge_info): New class, adapted from >> exploded_edge::custom_info_t. Make member functions const. >> Make update_model return bool, converting edge param from >> reference to a pointer, and adding a ctxt param. >> (class path_context): New class. >> * call-info.cc: New file. >> * call-info.h: New file. >> * engine.cc: Include "analyzer/call-info.h" and . > I believe it is this change that is causing bootstrap to fail with > > In file included from > /scratch/tmp/gerald/GCC-HEAD/gcc/analyzer/engine.cc:69: > In file included from /usr/include/c++/v1/memory:653: > /usr/include/c++/v1/typeinfo:346:5: error: no member named 'fancy_abort' > in name space 'std::__1'; did you mean simply 'fancy_abort'? > _VSTD::abort(); > ^~~ > /usr/include/c++/v1/__config:782:15: note: expanded from macro '_VSTD' > #define _VSTD std::_LIBCPP_ABI_NAMESPACE I'm now 99% certain this patch is the trigger and also filed a bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102242 Good news is that I have prototype patch in testing, so if you have not started yet I suggest to wait and I should have an updated by tomorrow morning. Gerald
Re: [PATCH] Fix SFmode subreg of DImode and TImode
On Wed, Sep 08, 2021 at 08:39:31PM +0200, Richard Biener wrote: > On September 8, 2021 7:08:09 PM GMT+02:00, Segher Boessenkool > wrote: > >It is not a good idea to do allow all those things. Most backends can > >only support a few combinations of them, and everything else results in > >*worse* machine code, best case, and more and more complicated (and more > >buggy!) backend code. > > > >But that is a code quality issue. The current problem is that we have > >at least PR102211 and PR102154 (as well as reports elsewhere of bugs on > >other targets). Code that used before doesn't anymore, and we have no > >clear way out, no recommendation how to fix this and a) keep the same > >functionality without huge changes, and b) keep the same machine code > >quality. > > > >I do not think something like that can be done. That is why I am asking > >for the patch to be reverted until all of the groundwork for it has been > >done. That includes making generic testcases that show how such subregs > >behave, so that we can see in testresults what changes do to existing > >targets. > > Heh, I understood your earlier reply that you supported the change in > principle based on the fact that nested subregs are invalid. Ah. No. I meant to lament the fact that we use subregs for multiple things, so that doing a bit_cast of a real subreg has to be expressed as just one subreg, which is clearly sub-optimal for most backends. I say *has to* because a subreg of a subreg is not valid RTL; it has to be written as just one subreg. Which makes thing more confusing and confused than this already non-trivial thing has to be. > Now, I don't think that validate_subreg is supposed to be the decision maker > on what a target allows. Right, some target hook or macro or whatnot should. > For subregs of hardregs we seem to have a good way of validating, but > what do we have for subregs of pseudos? Is it the passes generating > the new unsupported subregs that should do different things? Should > validate_subreg use a target hook to allow those special casings we > removed which all were necessary just for specific targets but > appearantly did not do any harm for other targets? Currently, we can disallow things in predicates and/or the instruction conditions. > Can you give advice as to how to address the needs of the HFmode subregs on > x86 if not by adding another (generic) narrow exception in validate_subreg? If I knew what the problem was, perhaps? Is this explained in some mail somewhere? > That said, I fail to see a good way forward after now two appearantly failed > attempts. > > Btw, I'm fine reverting the patch but then what's the solution here? I think we should (longer term) get rid of the overloaded meanings and uses of subregs. One fairly simple thing is to make a new rtx code "bit_cast" (or is there a nice short more traditional name for it?) But that is not the core problem we had here. The behaviour of the generic parts of the compiler was changed, without testing if that works on other targets but x86. That is an understandable mistake, it takes some experience to know where the morasses are. But this change should have been accompanied by testcases exercising the changed code. We would have clearly seen there are issues then, simply by watching gcc-testresults@ (and/or maintainers are on top of the test results anyway). Also, if there were testcases for this, we could have some confidence that a change in this area is robust. Segher p.s. Very unrelated... Should we have __builtin_bit_cast for C as well? Is there any reason this could not work?
Re: [PATCH v5] Fix for powerpc64 long double complex divide failure
Ping Note: I don't have commit privileges for gcc. On 8/27/2021 2:59 PM, Patrick McGehearty via Gcc-patches wrote: This revision (v5) adds a test in libgcc/libgcc2.c for when "__LIBGCC_TF_MANT_DIG__ == 106" to use __LIBGCC_DF_EPSILON__ instead of __LIBGCC_TF_EPSILON__. That is specific to IBM 128-bit format long doubles where EPSILON is very, very small and 1/EPSILON oveflows to infinity. This change avoids the overflow without affecting any other platform. Discussion in the patch is adjusted to reflect this limitation. It does not make any changes to .../rs6000/_divkc3.c, leaving it to use __LIBGCC_KF__*. That means the upstream gcc will not build in older IBM environments that do not recognize the KF floating point mode properly. Environments that do not need IBM longdouble support do build cleanly. - - - - This patch addresses the failure of powerpc64 long double complex divide in native ibm long double format after the patch "Practical improvement to libgcc complex divide". The new code uses the following macros which are intended to be mapped to appropriate values according to the underlying hardware representation. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101104 RBIG a value near the maximum representation RMIN a value near the minimum representation (but not in the subnormal range) RMIN2a value moderately less than 1 RMINSCAL the inverse of RMIN2 RMAX2RBIG * RMIN2 - a value to limit scaling to not overflow When "long double" values were not using the IEEE 128-bit format but the traditional IBM 128-bit, the previous code used the LDBL values which caused overflow for RMINSCAL. The new code uses the DBL values. RBIG LDBL_MAX = 0x1.f800p+1022 DBL_MAX = 0x1.f000p+1022 RMIN LDBL_MIN = 0x1.p-969 RMIN DBL_MIN = 0x1.p-1022 RMIN2 LDBL_EPSILON = 0x0.1000p-1022 = 0x1.0p-1074 RMIN2 DBL_EPSILON = 0x1.p-52 RMINSCAL 1/LDBL_EPSILON = inf (1.0p+1074 does not fit in IBM 128-bit). 1/DBL_EPSILON = 0x1.p+52 RMAX2 = RBIG * RMIN2 = 0x1.f800p-52 RBIG * RMIN2 = 0x1.f000p+970 The MAX and MIN values have only modest changes since the maximum and minimum values are about the same as for double precision. The EPSILON field is considerably different. Due to how very small values can be represented in the lower 64 bits of the IBM 128-bit floating point, EPSILON is extremely small, so far beyond the desired value that inversion of the value overflows and even without the overflow, the RMAX2 is so small as to eliminate most usage of the test. The change has been tested on gcc135.fsffrance.org and gains the expected improvements in accuracy for long double complex divide. libgcc/ PR target/101104 * libgcc2.c (RMIN2, RMINSCAL, RMAX2): Use more correct values for native IBM 128-bit. --- libgcc/libgcc2.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c index 38f935e..bf45576 100644 --- a/libgcc/libgcc2.c +++ b/libgcc/libgcc2.c @@ -1906,8 +1906,13 @@ NAME (TYPE x, int m) # define NOTRUNC (!__LIBGCC_TF_EXCESS_PRECISION__) # define RBIG (__LIBGCC_TF_MAX__ / 2) # define RMIN (__LIBGCC_TF_MIN__) -# define RMIN2 (__LIBGCC_TF_EPSILON__) -# define RMINSCAL (1 / __LIBGCC_TF_EPSILON__) +# if __LIBGCC_TF_MANT_DIG__ == 106 +# define RMIN2 (__LIBGCC_DF_EPSILON__) +# define RMINSCAL (1 / __LIBGCC_DF_EPSILON__) +# else +# define RMIN2(__LIBGCC_TF_EPSILON__) +# define RMINSCAL (1 / __LIBGCC_TF_EPSILON__) +# endif # define RMAX2(RBIG * RMIN2) #else # error
Re: [PATCH] c++: Fix docs on assignment of virtual bases [PR60318]
Ping (and remember to CC a maintainer this time). On 31/08/21 09:53 +0100, Jonathan Wakely wrote: The description of behaviour is incorrect, the virtual base gets assigned before entering the bodies of A::operator= and B::operator=, not after. The example is also ill-formed (passing a string literal to char*) and undefined (missing return from Base::operator=). Signed-off-by: Jonathan Wakely gcc/ChangeLog: PR c++/60318 * doc/trouble.texi (Copy Assignment): Fix description of behaviour and fix code in example. OK for trunk and all active branches? commit f0fa91971e35c1df7381ac289ece9bd5f07f8535 Author: Jonathan Wakely Date: Tue Aug 31 09:46:41 2021 c++: Fix docs on assignment of virtual bases [PR60318] The description of behaviour is incorrect, the virtual base gets assigned before entering the bodies of A::operator= and B::operator=, not after. The example is also ill-formed (passing a string literal to char*) and undefined (missing return from Base::operator=). Signed-off-by: Jonathan Wakely gcc/ChangeLog: PR c++/60318 * doc/trouble.texi (Copy Assignment): Fix description of behaviour and fix code in example. diff --git a/gcc/doc/trouble.texi b/gcc/doc/trouble.texi index 40c51ae21cb..8b34be4aa63 100644 --- a/gcc/doc/trouble.texi +++ b/gcc/doc/trouble.texi @@ -865,10 +865,11 @@ objects behave unspecified when being assigned. For example: @smallexample struct Base@{ char *name; - Base(char *n) : name(strdup(n))@{@} + Base(const char *n) : name(strdup(n))@{@} Base& operator= (const Base& other)@{ free (name); name = strdup (other.name); + return *this; @} @}; @@ -901,8 +902,8 @@ inside @samp{func} in the example). G++ implements the ``intuitive'' algorithm for copy-assignment: assign all direct bases, then assign all members. In that algorithm, the virtual base subobject can be encountered more than once. In the example, copying -proceeds in the following order: @samp{val}, @samp{name} (via -@code{strdup}), @samp{bval}, and @samp{name} again. +proceeds in the following order: @samp{name} (via @code{strdup}), +@samp{val}, @samp{name} again, and @samp{bval}. If application code relies on copy-assignment, a user-defined copy-assignment operator removes any uncertainties. With such an
Re: [PATCH] c++: Fix docs on assignment of virtual bases [PR60318]
On 9/8/21 3:52 PM, Jonathan Wakely via Gcc-patches wrote: Ping (and remember to CC a maintainer this time). OK, thanks. On 31/08/21 09:53 +0100, Jonathan Wakely wrote: The description of behaviour is incorrect, the virtual base gets assigned before entering the bodies of A::operator= and B::operator=, not after. The example is also ill-formed (passing a string literal to char*) and undefined (missing return from Base::operator=). Signed-off-by: Jonathan Wakely gcc/ChangeLog: PR c++/60318 * doc/trouble.texi (Copy Assignment): Fix description of behaviour and fix code in example. OK for trunk and all active branches? commit f0fa91971e35c1df7381ac289ece9bd5f07f8535 Author: Jonathan Wakely Date: Tue Aug 31 09:46:41 2021 c++: Fix docs on assignment of virtual bases [PR60318] The description of behaviour is incorrect, the virtual base gets assigned before entering the bodies of A::operator= and B::operator=, not after. The example is also ill-formed (passing a string literal to char*) and undefined (missing return from Base::operator=). Signed-off-by: Jonathan Wakely gcc/ChangeLog: PR c++/60318 * doc/trouble.texi (Copy Assignment): Fix description of behaviour and fix code in example. diff --git a/gcc/doc/trouble.texi b/gcc/doc/trouble.texi index 40c51ae21cb..8b34be4aa63 100644 --- a/gcc/doc/trouble.texi +++ b/gcc/doc/trouble.texi @@ -865,10 +865,11 @@ objects behave unspecified when being assigned. For example: @smallexample struct Base@{ char *name; - Base(char *n) : name(strdup(n))@{@} + Base(const char *n) : name(strdup(n))@{@} Base& operator= (const Base& other)@{ free (name); name = strdup (other.name); + return *this; @} @}; @@ -901,8 +902,8 @@ inside @samp{func} in the example). G++ implements the ``intuitive'' algorithm for copy-assignment: assign all direct bases, then assign all members. In that algorithm, the virtual base subobject can be encountered more than once. In the example, copying -proceeds in the following order: @samp{val}, @samp{name} (via -@code{strdup}), @samp{bval}, and @samp{name} again. +proceeds in the following order: @samp{name} (via @code{strdup}), +@samp{val}, @samp{name} again, and @samp{bval}. If application code relies on copy-assignment, a user-defined copy-assignment operator removes any uncertainties. With such an
Re: [PATCH] warn for more impossible null pointer tests [PR102103]
On 9/2/21 7:53 PM, Martin Sebor wrote: @@ -4622,28 +4622,94 @@ warn_for_null_address (location_t location, tree op, tsubst_flags_t complain) if (!warn_address || (complain & tf_warning) == 0 || c_inhibit_evaluation_warnings != 0 - || warning_suppressed_p (op, OPT_Waddress)) + || warning_suppressed_p (op, OPT_Waddress) + || processing_template_decl != 0) Completely suppressing this warning in templates seems like a regression; I'd think we could recognize many relevant cases before instantiation. You just can't assume that ADDR_EXPR has the default meaning if it has unknown type (i.e. because op0 is type-dependent). Jason
[PING][PATCH] rs6000: Disable optimizing multiple xxsetaccz instructions into one xxsetaccz
I'd like to ping the following MMA patch: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/578288.html Message-ID: <8393a33f-50ab-6720-0017-3f012803b...@linux.ibm.com> Peter Fwprop will happily optimize two xxsetaccz instructions into one xxsetaccz by propagating the results of the first to the uses of the second. We really don't want that to happen given the late priming/depriming of accumulators. I fixed this by making the xxsetaccz source operand an unspec volatile. I also removed the mma_xxsetaccz define_expand and define_insn_and_split and replaced it with a simple define_insn. The expand and splitter patterns were leftovers from the pre opaque mode code when the xxsetaccz code was part of the movpxi pattern, and we don't need them now. Rather than a new test case, I was able to just modify the current test case to add another __builtin_mma_xxsetaccz call which shows the bad code gen with unpatched compilers. This passed bootstrap on powerpc64le-linux with no regressions. Ok for trunk? We'll need this for sure in GCC11. Ok there too after some trunk burn in time? GCC10 suffers from the same issue, but since the code is different, I'll have to determine a different solution which I'll post as a separate patch. gcc/ * config/rs6000/mma.md (unspec): Delete UNSPEC_MMA_XXSETACCZ. (unspecv): Add UNSPECV_MMA_XXSETACCZ. (*mma_xxsetaccz): Delete. (mma_xxsetaccz): Change to define_insn. Remove match_operand. Use UNSPECV_MMA_XXSETACCZ. * config/rs6000/rs6000.c (rs6000_rtx_costs): Use UNSPECV_MMA_XXSETACCZ. gcc/testsuite/ * gcc.target/powerpc/mma-builtin-6.c: Add second call to xxsetacc built-in. Update instruction counts.
Re: [PATCH] Fix SFmode subreg of DImode and TImode
On Thu, Sep 9, 2021 at 3:17 AM Segher Boessenkool wrote: > > On Wed, Sep 08, 2021 at 08:39:31PM +0200, Richard Biener wrote: > > On September 8, 2021 7:08:09 PM GMT+02:00, Segher Boessenkool > > wrote: > > >It is not a good idea to do allow all those things. Most backends can > > >only support a few combinations of them, and everything else results in > > >*worse* machine code, best case, and more and more complicated (and more > > >buggy!) backend code. > > > > > >But that is a code quality issue. The current problem is that we have > > >at least PR102211 and PR102154 (as well as reports elsewhere of bugs on > > >other targets). Code that used before doesn't anymore, and we have no > > >clear way out, no recommendation how to fix this and a) keep the same > > >functionality without huge changes, and b) keep the same machine code > > >quality. > > > > > >I do not think something like that can be done. That is why I am asking > > >for the patch to be reverted until all of the groundwork for it has been > > >done. That includes making generic testcases that show how such subregs > > >behave, so that we can see in testresults what changes do to existing > > >targets. > > > > Heh, I understood your earlier reply that you supported the change in > > principle based on the fact that nested subregs are invalid. > > Ah. No. I meant to lament the fact that we use subregs for multiple > things, so that doing a bit_cast of a real subreg has to be expressed as > just one subreg, which is clearly sub-optimal for most backends. > > I say *has to* because a subreg of a subreg is not valid RTL; it has to > be written as just one subreg. Which makes thing more confusing and > confused than this already non-trivial thing has to be. > > > Now, I don't think that validate_subreg is supposed to be the decision > > maker on what a target allows. Since there is no good solution yet, I think we can add a target hook (targetm.validate_subreg) to recover the original removed code as a default behavior, and then x86 itself defines the hook as empty (meaning that the backend allows all kinds of subreg). something like @@ -922,6 +922,9 @@ validate_subreg (machine_mode omode, machine_mode imode, poly_uint64 regsize = REGMODE_NATURAL_SIZE (imode); + if (!targetm.valiate_subreg (omode, imode, reg, offset)) +return false; + /* Paradoxical subregs must have offset zero. */ if (maybe_gt (osize, isize)) return known_eq (offset, 0U); > > Right, some target hook or macro or whatnot should. > > > For subregs of hardregs we seem to have a good way of validating, but > > what do we have for subregs of pseudos? Is it the passes generating > > the new unsupported subregs that should do different things? Should > > validate_subreg use a target hook to allow those special casings we > > removed which all were necessary just for specific targets but > > appearantly did not do any harm for other targets? > > Currently, we can disallow things in predicates and/or the instruction > conditions. > > > Can you give advice as to how to address the needs of the HFmode subregs on > > x86 if not by adding another (generic) narrow exception in validate_subreg? > > If I knew what the problem was, perhaps? Is this explained in some mail > somewhere? > > > That said, I fail to see a good way forward after now two appearantly > > failed attempts. > > > > Btw, I'm fine reverting the patch but then what's the solution here? > > I think we should (longer term) get rid of the overloaded meanings and > uses of subregs. One fairly simple thing is to make a new rtx code > "bit_cast" (or is there a nice short more traditional name for it?) > > But that is not the core problem we had here. The behaviour of the > generic parts of the compiler was changed, without testing if that > works on other targets but x86. That is an understandable mistake, it > takes some experience to know where the morasses are. But this change > should have been accompanied by testcases exercising the changed code. > We would have clearly seen there are issues then, simply by watching > gcc-testresults@ (and/or maintainers are on top of the test results > anyway). Also, if there were testcases for this, we could have some > confidence that a change in this area is robust. > > > Segher > > > p.s. Very unrelated... Should we have __builtin_bit_cast for C as well? > Is there any reason this could not work? -- BR, Hongtao
Re: [PATCH] Fix SFmode subreg of DImode and TImode
On Thu, Sep 9, 2021 at 3:17 AM Segher Boessenkool wrote: > > On Wed, Sep 08, 2021 at 08:39:31PM +0200, Richard Biener wrote: > > On September 8, 2021 7:08:09 PM GMT+02:00, Segher Boessenkool > > wrote: > > >It is not a good idea to do allow all those things. Most backends can > > >only support a few combinations of them, and everything else results in > > >*worse* machine code, best case, and more and more complicated (and more > > >buggy!) backend code. > > > > > >But that is a code quality issue. The current problem is that we have > > >at least PR102211 and PR102154 (as well as reports elsewhere of bugs on > > >other targets). Code that used before doesn't anymore, and we have no > > >clear way out, no recommendation how to fix this and a) keep the same > > >functionality without huge changes, and b) keep the same machine code > > >quality. > > > > > >I do not think something like that can be done. That is why I am asking > > >for the patch to be reverted until all of the groundwork for it has been > > >done. That includes making generic testcases that show how such subregs > > >behave, so that we can see in testresults what changes do to existing > > >targets. > > > > Heh, I understood your earlier reply that you supported the change in > > principle based on the fact that nested subregs are invalid. > > Ah. No. I meant to lament the fact that we use subregs for multiple > things, so that doing a bit_cast of a real subreg has to be expressed as > just one subreg, which is clearly sub-optimal for most backends. > > I say *has to* because a subreg of a subreg is not valid RTL; it has to > be written as just one subreg. Which makes thing more confusing and > confused than this already non-trivial thing has to be. > > > Now, I don't think that validate_subreg is supposed to be the decision > > maker on what a target allows. > > Right, some target hook or macro or whatnot should. > > > For subregs of hardregs we seem to have a good way of validating, but > > what do we have for subregs of pseudos? Is it the passes generating > > the new unsupported subregs that should do different things? Should > > validate_subreg use a target hook to allow those special casings we > > removed which all were necessary just for specific targets but > > appearantly did not do any harm for other targets? > > Currently, we can disallow things in predicates and/or the instruction > conditions. > > > Can you give advice as to how to address the needs of the HFmode subregs on > > x86 if not by adding another (generic) narrow exception in validate_subreg? > > If I knew what the problem was, perhaps? Is this explained in some mail > somewhere? it's at [1], target_hook may not be the best solution, if other targets support HFmode, they may encounter the same problem, and revisit it again. [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576827.html > > > That said, I fail to see a good way forward after now two appearantly > > failed attempts. > > > > Btw, I'm fine reverting the patch but then what's the solution here? > > I think we should (longer term) get rid of the overloaded meanings and > uses of subregs. One fairly simple thing is to make a new rtx code > "bit_cast" (or is there a nice short more traditional name for it?) > > But that is not the core problem we had here. The behaviour of the > generic parts of the compiler was changed, without testing if that > works on other targets but x86. That is an understandable mistake, it > takes some experience to know where the morasses are. But this change > should have been accompanied by testcases exercising the changed code. > We would have clearly seen there are issues then, simply by watching > gcc-testresults@ (and/or maintainers are on top of the test results > anyway). Also, if there were testcases for this, we could have some > confidence that a change in this area is robust. > > > Segher > > > p.s. Very unrelated... Should we have __builtin_bit_cast for C as well? > Is there any reason this could not work? -- BR, Hongtao
Re: [RFC] Don't move cold code out of loop by checking bb count
On 2021/8/26 19:33, Richard Biener wrote: On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo wrote: Hi, On 2021/8/6 20:15, Richard Biener wrote: On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo wrote: There was a patch trying to avoid move cold block out of loop: https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html Richard suggested to "never hoist anything from a bb with lower execution frequency to a bb with higher one in LIM invariantness_dom_walker before_dom_children". This patch does this profile count check in both gimple LIM move_computations_worker and RTL loop-invariant.c find_invariants_bb, if the loop bb is colder than loop preheader, don't hoist it out of loop. Also, the profile count in loop split pass should be corrected to avoid lim2 and lim4 mismatch behavior, currently, the new loop preheader generated by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will move statement out of loop unexpectely when lim2 didn't move it. This change could fix regression on 544.nab_r from -1.55% to +0.46%. SPEC2017 performance evaluation shows 1% performance improvement for intrate GEOMEAN and no obvious regression for others. Especially, 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% on P8LE. Regression and bootstrap tested pass on P8LE, any comments? Thanks. While I'm not familiar with the RTL invariant motion pass the patch there looks reasonable. Note that we should assess the profile quality somehow - I'm not sure how to do that, CCed Honza for that. Thanks. For the GIMPLE part the patch looks quite complicated - but note it probably has to be since LIM performs kind of a "CSE" on loads (and stores for store-motion), so when there are multiple stmts affected by a hoisting decision the biggest block count has to be accounted. Likewise when there are dependent stmts involved that might include conditional stmts (a "PHI"), but the overall cost should be looked at. Currently, The gimple code check two situations with the patch: 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out of loop; 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs couldn't be moved out of loop, also don't move it out of loop to avoid definition not dominates use error. But part 2) is obviously already done. What I tried to say is your heuristic doesn't integrate nicely with the pass but I admitted that it might be a bit difficult to find a place to add this heuristic. There is lim_data->cost which we could bias negatively but then this is a cost that is independent on the hoisting distance. But doing this would work at least for the case where the immediately enclosing loop preheader is hotter than the stmt and with this it would be a patch that's similarly simple as the RTL one. Another possibility is to simply only adjust PHI processing in compute_invariantness, capping movement according to the hotness heuristic. The same could be done for regular stmts there but I'm not sure that will do good in the end since this function is supposed to compute "correctness" (well, it also has the cost stuff), and it's not the place to do overall cost considerations. Thanks. I found that adding a function find_coldest_out_loop and check it in outermost_invariant_loop to find the coldest invariant loop between outermost loop and itself could also reach the purpose. Then the gimple code check is redundant and could be removed. May be I could collect the number of instructions not hoisted with the patch on regression tests and SPEC2017 to do a estimation for "multiple stmts affected" and "overall cost" need to be considered? But it seems move_computations_worker couldn't rollback if we still want to hoist multiple stmts out during the iterations? Now - GIMPLE LIM "costing" is somewhat backward right now and it isn't set up to consider those multiple involved stmts. Plus the store-motion part does not have any cost part (but it depends on previously decided invariant motions). I think the way you implemented the check will cause no hoisting to be performed instead of, say, hoisting to a different loop level only. Possibly shown when you consider a loop nest like for (;;) if (unlikely_cond) for (;;) invariant; we want to hoist 'invariant' but only from the inner loop even if it is invariant also in the outer loop. For this case, theorotically I think the master GCC will optimize it to: invariant; for (;;) if (unlikely_cond) for (;;) ; 'invariant' is moved out of outer loop, but with the patch, it will get: for (;;) if (unlikely_cond) { invariant; for (;;) ; } 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop, so hoist it out of inner loop, this is exactly what we want, right? Yes. I had doubts your p
[PATCH] x86: Add TARGET_AVX256_[MOVE|STORE]_BY_PIECES
1. Add TARGET_AVX256_MOVE_BY_PIECES to perform move by-pieces operation with 256-bit AVX instructions. 2. Add TARGET_AVX256_STORE_BY_PIECES to perform move and store by-pieces operations with 256-bit AVX instructions. They are enabled only for Intel Alder Lake and Intel processors with AVX512. gcc/ PR target/101935 * config/i386/i386.h (TARGET_AVX256_MOVE_BY_PIECES): New. (TARGET_AVX256_STORE_BY_PIECES): Likewise. (MOVE_MAX): Check TARGET_AVX256_MOVE_BY_PIECES and TARGET_AVX256_STORE_BY_PIECES instead of TARGET_AVX256_SPLIT_UNALIGNED_LOAD and TARGET_AVX256_SPLIT_UNALIGNED_STORE. (STORE_MAX_PIECES): Check TARGET_AVX256_STORE_BY_PIECES instead of TARGET_AVX256_SPLIT_UNALIGNED_STORE. * config/i386/x86-tune.def (X86_TUNE_AVX256_MOVE_BY_PIECES): New. (X86_TUNE_AVX256_STORE_BY_PIECES): Likewise. gcc/testsuite/ PR target/101935 * g++.target/i386/pr80566-1.C: Add -mtune-ctrl=avx256_store_by_pieces. * gcc.target/i386/pr100865-4a.c: Likewise. * gcc.target/i386/pr100865-10a.c: Likewise. * gcc.target/i386/pr90773-20.c: Likewise. * gcc.target/i386/pr90773-21.c: Likewise. * gcc.target/i386/pr90773-22.c: Likewise. * gcc.target/i386/pr90773-23.c: Likewise. * g++.target/i386/pr80566-2.C: Add -mtune-ctrl=avx256_move_by_pieces. * gcc.target/i386/eh_return-1.c: Likewise. * gcc.target/i386/pr90773-26.c: Likewise. * gcc.target/i386/pieces-memcpy-12.c: Replace -mtune=haswell with -mtune-ctrl=avx256_move_by_pieces. * gcc.target/i386/pieces-memcpy-15.c: Likewise. * gcc.target/i386/pieces-memset-2.c: Replace -mtune=haswell with -mtune-ctrl=avx256_store_by_pieces. * gcc.target/i386/pieces-memset-5.c: Likewise. * gcc.target/i386/pieces-memset-11.c: Likewise. * gcc.target/i386/pieces-memset-14.c: Likewise. * gcc.target/i386/pieces-memset-20.c: Likewise. * gcc.target/i386/pieces-memset-23.c: Likewise. * gcc.target/i386/pieces-memset-29.c: Likewise. * gcc.target/i386/pieces-memset-30.c: Likewise. * gcc.target/i386/pieces-memset-33.c: Likewise. * gcc.target/i386/pieces-memset-34.c: Likewise. * gcc.target/i386/pieces-memset-44.c: Likewise. * gcc.target/i386/pieces-memset-37.c: Replace -mtune=generic with -mtune-ctrl=avx256_store_by_pieces. --- gcc/config/i386/i386.h | 10 +++--- gcc/config/i386/x86-tune.def | 11 +++ gcc/testsuite/g++.target/i386/pr80566-1.C| 2 +- gcc/testsuite/g++.target/i386/pr80566-2.C| 2 +- gcc/testsuite/gcc.target/i386/eh_return-1.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memcpy-12.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memcpy-15.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memset-11.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memset-14.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memset-2.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memset-20.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memset-23.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memset-29.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memset-30.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memset-33.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memset-34.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memset-37.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memset-44.c | 2 +- gcc/testsuite/gcc.target/i386/pieces-memset-5.c | 2 +- gcc/testsuite/gcc.target/i386/pr100865-10a.c | 2 +- gcc/testsuite/gcc.target/i386/pr100865-4a.c | 2 +- gcc/testsuite/gcc.target/i386/pr90773-20.c | 2 +- gcc/testsuite/gcc.target/i386/pr90773-21.c | 2 +- gcc/testsuite/gcc.target/i386/pr90773-22.c | 2 +- gcc/testsuite/gcc.target/i386/pr90773-23.c | 2 +- gcc/testsuite/gcc.target/i386/pr90773-26.c | 2 +- 26 files changed, 42 insertions(+), 27 deletions(-) diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index f671dae9236..9d45fe9611b 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -403,6 +403,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR] #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \ ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL] +#define TARGET_AVX256_MOVE_BY_PIECES \ + ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES] +#define TARGET_AVX256_STORE_BY_PIECES \ + ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES] #define TARGET_AVX256_SPLIT_REGS \ ix86_tune_features[X86_TUNE_AVX256_SPLIT_REGS] #define TARGET_GENERAL_REGS_SSE_SPILL \ @@ -1781,8 +1785,8 @@ typedef struct ix86_args { ? 64 \ : ((TARGET_AVX \ && !TARGET_PREFER_AVX128 \ - && !TARGET_AVX256_SPLIT_UNALIGNED_LOAD \ - && !TARGET_AVX256_SPLIT_
Re: [PATCH] x86: Add TARGET_AVX256_[MOVE|STORE]_BY_PIECES
On Thu, Sep 9, 2021 at 11:21 AM H.J. Lu via Gcc-patches wrote: > > 1. Add TARGET_AVX256_MOVE_BY_PIECES to perform move by-pieces operation > with 256-bit AVX instructions. > 2. Add TARGET_AVX256_STORE_BY_PIECES to perform move and store by-pieces > operations with 256-bit AVX instructions. > > They are enabled only for Intel Alder Lake and Intel processors with > AVX512. This patch fixes the regression on znver2 and kabylake, although we don't know why the mov with ymm would behave differently on non-avx512 (except for alderlake) and avx512 machines, the microarchitecture tuning seems like a reasonable choice. I'll install this patch if there's no objections in next 72 hours. > > gcc/ > > PR target/101935 > * config/i386/i386.h (TARGET_AVX256_MOVE_BY_PIECES): New. > (TARGET_AVX256_STORE_BY_PIECES): Likewise. > (MOVE_MAX): Check TARGET_AVX256_MOVE_BY_PIECES and > TARGET_AVX256_STORE_BY_PIECES instead of > TARGET_AVX256_SPLIT_UNALIGNED_LOAD and > TARGET_AVX256_SPLIT_UNALIGNED_STORE. > (STORE_MAX_PIECES): Check TARGET_AVX256_STORE_BY_PIECES instead > of TARGET_AVX256_SPLIT_UNALIGNED_STORE. > * config/i386/x86-tune.def (X86_TUNE_AVX256_MOVE_BY_PIECES): New. > (X86_TUNE_AVX256_STORE_BY_PIECES): Likewise. > > gcc/testsuite/ > > PR target/101935 > * g++.target/i386/pr80566-1.C: Add > -mtune-ctrl=avx256_store_by_pieces. > * gcc.target/i386/pr100865-4a.c: Likewise. > * gcc.target/i386/pr100865-10a.c: Likewise. > * gcc.target/i386/pr90773-20.c: Likewise. > * gcc.target/i386/pr90773-21.c: Likewise. > * gcc.target/i386/pr90773-22.c: Likewise. > * gcc.target/i386/pr90773-23.c: Likewise. > * g++.target/i386/pr80566-2.C: Add > -mtune-ctrl=avx256_move_by_pieces. > * gcc.target/i386/eh_return-1.c: Likewise. > * gcc.target/i386/pr90773-26.c: Likewise. > * gcc.target/i386/pieces-memcpy-12.c: Replace -mtune=haswell > with -mtune-ctrl=avx256_move_by_pieces. > * gcc.target/i386/pieces-memcpy-15.c: Likewise. > * gcc.target/i386/pieces-memset-2.c: Replace -mtune=haswell > with -mtune-ctrl=avx256_store_by_pieces. > * gcc.target/i386/pieces-memset-5.c: Likewise. > * gcc.target/i386/pieces-memset-11.c: Likewise. > * gcc.target/i386/pieces-memset-14.c: Likewise. > * gcc.target/i386/pieces-memset-20.c: Likewise. > * gcc.target/i386/pieces-memset-23.c: Likewise. > * gcc.target/i386/pieces-memset-29.c: Likewise. > * gcc.target/i386/pieces-memset-30.c: Likewise. > * gcc.target/i386/pieces-memset-33.c: Likewise. > * gcc.target/i386/pieces-memset-34.c: Likewise. > * gcc.target/i386/pieces-memset-44.c: Likewise. > * gcc.target/i386/pieces-memset-37.c: Replace -mtune=generic > with -mtune-ctrl=avx256_store_by_pieces. > --- > gcc/config/i386/i386.h | 10 +++--- > gcc/config/i386/x86-tune.def | 11 +++ > gcc/testsuite/g++.target/i386/pr80566-1.C| 2 +- > gcc/testsuite/g++.target/i386/pr80566-2.C| 2 +- > gcc/testsuite/gcc.target/i386/eh_return-1.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memcpy-12.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memcpy-15.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memset-11.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memset-14.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memset-2.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memset-20.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memset-23.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memset-29.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memset-30.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memset-33.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memset-34.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memset-37.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memset-44.c | 2 +- > gcc/testsuite/gcc.target/i386/pieces-memset-5.c | 2 +- > gcc/testsuite/gcc.target/i386/pr100865-10a.c | 2 +- > gcc/testsuite/gcc.target/i386/pr100865-4a.c | 2 +- > gcc/testsuite/gcc.target/i386/pr90773-20.c | 2 +- > gcc/testsuite/gcc.target/i386/pr90773-21.c | 2 +- > gcc/testsuite/gcc.target/i386/pr90773-22.c | 2 +- > gcc/testsuite/gcc.target/i386/pr90773-23.c | 2 +- > gcc/testsuite/gcc.target/i386/pr90773-26.c | 2 +- > 26 files changed, 42 insertions(+), 27 deletions(-) > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h > index f671dae9236..9d45fe9611b 100644 > --- a/gcc/config/i386/i386.h > +++ b/gcc/config/i386/i386.h > @@ -403,6 +403,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; > ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR] > #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \ > ix86_tune_features[X86_TUNE_SOFTWARE
Re: [PATCH] Fix SFmode subreg of DImode and TImode
On Wed, Sep 8, 2021 at 9:18 PM Segher Boessenkool wrote: > > On Wed, Sep 08, 2021 at 08:39:31PM +0200, Richard Biener wrote: > > On September 8, 2021 7:08:09 PM GMT+02:00, Segher Boessenkool > > wrote: > > >It is not a good idea to do allow all those things. Most backends can > > >only support a few combinations of them, and everything else results in > > >*worse* machine code, best case, and more and more complicated (and more > > >buggy!) backend code. > > > > > >But that is a code quality issue. The current problem is that we have > > >at least PR102211 and PR102154 (as well as reports elsewhere of bugs on > > >other targets). Code that used before doesn't anymore, and we have no > > >clear way out, no recommendation how to fix this and a) keep the same > > >functionality without huge changes, and b) keep the same machine code > > >quality. > > > > > >I do not think something like that can be done. That is why I am asking > > >for the patch to be reverted until all of the groundwork for it has been > > >done. That includes making generic testcases that show how such subregs > > >behave, so that we can see in testresults what changes do to existing > > >targets. > > > > Heh, I understood your earlier reply that you supported the change in > > principle based on the fact that nested subregs are invalid. > > Ah. No. I meant to lament the fact that we use subregs for multiple > things, so that doing a bit_cast of a real subreg has to be expressed as > just one subreg, which is clearly sub-optimal for most backends. > > I say *has to* because a subreg of a subreg is not valid RTL; it has to > be written as just one subreg. Which makes thing more confusing and > confused than this already non-trivial thing has to be. > > > Now, I don't think that validate_subreg is supposed to be the decision > > maker on what a target allows. > > Right, some target hook or macro or whatnot should. > > > For subregs of hardregs we seem to have a good way of validating, but > > what do we have for subregs of pseudos? Is it the passes generating > > the new unsupported subregs that should do different things? Should > > validate_subreg use a target hook to allow those special casings we > > removed which all were necessary just for specific targets but > > appearantly did not do any harm for other targets? > > Currently, we can disallow things in predicates and/or the instruction > conditions. > > > Can you give advice as to how to address the needs of the HFmode subregs on > > x86 if not by adding another (generic) narrow exception in validate_subreg? > > If I knew what the problem was, perhaps? Is this explained in some mail > somewhere? > > > That said, I fail to see a good way forward after now two appearantly > > failed attempts. > > > > Btw, I'm fine reverting the patch but then what's the solution here? > > I think we should (longer term) get rid of the overloaded meanings and > uses of subregs. One fairly simple thing is to make a new rtx code > "bit_cast" (or is there a nice short more traditional name for it?) But subreg _is_ bit_cast. What is odd to me is that a "disallowed" subreg like (subreg:SF (reg:TI ..) 0) magically becomes valid (in terms of validate_subreg) if you rewrite it as (subreg:SF (subreg:SI (reg:TI ..) 0) 0). Of course that's nested and invalid but just push the inner subreg to a new pseudo and the thing becomes valid. That was my original idea for dealing with the issue described in the thread including https://gcc.gnu.org/pipermail/gcc-patches/2021-August/578437.html where the original issue is that extract_integral_bit_field creates a subreg that's not valid according to validate_subreg. So I thought well, extracting bitfields should be done on integer modes (thus pun everything to ints and back), but that had even more(?) fallout. > But that is not the core problem we had here. The behaviour of the > generic parts of the compiler was changed, without testing if that > works on other targets but x86. That is an understandable mistake, it > takes some experience to know where the morasses are. But this change > should have been accompanied by testcases exercising the changed code. > We would have clearly seen there are issues then, simply by watching > gcc-testresults@ (and/or maintainers are on top of the test results > anyway). Also, if there were testcases for this, we could have some > confidence that a change in this area is robust. Well, that only works if some maintainers that are familiar enough with all this chime in ;) It's stage1 so it's understandable that some people (like me ...) are tyring to help people making progress even if that involves trying to decipher 30 years of GCC history in this area (without much success in the end as we see) ;) Given validate_subreg's _intent_ is to disallow float subregs that change size exceptions like /* ??? This should not be here. Temporarily continue to allow word_mode subregs of anything. The most common o
Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
On 2021/9/2 18:37, Richard Biener wrote: On Thu, 2 Sep 2021, Xionghu Luo wrote: On 2021/9/2 16:50, Richard Biener wrote: On Thu, 2 Sep 2021, Richard Biener wrote: On Thu, 2 Sep 2021, Xionghu Luo wrote: On 2021/9/1 17:58, Richard Biener wrote: This fixes the CFG walk order of fill_always_executed_in to use RPO oder rather than the dominator based order computed by get_loop_body_in_dom_order. That fixes correctness issues with unordered dominator children. The RPO order computed by rev_post_order_and_mark_dfs_back_seme in its for-iteration mode is a good match for the algorithm. Xionghu, I've tried to only fix the CFG walk order issue and not change anything else with this so we have a more correct base to work against. The code still walks inner loop bodies up to loop depth times and thus is quadratic in the loop depth. Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't have any comments I plan to push this and then revisit what we were circling around. LGTM, thanks. I pushed it, thought again in the attempt to build a testcase and concluded I was wrong with the appearant mishandling of contains_call - get_loop_body_in_dom_order seems to be exactly correct for this specific case. So I reverted the commit again. And I figured what the /* In a loop that is always entered we may proceed anyway. But record that we entered it and stop once we leave it. */ comment was about. The code was present before the fix for PR78185 and it was supposed to catch the case where the entered inner loop is not finite. Just as the testcase from PR78185 shows the stopping was done too late when the exit block was already marked as to be always executed. A simpler fix for PR78185 would have been to move if (!flow_bb_inside_loop_p (inn_loop, bb)) break; before setting of last = bb. In fact the installed fix was more pessimistic than that given it terminated already when entering a possibly infinite loop. So we can improve that by doing sth like which should also improve the situation for some of the cases you were looking at? What remains is that we continue to stop when entering a not always executed loop: if (bb->loop_father->header == bb) { if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) break; Yes. This will cause blocks after inner loop missed to be check if they are actually ALWAYS_EXECUTED. I am afraid O(N^2) is inevitable here... Yes. What we can try is pre-computing whether a loop has a call or an inner loop that might not terminate and then when that's true for the loop to be entered continue to break; but when not, skip processing that loop blocks (but we still fill the blocks array, and we do need to do this in the order for the loop we're processing ...). So what I was thinking was to somehow embed the dominator walk of get_loop_body_in_dom_order and instead of pre-recording the above info (call, infinite loop) for loops, pre-record it on the dominator tree so that we can ask "in any of our dominator children, is there a call or an infinite loop" and thus cut the dominator walk at loop header blocks that are not dominating the outer loop latch ... Of course the simplistic solution might be to simply do if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb) && ((loop_depth (bb->loop_father) - loop_depth (loop)) > param_max_lim_loop_depth_lookahead))) break; and thus limit the processing of conditionally executed inner loops by relative depth ... as you say the actual processing is unlikely to be the bottleneck for the degenerate cases of a very deep nest of conditionally executed loops. But still for this case get_loop_body_in_dom_order is doing quadratic processing so we can also say that another linear walk over the produced array does not increase complexity.> volatile int flag, bar; double foo (double *valp) { double sum = 0; for (int i = 0; i < 256; ++i) { for (int j = 0; j < 256; ++j) bar = flag; if (flag) sum += 1.; sum += *valp; } return sum; } The patch still fails to handle cases like this: struct X { int i; int j; int k;}; volatile int m; void bar (struct X *x, int n, int l, int k) { for (int i = 0; i < l; i++) { if (k) for (int j = 0; j < l; j++) { if (m) x->i = m; else x->i = 1 - m; int *r = &x->k; int tem2 = *r; x->k += tem2 * j; } x->i = m; } } x->i is still not marked ALWAYS_EXECUTED for outer loop. Collected data when build gcc stage1 and bootstrap. There are still about 9% bbs are missed to be marked with ALWAYS_EXECUTED. Execution time of fill_always_executed_in_1 is increased obvious in stage1 but it is in mini-second level while bootstrap build does