Re: [PATCH] Fix crash with constant initializer caused by IPA

2025-05-30 Thread Jan Hubicka
> On Fri, May 30, 2025 at 11:30 AM Jan Hubicka wrote: > > > > Hi, > > > > > > > > Hi, > > > > > > > > the attached Ada testcase compiled with -O2 -gnatn makes the compiler > > > > crash in > > > > vect_ca

Re: [PATCH] Fix crash with constant initializer caused by IPA

2025-05-30 Thread Jan Hubicka
Hi, > > > > Hi, > > > > the attached Ada testcase compiled with -O2 -gnatn makes the compiler crash > > in > > vect_can_force_dr_alignment_p during SLP vectorization: > > > > if (decl_in_symtab_p (decl) > > && !symtab_node::get (decl)->can_increase_alignment_p ()) > > return false; > >

Re: [PATCH] ipa: When inlining, don't combine PT JFs changing signedness (PR120295)

2025-05-29 Thread Jan Hubicka
> Hi, > > in GCC 15 we allowed jump-function generation code to skip over a > type-cast converting one integer to another as long as the latter can > hold all the values of the former or has at least the same precision. > This works well for IPA-CP where we do then evaluate each jump > function as

Re: [AUTOFDO] Fix annotated profile for de-duplicated call

2025-05-29 Thread Jan Hubicka
> > However i do not quite follow the old or new logic here. > So if I have only one unknown edge out (or in) from BB and I know > its count, I can determine count of that edge by Kirhoff law. > > But then the old code computes number of edges out of the BB > and if it is only one it updates the

Re: [AUTOFDO] Fix annotated profile for de-duplicated call

2025-05-29 Thread Jan Hubicka
> diff --git a/gcc/auto-profile.cc b/gcc/auto-profile.cc > index 7e0e8c66124..8a317d85277 100644 > --- a/gcc/auto-profile.cc > +++ b/gcc/auto-profile.cc > @@ -1129,6 +1129,26 @@ afdo_set_bb_count (basic_block bb, const stmt_set > &promoted) >gimple *stmt = gsi_stmt (gsi); >if (gimp

Re: [PATCH] [AUTOFDO] Enable autofdo tests for aarch64

2025-05-29 Thread Jan Hubicka
> I also noticed that some tests are only enabled for x86. I am also seeing: > ./gcc/testsuite/gcc/gcc.sum:UNSUPPORTED: gcc.dg/tree-prof/pr66295.c This is testing a former ifun bug which reproduced with -fprofile-use > ./gcc/testsuite/gcc/gcc.sum:UNSUPPORTED: gcc.dg/tree-prof/split-1.c This is test

Re: [PATCH] [AUTOFDO] Enable autofdo tests for aarch64

2025-05-29 Thread Jan Hubicka
> Hi, > autofdo tests are now running only for x86. This patch makes it > run for aarch64 too. Verified that perf and create_gcov are running > as expected. > > gcc/ChangeLog: > > * config/aarch64/gcc-auto-profile: Make script executable. > > gcc/testsuite/ChangeLog: > > * lib/t

Set znver5 addss cost to 2 again

2025-05-28 Thread Jan Hubicka
Hi, since uses of addss for other purposes then modelling FP addition/subtraction should be gone now, this patch sets addss cost back to 2. Bootsrapped/regtested x86_64-linux, comitted. gcc/ChangeLog: PR target/119298 * config/i386/x86-tune-costs.h (struct processor_costs): Set

Do not drop AFDO profile if entry block has count of 0

2025-05-28 Thread Jan Hubicka
Hi, with normal profile feedback checking entry block count to be non-zero is quite reliable check for presence of non-0 profile in the body since the function body can only be executed if the entry block was executed. With autofdo this is not true, since the entry block may just execute too few t

Do not erase static profile by 0 autofdo profile

2025-05-28 Thread Jan Hubicka
Hi, This patch makes auto-fdo more careful about keeping info we have from static profile prediction. If all counters in function are 0, we can keep original auto-fdo profile. Having all 0 profile is not very useful especially becuase 0 in autofdo is not very informative and the code still may hav

Re: [PATCH] i386: Use Shuffles instead of shifts for Reduction in AMD znver4/5

2025-05-28 Thread Jan Hubicka
> gcc/ChangeLog: > > * config/i386/i386-expand.cc (emit_reduc_half): Use shuffles to > generate reduc half for V4SI, similar modes. > * config/i386/i386.h (TARGET_SSE_REDUCTION_PREFER_PSHUF): New Macro. > * config/i386/x86-tune.def (X86_TUNE_SSE_REDUCTION_PREFER_PSHUF): >

Remove dead code in auto-profile.cc

2025-05-27 Thread Jan Hubicka
Hi, this code to track what locations were used when reading auto-fdo profile seems dead since the initial commit. Removed thus. Comitted as obvious. Honza gcc/ChangeLog: * auto-profile.cc (function_instance::mark_annotated): Remove. (function_instance::total_annotated_count): Re

Re: [AUTOFDO] Merge profiles of clones before annotating

2025-05-26 Thread Jan Hubicka
> > > > On 26 May 2025, at 5:34 pm, Jan Hubicka wrote: > > > > External email: Use caution opening links or attachments > > > > > > Hi, > > also, please, can you add an testcase? We should have some coverage for > > auto-fdo specific is

Re: [AUTOFDO] Merge profiles of clones before annotating

2025-05-26 Thread Jan Hubicka
Hi, also, please, can you add an testcase? We should have some coverage for auto-fdo specific issues Honza 0002-AUTOFDO-Merge-profiles-of-clones-before-annotating.patch Description: 0002-AUTOFDO-Merge-profiles-of-clones-before-annotating.patch

Re: [AUTOFDO] Merge profiles of clones before annotating

2025-05-26 Thread Jan Hubicka
Hi, > Ping? Sorry for the delay. I think I finally got auto-fdo running on my box and indeed I see that if function is cloned later, the profile is lost. There are .suffixes added before afdo pass (such as openmp offloading or nested functions) and there are .suffixes added afer afdo (by ipa clonin

Re: [AUTOFDO] Enable ipa-split for auto-profile

2025-05-22 Thread Jan Hubicka
> > On 9 May 2025, at 11:55 am, Kugan Vivekanandarajah > > wrote: > > > > ipa-split is not now run for auto-profile. IMO this was an oversight. > > This patch enables it similar to PGO runs. > > > > gcc/ChangeLog: > > > >* ipa-split.cc pass_feedback_split_functions::clone (): New. > >

Re: [PATCH 3/5] ipa: Dump cgraph_node UID instead of order into ipa-clones dump file

2025-05-15 Thread Jan Hubicka
> Hi, > > starting with GCC 15 the order is not unique for any symtab_nodes but > m_uid is, I believe we ought to dump the latter in the ipa-clones dump, > if only so that people can reliably match entries about new clones to > those about removed nodes (if any). > > Bootstrapped and tested on x8

Re: [PATCH][x86] Fix regression from x86 multi-epilogue tuning

2025-05-14 Thread Jan Hubicka
> With the avx512_two_epilogues tuning enabled for zen4 and zen5 > the gcc.target/i386/vect-epilogues-5.c testcase below regresses > and ends up using AVX2 sized vectors for the masked epilogue > rather than AVX512 sized vectors. The following patch rectifies > this and adds coverage for the inten

Re: [PATCH v3] Consider frequency in cost estimation when converting scalar to vector.

2025-05-14 Thread Jan Hubicka
> Thansk for review. > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > Ok for trunk? > > n some benchmark, I notice stv failed due to cost unprofitable, but the igain > is inside the loop, but sse<->integer conversion is outside the loop, current > cost > model doesn't consider the

Re: [PATCH v3] Consider frequency in cost estimation when converting scalar to vector.

2025-05-12 Thread Jan Hubicka
> > gcc/ChangeLog: > > > > * config/i386/i386-features.cc > > (scalar_chain::mark_dual_mode_def): Weight > > n_integer_to_sse/n_sse_to_integer with bb frequency. > > (general_scalar_chain::compute_convert_gain): Ditto, and > > adjust function prototype to ret

Re: i386: Fix some problems in stv cost model

2025-05-12 Thread Jan Hubicka
> > Instructions with latency info are those really different. > > So the uncoverted code has sum of latencies 4 and real latency 3. > > Converted code has sum of latencies 4 and real latency 3 > > (vmod+vpmaxsd+vmov). > > So I do not quite see it should be a win. > > Note this was historically d

i386: Fix some problems in stv cost model

2025-05-10 Thread Jan Hubicka
Hi, this patch fixes some of problems with cosint in scalar to vector pass. In particular 1) the pass uses optimize_insn_for_size which is intended to be used by expanders and splitters and requires the optimization pass to use set_rtl_profile (bb) for currently processed bb. This is n

i386: implement costs for float<->int conversions in ix86_vector_costs::add_stmt_cost

2025-05-07 Thread Jan Hubicka
Hi, This patch adds pattern matching for float<->int conversions both as normal statements and promote_demote. While updating promote_demote I noticed that in cleanups I turned "stmt_cost =" into "int stmt_cost = " which turned the existing FP costing to NOOP. I also added comment on how demotes a

Fix i386 bootstrap on non-Windows targets

2025-05-06 Thread Jan Hubicka
Hi, this patch adds ifdef so we don't get warning on ix86_tls_index being unused. Bootstrapped x86_64-linux, comitted. * config/i386/i386.cc (ix86_tls_index): Add ifdef. diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index f28c92a9d3a..89f518c86b5 100644 --- a/gcc/config/

Improve maybe_hot handling in inliner heuristics

2025-05-03 Thread Jan Hubicka
Hi, Inliner currently applies different heuristics to hot and cold calls (the second are inlined only if the code size will shrink). It may happen that the call itself is hot, but the significant time is spent in callee and inlining makes it faster. For this reason we want to check if the anticip

Improve ix86 VEC_MERGE costs

2025-05-02 Thread Jan Hubicka
Hi, ix86_rtx_costs VEC_MERGE by special casing AVX512 mask operations and otherwise returning cost->sse_op completely ignoring costs of the operands. Since VEC_MERGE is also used to represent scalar variant of SSE/AVX operation, this means that many instructions (such as SSE converisions) are ofte

Re: Make ix86 cost of VEC_SELECT equivalent to SUBREG same as of SUBREG

2025-05-02 Thread Jan Hubicka
> target_insn_cost is used to prevent rpad optimization to be restored by > late_combine1, looks like it's not sufficient for size_cost. > > 21804static int > 21805ix86_insn_cost (rtx_insn *insn, bool speed) > 21806{ > 21807 int insn_cost = 0; > 21808 /* Add extra cost to avoid post_reload late

Re: [PATCH v2] Consider frequency in cost estimation when converting scalar to vector.

2025-04-29 Thread Jan Hubicka
> > so gain is the difference of runtime of integer variant compared to > > vector vairant and cost are the extra int->see and sse->int conversions > > needed? > > > > If you scale everything by a BB frequency, you will get a weird > > behaviour if chain happens to consist only of instructions in

Fix cs_interesting_for_ipcp_p wrt flag_profile_partial_training.

2025-04-29 Thread Jan Hubicka
Hi, as noticed by Martin Jambor, I introduced a bug while simplifying cs_interesting_for_ipcp_p and reversed condition for flag_profile_partial_training. Also I noticed that we probably want to consider calls with unintialized counts for cloning so the pass does somehting with -fno-guess-branch-pr

Re: [PATCH v2] Consider frequency in cost estimation when converting scalar to vector.

2025-04-29 Thread Jan Hubicka
> > > I am generally trying to get rid of remaing uses of REG_FREQ since the > > > 1 based fixed point arithmetics iot always working that well. > > > > > > You can do the sums in profile_count type (doing something reasonable > > > when count is uninitialized) and then convert it to sreal for

Make ix86 cost of VEC_SELECT equivalent to SUBREG same as of SUBREG

2025-04-29 Thread Jan Hubicka
Hi, this patch (partly) solves problem in PR119900 where changing ix86_size_cost of chap SSE instruction from 2 bytes to 4 bytes regresses imagemagick with PGO (119% on core and 54% on Zen) There is an interesting chain of problems 1) the train run of the SPEC2017 imagick is wrong and it does not

Re: [PATCH v2] Consider frequency in cost estimation when converting scalar to vector.

2025-04-29 Thread Jan Hubicka
> > I am generally trying to get rid of remaing uses of REG_FREQ since the > > 1 based fixed point arithmetics iot always working that well. > > > > You can do the sums in profile_count type (doing something reasonable > > when count is uninitialized) and then convert it to sreal for the final

Re: [PATCH] Accept allones or 0 operand for vcond_mask op1.

2025-04-24 Thread Jan Hubicka
> > And thus it may be more RTL friendly to represent it this way instead of > > current unspec called UNSPEC_IEEE_MAX... > > There's a patch proposed for that [1], and Jakub has some comments. > > Jakub Jelinek 于2024年11月15日周五 16:20写道: > > > > On Fri, Nov 15, 2024 at 04:04:55PM +0800, Hongyu Wan

Re: [PATCH] Accept allones or 0 operand for vcond_mask op1.

2025-04-24 Thread Jan Hubicka
> Note for blendv, it checks the significant bit of the mask, not simple > if_then_else > mask > if_true > if_false > > It should be > if_then_else >ashiftrt mask 31 >if_true >if_false I think canonical form (produced by combine) would be if_then_else ge mask 0 if_false

Re: [PATCH] Accept allones or 0 operand for vcond_mask op1.

2025-04-24 Thread Jan Hubicka
> On Thu, Apr 24, 2025 at 6:27 PM Jan Hubicka wrote: > > > > > Since ix86_expand_sse_movcc will simplify them into a simple vmov, vpand > > > or vpandn. > > > Current register_operand/vector_operand could lose some optimization > > > opportunity. >

Fix ICE building deepsjeng with -fprofile-use

2025-04-24 Thread Jan Hubicka
Hi, the problem here is division by zero, since adjusted 0 > precise 0. Fixed by using right test. gcc/ChangeLog: PR ipa/119924 * ipa-cp.cc (update_counts_for_self_gen_clones): Use nonzero_p. (update_profiling_info): Likewise. (update_specialized_profile): Likewise

Re: [PATCH] Accept allones or 0 operand for vcond_mask op1.

2025-04-24 Thread Jan Hubicka
> Since ix86_expand_sse_movcc will simplify them into a simple vmov, vpand > or vpandn. > Current register_operand/vector_operand could lose some optimization > opportunity. > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > Ok for trunk? > > gcc/ChangeLog: > > * config/i386/p

Re: Improve vectorizer costs of min, max, abs, absu and const_expr on x86

2025-04-24 Thread Jan Hubicka
Hi, > With this patch > https://gcc.gnu.org/pipermail/gcc-patches/2025-April/681503.html > scalar version can also be optimized to vcmpnltsd + vpandn this is nice. Would be nice if this was also caught by combiner... > > Can we also check if_true/if_false, if they're const0, or > > constm1(inte

Re: [PATCH] [x86] Generate 2 FMA instructions in ix86_expand_swdivsf.

2025-04-23 Thread Jan Hubicka
> From: "hongtao.liu" > > When FMA is available, N-R step can be rewritten with > > a / b = (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a > > which have 2 fma generated.[1] > > [1] https://bugs.llvm.org/show_bug.cgi?id=21385 > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > Ok fo

Re: [PATCH] Consider frequency in cost estimation when converting scalar to vector.

2025-04-23 Thread Jan Hubicka
> In some benchmark, I notice stv failed due to cost unprofitable, but the igain > is inside the loop, but sse<->integer conversion is outside the loop, current > cost > model doesn't consider the frequency of those gain/cost. > The patch weights those cost with frequency just like LRA does. > >

Re: Improve vectorizer costs of min, max, abs, absu and const_expr on x86

2025-04-22 Thread Jan Hubicka
> > But vectorizer computes costs of vector load of off array, 4x moving vector > > to > > scalar and 4x stores. I wonder if generic code can match this better and > > avoid > > the vector load of addresses when open-coding gather/scatter? > > The vectorizer does not explicitly consider the low

Improve vectorizer costs of min, max, abs, absu and const_expr on x86

2025-04-21 Thread Jan Hubicka
Hi, this patch adds special cases for vectorizer costs in COND_EXPR, MIN_EXPR, MAX_EXPR, ABS_EXPR and ABSU_EXPR. We previously costed ABS_EXPR and ABSU_EXPR but it was only correct for FP variant (wehre it corresponds to andss clearing sign bit). Integer abs/absu is open coded as conditinal move

Re: [PATCH v2] x86: Update memcpy/memset inline strategies for -mtune=generic

2025-04-21 Thread Jan Hubicka
> On Mon, Apr 21, 2025 at 7:24 AM H.J. Lu wrote: > > > > On Sun, Apr 20, 2025 at 6:31 PM Jan Hubicka wrote: > > > > > > > PR target/102294 > > > > PR target/119596 > > > > * config/i386/x86-tune-costs.h (generi

Re: [PATCH v2] x86: Update memcpy/memset inline strategies for -mtune=generic

2025-04-20 Thread Jan Hubicka
> PR target/102294 > PR target/119596 > * config/i386/x86-tune-costs.h (generic_memcpy): Updated. > (generic_memset): Likewise. > (generic_cost): Change CLEAR_RATIO to 17. > * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): > Add m_GENERIC

Re: [PATCH v2] x86: Update memcpy/memset inline strategies for -mtune=generic

2025-04-20 Thread Jan Hubicka
> On Sun, Apr 20, 2025 at 4:19 AM Jan Hubicka wrote: > > > > > On Tue, Apr 8, 2025 at 3:52 AM H.J. Lu wrote: > > > > > > > > Simplify memcpy and memset inline strategies to avoid branches for > > > > -mtune=generic: > > > > >

Re: [PATCH v2] x86: Update memcpy/memset inline strategies for -mtune=generic

2025-04-19 Thread Jan Hubicka
> On Tue, Apr 8, 2025 at 3:52 AM H.J. Lu wrote: > > > > Simplify memcpy and memset inline strategies to avoid branches for > > -mtune=generic: > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > >load and store for up to 16 * 16 (256) bytes when the data size is > >

Add sse_fp_cost into i386_rtx_costs

2025-04-18 Thread Jan Hubicka
Hi, Znver5 has addss cost of 2 while other common floating point SSE operations costs 3 cycles. We currently have only one entry in the costs tables which makes it impossible to model this. This patch adds sse_fp_op which is used for other common FP operations (basically conversions) and updates

Re: Add sse_fp_cost into i386_rtx_costs

2025-04-17 Thread Jan Hubicka
> On Thu, 17 Apr 2025, Jan Hubicka wrote: > > > Hi, > > Znver5 has addss cost of 2 while other common floating point SSE operations > > costs 3 cycles. We currently have only one entry in the costs tables which > > makes it impossible to model this. This patc

Put znver5 ADDSS cost back to 3

2025-04-16 Thread Jan Hubicka
Hi, Znver5 has latency of addss 2 in typical case while all earlier versions has latency 3. Unforunately addss cost is used to cost many other SSE instructions than just addss and setting the cost to 2 makes us to vectorize 4 64bit stores into one 256bit store which in turn regesses imagemagick

Stream ipa_return_value_summary

2025-04-16 Thread Jan Hubicka
Hi, this patch adds streaming of return summaries from compile time to ltrans which are now needed for vrp to not ouput false errors on musttail. Bootstrapped/regtested x86_64-linux, comitted. Co-authored-by: Jakub Jelinek gcc/ChangeLog: PR tree-optimization/119614 * ip

Re: [PATCH 2/4] cfgloopmanip: Add infrastructure for scaling of multi-exit loops [PR117790]

2025-04-15 Thread Jan Hubicka
Hi, > gcc/ChangeLog: > > PR tree-optimization/117790 > * tree-vect-loop.cc (scale_profile_for_vect_loop): Use > scale_loop_profile_hold_exit_counts instead of scale_loop_profile. Drop > the exit edge parameter, since the code now handles multiple exits. > Adjust the

Re: [PATCH 2/4] cfgloopmanip: Add infrastructure for scaling of multi-exit loops [PR117790]

2025-04-15 Thread Jan Hubicka
> Hi, > > gcc/ChangeLog: > > > > PR tree-optimization/117790 > > * cfgloopmanip.cc (can_flow_scale_loop_freqs_p): New. > > (flow_scale_loop_freqs): New. > > (scale_loop_freqs_with_exit_counts): New. > > (scale_loop_freqs_hold_exit_counts): New. > > (scale_loop_profile): Ref

Re: [PATCH 2/4] cfgloopmanip: Add infrastructure for scaling of multi-exit loops [PR117790]

2025-04-15 Thread Jan Hubicka
Hi, > gcc/ChangeLog: > > PR tree-optimization/117790 > * cfgloopmanip.cc (can_flow_scale_loop_freqs_p): New. > (flow_scale_loop_freqs): New. > (scale_loop_freqs_with_exit_counts): New. > (scale_loop_freqs_hold_exit_counts): New. > (scale_loop_profile): Refactor

Re: [PATCH] Locality cloning pass (was: Introduce -flto-partition=locality)

2025-04-13 Thread Jan Hubicka
> +@opindex fipa-reorder-for-locality > +@item -fipa-reorder-for-locality > +Group call chains close together in the binary layout to improve code code > +locality. This option is incompatible with an explicit > +@option{-flto-partition=} option since it enforces a custom partitioning > +scheme.

Re: [PATCH 5/7] ipa-cp: Use the collected pass-through types to propagate constants (PR118097)

2025-04-10 Thread Jan Hubicka
> This patch revisits the fix for PR 118097 and instead of deducing the > necessary operation type it just uses the value collected and streamed > by an earlier patch. > > It is bigger than the ones for propagating value ranges and known bits > because we track constants both in parameters themsel

Re: [RFC PATCH 6/7] ipa: Remove type checks in arithmetic pass-through jfunc construction

2025-04-08 Thread Jan Hubicka
> After reviewing the code involving arithmetic pass-through jump > functions I found out that we actually do check that the type of the > LHS is compatible with the type of the first operand on the RHS. Now > that we stream the types of the LHS of these operations, this is no > longer necessary -

Re: [PATCH 4/7] ipa-cp: Use the stored and streamed pass-through types in ipa-vr (PR118785)

2025-04-08 Thread Jan Hubicka
> This patch revisits the fix for PR 118785 and intead of deducing the > necessary operation type it just uses the value collected and streamed > by an earlier patch. > > gcc/ChangeLog: > > Bootstrapped and tested and LTO bootstrapped on x86_64-linux. OK for > master? > > Thanks, > > Martin >

Re: [PATCH 3/7] ipa-cp: Make dumping of widest_ints even more sane

2025-04-08 Thread Jan Hubicka
> This patch just introduces a form of dumping of widest ints that only > have zeros in the lowest 128 bits so that instead of printing > thousands of f's the output looks like: > >Bits: value = 0x, mask = all ones folled by > 0x > > and then makes sur

Re: [PATCH 2/7] ipa-cp: Make propagation of bits in IPA-CP aware of type conversions (PR119318)

2025-04-08 Thread Jan Hubicka
> After the propagation of constants and value ranges, it turns out > that the propagation of known bits also needs to be made aware of any > intermediate types in which any arithmetic operations are made and > must limit its precision there. This implements just that, using the > newly collected

Re: [PATCH 1/7] ipa: Record and stream result types of arithemetic jump functions

2025-04-08 Thread Jan Hubicka
> In order to replace the use of somewhat unweildy > expr_type_first_operand_type_p we need to record and stream the types > of results of operations recorded in arithmetic jump functions. This > is necessary so that we can then simulate them at the IPA stage with > the corresponding precision and

Re: [PATCH RFA] ipa: target clone and mangling alias [PR114992]

2025-04-04 Thread Jan Hubicka
> On Thu, 20 Mar 2025, Jason Merrill wrote: > > > Tested x86_64-pc-linux-gnu. OK for trunk and backports? > > > > -- 8< -- > > > > Since the mangling of the second lambda changed (previously we counted all > > lambdas, now we only count lambdas with the same signature), we > > generate_mangling

Make ipa-cp propagate over non-hot calls

2025-04-03 Thread Jan Hubicka
Hi, Currently enabling profile feedback regresses x264 and exchange. In both cases the root of the issue is that ipa-cp cost model thinks cloning is not relevant when feedback is available while it clones without feedback. Consider: __attribute__ ((used)) int a[1000]; __attribute__ ((noinline)

Re: [PATCH] sra: Avoid creating TBAA hazards (PR118924)

2025-04-03 Thread Jan Hubicka
> On Mon, 31 Mar 2025, Martin Jambor wrote: > > > Hi, > > > > the testcase in PR 118924, when compiled on Aarch64, contains an > > gimple aggregate assignment statement in between different types which > > are types_compatible_p but behave differently for the purposes of > > alias analysis. > >

Re: [PATCH] sra: Avoid creating TBAA hazards (PR118924)

2025-04-03 Thread Jan Hubicka
> > > So in WPA we can not assume that TYPE_CANONICAL (A) == TYPE_CANONICAL > > > (B) is forever. We also don't do any gimple transforms here, so this is > > > kind of safe, but ugly. > > > > Hmm. But we do > > > > /* alias_ptr_types_compatible_p relies on fact that during LTO > >

Re: [PATCH] sra: Avoid creating TBAA hazards (PR118924)

2025-04-03 Thread Jan Hubicka
> > So in WPA we can not assume that TYPE_CANONICAL (A) == TYPE_CANONICAL > > (B) is forever. We also don't do any gimple transforms here, so this is > > kind of safe, but ugly. > > Hmm. But we do > > /* alias_ptr_types_compatible_p relies on fact that during LTO > types do not g

Fix x86 -Os costs of loads and stores

2025-03-30 Thread Jan Hubicka
Hi, this patch fixes problem with size costs declaring all moves to have equal size (which was caught by the sanity check I tried in prologue move cost hook). Costs are relative to reg-reg move which is two. Coincidentally it is also size of the encoding, so the costs should represent typical size

Re: Mark const parameters passed by invisible reference as readonly in the function body

2025-03-30 Thread Jan Hubicka
Hi, I noticed that this patch got forgotten and I think it may be useful to solve this next stage 1. > > cp_apply_type_quals_to_decl drops 'const' if the type has mutable members. > Unfortunately TREE_READONLY on the PARM_DECL isn't helpful in the case of an > invisiref parameter. > > > > But ma

Re: [PATCH] libstdc++: Fix up string _M_constructor exports [PR103827]

2025-03-30 Thread Jan Hubicka
> On Thu, Mar 27, 2025 at 02:04:24PM +0100, Jan Hubicka wrote: > > > > Newline between functions please. > > > > > > > > OK with those two changes. > > > > > > Looking back through my inbox, this one doesn't seem to have been &g

Re: [PATCH 6/8] target/119010 - reservations for Zen4/Zen5 movhlps to memory

2025-03-29 Thread Jan Hubicka
> The following adds missing reservations for the store variant of > sselog reservations covering > > ;; 112--> b 0: i1499 [dx-0x10]=vec_select(xmm10,parallel):nothing > > Bootstrapped and tested on x86_64-unknown-linux-gnu, OK? > > PR target/119010 > * config/i386/zn4zn5.md

Re: [PATCH 8/8] target/119010 - add mode attribute to *vmovv16si_constm1_pternlog_false_dep

2025-03-29 Thread Jan Hubicka
> Like the other instances. This avoids > > ;; 1--> b 0: i6540 {xmm2=const_vector;unspec[xmm2] 38;}:nothing > > Bootstrapped and tested on x86_64-unknown-linux-gnu, OK? > > PR target/119010 > * config/i386/sse.md (*vmov_constm1_pternlog_false_dep): > Add mode attrib

Re: [PATCH] ipa-sra: Don't change return type to void if there are musttail calls [PR119484]

2025-03-29 Thread Jan Hubicka
> Hi! > > The following testcase is rejected, because IPA-SRA decides to > turn bar.constprop call into bar.constprop.isra which returns void. > While there is no explicit lhs on the call, as it is a musttail call > the tailc pass checks if IPA-VRP returns singleton from that function > and the fu

Re: [PATCH 3/8] target/119010 - add reservations for integer vector compares to zen4/zen5

2025-03-29 Thread Jan Hubicka
> The following handles TI, OI and XI mode in the respective EVEX > compare reservations that do not use memory (I've not yet run into > ones with). The znver automata has separate reservations for > integer compares (but only for zen1, for zen2 and zen3 there are > no compare reservations at all)

Re: [PATCH 2/8] target/119010 - missing reservations for Zen4/5 and SSE compares

2025-03-29 Thread Jan Hubicka
> There's the znver4_sse_test reservation which matches the memory-less > SSE compares but currently requires prefix_extra == 1. The old > znver automata in this case sometimes uses znver1-double instead of > znver1-direct, but it's quite a maze. The following simply drops prefix_extra is used to

Re: [PATCH 7/8] target/119010 - Zen4/Zen5 reservations for movlhps loads

2025-03-29 Thread Jan Hubicka
> The following fixes up the ssemov2 type introduction, amending > the znver4_sse_mov_fp_load reservation. This fixes > > ;; 14--> b 0: i1436 xmm6=vec_concat(xmm6,[ax+0x8]) :nothing > > Bootstrapped and tested on x86_64-unknown-linux-gnu, OK? > > PR target/119010 > *

Re: [PATCH 5/8] target/119010 - fixup Zen4/Zen5 fp<->int convert reservations

2025-03-29 Thread Jan Hubicka
> They were using ssecvt instead of sseicvt, I've also added handling > for sseicvt2 which was introduced without fixing up automata, and > the relevant instruction uses DFmode. IMO this is a quite messy > area that could need TLC in the machine description itself. > > Bootstrapped and tested on

Re: [PATCH 4/8] target/119010 - handle DFmode in SSE divide reservations for Zen4/Zen5

2025-03-29 Thread Jan Hubicka
> Like the other DFmode cases. > > Bootstrapped and tested on x86_64-unknown-linux-gnu, OK? > > PR target/119010 > * config/i386/zn4zn5.md (znver4_sse_div_pd, > znver4_sse_div_pd_load, znver5_sse_div_pd_load): Handle DFmode. OK, thanks! Honza > --- > gcc/config/i386/zn4zn5.md |

Re: [PATCH 1/8] target/119010 - fixup zn4zn5 reservation for move from const_vector

2025-03-29 Thread Jan Hubicka
> movv8si_internal uses sselog1 and V4SFmode for an instruction like > > (insn 363 2437 371 97 (set (reg:V8SI 46 xmm10 [1125]) > (const_vector:V8SI [ > (const_int 0 [0]) repeated x8 > ])) "ComputeNonbondedUtil.C":185:21 2402 {movv8si_internal} > > this wasn't c

Re: [PATCH] target/119010 - add znver{4,5}_insn_both to resolve missing reservations

2025-03-27 Thread Jan Hubicka
> I still was seeing > > ;;0--> b 0: i 101 {[sp-0x3c]=[sp-0x3c]+0x1;clobber flags;}:nothing > > so the following adds a standard alu insn reservation mimicing that > from the znver.md description allowing both load and store. > > Bootstrap and regtest running on x86_64-unknown-linux-gnu

Re: [libstdc++] Optimize string constructors

2025-03-27 Thread Jan Hubicka
> > Newline between functions please. > > > > OK with those two changes. > > Looking back through my inbox, this one doesn't seem to have been > pushed. Was it superseded by something else, or is it just waiting for > stage 1 now? Seems I missed the approval, sorry. I will push it - I think it w

Re: [PATCH] target/119474 - more DFmode handling in zn4zn5 reservations

2025-03-27 Thread Jan Hubicka
> The following adds DFmode where V1DFmode and SFmode were handled. > This resolves missing reservations for adds, subs [with memory] > and for FMAs for the testcase I'm looking at. Resolved cases are > > -;; 16--> b 0: i 237 xmm3=xmm3+[r9*0x8+si] :nothing > -;; 29-->

Re: [PATCH v2 1/2] PR118442: Don't instrument exit edges after musttail

2025-03-26 Thread Jan Hubicka
> > > > Hmm. I do wonder whether your earlier patch was more "correct" in the > > sense that a tail call does not return to the calling function but its > > caller. > > That means it should not have a fallthru edge, so our representation > > with feeding a return value to a function-local return

Re: [PATCH 2/2] Add prime path coverage to gcc/gcov

2025-03-26 Thread Jan Hubicka
Hello, I apologize for late reply here. I went thru the paper in gereater detail. While I originally though the usual path-profiling can be reasonably merged with the prime math profiling, so it is useful both for optimizaiton and coverage testing, I think it is better to not do that - the require

Re: [PATCH 1/2] gcov: branch, conds, calls in function summaries

2025-03-26 Thread Jan Hubicka
> The gcov function summaries only output the covered lines, not the > branches and calls. Since the function summaries is an opt-in it > probably makes sense to also include branch coverage, calls, and > condition coverage. > > $ gcc --coverage -fpath-coverage hello.c -o hello > $ ./hello > > Be

Re: [PATCH v2 1/2] PR118442: Don't instrument exit edges after musttail

2025-03-26 Thread Jan Hubicka
> > Hmm. I do wonder whether your earlier patch was more "correct" in the > sense that a tail call does not return to the calling function but its caller. > That means it should not have a fallthru edge, so our representation > with feeding a return value to a function-local return stmt isn't a g

Re: [PATCH v2 1/2] PR118442: Don't instrument exit edges after musttail

2025-03-26 Thread Jan Hubicka
> The only question I have is flow_call_edges_add only called while > profiling or is it called some other time? So looking into who calls > flow_call_edges_add, it is only branch_prob (profile.cc) which is only > called from tree-profile.cc. So a cleanup (for GCC 16 is remove the > cfghook flow_ca

Re: [PATCH] target/119010 - add missing integer store reservations for znver4 and znver5

2025-03-25 Thread Jan Hubicka
> The imov and imovx classified stores miss reservations in the znver4/5 > pipeline description. The following adds them. > > Bootstrap and regtest pending on x86_64-unknown-linux-gnu. > > OK? > > PR target/119010 > * config/i386/zn4zn5.md (znver4_imov_double_store, > znver5_i

Re: [PATCH] target/119010 - add missing DF load/store reservations for znver4 and znver5

2025-03-25 Thread Jan Hubicka
> On Tue, 25 Mar 2025, Richard Biener wrote: > > > The following resolves missing reservations for DFmode *movdf_internal > > loads and stores, visible as 'nothing' in -fsched-verbose=2 dumps. > > > > Bootstrap and regtest running on x86_64-unknown-linux-gnu. > > The alternative for the larger s

Re: [PATCH] target/119010 - add missing integer store reservations for znver4 and znver5

2025-03-25 Thread Jan Hubicka
> The imov and imovx classified stores miss reservations in the znver4/5 > pipeline description. The following adds them. > > Bootstrap and regtest pending on x86_64-unknown-linux-gnu. > > OK? > > PR target/119010 > * config/i386/zn4zn5.md (znver4_imov_double_store, > znver5_i

Fix speculation_useful_p

2025-03-15 Thread Jan Hubicka
Hi, this patch fixes issue with speculation and x264. With profile feedback we first introduce speculative calls to mc_chroma which is called indirectly. Then we propagate constants acorss these calls (which is useful transform) but then speculation_useful_p decides that these speculations are not

Fix invalid profile mismatch error

2025-03-13 Thread Jan Hubicka
Hi, this patch fixes false incosistent profile error message seen when building SPEC with -fprofile-use -fdump-ipa-profile. The problem is that with dumping tree_esitmate_probability is run in dry run mode to report success rates of heuristics. It however runs determine_unlikely_bbs which ovewri

Re: [PATCH v3] ira: Add new hooks for callee-save vs spills [PR117477]

2025-03-07 Thread Jan Hubicka
> > This is OK. In general, I think we could also go with assert on > > mem_cost <= 2, since that is kind of bogus setting (I don't think we > > will ever need to support x86 CPU with memory stores being as cheap as > > reg-reg moves), but current form is good. > > Unless the loading/storing inte

Re: [PATCH v3] ira: Add new hooks for callee-save vs spills [PR117477]

2025-03-07 Thread Jan Hubicka
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index fb93a6fdd0a..be5e27fc391 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -20600,12 +20600,26 @@ ix86_class_likely_spilled_p (reg_class_t rclass) >return false; > } > > -/* Implement TARGET_IR

Break false dependency chains on zen5

2025-03-04 Thread Jan Hubicka
Hi, Zen5 on some variants has false dependency on tzcnt, blsi, blsr and blsmsk instructions. Those can be tested by the following benchmark jh@shroud:~> cat ee.c int main() { int a = 10; int b = 0; for (int i = 0; i < 10; i++) { #ifdef BREAK asm

Make ix86_macro_fusion_pair_p and ix86_fuse_mov_alu_p match current CPUs better

2025-03-03 Thread Jan Hubicka
Hi, The current implementation of fussion predicates misses some common fussion cases on zen and more recent cores. I added knobs for individual conditionals we test. 1) I split checks for fusing ALU with conditional operands when the ALU has memory operand. This seems to be supported by zen3+

Re: [PATCH] ipa-vr: Handle non-conversion unary ops separately from conversions (PR 118756)

2025-02-27 Thread Jan Hubicka
> gcc/ChangeLog: > > 2025-02-24 Martin Jambor > > PR ipa/118785 > > * ipa-cp.cc (ipa_vr_intersect_with_arith_jfunc): Handle non-conversion > unary operations separately before doing any conversions. Check > expr_type_first_operand_type_p for non-unary operations too.

Re: [PATCH] ipa-sra: Avoid clashes with ipa-cp when pulling accesses across calls (PR 118243)

2025-02-27 Thread Jan Hubicka
> gcc/ChangeLog: > > 2025-02-10 Martin Jambor > > PR ipa/118243 > * ipa-sra.cc (pull_accesses_from_callee): New parameters > caller_ipcp_ts and param_idx. Check that scalar pulled accesses would > not clash with a known IPA-CP aggregate constant. > (param_splitting

Re: [PATCH v2] ira: Add a target hook for callee-saved register cost scale

2025-02-20 Thread Jan Hubicka
> > Thanks for running these. I saw poor results for perlbench with my > initial aarch64 hooks because the hooks reduced the cost to zero for > the entry case: > > auto entry_cost = targetm.callee_save_cost > (spill_cost_type::SAVE, hard_regno, mode, saved_nregs, >

Re: [PATCH v2] ira: Add a target hook for callee-saved register cost scale

2025-02-20 Thread Jan Hubicka
> On Wed, Feb 19, 2025 at 9:06 PM Jan Hubicka wrote: > > > > Hi, > > this is a variant of a hook I benchmarked on cpu2016 with -Ofast -flto > > and -O2 -flto. For non -Os and no Windows ABI should be pratically the > > same as your variant that was simply r

Re: [PATCH v2] ira: Add a target hook for callee-saved register cost scale

2025-02-19 Thread Jan Hubicka
Hi, this is a variant of a hook I benchmarked on cpu2016 with -Ofast -flto and -O2 -flto. For non -Os and no Windows ABI should be pratically the same as your variant that was simply returning mem_cost - 2. It seems mostly SPEC netural. With -O2 -flto there is small 4% improvement on povray (whic

Re: [PATCH v2] ira: Add a target hook for callee-saved register cost scale

2025-02-18 Thread Jan Hubicka
> Jan Hubicka writes: > > Concerning x86 specifics, there is cost for allocating stack frame. So > > if the function has nothing on stack frame push/pop becomes bit better > > candidate then a spill. The hook you added does not seem to be able to > > test this, since

  1   2   3   4   5   6   7   8   9   10   >