> On Fri, May 30, 2025 at 11:30 AM Jan Hubicka wrote:
> >
> > Hi,
> > > >
> > > > Hi,
> > > >
> > > > the attached Ada testcase compiled with -O2 -gnatn makes the compiler
> > > > crash in
> > > > vect_ca
Hi,
> >
> > Hi,
> >
> > the attached Ada testcase compiled with -O2 -gnatn makes the compiler crash
> > in
> > vect_can_force_dr_alignment_p during SLP vectorization:
> >
> > if (decl_in_symtab_p (decl)
> > && !symtab_node::get (decl)->can_increase_alignment_p ())
> > return false;
> >
> Hi,
>
> in GCC 15 we allowed jump-function generation code to skip over a
> type-cast converting one integer to another as long as the latter can
> hold all the values of the former or has at least the same precision.
> This works well for IPA-CP where we do then evaluate each jump
> function as
>
> However i do not quite follow the old or new logic here.
> So if I have only one unknown edge out (or in) from BB and I know
> its count, I can determine count of that edge by Kirhoff law.
>
> But then the old code computes number of edges out of the BB
> and if it is only one it updates the
> diff --git a/gcc/auto-profile.cc b/gcc/auto-profile.cc
> index 7e0e8c66124..8a317d85277 100644
> --- a/gcc/auto-profile.cc
> +++ b/gcc/auto-profile.cc
> @@ -1129,6 +1129,26 @@ afdo_set_bb_count (basic_block bb, const stmt_set
> &promoted)
>gimple *stmt = gsi_stmt (gsi);
>if (gimp
> I also noticed that some tests are only enabled for x86. I am also seeing:
> ./gcc/testsuite/gcc/gcc.sum:UNSUPPORTED: gcc.dg/tree-prof/pr66295.c
This is testing a former ifun bug which reproduced with -fprofile-use
> ./gcc/testsuite/gcc/gcc.sum:UNSUPPORTED: gcc.dg/tree-prof/split-1.c
This is test
> Hi,
> autofdo tests are now running only for x86. This patch makes it
> run for aarch64 too. Verified that perf and create_gcov are running
> as expected.
>
> gcc/ChangeLog:
>
> * config/aarch64/gcc-auto-profile: Make script executable.
>
> gcc/testsuite/ChangeLog:
>
> * lib/t
Hi,
since uses of addss for other purposes then modelling FP addition/subtraction
should
be gone now, this patch sets addss cost back to 2.
Bootsrapped/regtested x86_64-linux, comitted.
gcc/ChangeLog:
PR target/119298
* config/i386/x86-tune-costs.h (struct processor_costs): Set
Hi,
with normal profile feedback checking entry block count to be non-zero is quite
reliable check for presence of non-0 profile in the body since the function
body can only be executed if the entry block was executed. With autofdo this
is not true, since the entry block may just execute too few t
Hi,
This patch makes auto-fdo more careful about keeping info we have
from static profile prediction.
If all counters in function are 0, we can keep original auto-fdo profile.
Having all 0 profile is not very useful especially becuase 0 in autofdo is not
very informative and the code still may hav
> gcc/ChangeLog:
>
> * config/i386/i386-expand.cc (emit_reduc_half): Use shuffles to
> generate reduc half for V4SI, similar modes.
> * config/i386/i386.h (TARGET_SSE_REDUCTION_PREFER_PSHUF): New Macro.
> * config/i386/x86-tune.def (X86_TUNE_SSE_REDUCTION_PREFER_PSHUF):
>
Hi,
this code to track what locations were used when reading auto-fdo profile
seems dead since the initial commit. Removed thus.
Comitted as obvious.
Honza
gcc/ChangeLog:
* auto-profile.cc (function_instance::mark_annotated): Remove.
(function_instance::total_annotated_count): Re
>
>
> > On 26 May 2025, at 5:34 pm, Jan Hubicka wrote:
> >
> > External email: Use caution opening links or attachments
> >
> >
> > Hi,
> > also, please, can you add an testcase? We should have some coverage for
> > auto-fdo specific is
Hi,
also, please, can you add an testcase? We should have some coverage for
auto-fdo specific issues
Honza
0002-AUTOFDO-Merge-profiles-of-clones-before-annotating.patch
Description: 0002-AUTOFDO-Merge-profiles-of-clones-before-annotating.patch
Hi,
> Ping?
Sorry for the delay. I think I finally got auto-fdo running on my box
and indeed I see that if function is cloned later, the profile is lost.
There are .suffixes added before afdo pass (such as openmp offloading or
nested functions) and there are .suffixes added afer afdo (by ipa
clonin
> > On 9 May 2025, at 11:55 am, Kugan Vivekanandarajah
> > wrote:
> >
> > ipa-split is not now run for auto-profile. IMO this was an oversight.
> > This patch enables it similar to PGO runs.
> >
> > gcc/ChangeLog:
> >
> >* ipa-split.cc pass_feedback_split_functions::clone (): New.
> >
> Hi,
>
> starting with GCC 15 the order is not unique for any symtab_nodes but
> m_uid is, I believe we ought to dump the latter in the ipa-clones dump,
> if only so that people can reliably match entries about new clones to
> those about removed nodes (if any).
>
> Bootstrapped and tested on x8
> With the avx512_two_epilogues tuning enabled for zen4 and zen5
> the gcc.target/i386/vect-epilogues-5.c testcase below regresses
> and ends up using AVX2 sized vectors for the masked epilogue
> rather than AVX512 sized vectors. The following patch rectifies
> this and adds coverage for the inten
> Thansk for review.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> n some benchmark, I notice stv failed due to cost unprofitable, but the igain
> is inside the loop, but sse<->integer conversion is outside the loop, current
> cost
> model doesn't consider the
> > gcc/ChangeLog:
> >
> > * config/i386/i386-features.cc
> > (scalar_chain::mark_dual_mode_def): Weight
> > n_integer_to_sse/n_sse_to_integer with bb frequency.
> > (general_scalar_chain::compute_convert_gain): Ditto, and
> > adjust function prototype to ret
> > Instructions with latency info are those really different.
> > So the uncoverted code has sum of latencies 4 and real latency 3.
> > Converted code has sum of latencies 4 and real latency 3
> > (vmod+vpmaxsd+vmov).
> > So I do not quite see it should be a win.
>
> Note this was historically d
Hi,
this patch fixes some of problems with cosint in scalar to vector pass.
In particular
1) the pass uses optimize_insn_for_size which is intended to be used by
expanders and splitters and requires the optimization pass to use
set_rtl_profile (bb) for currently processed bb.
This is n
Hi,
This patch adds pattern matching for float<->int conversions both as normal
statements and promote_demote. While updating promote_demote I noticed that
in cleanups I turned "stmt_cost =" into "int stmt_cost = " which turned
the existing FP costing to NOOP. I also added comment on how demotes a
Hi,
this patch adds ifdef so we don't get warning on ix86_tls_index being
unused.
Bootstrapped x86_64-linux, comitted.
* config/i386/i386.cc (ix86_tls_index): Add ifdef.
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index f28c92a9d3a..89f518c86b5 100644
--- a/gcc/config/
Hi,
Inliner currently applies different heuristics to hot and cold calls (the
second are inlined only if the code size will shrink). It may happen that the
call itself is hot, but the significant time is spent in callee and inlining
makes it faster. For this reason we want to check if the anticip
Hi,
ix86_rtx_costs VEC_MERGE by special casing AVX512 mask operations and otherwise
returning cost->sse_op completely ignoring costs of the operands. Since
VEC_MERGE is also used to represent scalar variant of SSE/AVX operation, this
means that many instructions (such as SSE converisions) are ofte
> target_insn_cost is used to prevent rpad optimization to be restored by
> late_combine1, looks like it's not sufficient for size_cost.
>
> 21804static int
> 21805ix86_insn_cost (rtx_insn *insn, bool speed)
> 21806{
> 21807 int insn_cost = 0;
> 21808 /* Add extra cost to avoid post_reload late
> > so gain is the difference of runtime of integer variant compared to
> > vector vairant and cost are the extra int->see and sse->int conversions
> > needed?
> >
> > If you scale everything by a BB frequency, you will get a weird
> > behaviour if chain happens to consist only of instructions in
Hi,
as noticed by Martin Jambor, I introduced a bug while simplifying
cs_interesting_for_ipcp_p and reversed condition for
flag_profile_partial_training. Also I noticed that we probably want to
consider calls with unintialized counts for cloning so the pass does somehting
with -fno-guess-branch-pr
> > > I am generally trying to get rid of remaing uses of REG_FREQ since the
> > > 1 based fixed point arithmetics iot always working that well.
> > >
> > > You can do the sums in profile_count type (doing something reasonable
> > > when count is uninitialized) and then convert it to sreal for
Hi,
this patch (partly) solves problem in PR119900 where changing ix86_size_cost
of chap SSE instruction from 2 bytes to 4 bytes regresses imagemagick with PGO
(119% on core and 54% on Zen)
There is an interesting chain of problems
1) the train run of the SPEC2017 imagick is wrong and it does not
> > I am generally trying to get rid of remaing uses of REG_FREQ since the
> > 1 based fixed point arithmetics iot always working that well.
> >
> > You can do the sums in profile_count type (doing something reasonable
> > when count is uninitialized) and then convert it to sreal for the final
> > And thus it may be more RTL friendly to represent it this way instead of
> > current unspec called UNSPEC_IEEE_MAX...
>
> There's a patch proposed for that [1], and Jakub has some comments.
>
> Jakub Jelinek 于2024年11月15日周五 16:20写道:
> >
> > On Fri, Nov 15, 2024 at 04:04:55PM +0800, Hongyu Wan
> Note for blendv, it checks the significant bit of the mask, not simple
> if_then_else
> mask
> if_true
> if_false
>
> It should be
> if_then_else
>ashiftrt mask 31
>if_true
>if_false
I think canonical form (produced by combine) would be
if_then_else
ge mask 0
if_false
> On Thu, Apr 24, 2025 at 6:27 PM Jan Hubicka wrote:
> >
> > > Since ix86_expand_sse_movcc will simplify them into a simple vmov, vpand
> > > or vpandn.
> > > Current register_operand/vector_operand could lose some optimization
> > > opportunity.
>
Hi,
the problem here is division by zero, since adjusted 0 > precise 0. Fixed by
using right test.
gcc/ChangeLog:
PR ipa/119924
* ipa-cp.cc (update_counts_for_self_gen_clones): Use nonzero_p.
(update_profiling_info): Likewise.
(update_specialized_profile): Likewise
> Since ix86_expand_sse_movcc will simplify them into a simple vmov, vpand
> or vpandn.
> Current register_operand/vector_operand could lose some optimization
> opportunity.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/p
Hi,
> With this patch
> https://gcc.gnu.org/pipermail/gcc-patches/2025-April/681503.html
> scalar version can also be optimized to vcmpnltsd + vpandn
this is nice. Would be nice if this was also caught by combiner...
> > Can we also check if_true/if_false, if they're const0, or
> > constm1(inte
> From: "hongtao.liu"
>
> When FMA is available, N-R step can be rewritten with
>
> a / b = (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a
>
> which have 2 fma generated.[1]
>
> [1] https://bugs.llvm.org/show_bug.cgi?id=21385
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok fo
> In some benchmark, I notice stv failed due to cost unprofitable, but the igain
> is inside the loop, but sse<->integer conversion is outside the loop, current
> cost
> model doesn't consider the frequency of those gain/cost.
> The patch weights those cost with frequency just like LRA does.
>
>
> > But vectorizer computes costs of vector load of off array, 4x moving vector
> > to
> > scalar and 4x stores. I wonder if generic code can match this better and
> > avoid
> > the vector load of addresses when open-coding gather/scatter?
>
> The vectorizer does not explicitly consider the low
Hi,
this patch adds special cases for vectorizer costs in COND_EXPR, MIN_EXPR,
MAX_EXPR, ABS_EXPR and ABSU_EXPR. We previously costed ABS_EXPR and ABSU_EXPR
but it was only correct for FP variant (wehre it corresponds to andss clearing
sign bit). Integer abs/absu is open coded as conditinal move
> On Mon, Apr 21, 2025 at 7:24 AM H.J. Lu wrote:
> >
> > On Sun, Apr 20, 2025 at 6:31 PM Jan Hubicka wrote:
> > >
> > > > PR target/102294
> > > > PR target/119596
> > > > * config/i386/x86-tune-costs.h (generi
> PR target/102294
> PR target/119596
> * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
> (generic_memset): Likewise.
> (generic_cost): Change CLEAR_RATIO to 17.
> * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
> Add m_GENERIC
> On Sun, Apr 20, 2025 at 4:19 AM Jan Hubicka wrote:
> >
> > > On Tue, Apr 8, 2025 at 3:52 AM H.J. Lu wrote:
> > > >
> > > > Simplify memcpy and memset inline strategies to avoid branches for
> > > > -mtune=generic:
> > > >
>
> On Tue, Apr 8, 2025 at 3:52 AM H.J. Lu wrote:
> >
> > Simplify memcpy and memset inline strategies to avoid branches for
> > -mtune=generic:
> >
> > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> >load and store for up to 16 * 16 (256) bytes when the data size is
> >
Hi,
Znver5 has addss cost of 2 while other common floating point SSE operations
costs 3 cycles. We currently have only one entry in the costs tables which
makes it impossible to model this. This patch adds sse_fp_op which is used for
other common FP operations (basically conversions) and updates
> On Thu, 17 Apr 2025, Jan Hubicka wrote:
>
> > Hi,
> > Znver5 has addss cost of 2 while other common floating point SSE operations
> > costs 3 cycles. We currently have only one entry in the costs tables which
> > makes it impossible to model this. This patc
Hi,
Znver5 has latency of addss 2 in typical case while all earlier versions has
latency 3.
Unforunately addss cost is used to cost many other SSE instructions than just
addss and
setting the cost to 2 makes us to vectorize 4 64bit stores into one 256bit
store which
in turn regesses imagemagick
Hi,
this patch adds streaming of return summaries from compile time to ltrans
which are now needed for vrp to not ouput false errors on musttail.
Bootstrapped/regtested x86_64-linux, comitted.
Co-authored-by: Jakub Jelinek
gcc/ChangeLog:
PR tree-optimization/119614
* ip
Hi,
> gcc/ChangeLog:
>
> PR tree-optimization/117790
> * tree-vect-loop.cc (scale_profile_for_vect_loop): Use
> scale_loop_profile_hold_exit_counts instead of scale_loop_profile. Drop
> the exit edge parameter, since the code now handles multiple exits.
> Adjust the
> Hi,
> > gcc/ChangeLog:
> >
> > PR tree-optimization/117790
> > * cfgloopmanip.cc (can_flow_scale_loop_freqs_p): New.
> > (flow_scale_loop_freqs): New.
> > (scale_loop_freqs_with_exit_counts): New.
> > (scale_loop_freqs_hold_exit_counts): New.
> > (scale_loop_profile): Ref
Hi,
> gcc/ChangeLog:
>
> PR tree-optimization/117790
> * cfgloopmanip.cc (can_flow_scale_loop_freqs_p): New.
> (flow_scale_loop_freqs): New.
> (scale_loop_freqs_with_exit_counts): New.
> (scale_loop_freqs_hold_exit_counts): New.
> (scale_loop_profile): Refactor
> +@opindex fipa-reorder-for-locality
> +@item -fipa-reorder-for-locality
> +Group call chains close together in the binary layout to improve code code
> +locality. This option is incompatible with an explicit
> +@option{-flto-partition=} option since it enforces a custom partitioning
> +scheme.
> This patch revisits the fix for PR 118097 and instead of deducing the
> necessary operation type it just uses the value collected and streamed
> by an earlier patch.
>
> It is bigger than the ones for propagating value ranges and known bits
> because we track constants both in parameters themsel
> After reviewing the code involving arithmetic pass-through jump
> functions I found out that we actually do check that the type of the
> LHS is compatible with the type of the first operand on the RHS. Now
> that we stream the types of the LHS of these operations, this is no
> longer necessary -
> This patch revisits the fix for PR 118785 and intead of deducing the
> necessary operation type it just uses the value collected and streamed
> by an earlier patch.
>
> gcc/ChangeLog:
>
> Bootstrapped and tested and LTO bootstrapped on x86_64-linux. OK for
> master?
>
> Thanks,
>
> Martin
>
> This patch just introduces a form of dumping of widest ints that only
> have zeros in the lowest 128 bits so that instead of printing
> thousands of f's the output looks like:
>
>Bits: value = 0x, mask = all ones folled by
> 0x
>
> and then makes sur
> After the propagation of constants and value ranges, it turns out
> that the propagation of known bits also needs to be made aware of any
> intermediate types in which any arithmetic operations are made and
> must limit its precision there. This implements just that, using the
> newly collected
> In order to replace the use of somewhat unweildy
> expr_type_first_operand_type_p we need to record and stream the types
> of results of operations recorded in arithmetic jump functions. This
> is necessary so that we can then simulate them at the IPA stage with
> the corresponding precision and
> On Thu, 20 Mar 2025, Jason Merrill wrote:
>
> > Tested x86_64-pc-linux-gnu. OK for trunk and backports?
> >
> > -- 8< --
> >
> > Since the mangling of the second lambda changed (previously we counted all
> > lambdas, now we only count lambdas with the same signature), we
> > generate_mangling
Hi,
Currently enabling profile feedback regresses x264 and exchange. In both cases
the root of the
issue is that ipa-cp cost model thinks cloning is not relevant when feedback is
available
while it clones without feedback.
Consider:
__attribute__ ((used))
int a[1000];
__attribute__ ((noinline)
> On Mon, 31 Mar 2025, Martin Jambor wrote:
>
> > Hi,
> >
> > the testcase in PR 118924, when compiled on Aarch64, contains an
> > gimple aggregate assignment statement in between different types which
> > are types_compatible_p but behave differently for the purposes of
> > alias analysis.
> >
> > > So in WPA we can not assume that TYPE_CANONICAL (A) == TYPE_CANONICAL
> > > (B) is forever. We also don't do any gimple transforms here, so this is
> > > kind of safe, but ugly.
> >
> > Hmm. But we do
> >
> > /* alias_ptr_types_compatible_p relies on fact that during LTO
> >
> > So in WPA we can not assume that TYPE_CANONICAL (A) == TYPE_CANONICAL
> > (B) is forever. We also don't do any gimple transforms here, so this is
> > kind of safe, but ugly.
>
> Hmm. But we do
>
> /* alias_ptr_types_compatible_p relies on fact that during LTO
> types do not g
Hi,
this patch fixes problem with size costs declaring all moves to have equal size
(which was caught by the sanity check I tried in prologue move cost hook).
Costs are relative to reg-reg move which is two. Coincidentally it is also size
of the encoding, so the costs should represent typical size
Hi,
I noticed that this patch got forgotten and I think it may be useful to
solve this next stage 1.
>
> cp_apply_type_quals_to_decl drops 'const' if the type has mutable members.
> Unfortunately TREE_READONLY on the PARM_DECL isn't helpful in the case of an
> invisiref parameter.
>
> > > But ma
> On Thu, Mar 27, 2025 at 02:04:24PM +0100, Jan Hubicka wrote:
> > > > Newline between functions please.
> > > >
> > > > OK with those two changes.
> > >
> > > Looking back through my inbox, this one doesn't seem to have been
&g
> The following adds missing reservations for the store variant of
> sselog reservations covering
>
> ;; 112--> b 0: i1499 [dx-0x10]=vec_select(xmm10,parallel):nothing
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?
>
> PR target/119010
> * config/i386/zn4zn5.md
> Like the other instances. This avoids
>
> ;; 1--> b 0: i6540 {xmm2=const_vector;unspec[xmm2] 38;}:nothing
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?
>
> PR target/119010
> * config/i386/sse.md (*vmov_constm1_pternlog_false_dep):
> Add mode attrib
> Hi!
>
> The following testcase is rejected, because IPA-SRA decides to
> turn bar.constprop call into bar.constprop.isra which returns void.
> While there is no explicit lhs on the call, as it is a musttail call
> the tailc pass checks if IPA-VRP returns singleton from that function
> and the fu
> The following handles TI, OI and XI mode in the respective EVEX
> compare reservations that do not use memory (I've not yet run into
> ones with). The znver automata has separate reservations for
> integer compares (but only for zen1, for zen2 and zen3 there are
> no compare reservations at all)
> There's the znver4_sse_test reservation which matches the memory-less
> SSE compares but currently requires prefix_extra == 1. The old
> znver automata in this case sometimes uses znver1-double instead of
> znver1-direct, but it's quite a maze. The following simply drops
prefix_extra is used to
> The following fixes up the ssemov2 type introduction, amending
> the znver4_sse_mov_fp_load reservation. This fixes
>
> ;; 14--> b 0: i1436 xmm6=vec_concat(xmm6,[ax+0x8]) :nothing
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?
>
> PR target/119010
> *
> They were using ssecvt instead of sseicvt, I've also added handling
> for sseicvt2 which was introduced without fixing up automata, and
> the relevant instruction uses DFmode. IMO this is a quite messy
> area that could need TLC in the machine description itself.
>
> Bootstrapped and tested on
> Like the other DFmode cases.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?
>
> PR target/119010
> * config/i386/zn4zn5.md (znver4_sse_div_pd,
> znver4_sse_div_pd_load, znver5_sse_div_pd_load): Handle DFmode.
OK,
thanks!
Honza
> ---
> gcc/config/i386/zn4zn5.md |
> movv8si_internal uses sselog1 and V4SFmode for an instruction like
>
> (insn 363 2437 371 97 (set (reg:V8SI 46 xmm10 [1125])
> (const_vector:V8SI [
> (const_int 0 [0]) repeated x8
> ])) "ComputeNonbondedUtil.C":185:21 2402 {movv8si_internal}
>
> this wasn't c
> I still was seeing
>
> ;;0--> b 0: i 101 {[sp-0x3c]=[sp-0x3c]+0x1;clobber flags;}:nothing
>
> so the following adds a standard alu insn reservation mimicing that
> from the znver.md description allowing both load and store.
>
> Bootstrap and regtest running on x86_64-unknown-linux-gnu
> > Newline between functions please.
> >
> > OK with those two changes.
>
> Looking back through my inbox, this one doesn't seem to have been
> pushed. Was it superseded by something else, or is it just waiting for
> stage 1 now?
Seems I missed the approval, sorry. I will push it - I think it w
> The following adds DFmode where V1DFmode and SFmode were handled.
> This resolves missing reservations for adds, subs [with memory]
> and for FMAs for the testcase I'm looking at. Resolved cases are
>
> -;; 16--> b 0: i 237 xmm3=xmm3+[r9*0x8+si] :nothing
> -;; 29-->
> >
> > Hmm. I do wonder whether your earlier patch was more "correct" in the
> > sense that a tail call does not return to the calling function but its
> > caller.
> > That means it should not have a fallthru edge, so our representation
> > with feeding a return value to a function-local return
Hello,
I apologize for late reply here. I went thru the paper in gereater
detail. While I originally though the usual path-profiling can be
reasonably merged with the prime math profiling, so it is useful both
for optimizaiton and coverage testing, I think it is better to not do
that - the require
> The gcov function summaries only output the covered lines, not the
> branches and calls. Since the function summaries is an opt-in it
> probably makes sense to also include branch coverage, calls, and
> condition coverage.
>
> $ gcc --coverage -fpath-coverage hello.c -o hello
> $ ./hello
>
> Be
>
> Hmm. I do wonder whether your earlier patch was more "correct" in the
> sense that a tail call does not return to the calling function but its caller.
> That means it should not have a fallthru edge, so our representation
> with feeding a return value to a function-local return stmt isn't a g
> The only question I have is flow_call_edges_add only called while
> profiling or is it called some other time? So looking into who calls
> flow_call_edges_add, it is only branch_prob (profile.cc) which is only
> called from tree-profile.cc. So a cleanup (for GCC 16 is remove the
> cfghook flow_ca
> The imov and imovx classified stores miss reservations in the znver4/5
> pipeline description. The following adds them.
>
> Bootstrap and regtest pending on x86_64-unknown-linux-gnu.
>
> OK?
>
> PR target/119010
> * config/i386/zn4zn5.md (znver4_imov_double_store,
> znver5_i
> On Tue, 25 Mar 2025, Richard Biener wrote:
>
> > The following resolves missing reservations for DFmode *movdf_internal
> > loads and stores, visible as 'nothing' in -fsched-verbose=2 dumps.
> >
> > Bootstrap and regtest running on x86_64-unknown-linux-gnu.
>
> The alternative for the larger s
> The imov and imovx classified stores miss reservations in the znver4/5
> pipeline description. The following adds them.
>
> Bootstrap and regtest pending on x86_64-unknown-linux-gnu.
>
> OK?
>
> PR target/119010
> * config/i386/zn4zn5.md (znver4_imov_double_store,
> znver5_i
Hi,
this patch fixes issue with speculation and x264. With profile feedback
we first introduce speculative calls to mc_chroma which is called indirectly.
Then we propagate constants acorss these calls (which is useful transform) but
then speculation_useful_p decides that these speculations are not
Hi,
this patch fixes false incosistent profile error message seen when building
SPEC with
-fprofile-use -fdump-ipa-profile.
The problem is that with dumping tree_esitmate_probability is run in dry run
mode to report success rates of heuristics. It however runs
determine_unlikely_bbs
which ovewri
> > This is OK. In general, I think we could also go with assert on
> > mem_cost <= 2, since that is kind of bogus setting (I don't think we
> > will ever need to support x86 CPU with memory stores being as cheap as
> > reg-reg moves), but current form is good.
>
> Unless the loading/storing inte
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index fb93a6fdd0a..be5e27fc391 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -20600,12 +20600,26 @@ ix86_class_likely_spilled_p (reg_class_t rclass)
>return false;
> }
>
> -/* Implement TARGET_IR
Hi,
Zen5 on some variants has false dependency on tzcnt, blsi, blsr and blsmsk
instructions. Those can be tested by the following benchmark
jh@shroud:~> cat ee.c
int
main()
{
int a = 10;
int b = 0;
for (int i = 0; i < 10; i++)
{
#ifdef BREAK
asm
Hi,
The current implementation of fussion predicates misses some common
fussion cases on zen and more recent cores. I added knobs for
individual conditionals we test.
1) I split checks for fusing ALU with conditional operands when the ALU
has memory operand. This seems to be supported by zen3+
> gcc/ChangeLog:
>
> 2025-02-24 Martin Jambor
>
> PR ipa/118785
>
> * ipa-cp.cc (ipa_vr_intersect_with_arith_jfunc): Handle non-conversion
> unary operations separately before doing any conversions. Check
> expr_type_first_operand_type_p for non-unary operations too.
> gcc/ChangeLog:
>
> 2025-02-10 Martin Jambor
>
> PR ipa/118243
> * ipa-sra.cc (pull_accesses_from_callee): New parameters
> caller_ipcp_ts and param_idx. Check that scalar pulled accesses would
> not clash with a known IPA-CP aggregate constant.
> (param_splitting
>
> Thanks for running these. I saw poor results for perlbench with my
> initial aarch64 hooks because the hooks reduced the cost to zero for
> the entry case:
>
> auto entry_cost = targetm.callee_save_cost
> (spill_cost_type::SAVE, hard_regno, mode, saved_nregs,
>
> On Wed, Feb 19, 2025 at 9:06 PM Jan Hubicka wrote:
> >
> > Hi,
> > this is a variant of a hook I benchmarked on cpu2016 with -Ofast -flto
> > and -O2 -flto. For non -Os and no Windows ABI should be pratically the
> > same as your variant that was simply r
Hi,
this is a variant of a hook I benchmarked on cpu2016 with -Ofast -flto
and -O2 -flto. For non -Os and no Windows ABI should be pratically the
same as your variant that was simply returning mem_cost - 2.
It seems mostly SPEC netural. With -O2 -flto there is
small 4% improvement on povray (whic
> Jan Hubicka writes:
> > Concerning x86 specifics, there is cost for allocating stack frame. So
> > if the function has nothing on stack frame push/pop becomes bit better
> > candidate then a spill. The hook you added does not seem to be able to
> > test this, since
1 - 100 of 1197 matches
Mail list logo