Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/mmx.md (VHF_32_64): New mode iterator.
(3): New define_expand, merged from ..
(v4hf3): .. this and
(v2hf3): .. this.
(movd_v2hf_to_sse_reg): Ne
Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/i386.cc (ix86_build_const_vector): Handle V2HF
and V4HFmode.
(ix86_build_signbit_mask): Ditto.
* config/i386/mmx.md (mmxintvecmode): Ditto.
(2)
In the expander, it will emit below insn.
rtx tmp = gen_rtx_VEC_CONCAT (V4SFmode, operands[2],
force_reg (V2SFmode, CONST1_RTX (V2SFmode)));
but *vec_concat only allow register_operand.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/Ch
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/mmx.md (V2FI_32): New mode iterator
(movd_v2hf_to_sse): Rename to ..
(movd__to_sse): .. this.
(movd_v2hf_to_sse_reg): Rename to ..
(movd__to_sse_reg)
For lrint/lround/lceil/lfoor is not vectorized due to vectorization
restriction. When input element size is different from output element size,
vectorization relies on the old TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
intstead of the modern standand pattern name. The patch only supports standard
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/mmx.md (fma4): New expander.
(fms4): Ditto.
(fnma4): Ditto.
(fnms4): Ditto.
(vec_fmaddsubv4hf4): Ditto.
(vec_fmsubaddv4hf4): Ditto.
gcc/test
Also give up vectorization when niters_skip is negative which will be
used for fully masked loop.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR tree-optimization/111820
PR tree-optimization/111833
* tree-vect-loop-manip.cc (vect
>So the bugs were not fixed without this hunk? IIRC in the audit
>trail we concluded the value is always positive ... (but of course
>a large unsigned value can appear negative if you test it this way?)
No, I added this incase in the future there's negative skip_niters as
you mentioned in the PR,
>So with pow being available this limit shouldn't be necessary any more and
>the testcase adjustment can be avoided?
I tries, compile time still hogs on mpz_powm(3, INT_MAX), so i'll just
keep this.
>and to avoid undefined behavior with too large shift just go the gmp
>way unconditionally.
Changed
When I'm working on enable more 32/64-bit vectorization for _Float16,
I notice there's 1 redundant define_expand, the patch removed the expander.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
* config/i386/mmx.md (mmx_pinsrw): Removed.
---
gcc/co
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/103861
* config/i386/i386-expand.cc (ix86_expand_sse_movcc): Handle
V2HF/V2BF/V4HF/V4BFmode.
* config/i386/mmx.md (vec_cmpv4hfqi): New expander.
(vcondv4
This is the backport patch for releases/gcc-13 branch, the original patch for
main trunk
is at [1].
The only difference between this backport patch and [1] is GCC13 doesn't
support auto_mpz,
So this patch manually use mpz_init/mpz_clear.
[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-October
>I think it's indeed on purpose that the result of v1 < v2 is a signed
>integer vector type.
>But build_vec_cmp should not use the truth type for the result but instead the
>truth type for the comparison, so
Change build_vec_cmp in both c/c++, also notice for jit part, it already uses
type of comp
>vcond and vcondeq shouldn't be necessary if there's
>vcond_mask and vcmp support which is the "modern"
>way of handling vcond. Unless the ISA really can do
>compare and select with a single instruction.
The V2 patch remove vcond/vcondu from the initial version[1], but there're
many optimizations
When 2 vectors are equal, kmask is allones and kortest will set CF,
else CF will be cleared.
So CF bit can be used to check for the result of the comparison.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
Before:
vmovdqu (%rsi), %ymm0
vpxorq (%rdi), %ymm
-(define_split
- [(set (match_operand:V2HI 0 "register_operand")
-(eq:V2HI
- (eq:V2HI
-(us_minus:V2HI
- (match_operand:V2HI 1 "register_operand")
- (match_operand:V2HI 2 "register_operand"))
-(match_operand:V2HI 3 "const0_operand")
analyze_and_compute_bitop_with_inv_effect assumes the first operand is
loop invariant which is not the case when it's INTEGER_CST.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
Ok for trunk?
gcc/ChangeLog:
PR tree-optimization/105735
PR tree-optimization/111972
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
Ready push to trunk.
gcc/ChangeLog:
* config/i386/mmx.md (cmlav4hf4): New expander.
(cmla_conjv4hf4): Ditto.
(cmulv4hf3): Ditto.
(cmul_conjv4hf3): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/p
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/112393
* config/i386/i386-expand.cc (ix86_expand_vec_perm_vpermt2):
Avoid generating RTL code when d->testing_p.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr1
if alignb > ASAN_RED_ZONE_SIZE and offset[0] is not multiple of
alignb. (base_align_bias - base_offset) may not aligned to alignb, and
caused segement fault.
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
Ok for trunk and backport to GCC13?
gcc/ChangeLog:
PR sanitizer/110027
When we split
(insn 37 36 38 10 (set (reg:DI 104 [ _18 ])
(mem:DI (reg/f:SI 98 [ CallNative_nclosure.0_1 ]) [6 MEM[(struct
SQRefCounted *)CallNative_nclosure.0_1]._uiRef+0 S8 A32])) "test.C":22:42 84
{*movdi_internal}
(expr_list:REG_EH_REGION (const_int -11 [0xfff5])
int
It fixes ICE of unrecognized logic operation insn which is generated by
lroundmn2 expanders.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/114334
* config/i386/i386.md (mode): Add new number V8BF,V16BF,V32BF.
(MOD
Commit r14-9459-g618e34d56cc38e only handles
general_scalar_chain::convert_op. The patch also handles
timode_scalar_chain::convert_op to avoid potential similar bug.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk and backport to releases/gcc-13 branch?
gcc/ChangeLog:
Ok for trunk?
gcc/ChangeLog:
* doc/invoke.texi: Document -fexcess-precision=16.
---
gcc/doc/invoke.texi | 3 +++
1 file changed, 3 insertions(+)
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 85c938d4a14..673420fdd3e 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.te
gcc/ChangeLog:
* doc/invoke.texi: Document -fexcess-precision=16.
---
gcc/doc/invoke.texi | 3 +++
1 file changed, 3 insertions(+)
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 85c938d4a14..6bc1ebf9721 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -14930,6
wi::from_mpz doesn't take a sign argument, we want it to be wrapped
instead of saturation, so pass utype and true to it, and it fixes the
bug.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk and backport to gcc13?
gcc/ChangeLog:
PR tree-optimization/114396
Also fixed a typo in the testcase.
Commit as an obvious fix.
gcc/testsuite/ChangeLog:
PR tree-optimization/114396
* gcc.target/i386/pr114396.c: Move to...
* gcc.c-torture/execute/pr114396.c: ...here.
---
.../{gcc.target/i386 => gcc.c-torture/execute}/pr114396.c | 6 +++
> > So, try to add some other variable with larger size and smaller alignment
> > to the frame (and make sure it isn't optimized away).
> >
> > alignb above is the alignment of the first partition's var, if
> > align_frame_offset really needs to depend on the var alignment, it probably
> > should b
---
htdocs/gcc-14/changes.html | 5 +
1 file changed, 5 insertions(+)
diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index 6d917535..a022357a 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -499,6 +499,11 @@ a work-in-progress.
-march=knm
target maybe_x32 doesn't check if platform has gnu/stubs-x32.h, but
it's included by stdint.h in the testcase.
Adjust testcase: remove stdint.h, use 'typedef long long int64_t'
instead.
Commit as an obvious patch.
gcc/testsuite/ChangeLog:
PR target/113711
* gcc.target/i386/apx-nd
After r14-2692-g1c6231c05bdcca, the option is defined as EnumSet and
-fcf-protection=branch won't unset any others bits since they're in
different groups. So to override -fcf-protection, an explicit
-fcf-protection=none needs to be added and then with
-fcf-protection=XXX
Bootstrapped and regtested
To override -fcf-protection, -fcf-protection=none needs to be added
and then with -fcf-protection=xxx.
---
htdocs/gcc-14/changes.html | 6 ++
1 file changed, 6 insertions(+)
diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index e3a68998..72b0d291 100644
--- a/htdocs/gcc-1
After r14-7124-g6686e16fda4190, the testcase can be optimized to
MAX_EXPR if the backends support that. So I adjust the testcase to
scan for MAX_EXPR, but it failed many platforms which don't support
that.
As pinski mentioned, target vect_no_int_min_max is only available
under vect directory, so fo
After vect_early_break is supported, more vectorization is enabled(3
COPYSIGN), so adjust testcase for that.
Commit as obvious fix.
gcc/testsuite/ChangeLog:
* gcc.target/i386/part-vect-copysignhf.c: Remove
-ftree-vectorize from dg-options.
---
gcc/testsuite/gcc.target/i386/part-
Ready push to trunk.
gcc/ChangeLog:
* config/i386/i386-options.cc (ix86_option_override_internal):
Enable -mlam=u57 by default when compiled with
-fsanitize=hwaddress.
---
gcc/config/i386/i386-options.cc | 9 +
1 file changed, 9 insertions(+)
diff --git a/gcc/con
There're 2 cases:
1. hwasan-poison-optimisation.c is supposed to scan call to
__hwasan_tag_mismatch4, and x86 have different mnemonic(call) from
aarch64(bl), so adjust testcase to scan either call or bl.
2. alloca-outside-caught.c/vararray-outside-caught.c are supposed to
scan mismatched tags and
Similar for A < B ? B : A to MAX_EXPR.
There're codes in the frontend to optimize such pattern but failed to
handle testcase in the PR since it's exposed at gimple level when
folding backend builtins.
pr95906 now can be optimized to MAX_EXPR as it's commented in the
testcase.
// FIXME: this shoul
> I wonder if you can amend the existing patterns instead by iterating
> over cond/vec_cond. There are quite some (look for uses of
> minmax_from_comparison) that could be adapted to vectors.
>
> The ones matching the simple form you match are
>
> #if GIMPLE
> /* A >= B ? A : B -> max (A, B) and f
> Hmm, I would suggest you put reg_needed into the class and accumulate
> over all vec_construct, with your patch you pessimize a single v32qi
> over two separate v16qi for example. Also currently the whole block is
> gated with INTEGRAL_TYPE_P but register pressure would be also
> a concern for f
.i.e. for below cases.
a[0] = b1;
a[1] = b2;
..
a[n] = bn;
There're extra dependences when contructing the vector, but not for
scalar store. According to experiments, it's generally worse.
The patch adds an cut-off heuristic when vec_stmt is just
vec_construct and vector store. It imp
Like r14-5990-gb4a7c1c8c59d19, but the patch optimized for udot_prod.
Since (zero_extend) (unsigned char)-> int is equal
to (zero_extend)(unsigned char) -> short
+ (sign_extend) (short) -> int
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
It should be safe to emu
If the function desn't clobber any sse registers or only clobber
128-bit part, then vzeroupper isn't issued before the function exit.
the status not CLEAN but ANY after the function.
Also for sibling_call, it's safe to issue an vzeroupper. Also there
could be missing vzeroupper since there's no mo
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/112904
* config/i386/mmx.md (*xop_pcmov_): New define_insn.
gcc/testsuite/ChangeLog:
* g++.target/i386/pr112904.C: New test.
---
gcc/config/i386/mmx.md
> since you are looking at TYPE_PRECISION below you want
> VECTOR_INTIEGER_TYPE_P here as well? The alternative
> would be to compare TYPE_SIZE.
>
> Some of the checks feel redundant but are probably good for
> documentation purposes.
>
> OK with using VECTOR_INTIEGER_TYPE_P
Actually, the data typ
x86 doesn't support horizontal reduction instructions, reduc_op_scal_m
is emulated with vec_extract_half + op(half vector length)
Take that into account when calculating cost for vectorization.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
No big performance impact on SPEC2017 as measu
vpbroadcastd/vpbroadcastq is avaiable under TARGET_AVX2, but
vec_dup{v4di,v8si} pattern is avaiable under AVX with memory operand.
And it will cause LRA/Reload to generate spill and reload if we put
constant in register.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk
analyze_and_compute_bitop_with_inv_effect assumes the first operand is
loop invariant which is not the case when it's INTEGER_CST.
Bootstrapped and regtseted on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR tree-optimization/105735
PR tree-optimization/111972
Boostrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
Will test and backport to GCC13/GCC12 release branch.
gcc/ChangeLog:
PR target/112443
* config/i386/sse.md (*avx2_pcmp3_4): Fix swap condition
from LT to GE since there's not in the pattern.
When I'm working on PR112443, I notice there's some misoptimizations: after we
fold _mm{,256}_blendv_epi8/pd/ps into gimple, the backend fails to combine it
back to v{,p}blendv{v,ps,pd} since the pattern is too complicated, so I think
maybe we should hanlde it in the gimple level.
The dump is like
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/i386-expand.cc
(ix86_expand_vector_init_duplicate): Handle V4HF/V4BF and
V2HF/V2BF.
(ix86_expand_vector_init_one_nonzero): Ditto.
(ix86_expand_vector
When I'm working on PR112443, I notice there's some misoptimizations:
after we fold _mm{,256}_blendv_epi8/pd/ps into gimple, the backend
fails to combine it back to v{,p}blendv{v,ps,pd} since the pattern is
too complicated, so I think maybe we should hanlde it in the gimple
level.
The dump is like
if (TREE_CODE (init_expr) == INTEGER_CST)
init_expr = fold_convert (TREE_TYPE (vectype), init_expr);
else
gcc_assert (tree_nop_conversion_p (TREE_TYPE (vectype),
TREE_TYPE (init_expr)));
and init_expr is a 24 bit integer type while vectype has 32bi
The new added splitter will generate
(insn 58 56 59 2 (set (reg:V4HI 20 xmm0 [129])
(vec_duplicate:V4HI (reg:HI 22 xmm2 [123]))) "testcase.c":16:21 -1
But we only have
(define_insn "*vec_dupv4hi"
[(set (match_operand:V4HI 0 "register_operand" "=y,Yw")
(vec_duplicate:V4HI
Update in V2:
1) Add some comments before the pattern.
2) Remove ? from view_convert.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
When I'm working on PR112443, I notice there's some misoptimizations:
after we fold _mm{,256}_blendv_epi8/pd/ps into gimple, the backend
fa
x86 backend support reduc_{and,ior,xor>_scal_m for vector integer
modes.
Ok for trunk?
gcc/testsuite/ChangeLog:
* lib/target-supports.exp (vect_logical_reduc): Add i?86-*-*
and x86_64-*-*.
---
gcc/testsuite/lib/target-supports.exp | 3 ++-
1 file changed, 2 insertions(+), 1 dele
BB vectorizer relies on the backend support of
.REDUC_{PLUS,IOR,XOR,AND} to vectorize reduction.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/112325
* config/i386/sse.md (reduc__scal_): New expander.
(REDUC_ANY_LO
The missing cbranchv*{hi,qi}4 maybe needed by early break vectorization.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/sse.md (cbranch4): Extend to Vector
HI/QImode.
---
gcc/config/i386/sse.md | 10 --
1 file
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/112325
* config/i386/i386-expand.cc (emit_reduc_half): Hanlde
V8QImode.
* config/i386/mmx.md (reduc__scal_): New expander.
(reduc__scal_v4qi): Ditto.
gc
From: "Zhang, Annita"
Avoid_fma_chain was enabled in m_SAPPHIRERAPIDS, m_ALDERLAKE and
m_CORE_HYBRID. It can also be enabled in m_GENERIC to improve the
performance of -march=x86-64-v3/v4 with -mtune=generic set by
default. One SPEC2017 benchmark 510.parest_r can improve greatly due
to it. From t
For vec_contruct, the components must be live at the same time if
they're not loaded from memory, when the number of those components
exceeds available registers, spill happens. Try to account that with a
rough estimation.
??? Ideally, we should have an overall estimation of register pressure
if we
Currently sdot_prodv*qi is available under TARGET_AVXVNNIINT8, but it
can be emulated by
vec_unpacks_lo_v32qi
vec_unpacks_lo_v32qi
vec_unpacks_hi_v32qi
vec_unpacks_hi_v32qi
sdot_prodv16hi
sdot_prodv16hi
add3v8si
which is faster than original
vect_patt_39.11_48 = WIDEN_MULT_LO_EXPR ;
v
Loop vectorizer will use vec_perm to select lower part of a vector,
there could be some redundancy when using subreg in
reduc__scal_m, because rtl cse can't figure out vec_select lower
part is just subreg.
I'm trying to canonicalize vec_select to subreg like aarch64 did, but
there're so many regre
> But rtx_cost invokes targetm.rtx_cost which allows to avoid that
> recursive processing at any level. You're dealing with MEM [addr]
> here, so why's rtx_cost (addr, Pmode, MEM, 0, speed) not always
> the best way to deal with this? Since this is the MEM [addr] case
> we know it's not LEA, no?
These versions of the min/max patterns implement exactly the operations
min = (op1 < op2 ? op1 : op2)
max = (!(op1 < op2) ? op1 : op2)
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md (*minmax3_1): New pre_reload
define_insn_and_split.
(*minmax3_2): Ditto.
O2
-march=x86-64 -O2
-march=sapphirerapids -O2
Didn't observe obvious performance change, mostly same binaries.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Any comments?
liuhongt (7):
[x86] Add more splitters to match (unspec [op1 op2 (gt op3
constm1_operand)] UNSPEC_BLE
These define_insn_and_split are needed after vcond{,u,eq} is obsolete.
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md
(*_blendv_gt): New
define_insn_and_split.
(*_blendv_gtint):
Ditto.
(*_blendv_not_gtint):
Ditto.
(*_pb
Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31
and x < 0 ? 1 : 0 into (unsigned) x >> 31.
Add define_insn_and_split for the optimization did in
ix86_expand_int_vcond.
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md ("*ashr3_1"): New
define_insn_and_split.
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md
(*_movmsk_lt_avx512): New
define_insn_and_split.
(*_movmsk_ext_lt_avx512):
Ditto.
(*_pmovmskb_lt_avx512): Ditto.
(*_pmovmskb_zext_lt_avx512): Ditto.
(*sse2_pmovmskb_ext_lt_a
gcc/ChangeLog
PR target/115517
* config/i386/sse.md
(*_cvtmask2_not): New pre_reload
splitter.
(*_cvtmask2_not): Ditto.
(*avx2_pcmp3_6): Ditto.
(*avx2_pcmp3_7): Ditto.
---
gcc/config/i386/sse.md | 97 ++
> Richard suggests that we implement the "obvious" transforms like
> inversion in the middle-end but if for example unsigned compares
> are not supported the us_minus + eq + negative trick isn't on
> that list.
>
> The main reason to restrict vec_cmp would be to avoid
> a <= b ? c : d going with an
gcc/ChangeLog:
PR target/115517
* config/i386/mmx.md (vcondv2sf): Removed.
(vcond): Ditto.
(vcond): Ditto.
(vcondu): Ditto.
(vcondu): Ditto.
* config/i386/sse.md (vcond): Ditto.
(vcond): Ditto.
(vcond): Ditto.
(vcond):
for the testcase in the PR115406, here is part of the dump.
char D.4882;
vector(1) _1;
vector(1) signed char _2;
char _5;
:
_1 = { -1 };
When assign { -1 } to vector(1} {signed-boolean:8},
Since TYPE_PRECISION (itype) <= BITS_PER_UNIT, so it set each bit of dest
with each vector el
The testcases are supposed to scan for vpopcnt{b,w,d,q} operations
with k mask, but mask is defined as uninitialized local variable which
will be set as 0 at rtl expand phase.
And it's further simplified off by late_combine which caused scan assembly
failure.
Move the definition of mask outside to
hen do the real operation.
After enabling flate_combine, they're combined into embeded broadcast
operations.
Tested with SPEC2017, flate_combine reduces codesize by ~0.6%, which means
there're lots of small improvements.
Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}.
Ok
Move pass_stv2 and pass_rpad after pre_reload pass_late_combine, also
define target_insn_cost to prevent post_reload pass_late_combine to
revert the optimziation did in pass_rpad.
Adjust testcases since pass_late_combine generates better code but
break scan assembly.
.i.e
Under 32-bit target, gcc
late_combine will combine lshift + zero into *lshifrtsi3_1_zext which
cause extra mov between gpr and kmask, add ?k to the pattern.
gcc/ChangeLog:
PR target/115610
* config/i386/i386.md (<*insnsi3_zext): Add alternative ?k,
enable it only for lshiftrt and under avx512bw.
From: "H.J. Lu"
According to Intel® 64 and IA-32 Architectures Optimization Reference
Manual[1], Branch Hint is updated for Redwood Cove.
cut from [1]-
Starting with the Redwood Cove microarchitecture, if the predictor has
no stored information about a branch, the
The patch can avoid SIGILL on non-AVX512 machine due to kmovd is
generated in dynamic check.
Committed as an obvious fix.
gcc/testsuite/ChangeLog:
PR target/115748
* gcc.target/i386/avx512-check.h: Move runtime check into a
separate function and guard it with target ("no-
From: "H.J. Lu"
>The above reads like it would be worth splitting branc_prediction_hits
>into branch_prediction_hints_taken and branch_prediction_hints_not_taken
>given not-taken is the default and thus will just increase code size?
>According to Intel® 64 and IA-32 Architectures Optimization Ref
>> Hmm, now all avx512 tests SIGILL when testing with -m32:
>>
>> Dump of assembler code for function __get_cpuid_count:
>> => 0x08049500 <+0>: kmovd %eax,%k2
>> 0x08049504 <+4>: kmovd %edx,%k1
>> 0x08049508 <+8>: pushf
>> 0x08049509 <+9>: pushf
>> 0x0804950a <+10>:
I have a build failure on NetBSD as the namespace pollution avoidance causes
a direct hit with the system /usr/include/math.h
===
In file included from /usr/src/local/gcc/obj/gcc/include/emmintrin.h:31,
from
/usr
>- _5 = __atomic_fetch_or_8 (&set_work_pending_p, 1, 0);
>- # DEBUG old => (long int) _5
>+ _6 = .ATOMIC_BIT_TEST_AND_SET (&set_work_pending_p, 0, 1, 0,
>__atomic_fetch_or_8);
>+ # DEBUG old => NULL
> # DEBUG BEGIN_STMT
>- # DEBUG D#2 => _5 & 1
>+ # DEBUG D#2 => NULL
>...
>- _10 = ~_5;
>-
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/115843
* config/i386/predicates.md (const0_or_m1_operand): New
predicate.
* config/i386/sse.md (*_store_mask_1): New
pre_reload define_insn_and_split.
> Also, in case the insn is deleted, do:
>
> emit_note (NOTE_INSN_DELETED);
>
> DONE;
>
> instead of leaving (const_int 0) in the stream.
>
> So, the above insn preparation statements should read:
>
> --cut here--
> if (constm1_operand (operands[2], mode))
> emit_move_insn (operands[0], operands[
Since there is no corresponding instruction, the shift operation for
vector int8 is implemented using the instructions for vector int16,
but for some special shift counts, it can be transformed into vpcmpgtb.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/Chang
pshufb is available under TARGET_SSSE3, so
ix86_expand_vec_perm_const_1 must return true when TARGET_SSSE3.
w/o TARGET_SSSE3, if we set one_operand_p to true, ix86_expand_vec_perm_const_1
could return false.
With the patch under -march=x86-64-v2
v8qi
foo (v8qi a)
{
return a >> 5;
}
< pm
For vec_pack_truncv8si/v4si w/o AVX512,
(const_vector:v4si (const_int 0x) x4) is used as mask to clear
upper 16 bits, but vpblendw with zero_vector can also be used, and
zero vector is cheaper than (const_vector:v4si (const_int 0x) x4).
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m3
When mask is (1 << (prec - imm) - 1) which is used to clear upper bits
of A, then it can be simplified to LSHIFTRT.
i.e Simplify
(and:v8hi
(ashifrt:v8hi A 8)
(const_vector 0xff x8))
to
(lshifrt:v8hi A 8)
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok of trunk?
gcc/ChangeLog:
For CONST_VECTOR_DUPLICATE_P in constant_pool, it is just broadcast or
variants in ix86_vector_duplicate_simode_const.
Adjust the cost to COSTS_N_INSNS (2) + speed which should be a little
bit larger than broadcast.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/Chang
According to IEEE standard, for conversions from floating point to
integer. When a NaN or infinite operand cannot be represented in the
destination format and this cannot otherwise be indicated, the invalid
operation exception shall be signaled. When a numeric operand would
convert to an integer ou
>> Hard to find a default value satisfying all testcases.
>> some require loop unroll with 7 insns increment, some don't want loop
>> unroll w/ 5 insn increment.
>> The original 2/3 reduction happened to meet all those testcases(or the
>> testcases are constructed based on the old 2/3).
>> Can we d
Update in V3:
> Since this was about vectorization can you instead add a testcase to
> gcc.dg/vect/ and check for
> vectorization to happen?
Move to vect/pr112325.c.
>
> I believe the if (unr_insn <= 0) check can go as well.
Removed.
> as said, you want to do
>
> curolli = false;
>
> aft
Committed as an obvious patch.
gcc/testsuite/ChangeLog:
PR target/114148
* gcc.target/i386/pr106010-7b.c: Refine testcase.
---
gcc/testsuite/gcc.target/i386/pr106010-7b.c | 10 +-
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/gcc/testsuite/gcc.target/i386/
Update in V2:
Guard constant folding for overflow value in
fold_convert_const_int_from_real with flag_trapping_math.
Add -fno-trapping-math to related testcases which warn for overflow
in conversion from floating point to integer.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for tr
For MEM, rtx_cost iterates each subrtx, and adds up the costs,
so for MEM (reg) and MEM (reg + 4), the former costs 5,
the latter costs 9, it is not accurate for x86. Ideally
address_cost should be used, but it reduce cost too much.
So current solution is make constant disp as cheap as possible.
B
When I applied Roger's patch [1], there's ICE due to it.
The patch fix the latent bug.
[1] https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651365.html
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Pushed to trunk.
gcc/ChangeLog:
* config/i386/sse.md
(___mask): Ali
> IMO, there is no need for CONST_INT_P condition, we should also allow
> symbol_ref, label_ref and const (all allowed by
> x86_64_immediate_operand predicate), these all decay to an immediate
> value.
Changed.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk.
For MEM, rtx_
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/sse.md (vcond_mask_): New expander.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr114125.c: New test.
---
gcc/config/i386/sse.md | 20
Committed as an obvious patch.
gcc/ChangeLog:
* config/i386/emmintrin.h (__double_u): Rename from double_u.
(_mm_load_sd): Replace double_u with __double_u.
(_mm_store_sd): Ditto.
(_mm_loadh_pd): Ditto.
(_mm_loadl_pd): Ditto.
* config/i386/xmmintrin
W/o TARGET_SSE4_1, it takes 3 instructions (pand, pandn and por) for
movdfcc/movsfcc, and could possibly fail cost comparison. Increase
branch cost could hurt performance for other modes, so specially add
some preference for floating point ifcvt.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-
1 - 100 of 585 matches
Mail list logo