[PATCH 1/2] [x86] Support smin/smax for V2HF/V4HF

2023-10-07 Thread liuhongt
Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. Ready push to trunk. gcc/ChangeLog: * config/i386/mmx.md (VHF_32_64): New mode iterator. (3): New define_expand, merged from .. (v4hf3): .. this and (v2hf3): .. this. (movd_v2hf_to_sse_reg): Ne

[PATCH 2/2] Support signbit/xorsign/copysign/abs/neg/and/xor/ior/andn for V2HF/V4HF.

2023-10-07 Thread liuhongt
Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. Ready push to trunk. gcc/ChangeLog: * config/i386/i386.cc (ix86_build_const_vector): Handle V2HF and V4HFmode. (ix86_build_signbit_mask): Ditto. * config/i386/mmx.md (mmxintvecmode): Ditto. (2)

[PATCH] [x86] Refine predicate of operands[2] in divv4hf3 with register_operand.

2023-10-10 Thread liuhongt
In the expander, it will emit below insn. rtx tmp = gen_rtx_VEC_CONCAT (V4SFmode, operands[2], force_reg (V2SFmode, CONST1_RTX (V2SFmode))); but *vec_concat only allow register_operand. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/Ch

[PATCH 2/2] Support 32/64-bit vectorization for conversion between _Float16 and integer/float.

2023-10-11 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/mmx.md (V2FI_32): New mode iterator (movd_v2hf_to_sse): Rename to .. (movd__to_sse): .. this. (movd_v2hf_to_sse_reg): Rename to .. (movd__to_sse_reg)

[PATCH 1/2] Enable vectorization for V2HF/V4HF rounding operations and sqrt.

2023-10-11 Thread liuhongt
For lrint/lround/lceil/lfoor is not vectorized due to vectorization restriction. When input element size is different from output element size, vectorization relies on the old TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION intstead of the modern standand pattern name. The patch only supports standard

[PATCH] Support 32/64-bit vectorization for _Float16 fma related operations.

2023-10-16 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/mmx.md (fma4): New expander. (fms4): Ditto. (fnma4): Ditto. (fnms4): Ditto. (vec_fmaddsubv4hf4): Ditto. (vec_fmsubaddv4hf4): Ditto. gcc/test

[PATCH] Avoid compile time hog on vect_peel_nonlinear_iv_init for nonlinear induction vec_step_op_mul when iteration count is too big. 65; 6800; 1c There's loop in vect_peel_nonlinear_iv_init to get i

2023-10-18 Thread liuhongt
Also give up vectorization when niters_skip is negative which will be used for fully masked loop. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR tree-optimization/111820 PR tree-optimization/111833 * tree-vect-loop-manip.cc (vect

[PATCH] Avoid compile time hog on vect_peel_nonlinear_iv_init for nonlinear induction vec_step_op_mul when iteration count is too big.

2023-10-18 Thread liuhongt
>So the bugs were not fixed without this hunk? IIRC in the audit >trail we concluded the value is always positive ... (but of course >a large unsigned value can appear negative if you test it this way?) No, I added this incase in the future there's negative skip_niters as you mentioned in the PR,

[PATCH] Avoid compile time hog on vect_peel_nonlinear_iv_init for nonlinear induction vec_step_op_mul when iteration count is too big.

2023-10-19 Thread liuhongt
>So with pow being available this limit shouldn't be necessary any more and >the testcase adjustment can be avoided? I tries, compile time still hogs on mpz_powm(3, INT_MAX), so i'll just keep this. >and to avoid undefined behavior with too large shift just go the gmp >way unconditionally. Changed

[PATCH] [x86] Remove unused mmx_pinsrw.

2023-10-19 Thread liuhongt
When I'm working on enable more 32/64-bit vectorization for _Float16, I notice there's 1 redundant define_expand, the patch removed the expander. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: * config/i386/mmx.md (mmx_pinsrw): Removed. --- gcc/co

[PATCH] Support vec_cmpmn/vcondmn for v2hf/v4hf.

2023-10-23 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/103861 * config/i386/i386-expand.cc (ix86_expand_sse_movcc): Handle V2HF/V2BF/V4HF/V4BFmode. * config/i386/mmx.md (vec_cmpv4hfqi): New expander. (vcondv4

[PATCH GCC13 backport] Avoid compile time hog on vect_peel_nonlinear_iv_init for nonlinear induction vec_step_op_mul when iteration count is too big.

2023-10-24 Thread liuhongt
This is the backport patch for releases/gcc-13 branch, the original patch for main trunk is at [1]. The only difference between this backport patch and [1] is GCC13 doesn't support auto_mpz, So this patch manually use mpz_init/mpz_clear. [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-October

[PATCH V2 1/2] Pass type of comparison operands instead of comparison result to truth_type_for in build_vec_cmp.

2023-10-25 Thread liuhongt
>I think it's indeed on purpose that the result of v1 < v2 is a signed >integer vector type. >But build_vec_cmp should not use the truth type for the result but instead the >truth type for the comparison, so Change build_vec_cmp in both c/c++, also notice for jit part, it already uses type of comp

[PATCH V2 2/2] Support vec_cmpmn/vcondmn for v2hf/v4hf.

2023-10-25 Thread liuhongt
>vcond and vcondeq shouldn't be necessary if there's >vcond_mask and vcmp support which is the "modern" >way of handling vcond. Unless the ISA really can do >compare and select with a single instruction. The V2 patch remove vcond/vcondu from the initial version[1], but there're many optimizations

[PATCH] Improve memcmpeq for 512-bit vector with vpcmpeq + kortest.

2023-10-26 Thread liuhongt
When 2 vectors are equal, kmask is allones and kortest will set CF, else CF will be cleared. So CF bit can be used to check for the result of the comparison. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? Before: vmovdqu (%rsi), %ymm0 vpxorq (%rdi), %ymm

[PATCH] Fix wrong code due to incorrest define_split

2023-10-30 Thread liuhongt
-(define_split - [(set (match_operand:V2HI 0 "register_operand") -(eq:V2HI - (eq:V2HI -(us_minus:V2HI - (match_operand:V2HI 1 "register_operand") - (match_operand:V2HI 2 "register_operand")) -(match_operand:V2HI 3 "const0_operand")

[PATCH] Handle bitop with INTEGER_CST in analyze_and_compute_bitop_with_inv_effect.

2023-10-30 Thread liuhongt
analyze_and_compute_bitop_with_inv_effect assumes the first operand is loop invariant which is not the case when it's INTEGER_CST. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} Ok for trunk? gcc/ChangeLog: PR tree-optimization/105735 PR tree-optimization/111972

[PATCH] Support cmul{_conj}v4hf3/cmla{_conj}v4hf4 with AVX512FP16 instruction.

2023-11-01 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} Ready push to trunk. gcc/ChangeLog: * config/i386/mmx.md (cmlav4hf4): New expander. (cmla_conjv4hf4): Ditto. (cmulv4hf3): Ditto. (cmul_conjv4hf3): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/p

[PATCH] Avoid generating RTL code when d->testing_p.

2023-11-06 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/112393 * config/i386/i386-expand.cc (ix86_expand_vec_perm_vpermt2): Avoid generating RTL code when d->testing_p. gcc/testsuite/ChangeLog: * gcc.target/i386/pr1

[PATCH] sanitizer: [PR110027] Align asan_vec[0] to MAX (alignb, ASAN_RED_ZONE_SIZE)

2024-03-12 Thread liuhongt
if alignb > ASAN_RED_ZONE_SIZE and offset[0] is not multiple of alignb. (base_align_bias - base_offset) may not aligned to alignb, and caused segement fault. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Ok for trunk and backport to GCC13? gcc/ChangeLog: PR sanitizer/110027

[PATCH] i386[stv]: Handle REG_EH_REGION note

2024-03-13 Thread liuhongt
When we split (insn 37 36 38 10 (set (reg:DI 104 [ _18 ]) (mem:DI (reg/f:SI 98 [ CallNative_nclosure.0_1 ]) [6 MEM[(struct SQRefCounted *)CallNative_nclosure.0_1]._uiRef+0 S8 A32])) "test.C":22:42 84 {*movdi_internal} (expr_list:REG_EH_REGION (const_int -11 [0xfff5]) int

[PATCH] Add missing hf/bf patterns.

2024-03-17 Thread liuhongt
It fixes ICE of unrecognized logic operation insn which is generated by lroundmn2 expanders. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/114334 * config/i386/i386.md (mode): Add new number V8BF,V16BF,V32BF. (MOD

[PATCH] i386 [stv]: Handle REG_EH_REGION note [pr111822].

2024-03-18 Thread liuhongt
Commit r14-9459-g618e34d56cc38e only handles general_scalar_chain::convert_op. The patch also handles timode_scalar_chain::convert_op to avoid potential similar bug. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk and backport to releases/gcc-13 branch? gcc/ChangeLog:

[PATCH] Document -fexcess-precision=16.

2024-03-18 Thread liuhongt
Ok for trunk? gcc/ChangeLog: * doc/invoke.texi: Document -fexcess-precision=16. --- gcc/doc/invoke.texi | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 85c938d4a14..673420fdd3e 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.te

[PATCH V2] Document -fexcess-precision=16.

2024-03-19 Thread liuhongt
gcc/ChangeLog: * doc/invoke.texi: Document -fexcess-precision=16. --- gcc/doc/invoke.texi | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 85c938d4a14..6bc1ebf9721 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -14930,6

[PATCH] Fix runtime error for nonlinear iv vectorization(step_mult).

2024-03-21 Thread liuhongt
wi::from_mpz doesn't take a sign argument, we want it to be wrapped instead of saturation, so pass utype and true to it, and it fixes the bug. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk and backport to gcc13? gcc/ChangeLog: PR tree-optimization/114396

[PATCH] Move pr114396.c from gcc.target/i386 to gcc.c-torture/execute.

2024-03-21 Thread liuhongt
Also fixed a typo in the testcase. Commit as an obvious fix. gcc/testsuite/ChangeLog: PR tree-optimization/114396 * gcc.target/i386/pr114396.c: Move to... * gcc.c-torture/execute/pr114396.c: ...here. --- .../{gcc.target/i386 => gcc.c-torture/execute}/pr114396.c | 6 +++

[PATCH V2] sanitizer: [PR110027] Align asan_vec[0] to MAX (BIGGEST_ALIGNMENT / BITS_PER_UNIT, ASAN_RED_ZONE_SIZE)

2024-03-25 Thread liuhongt
> > So, try to add some other variable with larger size and smaller alignment > > to the frame (and make sure it isn't optimized away). > > > > alignb above is the alignment of the first partition's var, if > > align_frame_offset really needs to depend on the var alignment, it probably > > should b

[PATCH wwwdoc] Hardware-assisted AddressSanitizer now works for x86_64 with LAM_U57

2024-02-08 Thread liuhongt
--- htdocs/gcc-14/changes.html | 5 + 1 file changed, 5 insertions(+) diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html index 6d917535..a022357a 100644 --- a/htdocs/gcc-14/changes.html +++ b/htdocs/gcc-14/changes.html @@ -499,6 +499,11 @@ a work-in-progress. -march=knm

[PATCH] Fix testcase for platform without gnu/stubs-x32.h

2024-02-18 Thread liuhongt
target maybe_x32 doesn't check if platform has gnu/stubs-x32.h, but it's included by stdint.h in the testcase. Adjust testcase: remove stdint.h, use 'typedef long long int64_t' instead. Commit as an obvious patch. gcc/testsuite/ChangeLog: PR target/113711 * gcc.target/i386/apx-nd

[PATCH] Update documents for fcf-protection=

2024-01-09 Thread liuhongt
After r14-2692-g1c6231c05bdcca, the option is defined as EnumSet and -fcf-protection=branch won't unset any others bits since they're in different groups. So to override -fcf-protection, an explicit -fcf-protection=none needs to be added and then with -fcf-protection=XXX Bootstrapped and regtested

[PATCH] Document refactoring of the option -fcf-protection=x.

2024-01-09 Thread liuhongt
To override -fcf-protection, -fcf-protection=none needs to be added and then with -fcf-protection=xxx. --- htdocs/gcc-14/changes.html | 6 ++ 1 file changed, 6 insertions(+) diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html index e3a68998..72b0d291 100644 --- a/htdocs/gcc-1

[PATCH] Fix testcase failure on many platforms which don't support vect_int_max.

2024-01-18 Thread liuhongt
After r14-7124-g6686e16fda4190, the testcase can be optimized to MAX_EXPR if the backends support that. So I adjust the testcase to scan for MAX_EXPR, but it failed many platforms which don't support that. As pinski mentioned, target vect_no_int_min_max is only available under vect directory, so fo

[PATCH] Adjust testcase gcc.target/i386/part-vect-copysignhf.c.

2024-01-18 Thread liuhongt
After vect_early_break is supported, more vectorization is enabled(3 COPYSIGN), so adjust testcase for that. Commit as obvious fix. gcc/testsuite/ChangeLog: * gcc.target/i386/part-vect-copysignhf.c: Remove -ftree-vectorize from dg-options. --- gcc/testsuite/gcc.target/i386/part-

[PATCH 2/2] [x86] Enable -mlam=u57 by default when compiled with -fsanitize=hwaddress.

2024-01-22 Thread liuhongt
Ready push to trunk. gcc/ChangeLog: * config/i386/i386-options.cc (ix86_option_override_internal): Enable -mlam=u57 by default when compiled with -fsanitize=hwaddress. --- gcc/config/i386/i386-options.cc | 9 + 1 file changed, 9 insertions(+) diff --git a/gcc/con

[PATCH 1/2] Adjust hwasan testcase for x86 target.

2024-01-22 Thread liuhongt
There're 2 cases: 1. hwasan-poison-optimisation.c is supposed to scan call to __hwasan_tag_mismatch4, and x86 have different mnemonic(call) from aarch64(bl), so adjust testcase to scan either call or bl. 2. alloca-outside-caught.c/vararray-outside-caught.c are supposed to scan mismatched tags and

[PATCH] Optimize A < B ? A : B to MIN_EXPR.

2023-12-18 Thread liuhongt
Similar for A < B ? B : A to MAX_EXPR. There're codes in the frontend to optimize such pattern but failed to handle testcase in the PR since it's exposed at gimple level when folding backend builtins. pr95906 now can be optimized to MAX_EXPR as it's commented in the testcase. // FIXME: this shoul

[PATCH] Optimize A < B ? A : B to MIN_EXPR.

2024-01-09 Thread liuhongt
> I wonder if you can amend the existing patterns instead by iterating > over cond/vec_cond.  There are quite some (look for uses of > minmax_from_comparison) that could be adapted to vectors. > > The ones matching the simple form you match are > > #if GIMPLE > /* A >= B ? A : B -> max (A, B) and f

[PATCH] Take register pressure into account for vec_construct/scalar_to_vec when the components are not loaded from memory.

2023-11-30 Thread liuhongt
> Hmm, I would suggest you put reg_needed into the class and accumulate > over all vec_construct, with your patch you pessimize a single v32qi > over two separate v16qi for example. Also currently the whole block is > gated with INTEGRAL_TYPE_P but register pressure would be also > a concern for f

[PATCH] Don't vectorize when vector stmts are only vec_contruct and stores

2023-12-03 Thread liuhongt
.i.e. for below cases. a[0] = b1; a[1] = b2; .. a[n] = bn; There're extra dependences when contructing the vector, but not for scalar store. According to experiments, it's generally worse. The patch adds an cut-off heuristic when vec_stmt is just vec_construct and vector store. It imp

[PATCH] Support udot_prodv*qi with emulation sdot_prodv*hi

2023-12-03 Thread liuhongt
Like r14-5990-gb4a7c1c8c59d19, but the patch optimized for udot_prod. Since (zero_extend) (unsigned char)-> int is equal to (zero_extend)(unsigned char) -> short + (sign_extend) (short) -> int Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. It should be safe to emu

[PATCH] Don't assume it's AVX_U128_CLEAN after call_insn whose abi.mode_clobber(V4DImode) deosn't contains all SSE_REGS.

2023-12-07 Thread liuhongt
If the function desn't clobber any sse registers or only clobber 128-bit part, then vzeroupper isn't issued before the function exit. the status not CLEAN but ANY after the function. Also for sibling_call, it's safe to issue an vzeroupper. Also there could be missing vzeroupper since there's no mo

[PATCH] [ICE] Support vpcmov for V4HF/V4BF/V2HF/V2BF under TARGET_XOP.

2023-12-07 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/112904 * config/i386/mmx.md (*xop_pcmov_): New define_insn. gcc/testsuite/ChangeLog: * g++.target/i386/pr112904.C: New test. --- gcc/config/i386/mmx.md

[v3 PATCH] Simplify vector ((VCE (a cmp b ? -1 : 0)) < 0) ? c : d to just (VCE ((a cmp b) ? (VCE c) : (VCE d))).

2023-12-10 Thread liuhongt
> since you are looking at TYPE_PRECISION below you want > VECTOR_INTIEGER_TYPE_P here as well? The alternative > would be to compare TYPE_SIZE. > > Some of the checks feel redundant but are probably good for > documentation purposes. > > OK with using VECTOR_INTIEGER_TYPE_P Actually, the data typ

[PATCH] Adjust vectorized cost for reduction.

2023-12-11 Thread liuhongt
x86 doesn't support horizontal reduction instructions, reduc_op_scal_m is emulated with vec_extract_half + op(half vector length) Take that into account when calculating cost for vectorization. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. No big performance impact on SPEC2017 as measu

[PATCH] Force broadcast constant to mem for vec_dup{v4di, v8si, v4df, v8df} when TARGET_AVX2 is not available.

2023-12-12 Thread liuhongt
vpbroadcastd/vpbroadcastq is avaiable under TARGET_AVX2, but vec_dup{v4di,v8si} pattern is avaiable under AVX with memory operand. And it will cause LRA/Reload to generate spill and reload if we put constant in register. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk

[V2 PATCH] Handle bitop with INTEGER_CST in analyze_and_compute_bitop_with_inv_effect.

2023-11-06 Thread liuhongt
analyze_and_compute_bitop_with_inv_effect assumes the first operand is loop invariant which is not the case when it's INTEGER_CST. Bootstrapped and regtseted on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR tree-optimization/105735 PR tree-optimization/111972

[PATCH] Fix wrong code due to vec_merge + pcmp to blendvb splitter.

2023-11-09 Thread liuhongt
Boostrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. Will test and backport to GCC13/GCC12 release branch. gcc/ChangeLog: PR target/112443 * config/i386/sse.md (*avx2_pcmp3_4): Fix swap condition from LT to GE since there's not in the pattern.

[PATCH] Simplify vector ((VCE?(a cmp b ? -1 : 0)) < 0) ? c : d to just (VCE:a cmp VCE:b) ? c : d.

2023-11-09 Thread liuhongt
When I'm working on PR112443, I notice there's some misoptimizations: after we fold _mm{,256}_blendv_epi8/pd/ps into gimple, the backend fails to combine it back to v{,p}blendv{v,ps,pd} since the pattern is too complicated, so I think maybe we should hanlde it in the gimple level. The dump is like

[PATCH] Support vec_set/vec_extract/vec_init for V4HF/V2HF.

2023-11-09 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_expand_vector_init_duplicate): Handle V4HF/V4BF and V2HF/V2BF. (ix86_expand_vector_init_one_nonzero): Ditto. (ix86_expand_vector

[PATCH] Simplify vector ((VCE?(a cmp b ? -1 : 0)) < 0) ? c : d to just VCE:((a cmp b) ? (VCE c) : (VCE d)).

2023-11-09 Thread liuhongt
When I'm working on PR112443, I notice there's some misoptimizations: after we fold _mm{,256}_blendv_epi8/pd/ps into gimple, the backend fails to combine it back to v{,p}blendv{v,ps,pd} since the pattern is too complicated, so I think maybe we should hanlde it in the gimple level. The dump is like

[PATCH] Fix ICE in vectorizable_nonlinear_induction with bitfield.

2023-11-13 Thread liuhongt
if (TREE_CODE (init_expr) == INTEGER_CST) init_expr = fold_convert (TREE_TYPE (vectype), init_expr); else gcc_assert (tree_nop_conversion_p (TREE_TYPE (vectype), TREE_TYPE (init_expr))); and init_expr is a 24 bit integer type while vectype has 32bi

[PATCH] Fix ICE of unrecognizable insn.

2023-11-15 Thread liuhongt
The new added splitter will generate (insn 58 56 59 2 (set (reg:V4HI 20 xmm0 [129]) (vec_duplicate:V4HI (reg:HI 22 xmm2 [123]))) "testcase.c":16:21 -1 But we only have (define_insn "*vec_dupv4hi" [(set (match_operand:V4HI 0 "register_operand" "=y,Yw") (vec_duplicate:V4HI

[V2 PATCH] Simplify vector ((VCE (a cmp b ? -1 : 0)) < 0) ? c : d to just (VCE ((a cmp b) ? (VCE c) : (VCE d))).

2023-11-16 Thread liuhongt
Update in V2: 1) Add some comments before the pattern. 2) Remove ? from view_convert. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? When I'm working on PR112443, I notice there's some misoptimizations: after we fold _mm{,256}_blendv_epi8/pd/ps into gimple, the backend fa

[PATCH 2/2] Add i?86-*-* and x86_64-*-* to vect_logical_reduc

2023-11-16 Thread liuhongt
x86 backend support reduc_{and,ior,xor>_scal_m for vector integer modes. Ok for trunk? gcc/testsuite/ChangeLog: * lib/target-supports.exp (vect_logical_reduc): Add i?86-*-* and x86_64-*-*. --- gcc/testsuite/lib/target-supports.exp | 3 ++- 1 file changed, 2 insertions(+), 1 dele

[PATCH 1/2] Support reduc_{plus, xor, and, ior}_scal_m for vector integer mode.

2023-11-16 Thread liuhongt
BB vectorizer relies on the backend support of .REDUC_{PLUS,IOR,XOR,AND} to vectorize reduction. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/112325 * config/i386/sse.md (reduc__scal_): New expander. (REDUC_ANY_LO

[PATCH] Support cbranchm for Vector HI/QImode.

2023-11-16 Thread liuhongt
The missing cbranchv*{hi,qi}4 maybe needed by early break vectorization. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/sse.md (cbranch4): Extend to Vector HI/QImode. --- gcc/config/i386/sse.md | 10 -- 1 file

[PATCH] [x86] Support reduc_{and, ior, xor}_scal_m for V4HI/V8QI/V4QImode

2023-11-19 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/112325 * config/i386/i386-expand.cc (emit_reduc_half): Hanlde V8QImode. * config/i386/mmx.md (reduc__scal_): New expander. (reduc__scal_v4qi): Ditto. gc

[PATCH] Set AVOID_256FMA_CHAINS TO m_GENERIC as it's generally good to new platforms

2023-11-21 Thread liuhongt
From: "Zhang, Annita" Avoid_fma_chain was enabled in m_SAPPHIRERAPIDS, m_ALDERLAKE and m_CORE_HYBRID. It can also be enabled in m_GENERIC to improve the performance of -march=x86-64-v3/v4 with -mtune=generic set by default. One SPEC2017 benchmark 510.parest_r can improve greatly due to it. From t

[PATCH] Take register pressure into account for vec_construct when the components are not loaded from memory.

2023-11-27 Thread liuhongt
For vec_contruct, the components must be live at the same time if they're not loaded from memory, when the number of those components exceeds available registers, spill happens. Try to account that with a rough estimation. ??? Ideally, we should have an overall estimation of register pressure if we

[PATCH] [x86] Support sdot_prodv*qi with emulation of sdot_prodv*hi.

2023-11-28 Thread liuhongt
Currently sdot_prodv*qi is available under TARGET_AVXVNNIINT8, but it can be emulated by vec_unpacks_lo_v32qi vec_unpacks_lo_v32qi vec_unpacks_hi_v32qi vec_unpacks_hi_v32qi sdot_prodv16hi sdot_prodv16hi add3v8si which is faster than original vect_patt_39.11_48 = WIDEN_MULT_LO_EXPR ; v

[PATCH] Use vec_extact_lo instead of subreg in reduc__scal_m.

2023-11-29 Thread liuhongt
Loop vectorizer will use vec_perm to select lower part of a vector, there could be some redundancy when using subreg in reduc__scal_m, because rtl cse can't figure out vec_select lower part is just subreg. I'm trying to canonicalize vec_select to subreg like aarch64 did, but there're so many regre

[PATCH V2] Fix wrong cost of MEM when addr is a lea.

2024-06-26 Thread liuhongt
> But rtx_cost invokes targetm.rtx_cost which allows to avoid that > recursive processing at any level. You're dealing with MEM [addr] > here, so why's rtx_cost (addr, Pmode, MEM, 0, speed) not always > the best way to deal with this? Since this is the MEM [addr] case > we know it's not LEA, no?

[PATCH 3/7] [x86] Match IEEE min/max with UNSPEC_IEEE_{MIN,MAX}.

2024-06-27 Thread liuhongt
These versions of the min/max patterns implement exactly the operations min = (op1 < op2 ? op1 : op2) max = (!(op1 < op2) ? op1 : op2) gcc/ChangeLog: PR target/115517 * config/i386/sse.md (*minmax3_1): New pre_reload define_insn_and_split. (*minmax3_2): Ditto.

[PATCH 0/7][x86] Remove vcond{,u,eq} expanders.

2024-06-27 Thread liuhongt
O2 -march=x86-64 -O2 -march=sapphirerapids -O2 Didn't observe obvious performance change, mostly same binaries. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Any comments? liuhongt (7): [x86] Add more splitters to match (unspec [op1 op2 (gt op3 constm1_operand)] UNSPEC_BLE

[PATCH 1/7] [x86] Add more splitters to match (unspec [op1 op2 (gt op3 constm1_operand)] UNSPEC_BLENDV)

2024-06-27 Thread liuhongt
These define_insn_and_split are needed after vcond{,u,eq} is obsolete. gcc/ChangeLog: PR target/115517 * config/i386/sse.md (*_blendv_gt): New define_insn_and_split. (*_blendv_gtint): Ditto. (*_blendv_not_gtint): Ditto. (*_pb

[PATCH 6/7] [x86] Optimize a < 0 ? -1 : 0 to (signed)a >> 31.

2024-06-27 Thread liuhongt
Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31 and x < 0 ? 1 : 0 into (unsigned) x >> 31. Add define_insn_and_split for the optimization did in ix86_expand_int_vcond. gcc/ChangeLog: PR target/115517 * config/i386/sse.md ("*ashr3_1"): New define_insn_and_split.

[PATCH 4/7] Add more splitter for mskmov with avx512 comparison.

2024-06-27 Thread liuhongt
gcc/ChangeLog: PR target/115517 * config/i386/sse.md (*_movmsk_lt_avx512): New define_insn_and_split. (*_movmsk_ext_lt_avx512): Ditto. (*_pmovmskb_lt_avx512): Ditto. (*_pmovmskb_zext_lt_avx512): Ditto. (*sse2_pmovmskb_ext_lt_a

[PATCH 2/7] Lower AVX512 kmask comparison back to AVX2 comparison when op_{true, false} is vector -1/0.

2024-06-27 Thread liuhongt
gcc/ChangeLog PR target/115517 * config/i386/sse.md (*_cvtmask2_not): New pre_reload splitter. (*_cvtmask2_not): Ditto. (*avx2_pcmp3_6): Ditto. (*avx2_pcmp3_7): Ditto. --- gcc/config/i386/sse.md | 97 ++

[PATCH 5/7] Adjust testcase for the regressed testcases after obsolete of vcond{, u, eq}.

2024-06-27 Thread liuhongt
> Richard suggests that we implement the "obvious" transforms like > inversion in the middle-end but if for example unsigned compares > are not supported the us_minus + eq + negative trick isn't on > that list. > > The main reason to restrict vec_cmp would be to avoid > a <= b ? c : d going with an

[PATCH 7/7] Remove vcond{, u, eq} expanders since they will be obsolete.

2024-06-27 Thread liuhongt
gcc/ChangeLog: PR target/115517 * config/i386/mmx.md (vcondv2sf): Removed. (vcond): Ditto. (vcond): Ditto. (vcondu): Ditto. (vcondu): Ditto. * config/i386/sse.md (vcond): Ditto. (vcond): Ditto. (vcond): Ditto. (vcond):

[PATCH] Fix native_encode_vector_part for itype when TYPE_PRECISION (itype) == BITS_PER_UNIT

2024-06-27 Thread liuhongt
for the testcase in the PR115406, here is part of the dump. char D.4882; vector(1) _1; vector(1) signed char _2; char _5; : _1 = { -1 }; When assign { -1 } to vector(1} {signed-boolean:8}, Since TYPE_PRECISION (itype) <= BITS_PER_UNIT, so it set each bit of dest with each vector el

[PATCH 1/3] [avx512 testsuite] Define mask as extern instead of uninitialized local variables.

2024-06-27 Thread liuhongt
The testcases are supposed to scan for vpopcnt{b,w,d,q} operations with k mask, but mask is defined as uninitialized local variable which will be set as 0 at rtl expand phase. And it's further simplified off by late_combine which caused scan assembly failure. Move the definition of mask outside to

[PATCH 0/3][x86] Enable pass_late_combine for x86.

2024-06-27 Thread liuhongt
hen do the real operation. After enabling flate_combine, they're combined into embeded broadcast operations. Tested with SPEC2017, flate_combine reduces codesize by ~0.6%, which means there're lots of small improvements. Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}. Ok

[PATCH 3/3] [x86] Enable flate-combine.

2024-06-27 Thread liuhongt
Move pass_stv2 and pass_rpad after pre_reload pass_late_combine, also define target_insn_cost to prevent post_reload pass_late_combine to revert the optimziation did in pass_rpad. Adjust testcases since pass_late_combine generates better code but break scan assembly. .i.e Under 32-bit target, gcc

[PATCH 2/3] Extend lshifrtsi3_1_zext to ?k alternative.

2024-06-27 Thread liuhongt
late_combine will combine lshift + zero into *lshifrtsi3_1_zext which cause extra mov between gpr and kmask, add ?k to the pattern. gcc/ChangeLog: PR target/115610 * config/i386/i386.md (<*insnsi3_zext): Add alternative ?k, enable it only for lshiftrt and under avx512bw.

[PATCH] x86: Update branch hint for Redwood Cove.

2024-07-01 Thread liuhongt
From: "H.J. Lu" According to Intel® 64 and IA-32 Architectures Optimization Reference Manual[1], Branch Hint is updated for Redwood Cove. cut from [1]- Starting with the Redwood Cove microarchitecture, if the predictor has no stored information about a branch, the

[PATCH][committed] Move runtime check into a separate function and guard it with target ("no-avx")

2024-07-03 Thread liuhongt
The patch can avoid SIGILL on non-AVX512 machine due to kmovd is generated in dynamic check. Committed as an obvious fix. gcc/testsuite/ChangeLog: PR target/115748 * gcc.target/i386/avx512-check.h: Move runtime check into a separate function and guard it with target ("no-

[PATCH V2] x86: Update branch hint for Redwood Cove.

2024-07-03 Thread liuhongt
From: "H.J. Lu" >The above reads like it would be worth splitting branc_prediction_hits >into branch_prediction_hints_taken and branch_prediction_hints_not_taken >given not-taken is the default and thus will just increase code size? >According to Intel® 64 and IA-32 Architectures Optimization Ref

[PATCH] [committed] Use __builtin_cpu_support instead of __get_cpuid_count.

2024-07-03 Thread liuhongt
>> Hmm, now all avx512 tests SIGILL when testing with -m32: >> >> Dump of assembler code for function __get_cpuid_count: >> => 0x08049500 <+0>:     kmovd  %eax,%k2 >>    0x08049504 <+4>:     kmovd  %edx,%k1 >>    0x08049508 <+8>:     pushf >>    0x08049509 <+9>:     pushf >>    0x0804950a <+10>:  

[PATCH] Rename __{float, double}_u to __x86_{float, double}_u to avoid pulluting the namespace.

2024-07-07 Thread liuhongt
I have a build failure on NetBSD as the namespace pollution avoidance causes a direct hit with the system /usr/include/math.h === In file included from /usr/src/local/gcc/obj/gcc/include/emmintrin.h:31, from /usr

[PATCH] Fix SSA_NAME leak due to def_stmt is removed before use_stmt.

2024-07-11 Thread liuhongt
>- _5 = __atomic_fetch_or_8 (&set_work_pending_p, 1, 0); >- # DEBUG old => (long int) _5 >+ _6 = .ATOMIC_BIT_TEST_AND_SET (&set_work_pending_p, 0, 1, 0, >__atomic_fetch_or_8); >+ # DEBUG old => NULL > # DEBUG BEGIN_STMT >- # DEBUG D#2 => _5 & 1 >+ # DEBUG D#2 => NULL >... >- _10 = ~_5; >-

[PATCH] [x86][avx512] Optimize maskstore when mask is 0 or -1 in UNSPEC_MASKMOV

2024-07-16 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/115843 * config/i386/predicates.md (const0_or_m1_operand): New predicate. * config/i386/sse.md (*_store_mask_1): New pre_reload define_insn_and_split.

[PATCH v2] [x86][avx512] Optimize maskstore when mask is 0 or -1 in UNSPEC_MASKMOV

2024-07-17 Thread liuhongt
> Also, in case the insn is deleted, do: > > emit_note (NOTE_INSN_DELETED); > > DONE; > > instead of leaving (const_int 0) in the stream. > > So, the above insn preparation statements should read: > > --cut here-- > if (constm1_operand (operands[2], mode)) > emit_move_insn (operands[0], operands[

[PATCH] [x86] Optimize ashift >> 7 to vpcmpgtb for vector int8.

2024-05-14 Thread liuhongt
Since there is no corresponding instruction, the shift operation for vector int8 is implemented using the instructions for vector int16, but for some special shift counts, it can be transformed into vpcmpgtb. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/Chang

[PATCH] [x86] Set d.one_operand_p to true when TARGET_SSSE3 in ix86_expand_vecop_qihi_partial.

2024-05-15 Thread liuhongt
pshufb is available under TARGET_SSSE3, so ix86_expand_vec_perm_const_1 must return true when TARGET_SSSE3. w/o TARGET_SSSE3, if we set one_operand_p to true, ix86_expand_vec_perm_const_1 could return false. With the patch under -march=x86-64-v2 v8qi foo (v8qi a) { return a >> 5; } < pm

[PATCH] Use pblendw instead of pand to clear upper 16 bits.

2024-05-16 Thread liuhongt
For vec_pack_truncv8si/v4si w/o AVX512, (const_vector:v4si (const_int 0x) x4) is used as mask to clear upper 16 bits, but vpblendw with zero_vector can also be used, and zero vector is cheaper than (const_vector:v4si (const_int 0x) x4). Bootstrapped and regtested on x86_64-pc-linux-gnu{-m3

[PATCH 1/2] Simplify (AND (ASHIFTRT A imm) mask) to (LSHIFTRT A imm) for vector mode.

2024-05-20 Thread liuhongt
When mask is (1 << (prec - imm) - 1) which is used to clear upper bits of A, then it can be simplified to LSHIFTRT. i.e Simplify (and:v8hi (ashifrt:v8hi A 8) (const_vector 0xff x8)) to (lshifrt:v8hi A 8) Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok of trunk? gcc/ChangeLog:

[PATCH 2/2] [x86] Adjust rtx_cost for MEM to enable more simplication

2024-05-20 Thread liuhongt
For CONST_VECTOR_DUPLICATE_P in constant_pool, it is just broadcast or variants in ix86_vector_duplicate_simode_const. Adjust the cost to COSTS_N_INSNS (2) + speed which should be a little bit larger than broadcast. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/Chang

[PATCH] Don't simplify NAN/INF or out-of-range constant for FIX/UNSIGNED_FIX.

2024-05-21 Thread liuhongt
According to IEEE standard, for conversions from floating point to integer. When a NaN or infinite operand cannot be represented in the destination format and this cannot otherwise be indicated, the invalid operation exception shall be signaled. When a numeric operand would convert to an integer ou

[V2 PATCH] Don't reduce estimated unrolled size for innermost loop at cunrolli.

2024-05-21 Thread liuhongt
>> Hard to find a default value satisfying all testcases. >> some require loop unroll with 7 insns increment, some don't want loop >> unroll w/ 5 insn increment. >> The original 2/3 reduction happened to meet all those testcases(or the >> testcases are constructed based on the old 2/3). >> Can we d

[V3 PATCH] Don't reduce estimated unrolled size for innermost loop.

2024-05-24 Thread liuhongt
Update in V3: > Since this was about vectorization can you instead add a testcase to > gcc.dg/vect/ and check for > vectorization to happen? Move to vect/pr112325.c. > > I believe the if (unr_insn <= 0) check can go as well. Removed. > as said, you want to do > > curolli = false; > > aft

[PATCH] Fix typo in the testcase.

2024-05-24 Thread liuhongt
Committed as an obvious patch. gcc/testsuite/ChangeLog: PR target/114148 * gcc.target/i386/pr106010-7b.c: Refine testcase. --- gcc/testsuite/gcc.target/i386/pr106010-7b.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/gcc/testsuite/gcc.target/i386/

[PATCH] Don't simplify NAN/INF or out-of-range constant for FIX/UNSIGNED_FIX.

2024-05-26 Thread liuhongt
Update in V2: Guard constant folding for overflow value in fold_convert_const_int_from_real with flag_trapping_math. Add -fno-trapping-math to related testcases which warn for overflow in conversion from floating point to integer. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for tr

[PATCH] Reduce cost of MEM (A + imm).

2024-05-27 Thread liuhongt
For MEM, rtx_cost iterates each subrtx, and adds up the costs, so for MEM (reg) and MEM (reg + 4), the former costs 5, the latter costs 9, it is not accurate for x86. Ideally address_cost should be used, but it reduce cost too much. So current solution is make constant disp as cheap as possible. B

[PATCH][committed] [avx512] Fix predicate mismatch between vfcmaddcph's define_insn and define_expand.

2024-05-27 Thread liuhongt
When I applied Roger's patch [1], there's ICE due to it. The patch fix the latent bug. [1] https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651365.html Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Pushed to trunk. gcc/ChangeLog: * config/i386/sse.md (___mask): Ali

[PATCH V2] Reduce cost of MEM (A + imm).

2024-05-28 Thread liuhongt
> IMO, there is no need for CONST_INT_P condition, we should also allow > symbol_ref, label_ref and const (all allowed by > x86_64_immediate_operand predicate), these all decay to an immediate > value. Changed. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk. For MEM, rtx_

[PATCH] [x86] Support vcond_mask_qiqi and friends.

2024-05-28 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/sse.md (vcond_mask_): New expander. gcc/testsuite/ChangeLog: * gcc.target/i386/pr114125.c: New test. --- gcc/config/i386/sse.md | 20

[committed] [x86] Rename double_u with __double_u to avoid pulluting the namespace.

2024-05-30 Thread liuhongt
Committed as an obvious patch. gcc/ChangeLog: * config/i386/emmintrin.h (__double_u): Rename from double_u. (_mm_load_sd): Replace double_u with __double_u. (_mm_store_sd): Ditto. (_mm_loadh_pd): Ditto. (_mm_loadl_pd): Ditto. * config/i386/xmmintrin

[PATCH] [x86] Add some preference for floating point rtl ifcvt when sse4.1 is not available

2024-06-02 Thread liuhongt
W/o TARGET_SSE4_1, it takes 3 instructions (pand, pandn and por) for movdfcc/movsfcc, and could possibly fail cost comparison. Increase branch cost could hurt performance for other modes, so specially add some preference for floating point ifcvt. Bootstrapped and regtested on x86_64-pc-linux-gnu{-

  1   2   3   4   5   6   >