[Bug target/52908] xop-mul-1:f9 miscompiled on bulldozer (-mxop)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52908 --- Comment #9 from vekumar at gcc dot gnu.org 2012-06-18 15:10:51 UTC --- Author: vekumar Date: Mon Jun 18 15:10:45 2012 New Revision: 188736 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188736 Log: Back port Fix PR 52908 - xop-mul-1:f9 miscompiled on bulldozer (-mxop) to 4.7 branch Modified: branches/gcc-4_7-branch/gcc/ChangeLog branches/gcc-4_7-branch/gcc/config/i386/sse.md branches/gcc-4_7-branch/gcc/testsuite/ChangeLog branches/gcc-4_7-branch/gcc/testsuite/gcc.target/i386/xop-imul32widen-vector.c
[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494 --- Comment #8 from vekumar at gcc dot gnu.org --- I tested mdbx before and after the revision Richard pointed out. On My Ryzen box there is ~4% regression. Although "vblenvps" is fast path instruction and can execute in pipe 0/1. It competes with vcmpccsd, fma and muls instruction that are also executing on pipe 0|1. Looks to me regression is due to added dependency and port pressure. We need to benchmark with large application like SPEC and then decide whether we need to enable X86_TUNE_SCALAR_FLOAT_BLENDV tuning for Ryzen or not. On BDVER4 there were no blendvps generated and no regression seen.
[Bug tree-optimization/86144] New: GCC is not generating vector math calls to svml/acml functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86144 Bug ID: 86144 Summary: GCC is not generating vector math calls to svml/acml functions Product: gcc Version: 8.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vekumar at gcc dot gnu.org Target Milestone: --- As per GCC 8.1.0 Manual ---snip-- -mveclibabi=type Specifies the ABI type to use for vectorizing intrinsics using an external library. Supported values for type are ‘svml’ for the Intel short vector math library and ‘acml’ for the AMD math core library. To use this option, both -ftree-vectorize and -funsafe-math-optimizations have to be enabled, and an SVML or ACML ABI-compatible library must be specified at link time. GCC currently emits calls to vmldExp2, vmldLn2, vmldLog102, vmldLog102, vmldPow2, vmldTanh2, vmldTan2, vmldAtan2, vmldAtanh2, vmldCbrt2, vmldSinh2, vmldSin2, vmldAsinh2, vmldAsin2, vmldCosh2, vmldCos2, vmldAcosh2, vmldAcos2, vmlsExp4, vmlsLn4, vmlsLog104, vmlsLog104, vmlsPow4, vmlsTanh4, vmlsTan4, vmlsAtan4, vmlsAtanh4, vmlsCbrt4, vmlsSinh4, vmlsSin4, vmlsAsinh4, vmlsAsin4, vmlsCosh4, vmlsCos4, vmlsAcosh4 and vmlsAcos4 for corresponding function type when -mveclibabi=svml is used, and __vrd2_sin, __vrd2_cos, __vrd2_exp, __vrd2_log, __vrd2_log2, __vrd2_log10, __vrs4_sinf, __vrs4_cosf, __vrs4_expf, __vrs4_logf, __vrs4_log2f, __vrs4_log10f and __vrs4_powf for the corresponding function type when -mveclibabi=acml is used. --snip-- #include double test_vect_exp (double* __restrict__ A, double* __restrict__ B, int size ) { int i; for (i = 0; i < size; i++) A[i] = exp(B[i]); return A[0]; } gcc-5.4.0/bin/gcc -O3 -mveclibabi=acml -ffast-math exp.c -S generated vector math calls to amdlibm/intel svml. ---Snip--- L8: movapd (%r12), %xmm0 addl$1, %r15d addq$16, %r12 addq$16, %rbx call__vrd2_exp movups %xmm0, -16(%rbx) cmpl%r15d, 4(%rsp) ja .L8 movl12(%rsp), %eax addl%eax, %ebp cmpl%eax, 8(%rsp) je .L10 ---Snip-- from gcc-6.0 we don't generate calls to acml/svml by default. What we generate is a call to glibC vector math function (libmvec) ---Snip--- .L8: movapd (%r12), %xmm0 addl$1, %r15d addq$16, %r12 addq$16, %rbx call_ZGVbN2v___exp_finite movups %xmm0, -16(%rbx) cmpl%r15d, 4(%rsp) ja .L8 movl12(%rsp), %eax addl%eax, %ebp cmpl%eax, 8(%rsp) je .L10 ---Snip---
[Bug tree-optimization/86144] GCC is not generating vector math calls to svml/acml functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86144 --- Comment #3 from vekumar at gcc dot gnu.org --- (In reply to Richard Biener from comment #2) > Note a workaround would be to re-arrange the vectorizer calls to > vectorizable_simd_clone_call and vectorizable_call. Can you check if > the following works? It gives precedence to what the target hook > (and thus -mveclibabi) provides. > > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c > index 9f365e31e49..bdef56bf65e 100644 > --- a/gcc/tree-vect-stmts.c > +++ b/gcc/tree-vect-stmts.c > @@ -9543,13 +9543,13 @@ vect_analyze_stmt (gimple *stmt, bool > *need_to_vectorize, slp_tree node, >if (!bb_vinfo >&& (STMT_VINFO_RELEVANT_P (stmt_info) > || STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def)) > -ok = (vectorizable_simd_clone_call (stmt, NULL, NULL, node, cost_vec) > +ok = (vectorizable_call (stmt, NULL, NULL, node, cost_vec) > || vectorizable_conversion (stmt, NULL, NULL, node, cost_vec) > || vectorizable_shift (stmt, NULL, NULL, node, cost_vec) > || vectorizable_operation (stmt, NULL, NULL, node, cost_vec) > || vectorizable_assignment (stmt, NULL, NULL, node, cost_vec) > || vectorizable_load (stmt, NULL, NULL, node, node_instance, > cost_vec) > - || vectorizable_call (stmt, NULL, NULL, node, cost_vec) > + || vectorizable_simd_clone_call (stmt, NULL, NULL, node, cost_vec) > || vectorizable_store (stmt, NULL, NULL, node, cost_vec) > || vectorizable_reduction (stmt, NULL, NULL, node, node_instance, > cost_vec) > @@ -9559,14 +9559,14 @@ vect_analyze_stmt (gimple *stmt, bool > *need_to_vectorize, slp_tree node, >else > { >if (bb_vinfo) > - ok = (vectorizable_simd_clone_call (stmt, NULL, NULL, node, cost_vec) > + ok = (vectorizable_call (stmt, NULL, NULL, node, cost_vec) > || vectorizable_conversion (stmt, NULL, NULL, node, cost_vec) > || vectorizable_shift (stmt, NULL, NULL, node, cost_vec) > || vectorizable_operation (stmt, NULL, NULL, node, cost_vec) > || vectorizable_assignment (stmt, NULL, NULL, node, cost_vec) > || vectorizable_load (stmt, NULL, NULL, node, node_instance, > cost_vec) > - || vectorizable_call (stmt, NULL, NULL, node, cost_vec) > + || vectorizable_simd_clone_call (stmt, NULL, NULL, node, > cost_vec) > || vectorizable_store (stmt, NULL, NULL, node, cost_vec) > || vectorizable_condition (stmt, NULL, NULL, NULL, 0, node, > cost_vec) Checked the patch now it give preference to -mveclibabi= option and generating expected calls.
[Bug target/91719] gcc compiles seq_cst store on x86-64 differently from clang/icc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91719 --- Comment #9 from vekumar at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #8) > CCing AMD too. Sure Let me check if this tuning helps AMD Zen Arch.
[Bug target/91719] gcc compiles seq_cst store on x86-64 differently from clang/icc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91719 --- Comment #10 from vekumar at gcc dot gnu.org --- xchg is faster than mov+mfence on AMD Zen. We can add m_ZNVER1 | m_ZNVER2 to the tuning.
[Bug target/87455] sse_packed_single_insn_optimal is suboptimal on Zen
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87455 --- Comment #2 from vekumar at gcc dot gnu.org --- This tuning was intended to generate movups instead of movupd as movups is 1 byte lesser than movupd. May be we should remove xorps generation part.
[Bug middle-end/68621] [6 Regression] FAIL: gcc.dg/tree-ssa/ifc-8.c scan-tree-dump-times ifcvt "Applying if-conversion" 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68621 --- Comment #3 from vekumar at gcc dot gnu.org --- (In reply to Richard Biener from comment #2) > You can change the testcase to > > __attribute__((aligned (32))) float array[LEN] = {}; > > which makes it not require -fno-common either and it should work with -fpic > then > (double-check). I added option "-fno-common" to check condition returned by decl_binds_to_current_def_p is true. /* or the base is know to be not readonly. */ tree base_tree = get_base_address (DR_REF (a)); if (DECL_P (base_tree) && decl_binds_to_current_def_p (base_tree) Changing to __attribute__((aligned (32))) float array[LEN] = {} also tests that condition. Sure, I will send a patch to adjust the test case.
[Bug middle-end/68621] [6 Regression] FAIL: gcc.dg/tree-ssa/ifc-8.c scan-tree-dump-times ifcvt "Applying if-conversion" 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68621 --- Comment #4 from vekumar at gcc dot gnu.org --- Even after initializing the array decl_binds_to_current_def_p (base_tree) return false when I set -fpic. ---Snip--- (1) bool decl_binds_to_current_def_p (const_tree decl) { gcc_assert (DECL_P (decl)); if (!targetm.binds_local_p (decl)) return false; (2) ---snip--- #if !TARGET_MACHO && !TARGET_DLLIMPORT_DECL_ATTRIBUTES /* For i386, common symbol is local only for non-PIE binaries. For x86-64, common symbol is local only for non-PIE binaries or linker supports copy reloc in PIE binaries. */ static bool ix86_binds_local_p (const_tree exp) { return default_binds_local_p_3 (exp, flag_shlib != 0, true, true, (!flag_pic || (TARGET_64BIT && HAVE_LD_PIE_COPYRELOC != 0))); } #endif ---snip--- And in default_binds_local_p_3 DECL_VISIBILITY (exp) is VISIBILITY_DEFAULT and shlib is set and it returns false (3) ---snip--- /* A symbol is local if the user has said explicitly that it will be, or if we have a definition for the symbol. We cannot infer visibility for undefined symbols. */ if (DECL_VISIBILITY (exp) != VISIBILITY_DEFAULT && (TREE_CODE (exp) == FUNCTION_DECL || !extern_protected_data || DECL_VISIBILITY (exp) != VISIBILITY_PROTECTED) && (DECL_VISIBILITY_SPECIFIED (exp) || defined_locally)) return true; /* If PIC, then assume that any global name can be overridden by symbols resolved from other modules. */ if (shlib) return false; ---snip---
[Bug middle-end/68621] [6 Regression] FAIL: gcc.dg/tree-ssa/ifc-8.c scan-tree-dump-times ifcvt "Applying if-conversion" 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68621 --- Comment #5 from vekumar at gcc dot gnu.org --- Adding visibility to hidden helps. diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ifc-8.c b/gcc/testsuite/gcc.dg/tree-ssa/ifc-8.c index 89a3410..7519a61 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ifc-8.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ifc-8.c @@ -1,9 +1,9 @@ /* { dg-do compile } */ -/* { dg-options "-Ofast -fdump-tree-ifcvt-details -fno-common -ftree-loop-if-convert-stores" } */ +/* { dg-options "-Ofast -fdump-tree-ifcvt-details -ftree-loop-if-convert-stores" } */ #define LEN 4096 - __attribute__((aligned (32))) float array[LEN]; + __attribute__((visibility("hidden"), aligned (32))) float array[LEN] = {}; void test () {
[Bug middle-end/68621] [6 Regression] FAIL: gcc.dg/tree-ssa/ifc-8.c scan-tree-dump-times ifcvt "Applying if-conversion" 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68621 --- Comment #6 from vekumar at gcc dot gnu.org --- Author: vekumar Date: Wed Mar 2 06:14:43 2016 New Revision: 233888 URL: https://gcc.gnu.org/viewcvs?rev=233888&root=gcc&view=rev Log: Adjust test case in PR68621 to compile with -fpic. 2016-03-02 Venkataramanan Kumar PR tree-optimization/68621 * gcc.dg/tree-ssa/ifc-8.c: Adjust test. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/ifc-8.c
[Bug tree-optimization/70102] New: Tree re-association prevents SLP vectorization at -Ofast.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70102 Bug ID: 70102 Summary: Tree re-association prevents SLP vectorization at -Ofast. Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vekumar at gcc dot gnu.org Target Milestone: --- The following test case failed to vectorize in gcc -Ofast. (---snip---) subroutine test (x,y,z) integer x,y,z real*8 a(5,x,y,z),b(5,x,y,z) real*8 c c = 0.0d0 do k=1,z do j=1,y do i=1,x do l=1,5 c = c + a(l,i,j,k)*b(l,i,j,k) enddo enddo enddo enddo write(30,*)'c ==',c return end (---snip---) Vectorizer dump (---snip---) test.f:9:0: note: original stmt _95 = _92 + _112; test.f:9:0: note: Build SLP for _152 = _150 * _151; test.f:9:0: note: Build SLP failed: different operation in stmt _152 = _150 * _151; test.f:9:0: note: original stmt _95 = _92 + _112; test.f:9:0: note: Build SLP for _55 = _53 * _54; test.f:9:0: note: Build SLP failed: different operation in stmt _55 = _53 * _54; test.f:9:0: note: original stmt _95 = _92 + _112; test.f:1:0: note: vectorized 0 loops in function (---snip---) Re-association pass changes one of the tree expression and it prevents from SLP block vectorization. Before (---snip---) # VUSE <.MEM_7> _90 = *A.18_37[_89]; # VUSE <.MEM_7> _91 = *A.20_40[_89]; _92 = _90 * _91; # VUSE <.MEM_7> c.21_93 = cD.3439; c.22_94 = _92 + c.21_93; _109 = _87 + 2; # VUSE <.MEM_7> _110 = *A.18_37[_109]; # VUSE <.MEM_7> _111 = *A.20_40[_109]; _112 = _110 * _111; c.22_114 = c.22_94 + _112; _129 = _87 + 3; (---snip---) After tree-reassoc (---snip---) # VUSE <.MEM_7> _90 = *A.18_37[_89]; # VUSE <.MEM_7> _91 = *A.20_40[_89]; _92 = _91 * _90; # VUSE <.MEM_7> c.21_93 = cD.3439; _109 = _87 + 2; # VUSE <.MEM_7> _110 = *A.18_37[_109]; # VUSE <.MEM_7> _111 = *A.20_40[_109]; _112 = _111 * _110; _31 = _112 + _92; <== new statement _129 = _87 + 3; (---snip---)
[Bug tree-optimization/70103] New: gcc reports bad dependence and bails out of vectorization for one of the bwaves loops.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70103 Bug ID: 70103 Summary: gcc reports bad dependence and bails out of vectorization for one of the bwaves loops. Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vekumar at gcc dot gnu.org Target Milestone: --- flux_lam.f:68:0: note: dependence distance = 0. flux_lam.f:68:0: note: dependence distance == 0 between MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_244] and MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_244] flux_lam.f:68:0: note: READ_WRITE dependence in interleaving. flux_lam.f:68:0: note: bad data dependence. Looking at vector dumps, if we have CSEd the load, then there is no dependency issue here. MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_244] = _272 _323 = MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_244]; ---snip--- MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_244] = _272; # VUSE <.MEM_273> _274 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_219]; # VUSE <.MEM_273> _275 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_224]; _276 = _274 - _275; _277 = ((_276)); t1_278 = _277 / dy2_68; _279 = _195 + 3; # VUSE <.MEM_273> _280 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_252]; # VUSE <.MEM_273> _281 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_254]; _282 = _280 - _281; _283 = ((_282)); _284 = _283 / dy2_68; _285 = t1_278 + _284; _286 = ((_285)); _287 = _286 * 5.0e-1; # VUSE <.MEM_273> _288 = MEM[(real(kind=8)D.18[0:D.3601] *)v.107_60][_206]; # VUSE <.MEM_273> _289 = MEM[(real(kind=8)D.18[0:D.3601] *)v.107_60][_203]; _290 = _288 - _289; _291 = ((_290)); _292 = _291 / _64; _293 = _287 + _292; _294 = ((_293)); _295 = t0_210 * _294; # .MEM_296 = VDEF <.MEM_273> MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_279] = _295; # VUSE <.MEM_296> _297 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_233]; # VUSE <.MEM_296> _298 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_239]; _299 = _297 - _298; _300 = ((_299)); t2_301 = _300 / dz2_71; _302 = _195 + 4; # VUSE <.MEM_296> _303 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_261]; # VUSE <.MEM_296> _304 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_263]; _305 = _303 - _304; _306 = ((_305)); _307 = _306 / dz2_71; _308 = t2_301 + _307; _309 = ((_308)); _310 = _309 * 5.0e-1; # VUSE <.MEM_296> _311 = MEM[(real(kind=8)D.18[0:D.3597] *)w.109_62][_206]; # VUSE <.MEM_296> _312 = MEM[(real(kind=8)D.18[0:D.3597] *)w.109_62][_203]; _313 = _311 - _312; _314 = ((_313)); _315 = _314 / _64; _316 = _310 + _315; _317 = ((_316)); _318 = t0_210 * _317; # .MEM_319 = VDEF <.MEM_296> MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_302] = _318; _320 = _195 + 5; _321 = _246 + _247; _322 = ((_321)); # VUSE <.MEM_319> _323 = MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_244]; ---snip---
[Bug tree-optimization/70103] gcc reports bad dependence and bails out of vectorization for one of the bwaves loops.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70103 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||matz at suse dot de, ||richard.guenther at gmail dot com Severity|normal |enhancement --- Comment #1 from vekumar at gcc dot gnu.org --- After discussion with Richard it was concluded that even after we fix this still we still won't be able to vectorize the loop. (Snip) flux_lam.f:68:0: note: not vectorized: relevant stmt not supported: _177 = _176 % _21; flux_lam.f:68:0: note: bad operation or unsupported loop bound. (Snip) The reason is we have % operations. (Snip) : # i_2 = PHI <1(23), _181(28)> _175 = i_2 + _21; _176 = _175 + -2; _177 = _176 % _21; im1_178 = _177 + 1; _179 = i_2 % _21; ip1_180 = _179 + 1; (Snip) that makes indices "wrap" around which is of course something that is hard to vectorize. One would need iteration space splitting to ensure the wrapping doesn't occur in the vectorized iterations. Reporting this bug and marking this as enhancement
[Bug tree-optimization/70193] New: missed loop splitting support based on iteration space
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70193 Bug ID: 70193 Summary: missed loop splitting support based on iteration space Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vekumar at gcc dot gnu.org Target Milestone: --- Following the comments in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70103#c2 and discussion with Richard, filing this PR. This is inspired by the loop flux_lam.f:68:0 at bwaves which has % operation. int a[100],b[100]; void test(int x, int N1) { int i,im1; for (i=0;i
[Bug tree-optimization/70193] missed loop splitting support based on iteration space
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70193 vekumar at gcc dot gnu.org changed: What|Removed |Added Severity|normal |enhancement
[Bug tree-optimization/58135] [x86] Missed opportunities for partial SLP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58135 --- Comment #3 from vekumar at gcc dot gnu.org --- Author: vekumar Date: Mon May 23 09:48:54 2016 New Revision: 236582 URL: https://gcc.gnu.org/viewcvs?rev=236582&root=gcc&view=rev Log: Fix PR58135. 2016-05-23 Venkataramanan Kumar PR tree-optimization/58135 * tree-vect-slp.c: When group size is not multiple of vector size, allow splitting of store group at vector boundary. 2016-05-23 Venkataramanan Kumar * gcc.dg/vect/bb-slp-19.c: Remove XFAIL. * gcc.dg/vect/pr58135.c: Add new. * gfortran.dg/pr46519-1.f: Adjust test case. Added: trunk/gcc/testsuite/gcc.dg/vect/pr58135.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/vect/bb-slp-19.c trunk/gcc/testsuite/gfortran.dg/pr46519-1.f trunk/gcc/tree-vect-slp.c
[Bug tree-optimization/58135] [x86] Missed opportunities for partial SLP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58135 vekumar at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from vekumar at gcc dot gnu.org --- Fixed the PR. ref: https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=236582 2016-05-23 Venkataramanan Kumar PR tree-optimization/58135 * tree-vect-slp.c: When group size is not multiple of vector size, allow splitting of store group at vector boundary. 2016-05-23 Venkataramanan Kumar * gcc.dg/vect/bb-slp-19.c: Remove XFAIL. * gcc.dg/vect/pr58135.c: Add new. * gfortran.dg/pr46519-1.f: Adjust test case
[Bug tree-optimization/71270] [7 Regression] fortran regression after fix SLP PR58135
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71270 --- Comment #2 from vekumar at gcc dot gnu.org --- Looked at x86_64 gimple code for intrinsic_pack_1.f90. After the SLP split we now vectorize at the place where we pass constant arguments via a parameterstructure to _gfortran_pack call. Before parm.20D.3555.dtypeD.3497 = 297; # .MEM_242 = VDEF <.MEM_241> parm.20D.3555.dimD.3502[0].lboundD.3499 = 1; # .MEM_243 = VDEF <.MEM_242> parm.20D.3555.dimD.3502[0].uboundD.3500 = 9; # .MEM_244 = VDEF <.MEM_243> parm.20D.3555.dimD.3502[0].strideD.3498 = 1; # .MEM_245 = VDEF <.MEM_244> parm.20D.3555.dataD.3495 = &d_ri4D.3433[0]; # .MEM_246 = VDEF <.MEM_245> parm.20D.3555.offsetD.3496 = -1; After # .MEM_243 = VDEF <.MEM_1566> parm.20D.3555.dimD.3502[0].uboundD.3500 = 9; # .MEM_245 = VDEF <.MEM_243> parm.20D.3555.dataD.3495 = &d_ri4D.3433[0]; # .MEM_992 = VDEF <.MEM_245> MEM[(integer(kind=8)D.9 *)&parm.20D.3555 + 8B] = vect_cst__993; # PT = anything # ALIGN = 16, MISALIGN = 8 _984 = &parm.20D.3555.offsetD.3496 + 16; # .MEM_983 = VDEF <.MEM_992> MEM[(integer(kind=8)D.9 *)_984] = vect_cst__999; vect_cst__993={-1,297} vect_cst__999={1,1} Other places looks similar. This looks like correct gimple. I am verifying the gimple generated for arm big endian target.
[Bug tree-optimization/71270] [7 Regression] fortran regression after fix SLP PR58135
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71270 --- Comment #3 from vekumar at gcc dot gnu.org --- Built armeb-none-linux-gnueabihf -with-cpu=cortex-a9 --with-fpu=neon-fp16 --with-float=hard And compared gimple output from intrinsic_pack_1.f90.151t.slp1 before and after my patch. The difference is shown below and is similar to x86_64 dump. The gimple dump after SLP looks correct to me. I think something in backend is causing the issues. Any thoughts? Gimple SLP dumps. Before # .MEM_1450 = VDEF <.MEM_1492> d_i1D.3585[0].vD.3582 = 1; # .MEM_1454 = VDEF <.MEM_1450> d_i1D.3585[1].vD.3582 = -1; # .MEM_1458 = VDEF <.MEM_1454> d_i1D.3585[2].vD.3582 = 2; # .MEM_1468 = VDEF <.MEM_1458> d_i1D.3585[3].vD.3582 = -2; # .MEM_1472 = VDEF <.MEM_1468> d_i1D.3585[4].vD.3582 = 3; # .MEM_1476 = VDEF <.MEM_1472> d_i1D.3585[5].vD.3582 = -3; # .MEM_1486 = VDEF <.MEM_1476> d_i1D.3585[6].vD.3582 = 4; # .MEM_1490 = VDEF <.MEM_1486> d_i1D.3585[7].vD.3582 = -4; # .MEM_1494 = VDEF <.MEM_1490> d_i1D.3585[8].vD.3582 = 5; After vect_cst__817 = { 1, 0, 1, 0 }; vect_cst__873 = { 1, 0, 1, 0 }; vect_cst__1413 = { 1, -1, 2, -2 }; vect_cst__1461 = { 3, -3, 4, -4 }; # .MEM_910 = VDEF <.MEM_1492> MEM[(integer(kind=1)D.3 *)&d_i1D.3585] = vect_cst__1413; # PT = anything # ALIGN = 4, MISALIGN = 0 _918 = &d_i1D.3585[0].vD.3582 + 4; # .MEM_865 = VDEF <.MEM_910> MEM[(integer(kind=1)D.3 *)_918] = vect_cst__1461; # .MEM_1494 = VDEF <.MEM_865> d_i1D.3585[8].vD.3582 = 5; Before # .MEM_1388 = VDEF <.MEM_217> MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][0] = 1; # .MEM_1393 = VDEF <.MEM_1388> MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][1] = 0; # .MEM_1398 = VDEF <.MEM_1393> MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][2] = 1; # .MEM_1409 = VDEF <.MEM_1398> MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][3] = 0; # .MEM_1414 = VDEF <.MEM_1409> MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][4] = 1; # .MEM_1419 = VDEF <.MEM_1414> MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][5] = 0; # .MEM_1430 = VDEF <.MEM_1419> MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][6] = 1; # .MEM_1435 = VDEF <.MEM_1430> MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][7] = 0; # .MEM_1440 = VDEF <.MEM_1435> MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][8] = 1; After # .MEM_825 = VDEF <.MEM_217> MEM[(logical(kind=1)D.7 *)&A.8D.3679] = vect_cst__817; # PT = anything # ALIGN = 4, MISALIGN = 0 _769 = &MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][0] + 4; # .MEM_777 = VDEF <.MEM_825> MEM[(logical(kind=1)D.7 *)_769] = vect_cst__873; # .MEM_1440 = VDEF <.MEM_777> MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][8] = 1; Before # .MEM_1271 = VDEF <.MEM_264> MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][0] = 1; # .MEM_1276 = VDEF <.MEM_1271> MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][1] = 0; # .MEM_1281 = VDEF <.MEM_1276> MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][2] = 1; # .MEM_1292 = VDEF <.MEM_1281> MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][3] = 0; # .MEM_1297 = VDEF <.MEM_1292> MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][4] = 1; # .MEM_1302 = VDEF <.MEM_1297> MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][5] = 0; # .MEM_1313 = VDEF <.MEM_1302> MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][6] = 1; # .MEM_1318 = VDEF <.MEM_1313> MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][7] = 0; # .MEM_1323 = VDEF <.MEM_1318> MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][8] = 1; After vect_cst__729 = { 1, 0, 1, 0 }; vect_cst__721 = { 1, 0, 1, 0 }; # .MEM_673 = VDEF <.MEM_264> MEM[(logical(kind=1)D.7 *)&A.23D.3720] = vect_cst__729; # PT = anything # ALIGN = 4, MISALIGN = 0 _681 = &MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][0] + 4; # .MEM_942 = VDEF <.MEM_673> MEM[(logical(kind=1)D.7 *)_681] = vect_cst__721; # .MEM_1323 = VDEF <.MEM_942> MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][8] = 1;
[Bug target/71270] [7 Regression] fortran regression after fix SLP PR58135
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71270 --- Comment #5 from vekumar at gcc dot gnu.org --- The expand dump after SLP split ---snip-- ;; MEM[(logical(kind=1) *)&A.8] = { 1, 0, 1, 0 }; (insn 71 70 72 (set (reg:SI 308) (const_int 16777472 [0x1000100])) intrinsic_pack_1.f90:49 -1 (nil)) (insn 72 71 0 (set (mem/c:SI (plus:SI (reg/f:SI 105 virtual-stack-vars) (const_int -576 [0xfdc0])) [8 MEM[(logical(kind=1)D.7 *)&A.8D.3679]+0 S4 A64]) (reg:SI 308)) intrinsic_pack_1.f90:49 -1 (nil)) ;; MEM[(logical(kind=1) *)&A.8 + 4B] = { 1, 0, 1, 0 }; (insn 73 72 74 (set (reg:SI 309) (const_int 16777472 [0x1000100])) intrinsic_pack_1.f90:49 -1 (nil)) (insn 74 73 0 (set (mem/c:SI (plus:SI (reg/f:SI 105 virtual-stack-vars) (const_int -572 [0xfdc4])) [8 MEM[(logical(kind=1)D.7 *)&A.8D.3679 + 4B]+0 S4 A32]) (reg:SI 309)) intrinsic_pack_1.f90:49 -1 (nil)) ;; MEM[(logical(kind=1)[9] *)&A.8][8] = 1; (insn 75 74 76 (set (reg:SI 310) (const_int 1 [0x1])) intrinsic_pack_1.f90:49 -1 (nil)) (insn 76 75 77 (set (reg:QI 311) (subreg:QI (reg:SI 310) 3)) intrinsic_pack_1.f90:49 -1 (nil)) (insn 77 76 0 (set (mem/c:QI (plus:SI (reg/f:SI 105 virtual-stack-vars) (const_int -568 [0xfdc8])) [8 A.8D.3679+8 S1 A64]) (reg:QI 311)) intrinsic_pack_1.f90:49 -1 (nil)) --snip---
[Bug tree-optimization/64946] [AArch64] gcc.target/aarch64/vect-abs-compile.c - "abs" vectorization fails for char/short types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946 vekumar at gcc dot gnu.org changed: What|Removed |Added Assignee|vekumar at gcc dot gnu.org |shiva0217 at gmail dot com --- Comment #15 from vekumar at gcc dot gnu.org --- I am not working at this now. so assigned to shiva chen
[Bug tree-optimization/64716] Missed vectorization in a hot code of SPEC2000 ammp
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64716 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||vekumar at gcc dot gnu.org --- Comment #4 from vekumar at gcc dot gnu.org --- Tried to see if there is improvement when allowing splitting the group stores based at VF boundary. Small improvement noted with slightly older trunk gcc version 7.0.0 20160524 (experimental) (GCC) rectmm.c:520:2: note: Basic block will be vectorized using SLP (Snip) a1-> px = a1->x + lambda*a1->dx; a1-> py = a1->y + lambda*a1->dy; a1-> pz = a1->z + lambda*a1->dz; (Snip) ---SLP dump--- rectmm.c:520:2: note: Detected interleaving load a1_944->xD.4701 and a1_944->yD.4702 rectmm.c:520:2: note: Detected interleaving load a1_944->xD.4701 and a1_944->zD.4703 rectmm.c:520:2: note: Detected interleaving load a1_944->xD.4701 and a1_944->dxD.4721 rectmm.c:520:2: note: Detected interleaving load a1_944->xD.4701 and a1_944->dyD.4722 rectmm.c:520:2: note: Detected interleaving load a1_944->xD.4701 and a1_944->dzD.4723 rectmm.c:520:2: note: Detected interleaving store a1_944->pxD.4728 and a1_944->pyD.4729 rectmm.c:520:2: note: Detected interleaving store a1_944->pxD.4728 and a1_944->pzD.4730 rectmm.c:520:2: note: Split group into 2 and 1 rectmm.c:520:2: note: Basic block will be vectorized using SLP rectmm.c:520:2: note: SLPing BB part rectmm.c:520:2: note: -->vectorizing SLP node starting from: # VUSE <.MEM_1752> _672 = a1_944->dxD.4721; ---SLP dump---
[Bug sanitizer/65662] AddressSanitizer CHECK failed: ../../../../gcc/libsanitizer/sanitizer_common/sanitizer_allocator.h:835 "((res)) < ((kNumPossibleRegions))" (0x3ffb49, 0x80000)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65662 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||vekumar at gcc dot gnu.org --- Comment #6 from vekumar at gcc dot gnu.org --- For 42 bit VA, I have to change the SANITIZER_MMAP_RANGE_SIZE to 1 <<42. Also compiler has to add the shadow offset instead of Oring it. I am planning to post a patch in LLVM. As Kostya said we can discuss in that thread.
[Bug sanitizer/65662] AddressSanitizer CHECK failed: ../../../../gcc/libsanitizer/sanitizer_common/sanitizer_allocator.h:835 "((res)) < ((kNumPossibleRegions))" (0x3ffb49, 0x80000)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65662 --- Comment #8 from vekumar at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #7) > (In reply to vekumar from comment #6) > > For 42 bit VA, I have to change the SANITIZER_MMAP_RANGE_SIZE to 1 <<42. > > Sure. > > > Also compiler has to add the shadow offset instead of Oring it. > > You don't, see my patch. > As I said, the hard part is making sure all 3 layouts work with the same > libasan library - the problem is that the library assumes some decisions > (like whether to use 32-bit or 64-bit allocator) have to be done at library > compile time, when for aarch64 they really have to be done at runtime. Hi Jakub, It was decided to make ASAN work for 42 bit VA without changing the default allocator (32bit) and the default shadow offset (1<<36). Please see thread https://groups.google.com/forum/#!topic/address-sanitizer/YzYRJEvVimw. On 42 bit VA with default settings, I found that some cases (LLVM ASAN tests) were failing because the compiler (LLVM) does Oring of shadow offset and ASAN library code adds the shadow offset. Both access resulted in valid memory and but we were poisoning the wrong shadow memory. Now your patch turns on the 64 bit allocator. I agree to do this we need to dynamically detect VA at runtime. Can you please join the thread and post your comments there.
[Bug bootstrap/62077] --with-build-config=bootstrap-lto fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62077 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||vekumar at gcc dot gnu.org --- Comment #58 from vekumar at gcc dot gnu.org --- Richard, So for GCC 5.0 branch we has to use --enable-stage1-checking=release as workaround?
[Bug target/66049] New: Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049 Bug ID: 66049 Summary: Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0. Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vekumar at gcc dot gnu.org Target Milestone: --- After preventing conversion of shift to mults in combiner https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=222874 few Aarch64 target tests generates suboptimal code. Tests that now fail, but worked before: --- gcc.target/aarch64/adds1.c scan-assembler adds\tw[0-9]+, w[0-9]+, w[0-9]+, lsl 3 gcc.target/aarch64/adds1.c scan-assembler adds\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 3 gcc.target/aarch64/adds3.c scan-assembler-times adds\tx[0-9]+, x[0-9]+, x[0-9]+, sxtw 2 gcc.target/aarch64/extend.c scan-assembler add\tw[0-9]+,.*uxth #?1 gcc.target/aarch64/extend.c scan-assembler add\tx[0-9]+,.*uxtw #?3 gcc.target/aarch64/extend.c scan-assembler sub\tw[0-9]+,.*uxth #?1 gcc.target/aarch64/extend.c scan-assembler sub\tx[0-9]+,.*uxth #?1 gcc.target/aarch64/extend.c scan-assembler sub\tx[0-9]+,.*uxtw #?3 gcc.target/aarch64/subs1.c scan-assembler subs\tw[0-9]+, w[0-9]+, w[0-9]+, lsl 3 gcc.target/aarch64/subs1.c scan-assembler subs\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 3 gcc.target/aarch64/subs3.c scan-assembler-times subs\tx[0-9]+, x[0-9]+, x[0-9]+, sxtw 2 Sample Test case unsigned long long adddi_uxtw (unsigned long long a, unsigned int i) { /* { dg-final { scan-assembler "add\tx\[0-9\]+,.*uxtw #?3" } } */ return a + ((unsigned long long)i << 3); } Before add x0, x0, x1, uxtw 3 Now ubfiz x1, x1, 3, 32 add x0, x1, x0
[Bug target/66049] Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049 --- Comment #1 from vekumar at gcc dot gnu.org --- We need patterns based on shifts to match with combiner generated. Below patch fixes them. diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 1c2c5fb..c5a640d 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -1555,6 +1555,23 @@ [(set_attr "type" "alus_shift_imm")] ) +(define_insn "*adds_shift_imm_" + [(set (reg:CC_NZ CC_REGNUM) +(compare:CC_NZ + (plus:GPI (ASHIFT:GPI +(match_operand:GPI 1 "register_operand" "r") +(match_operand:QI 2 "aarch64_shift_imm_" "n")) + (match_operand:GPI 3 "register_operand" "r")) + (const_int 0))) + (set (match_operand:GPI 0 "register_operand" "=r") +(plus:GPI (ASHIFT:GPI (match_dup 1) (match_dup 2)) + (match_dup 3)))] + "" + "adds\\t%0, %3, %1, %2" + [(set_attr "type" "alus_shift_imm")] +) + + (define_insn "*subs_mul_imm_" [(set (reg:CC_NZ CC_REGNUM) (compare:CC_NZ @@ -1571,6 +1588,23 @@ [(set_attr "type" "alus_shift_imm")] ) +(define_insn "*subs_shift_imm_" + [(set (reg:CC_NZ CC_REGNUM) +(compare:CC_NZ + (minus:GPI (match_operand:GPI 1 "register_operand" "r") +(ASHIFT:GPI + (match_operand:GPI 2 "register_operand" "r") + (match_operand:QI 3 "aarch64_shift_imm_" "n"))) + (const_int 0))) + (set (match_operand:GPI 0 "register_operand" "=r") +(minus:GPI (match_dup 1) + (ASHIFT:GPI (match_dup 2) (match_dup 3] + "" + "subs\\t%0, %1, %2, %3" + [(set_attr "type" "alus_shift_imm")] +) + + (define_insn "*adds__" [(set (reg:CC_NZ CC_REGNUM) (compare:CC_NZ @@ -1599,6 +1633,41 @@ [(set_attr "type" "alus_ext")] ) +(define_insn "*adds__shft_" + [(set (reg:CC_NZ CC_REGNUM) +(compare:CC_NZ + (plus:GPI (ashift:GPI (ANY_EXTEND:GPI +(match_operand:ALLX 1 "register_operand" "r")) + (match_operand 2 "aarch64_imm3" "Ui3")) + (match_operand:GPI 3 "register_operand" "r")) +(const_int 0))) + (set (match_operand:GPI 0 "register_operand" "=rk") +(plus:GPI (ashift:GPI (ANY_EXTEND:GPI (match_dup 1)) + (match_dup 2)) + (match_dup 3)))] + "" + "adds\\t%0, %3, %1, xt %2" + [(set_attr "type" "alus_ext")] +) + +(define_insn "*subs__shft_" + [(set (reg:CC_NZ CC_REGNUM) +(compare:CC_NZ + (minus:GPI (match_operand:GPI 1 "register_operand" "r") +(ashift:GPI (ANY_EXTEND:GPI +(match_operand:ALLX 2 "register_operand" "r")) + (match_operand 3 "aarch64_imm3" "Ui3"))) +(const_int 0))) + (set (match_operand:GPI 0 "register_operand" "=rk") +(minus:GPI (match_dup 1) +(ashift:GPI (ANY_EXTEND:GPI (match_dup 2)) + (match_dup 3] + "" + "subs\\t%0, %1, %2, xt %3" + [(set_attr "type" "alus_ext")] +) + + (define_insn "*adds__multp2" [(set (reg:CC_NZ CC_REGNUM) (compare:CC_NZ @@ -1909,6 +1978,22 @@ [(set_attr "type" "alu_ext")] ) +(define_insn "*add_uxt_shift2" + [(set (match_operand:GPI 0 "register_operand" "=rk") +(plus:GPI (and:GPI + (ashift:GPI (match_operand:GPI 1 "register_operand" "r") + (match_operand 2 "aarch64_imm3" "Ui3")) + (match_operand 3 "const_int_operand" "n")) + (match_operand:GPI 4 "register_operand" "r")))] + "aarch64_uxt_size (INTVAL (operands[2]), INTVAL (operands[3])) != 0" + "* + operands[3] = GEN_INT (aarch64_uxt_size (INTVAL(operands[2]), + INTVAL (operands[3]))); + return \"add\t%0, %4, %1, uxt%e3 %2\";" + [(set_attr "type" "alu_ext")] +) + + ;; zero_extend version of above (define_insn "*add_uxtsi_multp2_uxtw" [(set (match_operand:DI 0 "register_operand" "=rk") @@ -2165,6 +2
[Bug target/66049] Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049 --- Comment #4 from vekumar at gcc dot gnu.org --- (In reply to ktkachov from comment #3) > Venkat, are you planning to submit this patch to gcc-patches? > Also, does this mean we can remove the patterns that do arith+shift using > MULT rtxes? (like *adds__multp2) Hi Kyrill, Yes I am planing to submit the patch. But before that I need to test by putting some assert and check if *adds__multp2 and similar patterns are not used anymore.
[Bug target/66049] [6 regression] Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049 --- Comment #6 from vekumar at gcc dot gnu.org --- (In reply to Ramana Radhakrishnan from comment #5) > (In reply to vekumar from comment #4) > > (In reply to ktkachov from comment #3) > > > Venkat, are you planning to submit this patch to gcc-patches? > > > Also, does this mean we can remove the patterns that do arith+shift using > > > MULT rtxes? (like *adds__multp2) > > > > Hi Kyrill, > > > > Yes I am planing to submit the patch. But before that I need to test by > > putting some assert and check if *adds__multp2 and similar > > patterns are not used anymore. > > So this is a regression on GCC 6. what's holding up pushing this patch onto > gcc-patches@ ? GCC bootstrap and regression testing completed. I am doing SPEC 2006 INT run just to make sure no surprises. will post it in a day or two.
[Bug target/66049] [6 regression] Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049 --- Comment #7 from vekumar at gcc dot gnu.org --- (In reply to ktkachov from comment #3) > Venkat, are you planning to submit this patch to gcc-patches? > Also, does this mean we can remove the patterns that do arith+shift using > MULT rtxes? (like *adds__multp2) Hi Kyrill, I added shift based patterns for *adds__multp2 *subs__multp2 *add_uxt_multp2 *add_uxtsi_multp2_uxtw *sub_uxt_multp2 *sub_uxtsi_multp2_uxtw *adds_mul_imm_ *subs_mul_imm_ I added "gcc_unreachable" to these patterns and gcc boostrapped except add_uxt_multp2 pattern. The pattern "*add_uxtdi_multp2" can still be generated. /root/work/GCC_Team/vekumar/build-assert-check/./gcc/xgcc -B/root/work/GCC_Team/vekumar/build-assert-check/./gcc/ -B/root/work/GCC_Team/vekumar/install-assert-check/aarch64-unknown-linux-gnu/bin/ -B/root/work/GCC_Team/vekumar/install-assert-check/aarch64-unknown-linux-gnu/lib/ -isystem /root/work/GCC_Team/vekumar/install-assert-check/aarch64-unknown-linux-gnu/include -isystem /root/work/GCC_Team/vekumar/install-assert-check/aarch64-unknown-linux-gnu/sys-include -g -O2 -O2 -g -O2 -DIN_GCC-W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wno-format -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -isystem ./include -fPIC -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector -fPIC -I. -I. -I../.././gcc -I../../../gcc-assert-check/libgcc -I../../../gcc-assert-check/libgcc/. -I../../../gcc-assert-check/libgcc/../gcc -I../../../gcc-assert-check/libgcc/../include -DHAVE_CC_TLS -o _gcov.o -MT _gcov.o -MD -MP -MF _gcov.dep -DL_gcov -c ../../../gcc-assert-check/libgcc/libgcov-driver.c insn 1325 1324 1326 137 (set (reg:DI 725 [ ix ]) (zero_extend:DI (reg/v:SI 197 [ ix ]))) ../../../gcc-assert-check/libgcc/libgcov-driver.c:103 73 {*zero_extendsidi2_aarch64} (nil)) (insn 1326 1325 1327 137 (set (reg:DI 726) (plus:DI (reg:DI 725 [ ix ]) (const_int 4 [0x4]))) ../../../gcc-assert-check/libgcc/libgcov-driver.c:103 87 {*adddi3_aarch64} (expr_list:REG_DEAD (reg:DI 725 [ ix ]) (nil))) (insn 1327 1326 3536 137 (set (reg/f:DI 727) (mem/f:DI (plus:DI (mult:DI (reg:DI 726) (const_int 8 [0x8])) (reg/v/f:DI 571 [ list ])) [2 MEM[(const struct gcov_info *)list_372].merge S8 A64])) ../../../gcc-assert-check/libgcc/libgcov-driver.c:103 40 {*movdi_aarch64} Successfully matched this instruction: (set (reg/f:DI 727) (plus:DI (and:DI (mult:DI (subreg:DI (reg/v:SI 197 [ ix ]) 0) (const_int 8 [0x8])) (const_int 34359738360 [0x7fff8])) (reg/v/f:DI 571 [ list ]))) (insn 1326 1325 1327 137 (set (reg:DI 726) (plus:DI (and:DI (mult:DI (subreg:DI (reg/v:SI 197 [ ix ]) 0) (const_int 8 [0x8])) (const_int 34359738360 [0x7fff8])) (reg/v/f:DI 571 [ list ]))) ../../../gcc-assert-check/libgcc/libgcov-driver.c:103 252 {*add_uxtdi_multp2} (nil)) (insn 1327 1326 3536 137 (set (reg/f:DI 727) (mem/f:DI (plus:DI (reg:DI 726) (const_int 32 [0x20])) [2 MEM[(const struct gcov_info *)list_372].merge S8 A64])) ../../../gcc-assert-check/libgcc/libgcov-driver.c:103 40 {*movdi_aarch64} I am going to first send out patch for adding new shift based patterns. Then separate patch test and remove mul patterns.
[Bug target/66049] [6 regression] Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049 --- Comment #9 from vekumar at gcc dot gnu.org --- Author: vekumar Date: Tue May 26 15:32:02 2015 New Revision: 223703 URL: https://gcc.gnu.org/viewcvs?rev=223703&root=gcc&view=rev Log: 2015-05-26 Venkataramanan Kumar PR target/66049 * config/aarch64/aarch64.md (*adds_shift_imm_): New pattern. (*subs_shift_imm_): Likewise. (*adds__shift_): Likewise. (*subs__shift_): Likewise. (*add_uxt_shift2): Likewise. (*add_uxtsi_shift2_uxtw): Likewise. (*sub_uxt_shift2): Likewise. (*sub_uxtsi_shift2_uxtw): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/aarch64.md
[Bug target/66049] [6 regression] Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049 vekumar at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #10 from vekumar at gcc dot gnu.org --- Fixed at r223703
[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949 vekumar at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #12 from vekumar at gcc dot gnu.org --- Fixed at r222874 https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=222874 2015-05-07 Venkataramanan Kumar * combine.c (make_compound_operation): Remove checks for PLUS/MINUS rtx type.
[Bug target/67717] New: [6.0 regression] ICE when compiling WRF benchmark from cpu2006 with -Ofast -march=bdver4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67717 Bug ID: 67717 Summary: [6.0 regression] ICE when compiling WRF benchmark from cpu2006 with -Ofast -march=bdver4 Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vekumar at gcc dot gnu.org Target Milestone: --- gfortran -c -o module_cu_gd.fppized.o -I. -I./netcdf/include -march=bdver4 -Ofast -fno-second-underscore module_cu_gd.fppized.f90 module_cu_gd.fppized.f90:1302:0: (Snip) END SUBROUTINE CUP_enss ^ Error: insn does not satisfy its constraints: (insn 8179 8178 14983 589 (parallel [ (set (reg:V4SF 23 xmm2 [orig:2240 vect__1137.6099 ] [2240]) (unspec:V4SF [ (reg:V4SF 23 xmm2 [orig:2240 vect__1137.6099 ] [2240]) (mem:SF (unspec:DI [ (reg/f:DI 4 si [4841]) (reg:V2DI 21 xmm0 [orig:6879 vect__1136.6096 ] [6879]) (const_int 4 [0x4]) ] UNSPEC_VSIBADDR) [0 S4 A8]) (mem:BLK (scratch) [0 A8]) (reg:V4SF 23 xmm2 [orig:2240 vect__1137.6099 ] [2240]) ] UNSPEC_GATHER)) (clobber (reg:V4SF 23 xmm2 [orig:2240 vect__1137.6099 ] [2240])) ]) module_cu_gd.fppized.f90:1102 4603 {*avx2_gatherdiv4sf} (nil)) module_cu_gd.fppized.f90:1302:0: internal compiler error: in extract_constrain_insn, at recog.c:2200 0xaf0548 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*) ../../gcc-fsf-trunk/gcc/rtl-error.c:109 0xaf056f _fatal_insn_not_found(rtx_def const*, char const*, int, char const*) ../../gcc-fsf-trunk/gcc/rtl-error.c:120 0xabe03d extract_constrain_insn(rtx_insn*) ../../gcc-fsf-trunk/gcc/recog.c:2200 0xa9e185 reload_cse_simplify_operands ../../gcc-fsf-trunk/gcc/postreload.c:408 0xa9f245 reload_cse_simplify ../../gcc-fsf-trunk/gcc/postreload.c:194 0xa9f245 reload_cse_regs_1 ../../gcc-fsf-trunk/gcc/postreload.c:233 0xaa0b13 reload_cse_regs ../../gcc-fsf-trunk/gcc/postreload.c:81 0xaa0b13 execute ../../gcc-fsf-trunk/gcc/postreload.c:2350 Please submit a full bug report, with preprocessed source if appropriate. (Snip) I am trying to get a reduced test case. But the Bug seems to starts from r227382 commit 0af99ebfea26293fc900fe9050c5dd514005e4e5 2015-09-01 Vladimir Makarov PR target/61578 * lra-lives.c (process_bb_lives): Process move pseudos with the same value for copies and preferences * lra-constraints.c (match_reload): Create match reload pseudo with the same value from single dying input pseudo.
[Bug target/67717] [6.0 regression] ICE when compiling WRF benchmark from cpu2006 with -Ofast -march=bdver4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67717 --- Comment #2 from vekumar at gcc dot gnu.org --- yes reproducible with today's trunk. gcc version 6.0.0 20150925 (experimental) (GCC)
[Bug target/67717] [6.0 regression] ICE when compiling WRF benchmark from cpu2006 with -Ofast -march=bdver4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67717 --- Comment #3 from vekumar at gcc dot gnu.org --- (In reply to vekumar from comment #2) > yes reproducible with today's trunk. > gcc version 6.0.0 20150925 (experimental) (GCC) I meant ICE still shows up in the trunk.
[Bug target/66171] [6 Regression]: gcc.target/cris/biap.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66171 --- Comment #1 from vekumar at gcc dot gnu.org --- Yes canonical RTL is retained and is LSHIFT here. May be need to adjust the machine descriptions to be based on shift.
[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949 --- Comment #6 from vekumar at gcc dot gnu.org --- In the function make_compound_operation, there a check /* See if we have operations between an ASHIFTRT and an ASHIFT. If so, try to merge the shifts into a SIGN_EXTEND. We could also do this for some cases of SIGN_EXTRACT, but it doesn't seem worth the effort; the case checked for occurs on Alpha. */ if (!OBJECT_P (lhs) && ! (GET_CODE (lhs) == SUBREG && (OBJECT_P (SUBREG_REG (lhs && CONST_INT_P (rhs) && INTVAL (rhs) < HOST_BITS_PER_WIDE_INT && INTVAL (rhs) < mode_width && (new_rtx = extract_left_shift (lhs, INTVAL (rhs))) != 0) new_rtx = make_extraction (mode, make_compound_operation (new_rtx, next_code), 0, NULL_RTX, mode_width - INTVAL (rhs), code == LSHIFTRT, 0, in_code == COMPARE); break; Our input RTL actually matches this case. For (int)i << 1 we are getting incomming RTX as (ashiftrt:SI (ashift:SI (reg:SI 1 x1 [ i ]) (const_int 16 [0x10])) (const_int 15 [0xf])) LHS is ashift:SI (reg:SI 1 x1 [ i ]) (const_int 16 [0x10]) RHS is ashiftrt with a value of 15. So bacially we get (i<<16)>>15, we can merge these shifts to sign_extends. With extract_left_shift we get (ashift:SI (reg:SI 1 x1 [ i ]) (const_int 1 [0x1])) or x1<<1 When we do make_extraction with this shift pattern we get (ashift:SI (sign_extend:SI (reg:HI 1 x1 [ i ])) (const_int 1 [0x1]))) But instead this we are the shift RTX, we are actually passing MULT RTX to make_extraction via another make_compound_operation. p make_compound_operation(new_rtx,MEM) $3 = (rtx_def *) 0x777fd420 (gdb) pr (mult:SI (reg:SI 1 x1 [ i ]) (const_int 2 [0x2])) Which results in (subreg:SI (sign_extract:DI (mult:DI (reg:DI 1 x1 [ i ]) (const_int 2 [0x2])) (const_int 17 [0x11]) (const_int 0 [0])) 0) When I changed the original check to --- a/gcc/combine.c +++ b/gcc/combine.c @@ -7896,7 +7896,7 @@ make_compound_operation (rtx x, enum rtx_code in_code) && INTVAL (rhs) < HOST_BITS_PER_WIDE_INT && INTVAL (rhs) < mode_width && (new_rtx = extract_left_shift (lhs, INTVAL (rhs))) != 0) - new_rtx = make_extraction (mode, make_compound_operation (new_rtx, next_ + new_rtx = make_extraction (mode, new_rtx, 0, NULL_RTX, mode_width - INTVAL (rhs), code == LSHIFTRT, 0, in_code == COMPARE) Combiner was able to match the pattern successfully. Trying 8 -> 13: Successfully matched this instruction: (set (reg/i:SI 0 x0) (minus:SI (reg:SI 0 x0 [ a ]) (ashift:SI (sign_extend:SI (reg:HI 1 x1 [ i ])) (const_int 1 [0x1] (minus:SI (reg:SI 0 x0 [ a ]) (ashift:SI (sign_extend:SI (reg:HI 1 x1 [ i ])) Any comments about this change? (const_int 1 [0x1])))
[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949 --- Comment #7 from vekumar at gcc dot gnu.org --- I ran GCC tests against the patch found one failure. int adds_shift_ext ( long long a, int b, int c) { long long d = (a + ((long long)b << 3)); if (d == 0) return a + c; else return b + d + c; } The test expects adds generation and before my fix it is generated . adds_shift_ext: addsx3, x0, x1, sxtw 3 // 11 *adds_extvdi_multp2 [length = 4] beq .L5 // 12 *condjump [length = 4] But now I am generating sign extends with shifts instead of sign extends with mul in my patch. adds_shift_ext: add x3, x0, x1, sxtw 3 // 10 *add_extendsi_shft_di [length = 4] cbz x3, .L5 // 12 *cbeqdi1[length = 4] We don't have *adds_extendsi_shft_di pattern. We have patterns for adds_extvdi_multp2 that extends an operation over mult. Adding one will help optimize this case. But my concern is what if other targets hits the same issue?
[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949 --- Comment #8 from vekumar at gcc dot gnu.org --- This is complete patch for the first approach that I took (comment 6). This patch fixes issues I faced while testing. But I have added extra patterns to cater the sign extended operands with left shifts. This might impact other targets as well :( Now I am also exploring other possibilities instead of writing extra patterns. diff --git a/gcc/combine.c b/gcc/combine.c index ee7b3f9..80b345d 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -7896,7 +7896,7 @@ make_compound_operation (rtx x, enum rtx_code in_code) && INTVAL (rhs) < HOST_BITS_PER_WIDE_INT && INTVAL (rhs) < mode_width && (new_rtx = extract_left_shift (lhs, INTVAL (rhs))) != 0) - new_rtx = make_extraction (mode, make_compound_operation (new_rtx, next_ + new_rtx = make_extraction (mode, new_rtx, 0, NULL_RTX, mode_width - INTVAL (rhs), code == LSHIFTRT, 0, in_code == COMPARE); diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 97d7009..f0b9240 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -1570,26 +1570,62 @@ [(set_attr "type" "alus_ext")] ) -(define_insn "*adds__multp2" +(define_insn "*adds__extend_ashift" [(set (reg:CC_NZ CC_REGNUM) (compare:CC_NZ -(plus:GPI (ANY_EXTRACT:GPI - (mult:GPI (match_operand:GPI 1 "register_operand" "r") - (match_operand 2 "aarch64_pwr_imm3" "Up3")) - (match_operand 3 "const_int_operand" "n") - (const_int 0)) - (match_operand:GPI 4 "register_operand" "r")) +(plus:GPI (match_operand:GPI 1 "register_operand" "r") + (ashift:GPI (ANY_EXTEND:GPI +(match_operand:ALLX 2 "register_operand" "r")) + (match_operand 3 "aarch64_imm3" "Ui3"))) (const_int 0))) (set (match_operand:GPI 0 "register_operand" "=r") - (plus:GPI (ANY_EXTRACT:GPI (mult:GPI (match_dup 1) (match_dup 2)) - (match_dup 3) - (const_int 0)) - (match_dup 4)))] + (plus:GPI (match_dup 1) + (ashift:GPI (ANY_EXTEND:GPI (match_dup 2)) + (match_dup 3] + "" + "adds\\t%0, %1, %2, xt %3" + [(set_attr "type" "alus_ext")] +) + +(define_insn "*subs__extend_ashift" + [(set (reg:CC_NZ CC_REGNUM) +(compare:CC_NZ + (minus:GPI (match_operand:GPI 1 "register_operand" "r") +(ashift:GPI (ANY_EXTEND:GPI + (match_operand:ALLX 2 "register_operand" "r") + (match_operand 3 "aarch64_imm3" "Ui3"))) +(const_int 0))) + (set (match_operand:GPI 0 "register_operand" "=r") +(minus:GPI (match_dup 1) + (ashift:GPI (ANY_EXTEND:GPI (match_dup 2)) +(match_dup 3] + "" + "subs\\t%0, %1, %2, xt %3" + [(set_attr "type" "alus_ext")] +) + + +(define_insn "*adds__multp2" + [(set (reg:CC_NZ CC_REGNUM) +(compare:CC_NZ + (plus:GPI (ANY_EXTRACT:GPI +(mult:GPI (match_operand:GPI 1 "register_operand" "r") + (match_operand 2 "aarch64_pwr_imm3" "Up3")) +(match_operand 3 "const_int_operand" "n") +(const_int 0)) + (match_operand:GPI 4 "register_operand" "r")) +(const_int 0))) + (set (match_operand:GPI 0 "register_operand" "=r") +(plus:GPI (ANY_EXTRACT:GPI (mult:GPI (match_dup 1) (match_dup 2)) + (match_dup 3) + (const_int 0)) + (match_dup 4)))] "aarch64_is_extend_from_extract (mode, operands[2], operands[3])" "adds\\t%0, %4, %1, xt%e3 %p2" [(set_attr "type" "alus_ext")] ) + (define_insn "*subs__multp2" [(set (reg:CC_NZ CC_REGNUM) (compare:CC_NZ
[Bug rtl-optimization/64537] New: Aarch64 redundant sxth instruction gets generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64537 Bug ID: 64537 Summary: Aarch64 redundant sxth instruction gets generated Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vekumar at gcc dot gnu.org For the below test case redundant sxth instruction gets generate. int adds_shift_ext ( long long a, short b, int c) { long long d = (a - ((long long)b << 3)); if (d == 0) return a + c; else return b + d + c; } adds_shift_ext: sxthw1, w1 // 3*extendhisi2_aarch64/1 [length = 4] <==1 subsx3, x0, x1, sxth 3 // 11 *subs_extvdi_multp2 [length = 4] <==2 beq .L5 // 12 *condjump [length = 4] add w0, w1, w2 // 19 *addsi3_aarch64/2 [length = 4] add w0, w0, w3 // 20 *addsi3_aarch64/2 [length = 4] ret // 57 simple_return [length = 4] .p2align 2 .L5: add w0, w2, w0 // 14 *addsi3_aarch64/2 [length = 4] ret // 55 simple_return [length = 4] <== 1 is not needed.
[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949 --- Comment #10 from vekumar at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #9) > A MULT by a constant power of 2 is not canonical RTL (well, not what > simplify_rtx would give you); combine shouldn't generate this. In that case we think, we need to fix this heuristic which assumes "MEM" operation when we encounter a MINUS RTX in "make_compound_operation". /* Select the code to be used in recursive calls. Once we are inside an address, we stay there. If we have a comparison, set to COMPARE, but once inside, go back to our default of SET. */ next_code = (code == MEM ? MEM : ((code == PLUS || code == MINUS) && SCALAR_INT_MODE_P (mode)) ? MEM : ((code == COMPARE || COMPARISON_P (x)) && XEXP (x, 1) == const0_rtx) ? COMPARE : in_code == COMPARE ? SET : in_code); And later on make_compound_operation converts shift pattern to Mul. case ASHIFT: /* Convert shifts by constants into multiplications if inside an address. */ if (in_code == MEM && CONST_INT_P (XEXP (x, 1)) && INTVAL (XEXP (x, 1)) < HOST_BITS_PER_WIDE_INT && INTVAL (XEXP (x, 1)) >= 0 && SCALAR_INT_MODE_P (mode)) {
[Bug sanitizer/63850] Building TSAN for Aarch64 results in assembler Error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63850 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||vekumar at gcc dot gnu.org --- Comment #4 from vekumar at gcc dot gnu.org --- We did some changes in Makefile and configure under libsantizers to make it build for Aarch64. As clyon said local experiments are done in GCC tree. Just capturing the changes here. Ref: https://git.linaro.org/toolchain/gcc.git/commit/07178f9e98be4fc1efad5c5d7c4fed7627c17e1f
[Bug sanitizer/63850] Building TSAN for Aarch64 results in assembler Error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63850 --- Comment #7 from vekumar at gcc dot gnu.org --- (In reply to clyon from comment #6) > Venkat, > Can you submit your GCC patch, in an accepable way? (no change to sanitizer > libs code, and obviously do not activate tsan by default) Okay I will send out a patch that modifies libsanitizer/configure.ac and libsanitizer/tsan/Makefile.am that will separate out tsan_rtl_amd64.S from getting build for other targets. --- a/libsanitizer/tsan/tsan_rtl.h +++ b/libsanitizer/tsan/tsan_rtl.h @@ -679,7 +679,7 @@ void AcquireReleaseImpl(ThreadState *thr, uptr pc, SyncClock *c); // The trick is that the call preserves all registers and the compiler // does not treat it as a call. // If it does not work for you, use normal call. -#if TSAN_DEBUG == 0 +#if defined(__x86_64__) && TSAN_DEBUG == 0 This change is also acceptable?
[Bug tree-optimization/64946] New: For Aarch64, vectorization with "abs" instruction is not hapenning with vector elements of char/short type.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946 Bug ID: 64946 Summary: For Aarch64, vectorization with "abs" instruction is not hapenning with vector elements of char/short type. Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vekumar at gcc dot gnu.org For the below test case. signed char a[100],b[100]; void absolute_s8 (void) { int i; for (i=0; i<16; i++) a[i] = (b[i] > 0 ? b[i] : -b[i]); }; gcc version 5.0.0 20150203 (experimental) (GCC) with -O3 -S on aarch64-none-linux-gnu generates the following assembly absolute_s8: adrpx1, b adrpx0, a add x1, x1, :lo12:b add x0, x0, :lo12:a ldr q0, [x1] <== loads vector of 16 char elements sshll v1.8h, v0.8b, 0 <== sshll2 v0.8h, v0.16b, 0 <== sshll v3.4s, v1.4h, 0 <== sshll v2.4s, v0.4h, 0 <== sshll2 v1.4s, v1.8h, 0 <== sshll2 v0.4s, v0.8h, 0 <== promotes every element to "int" abs v3.4s, v3.4s <== Performs abs as vector of ints. abs v2.4s, v2.4s abs v1.4s, v1.4s abs v0.4s, v0.4s xtn v4.4h, v3.4s xtn2v4.8h, v1.4s xtn v1.4h, v2.4s xtn2v1.8h, v0.4s xtn v0.8b, v4.8h xtn2v0.16b, v1.8h str q0, [x0] ret Vectorization is done in INT or SI mode although Aarch64 supports abs v0.16b v0.16b. Expected code absolute_s8: adrpx1, b adrpx0, a add x1, x1, :lo12:b add x0, x0, :lo12:a ldr q0, [x1] <== loads vector of 16 char elements abs v0.16b, v0.16b<== abs in vector of chars str q0, [x0] ret
[Bug tree-optimization/64946] For Aarch64, vectorization with "abs" instruction is not hapenning with vector elements of char/short type.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946 --- Comment #1 from vekumar at gcc dot gnu.org --- The test case is got from gcc.target/aarch64/vect-abs-compile.c
[Bug tree-optimization/64946] [AArch64] gcc.target/aarch64/vect-abs-compile.c - "abs" vectorization fails for char/short types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #3 from vekumar at gcc dot gnu.org --- Richard, As per your suggestion, adding a pattern for type demotion in match.pd solves this. (simplify ( convert (abs (convert@1 @0))) ( if (INTEGRAL_TYPE_P (type) /* We check for type compatibility between @0 and @1 below, so there's no need to check that @1/@3 are integral types. */ && INTEGRAL_TYPE_P (TREE_TYPE (@0)) && INTEGRAL_TYPE_P (TREE_TYPE (@1)) /* The precision of the type of each operand must match the precision of the mode of each operand, similarly for the result. */ && (TYPE_PRECISION (TREE_TYPE (@0)) == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (@0 && (TYPE_PRECISION (TREE_TYPE (@1)) == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (@1 && TYPE_PRECISION (type) == GET_MODE_PRECISION (TYPE_MODE (type)) /* The inner conversion must be a widening conversion. */ && TYPE_PRECISION (TREE_TYPE (@1)) > TYPE_PRECISION (TREE_TYPE (@0)) && ((GENERIC && (TYPE_MAIN_VARIANT (TREE_TYPE (@0)) == TYPE_MAIN_VARIANT (type))) || (GIMPLE && types_compatible_p (TREE_TYPE (@0), type (abs @0))) I have not yet tested it. Will it have implication on targets that does not support vectorization with short/char types?
[Bug tree-optimization/64946] [AArch64] gcc.target/aarch64/vect-abs-compile.c - "abs" vectorization fails for char/short types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946 --- Comment #6 from vekumar at gcc dot gnu.org --- (In reply to Andrew Pinski from comment #5) > I think you should always use an unsigned type here so it will be defined in > the IR. This is mentioned in bug 22199#c3 . Andrew I missed to include something like this + (if (TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))) + (convert (op @0 @1))) as in https://gcc.gnu.org/viewcvs?rev=220695&root=gcc&view=rev
[Bug tree-optimization/64946] [AArch64] gcc.target/aarch64/vect-abs-compile.c - "abs" vectorization fails for char/short types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946 --- Comment #9 from vekumar at gcc dot gnu.org --- This match.pd pattern vectorizes the PR but works only with -fwrapv. (simplify ( convert (abs (convert@1 @0))) ( if (INTEGRAL_TYPE_P (type) /* We check for type compatibility between @0 and @1 below, so there's no need to check that @1/@3 are integral types. */ && INTEGRAL_TYPE_P (TREE_TYPE (@0)) && INTEGRAL_TYPE_P (TREE_TYPE (@1)) /* The precision of the type of each operand must match the precision of the mode of each operand, similarly for the result. */ && (TYPE_PRECISION (TREE_TYPE (@0)) == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (@0 && (TYPE_PRECISION (TREE_TYPE (@1)) == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (@1 && TYPE_PRECISION (type) == GET_MODE_PRECISION (TYPE_MODE (type)) /* The inner conversion must be a widening conversion. */ && TYPE_PRECISION (TREE_TYPE (@1)) > TYPE_PRECISION (TREE_TYPE (@0)) && ((GENERIC && (TYPE_MAIN_VARIANT (TREE_TYPE (@0)) == TYPE_MAIN_VARIANT (type))) || (GIMPLE && types_compatible_p (TREE_TYPE (@0), type (if (TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))) (abs @0 For default cases (when no -fwrapv is given), doing ABSE_EXPR(shorttype) will invoke undefined behaviour when value is -32678. similarly for signed char min. As per Richard suggestion we need to move to a new tree code ABSU_EXPR to do this type of folding optimization.
[Bug c/65287] Current trunk ICE in address_matters_p, at symtab.c:1908
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65287 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||vekumar at gcc dot gnu.org --- Comment #3 from vekumar at gcc dot gnu.org --- Also faced this bug when I tired to cross compile for aarch64-none-linux-gnu
[Bug target/63949] New: Aarch64 instruction combiner does not optimize subsi_sxth function as expected
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949 Bug ID: 63949 Summary: Aarch64 instruction combiner does not optimize subsi_sxth function as expected Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vekumar at gcc dot gnu.org Reference: https://bugs.linaro.org/show_bug.cgi?id=863 Test case int subsi_sxth (int a, short i) { /* { dg-final { scan-assembler "sub\tw\[0-9\]+,.*sxth #?1" } } */ return a - ((int)i << 1); } Assembly generated with GCC 5.0.0 20141114 subsi_sxth: sbfiz w1, w1, 1, 16 sub w0, w0, w1 ret Expected subww0,w0,w1,sxth 1 Combiner Says Failed to mismatch set (reg/i:SI 0 x0) (minus:SI (reg:SI 0 x0 [ a ]) (subreg:SI (sign_extract:DI (mult:DI (reg:DI 1 x1 [ i ]) (const_int 2 [0x2])) (const_int 17 [0x11]) (const_int 0 [0])) 0))) We have a pattern that would match in aarch64.md file, but it is not recognized. (define_insn "*sub__multp2" [(set (match_operand:GPI 0 "register_operand" "=rk") (minus:GPI (match_operand:GPI 4 "register_operand" "r") (ANY_EXTRACT:GPI (mult:GPI (match_operand:GPI 1 "register_operand" "r") (match_operand 2 "aarch64_pwr_imm3" "Up3")) (match_operand 3 "const_int_operand" "n") (const_int 0] "aarch64_is_extend_from_extract (mode, operands[2], operands[3])" "sub\\t%0, %4, %1, xt%e3 %p2" [(set_attr "type" "alu_ext")] )
[Bug bootstrap/61440] Bootstrap failure with --with-build-config=bootstrap-lto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61440 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||vekumar at gcc dot gnu.org --- Comment #5 from vekumar at gcc dot gnu.org --- Yes same issue work around would be to use configure with --enable-stage1-checking=release as mentioned in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62077#c54
[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949 --- Comment #4 from vekumar at gcc dot gnu.org --- (In reply to Richard Earnshaw from comment #3) > make_extraction is unable to generate bit-field extractions in more than one > mode. This causes the extractions that it does generate to be wrapped in > subregs when SImode results are wanted. > > Ideally, we should teach make_extraction to be more sensible, but I'm not > sure what the impact of that would be on other ports that really can only > support one mode for bit-field extracts. Yes, make_extraction converts mult to sign extract RTL. RTL from (mult: SI (reg: SI 1 x1 [i]) ( constant_int 2 [0x2])) to (subreg: SI (sign_extract: DI (ashift: DI (reg DI 1 x1 [i]) (constant_int 1 [0x1])) (constant_int 17 [0x11])) (constant_int 0 [0x0]))
[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949 --- Comment #5 from vekumar at gcc dot gnu.org --- Richard, what the function get_best_reg_extraction_insn is supposed to do in make_extraction ?
[Bug bootstrap/68667] New: GCC trunk build fails compiling graphite-isl-ast-to-gimple.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68667 Bug ID: 68667 Summary: GCC trunk build fails compiling graphite-isl-ast-to-gimple.c Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: bootstrap Assignee: unassigned at gcc dot gnu.org Reporter: vekumar at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* Build breaks while compiling graphite related file. Occurred with trunk r231212. (Snip) ../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c: In member function âtree_node* translate_isl_ast_to_gimple::binary_op_to_tree(tree, isl_ast_expr*,, ivs_params&)â: ../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c:591:10: error: âisl_ast_op_zdiv_râ was not declared in this scope case isl_ast_op_zdiv_r: ^ ../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c: In member function âtree_node* translate_isl_ast_to_gimple::gcc_expression_from_isl_expr_op(tree, isl_ast_expr*, ivs_params&)â: ../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c:762:10: error: âisl_ast_op_zdiv_râ was not declared in this scope case isl_ast_op_zdiv_r: ^ Makefile:1085: recipe for target 'graphite-isl-ast-to-gimple.o' failed make[2]: *** [graphite-isl-ast-to-gimple.o] Error 1 (Snip)
[Bug bootstrap/68667] GCC trunk build fails compiling graphite-isl-ast-to-gimple.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68667 --- Comment #1 from vekumar at gcc dot gnu.org --- (In reply to vekumar from comment #0) > Build breaks while compiling graphite related file. > Occurred with trunk r231212. > > (Snip) > ../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c: In member function > âtree_node* translate_isl_ast_to_gimple::binary_op_to_tree(tree, > isl_ast_expr*,, > ivs_params&)â: > ../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c:591:10: error: > âisl_ast_op_zdiv_râ was not declared in this scope > case isl_ast_op_zdiv_r: > ^ > ../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c: In member function > âtree_node* > translate_isl_ast_to_gimple::gcc_expression_from_isl_expr_op(tree, > isl_ast_expr*, ivs_params&)â: > ../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c:762:10: error: > âisl_ast_op_zdiv_râ was not declared in this scope > case isl_ast_op_zdiv_r: > ^ > Makefile:1085: recipe for target 'graphite-isl-ast-to-gimple.o' failed > make[2]: *** [graphite-isl-ast-to-gimple.o] Error 1 > (Snip) I do ./contrib/download_prerequisites in to the gcc folder and it downloaded isl-0.14.
[Bug tree-optimization/68417] [6 Regression] Missed vectorization opportunity when setting struct field
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68417 --- Comment #4 from vekumar at gcc dot gnu.org --- Older trunk showed (gcc version 6.0.0 20151202 (experimental) (GCC)) iftmp.1_19 = p1_36->y; tree could trap... Today's trunk (gcc version 6.0.0 20151209) Applying if-conversion new phi replacement stmt iftmp.1_6 = m1_16 >= m2_18 ? iftmp.1_19 : iftmp.1_20; Removing basic block 4 basic block 4, loop depth 1
[Bug tree-optimization/68417] [6 Regression] Missed vectorization opportunity when setting struct field
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68417 --- Comment #5 from vekumar at gcc dot gnu.org --- Richard, STMT: m1 = p1->x - m; While hashing p1->x being a component ref, we are hashing operand 0 part ie TREE_OPERAND (ref, 0). This is unconditionally read and written. STMT: p3->y = (m1 >= m2) ? p1->y : p2->y; Now for p1->y master DR is what we hashed previously for p1->x since p1->y is alos a component ref. Fix for PR 68583 https://gcc.gnu.org/viewcvs?rev=231444&root=gcc&view=rev Checks if candidate DR is unconditional accessed outside and also is a read only. /* If a is unconditionally accessed then ... */ if (DR_RW_UNCONDITIONALLY (*master_dr)) { /* an unconditional read won't trap. */ if (DR_IS_READ (a)) return true; This holds true now and hence does if conversion. Since this is expected can we close this PR as fixed?
[Bug tree-optimization/58135] [x86] Missed opportunities for partial SLP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58135 vekumar at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2015-12-10 Ever confirmed|0 |1
[Bug middle-end/68621] [6 Regression] FAIL: gcc.dg/tree-ssa/ifc-8.c scan-tree-dump-times ifcvt "Applying if-conversion" 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68621 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||vekumar at gcc dot gnu.org --- Comment #1 from vekumar at gcc dot gnu.org --- We do this optimization under -fno-common. In case of -fpic this option does not have any effect and the array declaration is assumed as can be overridden.
[Bug target/65951] [AArch64] Will not vectorize 64bit integer multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65951 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||vekumar at gcc dot gnu.org --- Comment #6 from vekumar at gcc dot gnu.org --- I found similar pattern in SPEC2006 hmmer benchmark, when comparing x86_64 ( -O3 + -march=bdver3 vs. -O3 + -mcpu=cortex-a57). x86_64 was able to vectorize 5 additional loops. Out of 5 loops, two were cost model related and aarch64 rejects because of running high vector cost. Remaining three loops are of this pattern. one used a constant 104. The other two of them used multiplication by 4 and that could be converted to vector shifts. I made a simple test case and wanted to open a PR. James pointed me to this PR. Thought of posting it as comments. unsigned long int __attribute__ ((aligned (64)))arr[100]; int i; void test_vector_shifts() { for(i=0; i<=99;i++) arr[i]=arr[i]<<2; } void test_vectorshift_via_mul() { for(i=0; i<=99;i++) arr[i]=arr[i]*4; } Assembly .cpu cortex-a57+fp+simd+crc .file "test.c" .text .align 2 .p2align 4,,15 .global test_vector_shifts .type test_vector_shifts, %function test_vector_shifts: adrpx0, arr add x0, x0, :lo12:arr adrpx1, arr+800 add x1, x1, :lo12:arr+800 .p2align 2 .L2: ldr q0, [x0] shl v0.2d, v0.2d, 2 <==vector shifts str q0, [x0], 16 cmp x0, x1 bne .L2 adrpx0, i mov w1, 100 str w1, [x0, #:lo12:i] ret .size test_vector_shifts, .-test_vector_shifts .align 2 .p2align 4,,15 .global test_vectorshift_via_mul .type test_vectorshift_via_mul, %function test_vectorshift_via_mul: adrpx0, arr add x0, x0, :lo12:arr adrpx2, arr+800 add x2, x2, :lo12:arr+800 .p2align 2 .L6: ldr x1, [x0] lsl x1, x1, 2 str x1, [x0], 8 <==scalar shifts cmp x0, x2 bne .L6 adrpx0, i mov w1, 100 str w1, [x0, #:lo12:i] ret .size test_vectorshift_via_mul, .-test_vectorshift_via_mul
[Bug target/65952] [AArch64] Will not vectorize storing induction of pointer addresses for LP64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||vekumar at gcc dot gnu.org --- Comment #9 from vekumar at gcc dot gnu.org --- As per Richards, suggestion I added a pattern in vector recog. This seems to vectorize the this PR. However I need some help on the following (1)How do I check the shift amount and also care about type/signedness. There could be different shift amounts allowed in the target architecture when looking for power 2 constants. (2)Should I need to check if target architecture supports vectorized shifts before converting the pattern? ---Patch--- diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c index f034635..995c9b2 100644 --- a/gcc/tree-vect-patterns.c +++ b/gcc/tree-vect-patterns.c @@ -76,6 +76,10 @@ static gimple vect_recog_vector_vector_shift_pattern (vec *, tree *, tree *); static gimple vect_recog_divmod_pattern (vec *, tree *, tree *); + +static gimple vect_recog_multconst_pattern (vec *, + tree *, tree *); + static gimple vect_recog_mixed_size_cond_pattern (vec *, tree *, tree *); static gimple vect_recog_bool_pattern (vec *, tree *, tree *); @@ -90,6 +94,7 @@ static vect_recog_func_ptr vect_vect_recog_func_ptrs[NUM_PATTERNS] = { vect_recog_rotate_pattern, vect_recog_vector_vector_shift_pattern, vect_recog_divmod_pattern, +vect_recog_multconst_pattern, vect_recog_mixed_size_cond_pattern, vect_recog_bool_pattern}; @@ -2147,6 +2152,90 @@ vect_recog_vector_vector_shift_pattern (vec *stmts, return pattern_stmt; } +static gimple +vect_recog_multconst_pattern (vec *stmts, + tree *type_in, tree *type_out) +{ + gimple last_stmt = stmts->pop (); + tree oprnd0, oprnd1, vectype, itype, cond; + gimple pattern_stmt, def_stmt; + enum tree_code rhs_code; + stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt); + loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_vinfo); + bb_vec_info bb_vinfo = STMT_VINFO_BB_VINFO (stmt_vinfo); + optab optab; + tree q; + int dummy_int, prec; + stmt_vec_info def_stmt_vinfo; + + if (!is_gimple_assign (last_stmt)) +return NULL; + + rhs_code = gimple_assign_rhs_code (last_stmt); + switch (rhs_code) +{ +case MULT_EXPR: + break; +default: + return NULL; +} + + if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo)) +return NULL; + + oprnd0 = gimple_assign_rhs1 (last_stmt); + oprnd1 = gimple_assign_rhs2 (last_stmt); + itype = TREE_TYPE (oprnd0); + if (TREE_CODE (oprnd0) != SSA_NAME + || TREE_CODE (oprnd1) != INTEGER_CST + || TREE_CODE (itype) != INTEGER_TYPE + || TYPE_PRECISION (itype) != GET_MODE_PRECISION (TYPE_MODE (itype))) +return NULL; + vectype = get_vectype_for_scalar_type (itype); + if (vectype == NULL_TREE) +return NULL; + + /* If the target can handle vectorized division or modulo natively, + don't attempt to optimize this. */ + optab = optab_for_tree_code (rhs_code, vectype, optab_default); + if (optab != unknown_optab) +{ + machine_mode vec_mode = TYPE_MODE (vectype); + int icode = (int) optab_handler (optab, vec_mode); + if (icode != CODE_FOR_nothing) +return NULL; +} + + prec = TYPE_PRECISION (itype); + if (integer_pow2p (oprnd1)) +{ + /*if (TYPE_UNSIGNED (itype) || tree_int_cst_sgn (oprnd1) != 1) +return NULL; + */ + + /* Pattern detected. */ + if (dump_enabled_p ()) +dump_printf_loc (MSG_NOTE, vect_location, + "vect_recog_multconst_pattern: detected:\n"); + + tree shift; + + shift = build_int_cst (itype, tree_log2 (oprnd1)); + pattern_stmt += gimple_build_assign (vect_recog_temp_ssa_var (itype, NULL), + LSHIFT_EXPR, oprnd0, shift); + if (dump_enabled_p ()) +dump_gimple_stmt_loc (MSG_NOTE, vect_location, TDF_SLIM, pattern_stmt, + 0); + + stmts->safe_push (last_stmt); + + *type_in = vectype; + *type_out = vectype; + return pattern_stmt; + } +return NULL; +} /* Detect a signed division by a constant that wouldn't be otherwise vectorized: diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 48c1f8d..833fe4b 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1131,7 +1131,7 @@ extern void vect_slp_transform_bb (basic_block); Additional pattern recognition functions can (and will) be added in the future. */ typedef gimple (* vect_recog_func_ptr) (vec *, tree *, tree *); -#defin
[Bug target/65952] [AArch64] Will not vectorize storing induction of pointer addresses for LP64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952 --- Comment #10 from vekumar at gcc dot gnu.org --- With the patch I get loop: adrpx0, array ldr q1, .LC0 ldr q2, .LC1 adrpx1, ptrs add x1, x1, :lo12:ptrs ldr x0, [x0, #:lo12:array] dup v0.2d, x0 add v1.2d, v0.2d, v1.2d <== vectorized add v0.2d, v0.2d, v2.2d <== vectorized str q1, [x1] str q0, [x1, 16] ret .size loop, .-loop .align 4 .LC0: .xword 0 .xword 16 .align 4 .LC1: .xword 32 .xword 48
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 65952, which changed state. Bug 65952 Summary: [AArch64] Will not vectorize storing induction of pointer addresses for LP64 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug target/65952] [AArch64] Will not vectorize storing induction of pointer addresses for LP64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952 vekumar at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #11 from vekumar at gcc dot gnu.org --- This is getting fixed after patch https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=226675 Vectorize mult expressions with power 2 constants via shift, for targets has no vector multiplication support. 2015-08-06 Venkataramanan Kumar * tree-vect-patterns.c (vect_recog_mult_pattern): New function for vectorizing multiplication patterns. * tree-vectorizer.h: Adjust the number of patterns. 2015-08-06 Venkataramanan Kumar * gcc.dg/vect/vect-mult-pattern-1.c: New test. * gcc.dg/vect/vect-mult-pattern-2.c: New test.
[Bug tree-optimization/54803] Rotates are not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54803 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||vekumar at gcc dot gnu.org --- Comment #5 from vekumar at gcc dot gnu.org --- On bdver4 when we enable -march=bdver4 and -mno-prefer-avx128 vectorizes using YMM Otherwise uses vprotq instruction. .L13: vmovdqa (%r8,%r9), %ymm0 incq%rax vpsrlq $32, %ymm0, %ymm1 vpsllq $32, %ymm0, %ymm0 vpor%ymm0, %ymm1, %ymm0 vmovdqa %ymm0, (%rdx,%r9) addq$32, %r9 cmpq%rax, %r10 ja .L13
[Bug tree-optimization/54803] Rotates are not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54803 --- Comment #6 from vekumar at gcc dot gnu.org --- (In reply to vekumar from comment #5) > On bdver4 when we enable -march=bdver4 and -mno-prefer-avx128 vectorizes > using YMM > Otherwise uses vprotq instruction. > > .L13: > vmovdqa (%r8,%r9), %ymm0 > incq%rax > vpsrlq $32, %ymm0, %ymm1 > vpsllq $32, %ymm0, %ymm0 > vpor%ymm0, %ymm1, %ymm0 > vmovdqa %ymm0, (%rdx,%r9) > addq$32, %r9 > cmpq%rax, %r10 > ja .L13 This is with trunk gcc version 6.0.0 20150810 (experimental) (GCC)
[Bug tree-optimization/67326] [5/6 Regression] -ftree-loop-if-convert-stores does not vectorize conditional assignment (anymore)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67326 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||vekumar at gcc dot gnu.org --- Comment #2 from vekumar at gcc dot gnu.org --- Hi Richard, As a first step I am trying to allow if conversion to happen under -ftree-loop-if-convert-stores for cases where we know it is already accessed (read) outside unconditionally once and also the memory access is read and write. __attribute__((aligned(32))) float a[LEN]; void test() { for (int i = 0; i < LEN; i++) { if (a[i] > (float)0.) { //<== Already read here unconditionally a[i] =3 ; //<== if we now it is read and write memory access we can allow if conversion. } As you said the cases in PR we need to enhance if-conversion pass to do bounds checking the array "a" accessing using values of i.
[Bug tree-optimization/71992] New: Missed BB SLP vectorization in GCC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71992 Bug ID: 71992 Summary: Missed BB SLP vectorization in GCC Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vekumar at gcc dot gnu.org Target Milestone: --- The below test case fails to vectorize. gcc version 7.0.0 20160724 (experimental) (GCC) gcc -Ofast -mavx -fvect-cost-model=unlimited slp.c -S -fdump-tree-slp-all struct st { double x; double y; double z; double p; double q; }*obj; double a,b,c; void slp_test() { obj->x = a*a+3.0; obj->y= b*b+c; obj->z= a+b*3.0; obj->p= a+b*3.0; obj->q =a+b+c; } LLVM is able to SLP vectorize looks like it is creating vector of [a,c] and [b*3.0,b*b] and does vector add. GCC is not SLP vectorizing. Group slitting also not working. I expected it to get split and vectorize these statements. obj->z= a+b*3.0; obj->p= a+b*3.0; Another case struct st { double x; double y; double z; double p; double q; }*obj; double a,b,c; void slp_test() { obj->x = a*b; obj->y= b+c; obj->z= a+b*3.0; obj->p= a+b*3.0; obj->q =a+b+c; } LLVM forms vector [b*3.0,a+b] [a,c] and does vector addition.
[Bug target/77270] Flag -mprftchw is shared with 3dnow for -march=k8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77270 vekumar at gcc dot gnu.org changed: What|Removed |Added CC||vekumar at gcc dot gnu.org --- Comment #8 from vekumar at gcc dot gnu.org --- There are 2 issues . #issue1 -mprfchw should be enabled only for targets that supports (3DNowprefetch). On K8, 3DNowprefetch is not available and -march=k8 should not set this flag. I can see behavior now corrected with Uros flag. Although I have to verify changes done other targets. #2 issue2 prefetchw ISA is also available in 3DNow!. Generating prefetchw by the GCC backend is functionally correct if write prefetches are requested. Looking at the test case why write prefetches are requested. void f() { extern int size; int i; float * fvec; float * fptr = (float *) get(); for(i = 0; i < size; ++i) fvec[i] = fptr[i]; get(); } I have to keep one more call statement so that "fvec" definition is not killed. prefetchw is generated for memory stores via fvec. They are written only.
[Bug tree-optimization/118380] New: GCC is not optimizing computataion and code with avx intrinsics.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118380 Bug ID: 118380 Summary: GCC is not optimizing computataion and code with avx intrinsics. Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vekumar at gcc dot gnu.org Target Milestone: --- For the test case in the given link https://godbolt.org/z/MP88MaTva LLVM is able to optimize the loop and computations hapenning completely. GCC is not able to do so. Arrays are defined locally and may not be the case with real world application. Nevertheless GCC can also optimize this case.