[Bug target/52908] xop-mul-1:f9 miscompiled on bulldozer (-mxop)

2012-06-18 Thread vekumar at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52908

--- Comment #9 from vekumar at gcc dot gnu.org 2012-06-18 15:10:51 UTC ---
Author: vekumar
Date: Mon Jun 18 15:10:45 2012
New Revision: 188736

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188736
Log:
Back port Fix PR 52908 - xop-mul-1:f9 miscompiled on bulldozer (-mxop) to 4.7
branch

Modified:
branches/gcc-4_7-branch/gcc/ChangeLog
branches/gcc-4_7-branch/gcc/config/i386/sse.md
branches/gcc-4_7-branch/gcc/testsuite/ChangeLog
   
branches/gcc-4_7-branch/gcc/testsuite/gcc.target/i386/xop-imul32widen-vector.c


[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494

--- Comment #8 from vekumar at gcc dot gnu.org ---
I tested mdbx before and after the revision Richard pointed out. 
On My Ryzen box there is ~4% regression. 

Although "vblenvps" is fast path instruction and can execute in pipe 0/1. It
competes with vcmpccsd, fma and muls instruction that are also executing on
pipe 0|1. Looks to me regression is due to added dependency and port pressure.  

We need to benchmark with large application like SPEC and then decide whether
we need to enable X86_TUNE_SCALAR_FLOAT_BLENDV tuning for Ryzen or not. On
BDVER4 there were no blendvps generated and no regression seen.

[Bug tree-optimization/86144] New: GCC is not generating vector math calls to svml/acml functions

2018-06-14 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86144

Bug ID: 86144
   Summary: GCC is not generating vector math calls to svml/acml
functions
   Product: gcc
   Version: 8.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vekumar at gcc dot gnu.org
  Target Milestone: ---

As per GCC 8.1.0 Manual 

---snip--
-mveclibabi=type
Specifies the ABI type to use for vectorizing intrinsics using an external
library. Supported values for type are ‘svml’ for the Intel short vector math
library and ‘acml’ for the AMD math core library. To use this option, both
-ftree-vectorize and -funsafe-math-optimizations have to be enabled, and an
SVML or ACML ABI-compatible library must be specified at link time.

GCC currently emits calls to vmldExp2, vmldLn2, vmldLog102, vmldLog102,
vmldPow2, vmldTanh2, vmldTan2, vmldAtan2, vmldAtanh2, vmldCbrt2, vmldSinh2,
vmldSin2, vmldAsinh2, vmldAsin2, vmldCosh2, vmldCos2, vmldAcosh2, vmldAcos2,
vmlsExp4, vmlsLn4, vmlsLog104, vmlsLog104, vmlsPow4, vmlsTanh4, vmlsTan4,
vmlsAtan4, vmlsAtanh4, vmlsCbrt4, vmlsSinh4, vmlsSin4, vmlsAsinh4, vmlsAsin4,
vmlsCosh4, vmlsCos4, vmlsAcosh4 and vmlsAcos4 for corresponding function type
when -mveclibabi=svml is used, and __vrd2_sin, __vrd2_cos, __vrd2_exp,
__vrd2_log, __vrd2_log2, __vrd2_log10, __vrs4_sinf, __vrs4_cosf, __vrs4_expf,
__vrs4_logf, __vrs4_log2f, __vrs4_log10f and __vrs4_powf for the corresponding
function type when -mveclibabi=acml is used.
--snip--

#include 
double test_vect_exp (double* __restrict__ A, double* __restrict__ B, int size
)
{
int i;
for (i = 0; i < size; i++)
   A[i] = exp(B[i]);
return A[0];
}

gcc-5.4.0/bin/gcc -O3 -mveclibabi=acml -ffast-math exp.c -S generated vector
math calls to amdlibm/intel svml. 
---Snip---
L8:
movapd  (%r12), %xmm0
addl$1, %r15d
addq$16, %r12
addq$16, %rbx
call__vrd2_exp
movups  %xmm0, -16(%rbx)
cmpl%r15d, 4(%rsp)
ja  .L8
movl12(%rsp), %eax
addl%eax, %ebp
cmpl%eax, 8(%rsp)
je  .L10
---Snip--

from gcc-6.0 we don't generate calls to acml/svml by default. 
What we generate is a call to glibC vector math function (libmvec)
---Snip---
.L8:
movapd  (%r12), %xmm0
addl$1, %r15d
addq$16, %r12
addq$16, %rbx
call_ZGVbN2v___exp_finite
movups  %xmm0, -16(%rbx)
cmpl%r15d, 4(%rsp)
ja  .L8
movl12(%rsp), %eax
addl%eax, %ebp
cmpl%eax, 8(%rsp)
je  .L10
---Snip---

[Bug tree-optimization/86144] GCC is not generating vector math calls to svml/acml functions

2018-06-14 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86144

--- Comment #3 from vekumar at gcc dot gnu.org ---
(In reply to Richard Biener from comment #2)
> Note a workaround would be to re-arrange the vectorizer calls to
> vectorizable_simd_clone_call and vectorizable_call.  Can you check if
> the following works?  It gives precedence to what the target hook
> (and thus -mveclibabi) provides.
> 
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 9f365e31e49..bdef56bf65e 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -9543,13 +9543,13 @@ vect_analyze_stmt (gimple *stmt, bool
> *need_to_vectorize, slp_tree node,
>if (!bb_vinfo
>&& (STMT_VINFO_RELEVANT_P (stmt_info)
>   || STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def))
> -ok = (vectorizable_simd_clone_call (stmt, NULL, NULL, node, cost_vec)
> +ok = (vectorizable_call (stmt, NULL, NULL, node, cost_vec)
>   || vectorizable_conversion (stmt, NULL, NULL, node, cost_vec)
>   || vectorizable_shift (stmt, NULL, NULL, node, cost_vec)
>   || vectorizable_operation (stmt, NULL, NULL, node, cost_vec)
>   || vectorizable_assignment (stmt, NULL, NULL, node, cost_vec)
>   || vectorizable_load (stmt, NULL, NULL, node, node_instance,
> cost_vec)
> - || vectorizable_call (stmt, NULL, NULL, node, cost_vec)
> + || vectorizable_simd_clone_call (stmt, NULL, NULL, node, cost_vec)
>   || vectorizable_store (stmt, NULL, NULL, node, cost_vec)
>   || vectorizable_reduction (stmt, NULL, NULL, node, node_instance,
>  cost_vec)
> @@ -9559,14 +9559,14 @@ vect_analyze_stmt (gimple *stmt, bool
> *need_to_vectorize, slp_tree node,
>else
>  {
>if (bb_vinfo)
> -   ok = (vectorizable_simd_clone_call (stmt, NULL, NULL, node, cost_vec)
> +   ok = (vectorizable_call (stmt, NULL, NULL, node, cost_vec)
>   || vectorizable_conversion (stmt, NULL, NULL, node, cost_vec)
>   || vectorizable_shift (stmt, NULL, NULL, node, cost_vec)
>   || vectorizable_operation (stmt, NULL, NULL, node, cost_vec)
>   || vectorizable_assignment (stmt, NULL, NULL, node, cost_vec)
>   || vectorizable_load (stmt, NULL, NULL, node, node_instance,
> cost_vec)
> - || vectorizable_call (stmt, NULL, NULL, node, cost_vec)
> + || vectorizable_simd_clone_call (stmt, NULL, NULL, node,
> cost_vec)
>   || vectorizable_store (stmt, NULL, NULL, node, cost_vec)
>   || vectorizable_condition (stmt, NULL, NULL, NULL, 0, node,
>  cost_vec)

Checked the patch now it give preference to  -mveclibabi= option and generating
expected calls.

[Bug target/91719] gcc compiles seq_cst store on x86-64 differently from clang/icc

2019-09-11 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91719

--- Comment #9 from vekumar at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #8)
> CCing AMD too.
Sure Let me check if this tuning helps AMD Zen Arch.

[Bug target/91719] gcc compiles seq_cst store on x86-64 differently from clang/icc

2019-09-12 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91719

--- Comment #10 from vekumar at gcc dot gnu.org ---
xchg is faster than mov+mfence on AMD Zen. We can add m_ZNVER1 | m_ZNVER2 to
the tuning.

[Bug target/87455] sse_packed_single_insn_optimal is suboptimal on Zen

2018-09-28 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87455

--- Comment #2 from vekumar at gcc dot gnu.org ---
This tuning was intended to generate movups instead of movupd as movups is 1
byte lesser than movupd. May be we should remove xorps generation part.

[Bug middle-end/68621] [6 Regression] FAIL: gcc.dg/tree-ssa/ifc-8.c scan-tree-dump-times ifcvt "Applying if-conversion" 1

2016-03-01 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68621

--- Comment #3 from vekumar at gcc dot gnu.org ---
(In reply to Richard Biener from comment #2)
> You can change the testcase to
> 
>  __attribute__((aligned (32))) float array[LEN] = {};
> 
> which makes it not require -fno-common either and it should work with -fpic
> then
> (double-check).

I added option "-fno-common" to check condition returned by
decl_binds_to_current_def_p is true.

  /* or the base is know to be not readonly.  */
  tree base_tree = get_base_address (DR_REF (a));
  if (DECL_P (base_tree)
  && decl_binds_to_current_def_p (base_tree)

Changing to  __attribute__((aligned (32))) float array[LEN] = {} also tests
that condition.

Sure, I will send a patch to adjust the test case.

[Bug middle-end/68621] [6 Regression] FAIL: gcc.dg/tree-ssa/ifc-8.c scan-tree-dump-times ifcvt "Applying if-conversion" 1

2016-03-01 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68621

--- Comment #4 from vekumar at gcc dot gnu.org ---
Even after initializing the array  
decl_binds_to_current_def_p (base_tree) return false when I set -fpic.

---Snip---

(1)
bool
decl_binds_to_current_def_p (const_tree decl)
{
  gcc_assert (DECL_P (decl));
  if (!targetm.binds_local_p (decl))
return false;

(2)
---snip---
#if !TARGET_MACHO && !TARGET_DLLIMPORT_DECL_ATTRIBUTES
/* For i386, common symbol is local only for non-PIE binaries.  For
   x86-64, common symbol is local only for non-PIE binaries or linker
   supports copy reloc in PIE binaries.   */

static bool
ix86_binds_local_p (const_tree exp)
{
  return default_binds_local_p_3 (exp, flag_shlib != 0, true, true,
  (!flag_pic
   || (TARGET_64BIT
   && HAVE_LD_PIE_COPYRELOC != 0)));
}
#endif 
---snip---
And in default_binds_local_p_3 
DECL_VISIBILITY (exp) is VISIBILITY_DEFAULT and shlib is set and it returns
false 

(3)
---snip---
 /* A symbol is local if the user has said explicitly that it will be,
 or if we have a definition for the symbol.  We cannot infer visibility
 for undefined symbols.  */
  if (DECL_VISIBILITY (exp) != VISIBILITY_DEFAULT
  && (TREE_CODE (exp) == FUNCTION_DECL
  || !extern_protected_data
  || DECL_VISIBILITY (exp) != VISIBILITY_PROTECTED)
  && (DECL_VISIBILITY_SPECIFIED (exp) || defined_locally))
return true;

  /* If PIC, then assume that any global name can be overridden by
 symbols resolved from other modules.  */
  if (shlib)
return false;
---snip---

[Bug middle-end/68621] [6 Regression] FAIL: gcc.dg/tree-ssa/ifc-8.c scan-tree-dump-times ifcvt "Applying if-conversion" 1

2016-03-01 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68621

--- Comment #5 from vekumar at gcc dot gnu.org ---
Adding visibility to hidden helps.

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ifc-8.c
b/gcc/testsuite/gcc.dg/tree-ssa/ifc-8.c
index 89a3410..7519a61 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ifc-8.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ifc-8.c
@@ -1,9 +1,9 @@

 /* { dg-do compile } */
-/* { dg-options "-Ofast -fdump-tree-ifcvt-details -fno-common
-ftree-loop-if-convert-stores" } */
+/* { dg-options "-Ofast -fdump-tree-ifcvt-details
-ftree-loop-if-convert-stores" } */

 #define LEN 4096
- __attribute__((aligned (32))) float array[LEN];
+ __attribute__((visibility("hidden"), aligned (32))) float array[LEN] = {};

 void test ()
 {

[Bug middle-end/68621] [6 Regression] FAIL: gcc.dg/tree-ssa/ifc-8.c scan-tree-dump-times ifcvt "Applying if-conversion" 1

2016-03-01 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68621

--- Comment #6 from vekumar at gcc dot gnu.org ---
Author: vekumar
Date: Wed Mar  2 06:14:43 2016
New Revision: 233888

URL: https://gcc.gnu.org/viewcvs?rev=233888&root=gcc&view=rev
Log:
Adjust test case in PR68621 to compile with -fpic.

2016-03-02  Venkataramanan Kumar  

PR tree-optimization/68621
* gcc.dg/tree-ssa/ifc-8.c: Adjust test.


Modified:
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/tree-ssa/ifc-8.c

[Bug tree-optimization/70102] New: Tree re-association prevents SLP vectorization at -Ofast.

2016-03-05 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70102

Bug ID: 70102
   Summary: Tree re-association prevents SLP vectorization at
-Ofast.
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vekumar at gcc dot gnu.org
  Target Milestone: ---

The following test case failed to vectorize in gcc -Ofast. 

(---snip---)
  subroutine test (x,y,z)
  integer x,y,z
  real*8 a(5,x,y,z),b(5,x,y,z)
  real*8 c

  c = 0.0d0
  do k=1,z
 do j=1,y
   do i=1,x
  do l=1,5
 c = c + a(l,i,j,k)*b(l,i,j,k)
  enddo
   enddo
 enddo
  enddo
  write(30,*)'c ==',c
  return
  end
(---snip---)

Vectorizer dump 
(---snip---)
test.f:9:0: note: original stmt _95 = _92 + _112;
test.f:9:0: note: Build SLP for _152 = _150 * _151;
test.f:9:0: note: Build SLP failed: different operation in stmt _152 = _150 *
_151;
test.f:9:0: note: original stmt _95 = _92 + _112;
test.f:9:0: note: Build SLP for _55 = _53 * _54;
test.f:9:0: note: Build SLP failed: different operation in stmt _55 = _53 *
_54;
test.f:9:0: note: original stmt _95 = _92 + _112;
test.f:1:0: note: vectorized 0 loops in function
(---snip---)

Re-association pass changes one of the tree expression and it prevents from SLP
block vectorization.

Before 
(---snip---)
 # VUSE <.MEM_7>
  _90 = *A.18_37[_89];
  # VUSE <.MEM_7>
  _91 = *A.20_40[_89];
  _92 = _90 * _91;
  # VUSE <.MEM_7>
  c.21_93 = cD.3439;
  c.22_94 = _92 + c.21_93;
  _109 = _87 + 2;
  # VUSE <.MEM_7>
  _110 = *A.18_37[_109];
  # VUSE <.MEM_7>
  _111 = *A.20_40[_109];
  _112 = _110 * _111;
  c.22_114 = c.22_94 + _112;
  _129 = _87 + 3;
(---snip---)


After tree-reassoc
(---snip---)
 # VUSE <.MEM_7>
  _90 = *A.18_37[_89];
  # VUSE <.MEM_7>
  _91 = *A.20_40[_89];
  _92 = _91 * _90;
  # VUSE <.MEM_7>
  c.21_93 = cD.3439;
  _109 = _87 + 2;
  # VUSE <.MEM_7>
  _110 = *A.18_37[_109];
  # VUSE <.MEM_7>
  _111 = *A.20_40[_109];
  _112 = _111 * _110;
  _31 = _112 + _92; <== new statement 
  _129 = _87 + 3;
(---snip---)

[Bug tree-optimization/70103] New: gcc reports bad dependence and bails out of vectorization for one of the bwaves loops.

2016-03-05 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70103

Bug ID: 70103
   Summary: gcc reports bad dependence and bails out of
vectorization for one of the bwaves loops.
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vekumar at gcc dot gnu.org
  Target Milestone: ---

flux_lam.f:68:0: note: dependence distance  = 0.
flux_lam.f:68:0: note: dependence distance == 0 between
MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_244] and
MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_244]
flux_lam.f:68:0: note: READ_WRITE dependence in interleaving.
flux_lam.f:68:0: note: bad data dependence.

Looking at vector dumps, if we have CSEd the load, then there is no dependency
issue here. 

MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_244] = _272

_323 = MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_244];


---snip---
  MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_244] = _272;
  # VUSE <.MEM_273>
  _274 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_219];
  # VUSE <.MEM_273>
  _275 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_224];
  _276 = _274 - _275;
  _277 = ((_276));
  t1_278 = _277 / dy2_68;
  _279 = _195 + 3;
  # VUSE <.MEM_273>
  _280 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_252];
  # VUSE <.MEM_273>
  _281 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_254];
  _282 = _280 - _281;
  _283 = ((_282));
  _284 = _283 / dy2_68;
  _285 = t1_278 + _284;
  _286 = ((_285));
  _287 = _286 * 5.0e-1;
  # VUSE <.MEM_273>
  _288 = MEM[(real(kind=8)D.18[0:D.3601] *)v.107_60][_206];
  # VUSE <.MEM_273>
  _289 = MEM[(real(kind=8)D.18[0:D.3601] *)v.107_60][_203];
  _290 = _288 - _289;
  _291 = ((_290));
  _292 = _291 / _64;
  _293 = _287 + _292;
  _294 = ((_293));
  _295 = t0_210 * _294;
  # .MEM_296 = VDEF <.MEM_273>
  MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_279] = _295;
  # VUSE <.MEM_296>
  _297 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_233];
  # VUSE <.MEM_296>
  _298 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_239];
  _299 = _297 - _298;
  _300 = ((_299));
 t2_301 = _300 / dz2_71;
  _302 = _195 + 4;
  # VUSE <.MEM_296>
  _303 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_261];
  # VUSE <.MEM_296>
  _304 = MEM[(real(kind=8)D.18[0:D.3605] *)u.105_58][_263];
  _305 = _303 - _304;
  _306 = ((_305));
  _307 = _306 / dz2_71;
  _308 = t2_301 + _307;
  _309 = ((_308));
  _310 = _309 * 5.0e-1;
  # VUSE <.MEM_296>
  _311 = MEM[(real(kind=8)D.18[0:D.3597] *)w.109_62][_206];
  # VUSE <.MEM_296>
  _312 = MEM[(real(kind=8)D.18[0:D.3597] *)w.109_62][_203];
  _313 = _311 - _312;
  _314 = ((_313));
  _315 = _314 / _64;
  _316 = _310 + _315;
  _317 = ((_316));
  _318 = t0_210 * _317;
  # .MEM_319 = VDEF <.MEM_296>
  MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_302] = _318;
  _320 = _195 + 5;
  _321 = _246 + _247;
  _322 = ((_321));
  # VUSE <.MEM_319>
   _323 = MEM[(real(kind=8)D.18[0:D.3627] *)ev_197(D) clique 1 base 12][_244];
---snip---

[Bug tree-optimization/70103] gcc reports bad dependence and bails out of vectorization for one of the bwaves loops.

2016-03-05 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70103

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||matz at suse dot de,
   ||richard.guenther at gmail dot 
com
   Severity|normal  |enhancement

--- Comment #1 from vekumar at gcc dot gnu.org ---
After discussion with Richard it was concluded that even after we fix this
still we still won't be able to vectorize the loop.

(Snip)
flux_lam.f:68:0: note: not vectorized: relevant stmt not supported:
_177 = _176 % _21;
flux_lam.f:68:0: note: bad operation or unsupported loop bound.
(Snip)

The reason is we have % operations.

(Snip)
  :
  # i_2 = PHI <1(23), _181(28)>
  _175 = i_2 + _21;
  _176 = _175 + -2;
  _177 = _176 % _21;
  im1_178 = _177 + 1;
  _179 = i_2 % _21;
  ip1_180 = _179 + 1;
(Snip)

that makes indices "wrap" around which is of course something that is hard to
vectorize.  One would need iteration space splitting to ensure the wrapping
doesn't occur in the vectorized iterations.

Reporting this bug and marking this as enhancement

[Bug tree-optimization/70193] New: missed loop splitting support based on iteration space

2016-03-11 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70193

Bug ID: 70193
   Summary: missed loop splitting support based on iteration space
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vekumar at gcc dot gnu.org
  Target Milestone: ---

Following the comments in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70103#c2
and discussion with Richard, filing this PR.

This is inspired by the loop flux_lam.f:68:0 at bwaves which has % operation.

int a[100],b[100];
void test(int x, int N1)
{
int i,im1;
for (i=0;i

[Bug tree-optimization/70193] missed loop splitting support based on iteration space

2016-03-11 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70193

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

   Severity|normal  |enhancement

[Bug tree-optimization/58135] [x86] Missed opportunities for partial SLP

2016-05-23 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58135

--- Comment #3 from vekumar at gcc dot gnu.org ---
Author: vekumar
Date: Mon May 23 09:48:54 2016
New Revision: 236582

URL: https://gcc.gnu.org/viewcvs?rev=236582&root=gcc&view=rev
Log:
Fix PR58135.

2016-05-23  Venkataramanan Kumar  

PR tree-optimization/58135
* tree-vect-slp.c: When group size is not multiple
of vector size, allow splitting of store group at
vector boundary.

2016-05-23  Venkataramanan Kumar  

* gcc.dg/vect/bb-slp-19.c:  Remove XFAIL.
* gcc.dg/vect/pr58135.c:  Add new.
* gfortran.dg/pr46519-1.f: Adjust test case.


Added:
trunk/gcc/testsuite/gcc.dg/vect/pr58135.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/vect/bb-slp-19.c
trunk/gcc/testsuite/gfortran.dg/pr46519-1.f
trunk/gcc/tree-vect-slp.c

[Bug tree-optimization/58135] [x86] Missed opportunities for partial SLP

2016-05-23 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58135

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from vekumar at gcc dot gnu.org ---
Fixed the PR.
ref: https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=236582

2016-05-23  Venkataramanan Kumar  

PR tree-optimization/58135
* tree-vect-slp.c: When group size is not multiple
of vector size, allow splitting of store group at
vector boundary.

2016-05-23  Venkataramanan Kumar  

* gcc.dg/vect/bb-slp-19.c:  Remove XFAIL.
* gcc.dg/vect/pr58135.c:  Add new.
* gfortran.dg/pr46519-1.f: Adjust test case

[Bug tree-optimization/71270] [7 Regression] fortran regression after fix SLP PR58135

2016-05-26 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71270

--- Comment #2 from vekumar at gcc dot gnu.org ---
Looked at x86_64 gimple code for intrinsic_pack_1.f90.

After the SLP split we now vectorize at the place where we pass constant
arguments via a parameterstructure to _gfortran_pack call.  

Before
 parm.20D.3555.dtypeD.3497 = 297;
  # .MEM_242 = VDEF <.MEM_241>
  parm.20D.3555.dimD.3502[0].lboundD.3499 = 1;
  # .MEM_243 = VDEF <.MEM_242>
  parm.20D.3555.dimD.3502[0].uboundD.3500 = 9;
  # .MEM_244 = VDEF <.MEM_243>
  parm.20D.3555.dimD.3502[0].strideD.3498 = 1;
  # .MEM_245 = VDEF <.MEM_244>
  parm.20D.3555.dataD.3495 = &d_ri4D.3433[0];
  # .MEM_246 = VDEF <.MEM_245>
  parm.20D.3555.offsetD.3496 = -1;

After 
# .MEM_243 = VDEF <.MEM_1566>
parm.20D.3555.dimD.3502[0].uboundD.3500 = 9;
# .MEM_245 = VDEF <.MEM_243>
parm.20D.3555.dataD.3495 = &d_ri4D.3433[0];
# .MEM_992 = VDEF <.MEM_245>
MEM[(integer(kind=8)D.9 *)&parm.20D.3555 + 8B] = vect_cst__993;
# PT = anything
# ALIGN = 16, MISALIGN = 8
_984 = &parm.20D.3555.offsetD.3496 + 16;
# .MEM_983 = VDEF <.MEM_992>
MEM[(integer(kind=8)D.9 *)_984] = vect_cst__999;

vect_cst__993={-1,297}
vect_cst__999={1,1}

Other places looks similar. This looks like correct gimple. I am verifying the
gimple generated for arm big endian target.

[Bug tree-optimization/71270] [7 Regression] fortran regression after fix SLP PR58135

2016-05-26 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71270

--- Comment #3 from vekumar at gcc dot gnu.org ---
Built armeb-none-linux-gnueabihf -with-cpu=cortex-a9 --with-fpu=neon-fp16
--with-float=hard

And compared gimple output from intrinsic_pack_1.f90.151t.slp1 before and after
my patch.

The difference is shown below and is similar to x86_64 dump. The gimple dump
after SLP looks correct to me. I think something in backend is causing the
issues.

Any thoughts?

Gimple SLP dumps.

Before

 # .MEM_1450 = VDEF <.MEM_1492>
  d_i1D.3585[0].vD.3582 = 1;
  # .MEM_1454 = VDEF <.MEM_1450>
  d_i1D.3585[1].vD.3582 = -1;
  # .MEM_1458 = VDEF <.MEM_1454>
  d_i1D.3585[2].vD.3582 = 2;
  # .MEM_1468 = VDEF <.MEM_1458>
  d_i1D.3585[3].vD.3582 = -2;
  # .MEM_1472 = VDEF <.MEM_1468>
  d_i1D.3585[4].vD.3582 = 3;
  # .MEM_1476 = VDEF <.MEM_1472>
  d_i1D.3585[5].vD.3582 = -3;
  # .MEM_1486 = VDEF <.MEM_1476>
  d_i1D.3585[6].vD.3582 = 4;
  # .MEM_1490 = VDEF <.MEM_1486>
  d_i1D.3585[7].vD.3582 = -4;
  # .MEM_1494 = VDEF <.MEM_1490>
  d_i1D.3585[8].vD.3582 = 5;


After 

  vect_cst__817 = { 1, 0, 1, 0 };
  vect_cst__873 = { 1, 0, 1, 0 };
  vect_cst__1413 = { 1, -1, 2, -2 };
  vect_cst__1461 = { 3, -3, 4, -4 };

  # .MEM_910 = VDEF <.MEM_1492>
  MEM[(integer(kind=1)D.3 *)&d_i1D.3585] = vect_cst__1413;
  # PT = anything
  # ALIGN = 4, MISALIGN = 0
  _918 = &d_i1D.3585[0].vD.3582 + 4;
  # .MEM_865 = VDEF <.MEM_910>
  MEM[(integer(kind=1)D.3 *)_918] = vect_cst__1461;
  # .MEM_1494 = VDEF <.MEM_865>
  d_i1D.3585[8].vD.3582 = 5;

Before 

 # .MEM_1388 = VDEF <.MEM_217>
  MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][0] = 1;
  # .MEM_1393 = VDEF <.MEM_1388>
  MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][1] = 0;
  # .MEM_1398 = VDEF <.MEM_1393>
  MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][2] = 1;
  # .MEM_1409 = VDEF <.MEM_1398>
  MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][3] = 0;
  # .MEM_1414 = VDEF <.MEM_1409>
  MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][4] = 1;
  # .MEM_1419 = VDEF <.MEM_1414>
  MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][5] = 0;
  # .MEM_1430 = VDEF <.MEM_1419>
  MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][6] = 1;
  # .MEM_1435 = VDEF <.MEM_1430>
  MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][7] = 0;
  # .MEM_1440 = VDEF <.MEM_1435>
  MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][8] = 1;

After 

  # .MEM_825 = VDEF <.MEM_217>
  MEM[(logical(kind=1)D.7 *)&A.8D.3679] = vect_cst__817;
  # PT = anything
  # ALIGN = 4, MISALIGN = 0
  _769 = &MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][0] + 4;
  # .MEM_777 = VDEF <.MEM_825>
  MEM[(logical(kind=1)D.7 *)_769] = vect_cst__873;
  # .MEM_1440 = VDEF <.MEM_777>
  MEM[(logical(kind=1)D.7[9] *)&A.8D.3679][8] = 1;

Before 

  # .MEM_1271 = VDEF <.MEM_264>
  MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][0] = 1;
  # .MEM_1276 = VDEF <.MEM_1271>
  MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][1] = 0;
  # .MEM_1281 = VDEF <.MEM_1276>
  MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][2] = 1;
  # .MEM_1292 = VDEF <.MEM_1281>
  MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][3] = 0;
  # .MEM_1297 = VDEF <.MEM_1292>
  MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][4] = 1;
  # .MEM_1302 = VDEF <.MEM_1297>
  MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][5] = 0;
  # .MEM_1313 = VDEF <.MEM_1302>
  MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][6] = 1;
  # .MEM_1318 = VDEF <.MEM_1313>
  MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][7] = 0;
  # .MEM_1323 = VDEF <.MEM_1318>
  MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][8] = 1;

After 

 vect_cst__729 = { 1, 0, 1, 0 };
  vect_cst__721 = { 1, 0, 1, 0 };

  # .MEM_673 = VDEF <.MEM_264>
  MEM[(logical(kind=1)D.7 *)&A.23D.3720] = vect_cst__729;
  # PT = anything
  # ALIGN = 4, MISALIGN = 0
  _681 = &MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][0] + 4;
  # .MEM_942 = VDEF <.MEM_673>
  MEM[(logical(kind=1)D.7 *)_681] = vect_cst__721;
  # .MEM_1323 = VDEF <.MEM_942>
  MEM[(logical(kind=1)D.7[9] *)&A.23D.3720][8] = 1;

[Bug target/71270] [7 Regression] fortran regression after fix SLP PR58135

2016-05-27 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71270

--- Comment #5 from vekumar at gcc dot gnu.org ---
The expand dump after SLP split 

---snip--
;; MEM[(logical(kind=1) *)&A.8] = { 1, 0, 1, 0 };

(insn 71 70 72 (set (reg:SI 308)
(const_int 16777472 [0x1000100])) intrinsic_pack_1.f90:49 -1
 (nil))

(insn 72 71 0 (set (mem/c:SI (plus:SI (reg/f:SI 105 virtual-stack-vars)
(const_int -576 [0xfdc0])) [8
MEM[(logical(kind=1)D.7 *)&A.8D.3679]+0 S4 A64])
(reg:SI 308)) intrinsic_pack_1.f90:49 -1
 (nil))

;; MEM[(logical(kind=1) *)&A.8 + 4B] = { 1, 0, 1, 0 };

(insn 73 72 74 (set (reg:SI 309)
(const_int 16777472 [0x1000100])) intrinsic_pack_1.f90:49 -1
 (nil))

(insn 74 73 0 (set (mem/c:SI (plus:SI (reg/f:SI 105 virtual-stack-vars)
(const_int -572 [0xfdc4])) [8
MEM[(logical(kind=1)D.7 *)&A.8D.3679 + 4B]+0 S4 A32])
(reg:SI 309)) intrinsic_pack_1.f90:49 -1
 (nil))

;; MEM[(logical(kind=1)[9] *)&A.8][8] = 1;

(insn 75 74 76 (set (reg:SI 310)
(const_int 1 [0x1])) intrinsic_pack_1.f90:49 -1
 (nil))

(insn 76 75 77 (set (reg:QI 311)
(subreg:QI (reg:SI 310) 3)) intrinsic_pack_1.f90:49 -1
 (nil))

(insn 77 76 0 (set (mem/c:QI (plus:SI (reg/f:SI 105 virtual-stack-vars)
(const_int -568 [0xfdc8])) [8 A.8D.3679+8 S1 A64])
(reg:QI 311)) intrinsic_pack_1.f90:49 -1
 (nil))
--snip---

[Bug tree-optimization/64946] [AArch64] gcc.target/aarch64/vect-abs-compile.c - "abs" vectorization fails for char/short types

2016-06-03 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

   Assignee|vekumar at gcc dot gnu.org |shiva0217 at gmail dot 
com

--- Comment #15 from vekumar at gcc dot gnu.org ---
I am not working at this now. so assigned to shiva chen

[Bug tree-optimization/64716] Missed vectorization in a hot code of SPEC2000 ammp

2016-06-10 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64716

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||vekumar at gcc dot gnu.org

--- Comment #4 from vekumar at gcc dot gnu.org ---
Tried to see if there is improvement when allowing splitting the group stores
based at VF boundary.

Small improvement noted with slightly older trunk 
gcc version 7.0.0 20160524 (experimental) (GCC)

rectmm.c:520:2: note: Basic block will be vectorized using SLP


(Snip)
a1-> px = a1->x + lambda*a1->dx;
a1-> py = a1->y + lambda*a1->dy;
a1-> pz = a1->z + lambda*a1->dz;
(Snip)

---SLP dump---
rectmm.c:520:2: note: Detected interleaving load a1_944->xD.4701 and
a1_944->yD.4702
rectmm.c:520:2: note: Detected interleaving load a1_944->xD.4701 and
a1_944->zD.4703
rectmm.c:520:2: note: Detected interleaving load a1_944->xD.4701 and
a1_944->dxD.4721
rectmm.c:520:2: note: Detected interleaving load a1_944->xD.4701 and
a1_944->dyD.4722
rectmm.c:520:2: note: Detected interleaving load a1_944->xD.4701 and
a1_944->dzD.4723
rectmm.c:520:2: note: Detected interleaving store a1_944->pxD.4728 and
a1_944->pyD.4729
rectmm.c:520:2: note: Detected interleaving store a1_944->pxD.4728 and
a1_944->pzD.4730

rectmm.c:520:2: note: Split group into 2 and 1

rectmm.c:520:2: note: Basic block will be vectorized using SLP
rectmm.c:520:2: note: SLPing BB part
rectmm.c:520:2: note: -->vectorizing SLP node starting from: # VUSE
<.MEM_1752>
_672 = a1_944->dxD.4721;
---SLP dump---

[Bug sanitizer/65662] AddressSanitizer CHECK failed: ../../../../gcc/libsanitizer/sanitizer_common/sanitizer_allocator.h:835 "((res)) < ((kNumPossibleRegions))" (0x3ffb49, 0x80000)

2015-04-05 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65662

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||vekumar at gcc dot gnu.org

--- Comment #6 from vekumar at gcc dot gnu.org ---
For 42 bit VA,  I have to change the SANITIZER_MMAP_RANGE_SIZE to  1 <<42.
Also compiler has to add the shadow offset instead of Oring it.

I am planning to post a patch in LLVM.

As Kostya said we can discuss in that thread.


[Bug sanitizer/65662] AddressSanitizer CHECK failed: ../../../../gcc/libsanitizer/sanitizer_common/sanitizer_allocator.h:835 "((res)) < ((kNumPossibleRegions))" (0x3ffb49, 0x80000)

2015-04-05 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65662

--- Comment #8 from vekumar at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #7)
> (In reply to vekumar from comment #6)
> > For 42 bit VA,  I have to change the SANITIZER_MMAP_RANGE_SIZE to  1 <<42.
> 
> Sure.
> 
> > Also compiler has to add the shadow offset instead of Oring it.
> 
> You don't, see my patch.
> As I said, the hard part is making sure all 3 layouts work with the same
> libasan library - the problem is that the library assumes some decisions
> (like whether to use 32-bit or 64-bit allocator) have to be done at library
> compile time, when for aarch64 they really have to be done at runtime.

Hi Jakub, 

It was decided to make ASAN work for 42 bit VA without changing the default
allocator (32bit) and the default shadow offset (1<<36). 

Please see thread
https://groups.google.com/forum/#!topic/address-sanitizer/YzYRJEvVimw.

On 42 bit VA with default settings, I found that some cases (LLVM ASAN tests)
were failing because the compiler (LLVM) does Oring of shadow offset and ASAN
library code adds the shadow offset. Both access resulted in valid memory and
but we were poisoning the wrong shadow memory.

Now your patch turns on the 64 bit allocator. I agree to do this we need to
dynamically detect VA at runtime. 

Can you please join the thread and post your comments there.


[Bug bootstrap/62077] --with-build-config=bootstrap-lto fails

2015-04-15 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62077

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||vekumar at gcc dot gnu.org

--- Comment #58 from vekumar at gcc dot gnu.org ---
Richard, 

So for GCC 5.0 branch we has to use --enable-stage1-checking=release as
workaround?


[Bug target/66049] New: Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.

2015-05-07 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049

Bug ID: 66049
   Summary: Few AArch64 extend and add with shift tests generates
sub optimal code with trunk gcc 6.0.
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vekumar at gcc dot gnu.org
  Target Milestone: ---

After preventing conversion of shift to mults in combiner
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=222874

few Aarch64 target tests generates suboptimal code.

Tests that now fail, but worked before:
---
gcc.target/aarch64/adds1.c scan-assembler adds\tw[0-9]+, w[0-9]+, w[0-9]+, lsl
3
gcc.target/aarch64/adds1.c scan-assembler adds\tx[0-9]+, x[0-9]+, x[0-9]+, lsl
3
gcc.target/aarch64/adds3.c scan-assembler-times adds\tx[0-9]+, x[0-9]+,
x[0-9]+,
 sxtw 2
gcc.target/aarch64/extend.c scan-assembler add\tw[0-9]+,.*uxth #?1
gcc.target/aarch64/extend.c scan-assembler add\tx[0-9]+,.*uxtw #?3
gcc.target/aarch64/extend.c scan-assembler sub\tw[0-9]+,.*uxth #?1
gcc.target/aarch64/extend.c scan-assembler sub\tx[0-9]+,.*uxth #?1
gcc.target/aarch64/extend.c scan-assembler sub\tx[0-9]+,.*uxtw #?3
gcc.target/aarch64/subs1.c scan-assembler subs\tw[0-9]+, w[0-9]+, w[0-9]+, lsl
3
gcc.target/aarch64/subs1.c scan-assembler subs\tx[0-9]+, x[0-9]+, x[0-9]+, lsl
3
gcc.target/aarch64/subs3.c scan-assembler-times subs\tx[0-9]+, x[0-9]+,
x[0-9]+,
 sxtw 2

Sample Test case 

unsigned long long
adddi_uxtw (unsigned long long a, unsigned int i)
{
  /* { dg-final { scan-assembler "add\tx\[0-9\]+,.*uxtw #?3" } } */
  return a + ((unsigned long long)i << 3);
}

Before 

 add x0, x0, x1, uxtw 3

Now 

ubfiz   x1, x1, 3, 32
add x0, x1, x0


[Bug target/66049] Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.

2015-05-07 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049

--- Comment #1 from vekumar at gcc dot gnu.org ---
We need patterns based on shifts to match with combiner generated.

Below patch fixes them.

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 1c2c5fb..c5a640d 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1555,6 +1555,23 @@
   [(set_attr "type" "alus_shift_imm")]
 )

+(define_insn "*adds_shift_imm_"
+  [(set (reg:CC_NZ CC_REGNUM)
+(compare:CC_NZ
+ (plus:GPI (ASHIFT:GPI
+(match_operand:GPI 1 "register_operand" "r")
+(match_operand:QI 2 "aarch64_shift_imm_" "n"))
+   (match_operand:GPI 3 "register_operand" "r"))
+ (const_int 0)))
+   (set (match_operand:GPI 0 "register_operand" "=r")
+(plus:GPI (ASHIFT:GPI (match_dup 1) (match_dup 2))
+  (match_dup 3)))]
+  ""
+  "adds\\t%0, %3, %1,  %2"
+  [(set_attr "type" "alus_shift_imm")]
+)
+
+
 (define_insn "*subs_mul_imm_"
   [(set (reg:CC_NZ CC_REGNUM)
(compare:CC_NZ
@@ -1571,6 +1588,23 @@
   [(set_attr "type" "alus_shift_imm")]
 )

+(define_insn "*subs_shift_imm_"
+  [(set (reg:CC_NZ CC_REGNUM)
+(compare:CC_NZ
+ (minus:GPI (match_operand:GPI 1 "register_operand" "r")
+(ASHIFT:GPI
+ (match_operand:GPI 2 "register_operand" "r")
+ (match_operand:QI 3 "aarch64_shift_imm_" "n")))
+ (const_int 0)))
+   (set (match_operand:GPI 0 "register_operand" "=r")
+(minus:GPI (match_dup 1)
+   (ASHIFT:GPI (match_dup 2) (match_dup 3]
+  ""
+  "subs\\t%0, %1, %2,  %3"
+  [(set_attr "type" "alus_shift_imm")]
+)
+
+
 (define_insn "*adds__"
   [(set (reg:CC_NZ CC_REGNUM)
(compare:CC_NZ
@@ -1599,6 +1633,41 @@
   [(set_attr "type" "alus_ext")]
 )

+(define_insn "*adds__shft_"
+  [(set (reg:CC_NZ CC_REGNUM)
+(compare:CC_NZ
+ (plus:GPI (ashift:GPI (ANY_EXTEND:GPI
+(match_operand:ALLX 1 "register_operand" "r"))
+   (match_operand 2 "aarch64_imm3" "Ui3"))
+   (match_operand:GPI 3 "register_operand" "r"))
+(const_int 0)))
+   (set (match_operand:GPI 0 "register_operand" "=rk")
+(plus:GPI (ashift:GPI (ANY_EXTEND:GPI (match_dup 1))
+  (match_dup 2))
+  (match_dup 3)))]
+  ""
+  "adds\\t%0, %3, %1, xt %2"
+  [(set_attr "type" "alus_ext")]
+)
+
+(define_insn "*subs__shft_"
+  [(set (reg:CC_NZ CC_REGNUM)
+(compare:CC_NZ
+ (minus:GPI (match_operand:GPI 1 "register_operand" "r")
+(ashift:GPI (ANY_EXTEND:GPI
+(match_operand:ALLX 2 "register_operand" "r"))
+   (match_operand 3 "aarch64_imm3" "Ui3")))
+(const_int 0)))
+   (set (match_operand:GPI 0 "register_operand" "=rk")
+(minus:GPI (match_dup 1)
+(ashift:GPI (ANY_EXTEND:GPI (match_dup 2))
+  (match_dup 3]
+  ""
+  "subs\\t%0, %1, %2, xt %3"
+  [(set_attr "type" "alus_ext")]
+)
+
+
 (define_insn "*adds__multp2"
   [(set (reg:CC_NZ CC_REGNUM)
(compare:CC_NZ
@@ -1909,6 +1978,22 @@
   [(set_attr "type" "alu_ext")]
 )

+(define_insn "*add_uxt_shift2"
+  [(set (match_operand:GPI 0 "register_operand" "=rk")
+(plus:GPI (and:GPI
+   (ashift:GPI (match_operand:GPI 1 "register_operand" "r")
+ (match_operand 2 "aarch64_imm3" "Ui3"))
+   (match_operand 3 "const_int_operand" "n"))
+  (match_operand:GPI 4 "register_operand" "r")))]
+  "aarch64_uxt_size (INTVAL (operands[2]), INTVAL (operands[3])) != 0"
+  "*
+  operands[3] = GEN_INT (aarch64_uxt_size (INTVAL(operands[2]),
+   INTVAL (operands[3])));
+  return \"add\t%0, %4, %1, uxt%e3 %2\";"
+  [(set_attr "type" "alu_ext")]
+)
+
+
 ;; zero_extend version of above
 (define_insn "*add_uxtsi_multp2_uxtw"
   [(set (match_operand:DI 0 "register_operand" "=rk")
@@ -2165,6 +2

[Bug target/66049] Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.

2015-05-12 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049

--- Comment #4 from vekumar at gcc dot gnu.org ---
(In reply to ktkachov from comment #3)
> Venkat, are you planning to submit this patch to gcc-patches?
> Also, does this mean we can remove the patterns that do arith+shift using
> MULT rtxes? (like *adds__multp2)

Hi Kyrill, 

Yes I am planing to submit the patch. But before that I need to test by putting
some assert and check if *adds__multp2 and similar patterns are
not used anymore.


[Bug target/66049] [6 regression] Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.

2015-05-15 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049

--- Comment #6 from vekumar at gcc dot gnu.org ---
(In reply to Ramana Radhakrishnan from comment #5)
> (In reply to vekumar from comment #4)
> > (In reply to ktkachov from comment #3)
> > > Venkat, are you planning to submit this patch to gcc-patches?
> > > Also, does this mean we can remove the patterns that do arith+shift using
> > > MULT rtxes? (like *adds__multp2)
> > 
> > Hi Kyrill, 
> > 
> > Yes I am planing to submit the patch. But before that I need to test by
> > putting some assert and check if *adds__multp2 and similar
> > patterns are not used anymore.
> 
> So this is a regression on GCC 6. what's holding up pushing this patch onto
> gcc-patches@ ?

GCC bootstrap and regression testing completed. I am doing SPEC 2006 INT run
just to make sure no surprises. will post it in a day or two.


[Bug target/66049] [6 regression] Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.

2015-05-18 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049

--- Comment #7 from vekumar at gcc dot gnu.org ---
(In reply to ktkachov from comment #3)
> Venkat, are you planning to submit this patch to gcc-patches?
> Also, does this mean we can remove the patterns that do arith+shift using
> MULT rtxes? (like *adds__multp2)

Hi Kyrill, 

I added shift based patterns for 

*adds__multp2
*subs__multp2
*add_uxt_multp2
*add_uxtsi_multp2_uxtw
*sub_uxt_multp2
*sub_uxtsi_multp2_uxtw
*adds_mul_imm_
*subs_mul_imm_

I added "gcc_unreachable" to these patterns and gcc boostrapped except
add_uxt_multp2 pattern.


The pattern "*add_uxtdi_multp2" can still be generated. 

/root/work/GCC_Team/vekumar/build-assert-check/./gcc/xgcc
-B/root/work/GCC_Team/vekumar/build-assert-check/./gcc/
-B/root/work/GCC_Team/vekumar/install-assert-check/aarch64-unknown-linux-gnu/bin/
-B/root/work/GCC_Team/vekumar/install-assert-check/aarch64-unknown-linux-gnu/lib/
-isystem
/root/work/GCC_Team/vekumar/install-assert-check/aarch64-unknown-linux-gnu/include
-isystem
/root/work/GCC_Team/vekumar/install-assert-check/aarch64-unknown-linux-gnu/sys-include
   -g -O2 -O2  -g -O2 -DIN_GCC-W -Wall -Wno-narrowing -Wwrite-strings
-Wcast-qual -Wno-format -Wstrict-prototypes -Wmissing-prototypes
-Wold-style-definition  -isystem ./include   -fPIC -g -DIN_LIBGCC2
-fbuilding-libgcc -fno-stack-protector   -fPIC -I. -I. -I../.././gcc
-I../../../gcc-assert-check/libgcc -I../../../gcc-assert-check/libgcc/.
-I../../../gcc-assert-check/libgcc/../gcc
-I../../../gcc-assert-check/libgcc/../include  -DHAVE_CC_TLS  -o _gcov.o -MT
_gcov.o -MD -MP -MF _gcov.dep -DL_gcov -c
../../../gcc-assert-check/libgcc/libgcov-driver.c


insn 1325 1324 1326 137 (set (reg:DI 725 [ ix ])
(zero_extend:DI (reg/v:SI 197 [ ix ])))
../../../gcc-assert-check/libgcc/libgcov-driver.c:103 73
{*zero_extendsidi2_aarch64}
 (nil))
(insn 1326 1325 1327 137 (set (reg:DI 726)
(plus:DI (reg:DI 725 [ ix ])
(const_int 4 [0x4])))
../../../gcc-assert-check/libgcc/libgcov-driver.c:103 87 {*adddi3_aarch64}
 (expr_list:REG_DEAD (reg:DI 725 [ ix ])
(nil)))
(insn 1327 1326 3536 137 (set (reg/f:DI 727)
(mem/f:DI (plus:DI (mult:DI (reg:DI 726)
(const_int 8 [0x8]))
(reg/v/f:DI 571 [ list ])) [2 MEM[(const struct gcov_info
*)list_372].merge S8 A64]))
../../../gcc-assert-check/libgcc/libgcov-driver.c:103 40 {*movdi_aarch64}


Successfully matched this instruction:
(set (reg/f:DI 727)
(plus:DI (and:DI (mult:DI (subreg:DI (reg/v:SI 197 [ ix ]) 0)
(const_int 8 [0x8]))
(const_int 34359738360 [0x7fff8]))
(reg/v/f:DI 571 [ list ])))

(insn 1326 1325 1327 137 (set (reg:DI 726)
(plus:DI (and:DI (mult:DI (subreg:DI (reg/v:SI 197 [ ix ]) 0)
(const_int 8 [0x8]))
(const_int 34359738360 [0x7fff8]))
(reg/v/f:DI 571 [ list ])))
../../../gcc-assert-check/libgcc/libgcov-driver.c:103 252 {*add_uxtdi_multp2}
 (nil))
(insn 1327 1326 3536 137 (set (reg/f:DI 727)
(mem/f:DI (plus:DI (reg:DI 726)
(const_int 32 [0x20])) [2 MEM[(const struct gcov_info
*)list_372].merge S8 A64]))
../../../gcc-assert-check/libgcc/libgcov-driver.c:103 40 {*movdi_aarch64}

I am going to first send out patch for adding new shift based patterns.
Then separate patch test  and remove mul patterns.


[Bug target/66049] [6 regression] Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.

2015-05-26 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049

--- Comment #9 from vekumar at gcc dot gnu.org ---
Author: vekumar
Date: Tue May 26 15:32:02 2015
New Revision: 223703

URL: https://gcc.gnu.org/viewcvs?rev=223703&root=gcc&view=rev
Log:
2015-05-26  Venkataramanan Kumar  

PR target/66049
* config/aarch64/aarch64.md
(*adds_shift_imm_):  New pattern.
(*subs_shift_imm_):  Likewise.
(*adds__shift_):  Likewise.
(*subs__shift_): Likewise.
(*add_uxt_shift2): Likewise.
(*add_uxtsi_shift2_uxtw): Likewise.
(*sub_uxt_shift2): Likewise.
(*sub_uxtsi_shift2_uxtw): Likewise.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/aarch64/aarch64.md


[Bug target/66049] [6 regression] Few AArch64 extend and add with shift tests generates sub optimal code with trunk gcc 6.0.

2015-05-26 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from vekumar at gcc dot gnu.org ---
Fixed at r223703


[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)

2015-05-26 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from vekumar at gcc dot gnu.org ---
Fixed at r222874

 https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=222874

2015-05-07  Venkataramanan Kumar  

* combine.c (make_compound_operation): Remove checks for PLUS/MINUS
rtx type.


[Bug target/67717] New: [6.0 regression] ICE when compiling WRF benchmark from cpu2006 with -Ofast -march=bdver4

2015-09-25 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67717

Bug ID: 67717
   Summary: [6.0 regression] ICE when compiling WRF benchmark from
cpu2006 with -Ofast -march=bdver4
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vekumar at gcc dot gnu.org
  Target Milestone: ---

gfortran -c -o module_cu_gd.fppized.o -I. -I./netcdf/include -march=bdver4
-Ofast -fno-second-underscore module_cu_gd.fppized.f90
module_cu_gd.fppized.f90:1302:0:

(Snip)
END SUBROUTINE CUP_enss
^
Error: insn does not satisfy its constraints:
(insn 8179 8178 14983 589 (parallel [
(set (reg:V4SF 23 xmm2 [orig:2240 vect__1137.6099 ] [2240])
(unspec:V4SF [
(reg:V4SF 23 xmm2 [orig:2240 vect__1137.6099 ] [2240])
(mem:SF (unspec:DI [
(reg/f:DI 4 si [4841])
(reg:V2DI 21 xmm0 [orig:6879
vect__1136.6096 ] [6879])
(const_int 4 [0x4])
] UNSPEC_VSIBADDR) [0  S4 A8])
(mem:BLK (scratch) [0  A8])
(reg:V4SF 23 xmm2 [orig:2240 vect__1137.6099 ] [2240])
] UNSPEC_GATHER))
(clobber (reg:V4SF 23 xmm2 [orig:2240 vect__1137.6099 ] [2240]))
]) module_cu_gd.fppized.f90:1102 4603 {*avx2_gatherdiv4sf}
 (nil))
module_cu_gd.fppized.f90:1302:0: internal compiler error: in
extract_constrain_insn, at recog.c:2200
0xaf0548 _fatal_insn(char const*, rtx_def const*, char const*, int, char
const*)
../../gcc-fsf-trunk/gcc/rtl-error.c:109
0xaf056f _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
../../gcc-fsf-trunk/gcc/rtl-error.c:120
0xabe03d extract_constrain_insn(rtx_insn*)
../../gcc-fsf-trunk/gcc/recog.c:2200
0xa9e185 reload_cse_simplify_operands
../../gcc-fsf-trunk/gcc/postreload.c:408
0xa9f245 reload_cse_simplify
../../gcc-fsf-trunk/gcc/postreload.c:194
0xa9f245 reload_cse_regs_1
../../gcc-fsf-trunk/gcc/postreload.c:233
0xaa0b13 reload_cse_regs
../../gcc-fsf-trunk/gcc/postreload.c:81
0xaa0b13 execute
../../gcc-fsf-trunk/gcc/postreload.c:2350
Please submit a full bug report,
with preprocessed source if appropriate.
(Snip)

I am trying to get a reduced test case.

But the Bug seems to starts from r227382 

commit 0af99ebfea26293fc900fe9050c5dd514005e4e5
2015-09-01  Vladimir Makarov  

PR target/61578
* lra-lives.c (process_bb_lives): Process move pseudos with the
same value for copies and preferences
* lra-constraints.c (match_reload): Create match reload pseudo
with the same value from single dying input pseudo.


[Bug target/67717] [6.0 regression] ICE when compiling WRF benchmark from cpu2006 with -Ofast -march=bdver4

2015-09-25 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67717

--- Comment #2 from vekumar at gcc dot gnu.org ---
yes reproducible with today's trunk.
gcc version 6.0.0 20150925 (experimental) (GCC)


[Bug target/67717] [6.0 regression] ICE when compiling WRF benchmark from cpu2006 with -Ofast -march=bdver4

2015-09-25 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67717

--- Comment #3 from vekumar at gcc dot gnu.org ---
(In reply to vekumar from comment #2)
> yes reproducible with today's trunk.
> gcc version 6.0.0 20150925 (experimental) (GCC)

I meant ICE still shows up in the trunk.


[Bug target/66171] [6 Regression]: gcc.target/cris/biap.c

2015-10-16 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66171

--- Comment #1 from vekumar at gcc dot gnu.org ---
Yes canonical RTL is retained and is LSHIFT here. 
May be need to adjust the machine descriptions to be based on shift.


[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)

2015-01-01 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949

--- Comment #6 from vekumar at gcc dot gnu.org ---
In the function make_compound_operation, there a check 

  /* See if we have operations between an ASHIFTRT and an ASHIFT.
 If so, try to merge the shifts into a SIGN_EXTEND.  We could
 also do this for some cases of SIGN_EXTRACT, but it doesn't
 seem worth the effort; the case checked for occurs on Alpha.  
*/

if (!OBJECT_P (lhs)
  && ! (GET_CODE (lhs) == SUBREG
&& (OBJECT_P (SUBREG_REG (lhs
  && CONST_INT_P (rhs)
  && INTVAL (rhs) < HOST_BITS_PER_WIDE_INT
  && INTVAL (rhs) < mode_width
  && (new_rtx = extract_left_shift (lhs, INTVAL (rhs))) != 0)
new_rtx = make_extraction (mode, make_compound_operation (new_rtx,
next_code),
   0, NULL_RTX, mode_width - INTVAL (rhs),
   code == LSHIFTRT, 0, in_code == COMPARE);

  break;



Our input RTL actually matches this case. 

For (int)i << 1  we are getting incomming RTX as 

(ashiftrt:SI (ashift:SI (reg:SI 1 x1 [ i ])
(const_int 16 [0x10]))
(const_int 15 [0xf]))


LHS is ashift:SI (reg:SI 1 x1 [ i ])
(const_int 16 [0x10]) 

RHS is ashiftrt with a value of 15.

So bacially we get (i<<16)>>15, we can merge these shifts to sign_extends.

With extract_left_shift we get 

(ashift:SI (reg:SI 1 x1 [ i ])
(const_int 1 [0x1]))

or x1<<1

When we do make_extraction with this shift pattern we get 

(ashift:SI (sign_extend:SI (reg:HI 1 x1 [ i ]))
(const_int 1 [0x1])))


But instead this we are the shift RTX, we are actually passing MULT RTX to
make_extraction via another make_compound_operation.

p make_compound_operation(new_rtx,MEM)
$3 = (rtx_def *) 0x777fd420
(gdb) pr
(mult:SI (reg:SI 1 x1 [ i ])
(const_int 2 [0x2]))

Which results in 

 (subreg:SI (sign_extract:DI (mult:DI (reg:DI 1 x1 [ i ])
(const_int 2 [0x2]))
(const_int 17 [0x11])
(const_int 0 [0])) 0)

When I changed the original check to

--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -7896,7 +7896,7 @@ make_compound_operation (rtx x, enum rtx_code in_code)
  && INTVAL (rhs) < HOST_BITS_PER_WIDE_INT
  && INTVAL (rhs) < mode_width
  && (new_rtx = extract_left_shift (lhs, INTVAL (rhs))) != 0)
-   new_rtx = make_extraction (mode, make_compound_operation (new_rtx,
next_
+   new_rtx = make_extraction (mode, new_rtx,
   0, NULL_RTX, mode_width - INTVAL (rhs),
   code == LSHIFTRT, 0, in_code == COMPARE)

Combiner was able to match the pattern successfully.

Trying 8 -> 13:
Successfully matched this instruction:
(set (reg/i:SI 0 x0)
(minus:SI (reg:SI 0 x0 [ a ])
(ashift:SI (sign_extend:SI (reg:HI 1 x1 [ i ]))
(const_int 1 [0x1]
(minus:SI (reg:SI 0 x0 [ a ])
(ashift:SI (sign_extend:SI (reg:HI 1 x1 [ i ]))

Any comments about this change?
(const_int 1 [0x1])))


[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)

2015-01-02 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949

--- Comment #7 from vekumar at gcc dot gnu.org ---
I ran GCC tests against the patch found one failure. 

int
adds_shift_ext ( long long a, int b, int c)
{
 long long  d = (a + ((long long)b << 3));

  if (d == 0)
return a + c;
  else
return b + d + c;
}


The test expects adds generation and before my fix it is generated .
adds_shift_ext:
addsx3, x0, x1, sxtw 3  // 11   *adds_extvdi_multp2 [length
= 4]
beq .L5 // 12   *condjump   [length = 4]

But now I am generating sign extends with shifts instead of sign extends with
mul in my patch.

adds_shift_ext:
add x3, x0, x1, sxtw 3  // 10   *add_extendsi_shft_di   [length
= 4]
cbz x3, .L5 // 12   *cbeqdi1[length = 4]

We don't have *adds_extendsi_shft_di pattern. We have patterns for
adds_extvdi_multp2 that extends an operation over mult.

Adding one will help optimize this case. But my concern is what if other
targets  hits the same issue?


[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)

2015-01-06 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949

--- Comment #8 from vekumar at gcc dot gnu.org ---
This is complete patch for the first approach that I took (comment 6). This
patch fixes issues I faced while testing. But I have added extra patterns to
cater the sign extended operands with left shifts. This might impact other
targets as well :( 

Now I am also exploring other possibilities instead of writing extra patterns. 


diff --git a/gcc/combine.c b/gcc/combine.c
index ee7b3f9..80b345d 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -7896,7 +7896,7 @@ make_compound_operation (rtx x, enum rtx_code in_code)
  && INTVAL (rhs) < HOST_BITS_PER_WIDE_INT
  && INTVAL (rhs) < mode_width
  && (new_rtx = extract_left_shift (lhs, INTVAL (rhs))) != 0)
-   new_rtx = make_extraction (mode, make_compound_operation (new_rtx,
next_
+   new_rtx = make_extraction (mode, new_rtx,
   0, NULL_RTX, mode_width - INTVAL (rhs),
   code == LSHIFTRT, 0, in_code == COMPARE);

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 97d7009..f0b9240 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1570,26 +1570,62 @@
   [(set_attr "type" "alus_ext")]
 )

-(define_insn "*adds__multp2"
+(define_insn "*adds__extend_ashift"
   [(set (reg:CC_NZ CC_REGNUM)
(compare:CC_NZ
-(plus:GPI (ANY_EXTRACT:GPI
-   (mult:GPI (match_operand:GPI 1 "register_operand" "r")
- (match_operand 2 "aarch64_pwr_imm3" "Up3"))
-   (match_operand 3 "const_int_operand" "n")
-   (const_int 0))
-  (match_operand:GPI 4 "register_operand" "r"))
+(plus:GPI (match_operand:GPI 1 "register_operand" "r")
+  (ashift:GPI (ANY_EXTEND:GPI
+(match_operand:ALLX 2 "register_operand" "r"))
+   (match_operand 3 "aarch64_imm3" "Ui3")))
(const_int 0)))
(set (match_operand:GPI 0 "register_operand" "=r")
-   (plus:GPI (ANY_EXTRACT:GPI (mult:GPI (match_dup 1) (match_dup 2))
-  (match_dup 3)
-  (const_int 0))
- (match_dup 4)))]
+   (plus:GPI  (match_dup 1) 
+   (ashift:GPI (ANY_EXTEND:GPI (match_dup 2))
+   (match_dup 3]
+  ""
+  "adds\\t%0, %1, %2, xt %3"
+  [(set_attr "type" "alus_ext")]
+)
+
+(define_insn "*subs__extend_ashift"
+  [(set (reg:CC_NZ CC_REGNUM)
+(compare:CC_NZ
+ (minus:GPI (match_operand:GPI 1 "register_operand" "r")
+(ashift:GPI (ANY_EXTEND:GPI
+  (match_operand:ALLX 2 "register_operand"
"r")
+ (match_operand 3 "aarch64_imm3" "Ui3")))
+(const_int 0)))
+   (set (match_operand:GPI 0 "register_operand" "=r")
+(minus:GPI (match_dup 1)
+   (ashift:GPI (ANY_EXTEND:GPI (match_dup 2))
+(match_dup 3]
+  ""
+  "subs\\t%0, %1, %2, xt %3"
+  [(set_attr "type" "alus_ext")]
+)
+
+
+(define_insn "*adds__multp2"
+  [(set (reg:CC_NZ CC_REGNUM)
+(compare:CC_NZ
+ (plus:GPI (ANY_EXTRACT:GPI
+(mult:GPI (match_operand:GPI 1 "register_operand" "r")
+  (match_operand 2 "aarch64_pwr_imm3" "Up3"))
+(match_operand 3 "const_int_operand" "n")
+(const_int 0))
+   (match_operand:GPI 4 "register_operand" "r"))
+(const_int 0)))
+   (set (match_operand:GPI 0 "register_operand" "=r")
+(plus:GPI (ANY_EXTRACT:GPI (mult:GPI (match_dup 1) (match_dup 2))
+   (match_dup 3)
+   (const_int 0))
+  (match_dup 4)))]
   "aarch64_is_extend_from_extract (mode, operands[2], operands[3])"
   "adds\\t%0, %4, %1, xt%e3 %p2"
   [(set_attr "type" "alus_ext")]
 )

+
 (define_insn "*subs__multp2"
   [(set (reg:CC_NZ CC_REGNUM)
(compare:CC_NZ


[Bug rtl-optimization/64537] New: Aarch64 redundant sxth instruction gets generated

2015-01-08 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64537

Bug ID: 64537
   Summary: Aarch64 redundant sxth instruction gets generated
   Product: gcc
   Version: 5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vekumar at gcc dot gnu.org

For the below test case redundant sxth instruction gets generate.

int
adds_shift_ext ( long long a, short b, int c)
{
 long long  d = (a - ((long long)b << 3));

  if (d == 0)
return a + c;
  else
return b + d + c;
}


adds_shift_ext:
sxthw1, w1  // 3*extendhisi2_aarch64/1  [length = 4] <==1
subsx3, x0, x1, sxth 3  // 11   *subs_extvdi_multp2 [length
= 4] <==2
beq .L5 // 12   *condjump   [length = 4]
add w0, w1, w2  // 19   *addsi3_aarch64/2   [length = 4]
add w0, w0, w3  // 20   *addsi3_aarch64/2   [length = 4]
ret // 57   simple_return   [length = 4]
.p2align 2
.L5:
add w0, w2, w0  // 14   *addsi3_aarch64/2   [length = 4]
ret // 55   simple_return   [length = 4]

<== 1 is not needed.


[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)

2015-01-09 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949

--- Comment #10 from vekumar at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #9)
> A MULT by a constant power of 2 is not canonical RTL (well, not what
> simplify_rtx would give you); combine shouldn't generate this.


In that case we think, we need to fix this heuristic which assumes "MEM"
operation when we encounter a MINUS RTX in "make_compound_operation".


/* Select the code to be used in recursive calls.  Once we are inside an
  address, we stay there.  If we have a comparison, set to COMPARE,
  but once inside, go back to our default of SET.  */

   next_code = (code == MEM ? MEM
: ((code == PLUS || code == MINUS)
   && SCALAR_INT_MODE_P (mode)) ? MEM
: ((code == COMPARE || COMPARISON_P (x))
   && XEXP (x, 1) == const0_rtx) ? COMPARE
: in_code == COMPARE ? SET : in_code);


 And later on make_compound_operation converts shift pattern to Mul.

 case ASHIFT:
  /* Convert shifts by constants into multiplications if inside
 an address.  */
  if (in_code == MEM && CONST_INT_P (XEXP (x, 1))
  && INTVAL (XEXP (x, 1)) < HOST_BITS_PER_WIDE_INT
  && INTVAL (XEXP (x, 1)) >= 0
  && SCALAR_INT_MODE_P (mode))
{


[Bug sanitizer/63850] Building TSAN for Aarch64 results in assembler Error

2015-01-18 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63850

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||vekumar at gcc dot gnu.org

--- Comment #4 from vekumar at gcc dot gnu.org ---

We did some changes in Makefile and configure under libsantizers to make it
build for Aarch64.

As clyon said local experiments are done in GCC tree. Just capturing the
changes here. 

Ref:
https://git.linaro.org/toolchain/gcc.git/commit/07178f9e98be4fc1efad5c5d7c4fed7627c17e1f


[Bug sanitizer/63850] Building TSAN for Aarch64 results in assembler Error

2015-01-19 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63850

--- Comment #7 from vekumar at gcc dot gnu.org ---
(In reply to clyon from comment #6)
> Venkat,
> Can you submit your GCC patch, in an accepable way? (no change to sanitizer
> libs code, and obviously do not activate tsan by default)

Okay I will send out a patch that modifies libsanitizer/configure.ac and
libsanitizer/tsan/Makefile.am that will separate out tsan_rtl_amd64.S from
getting build for other targets.


--- a/libsanitizer/tsan/tsan_rtl.h
+++ b/libsanitizer/tsan/tsan_rtl.h
@@ -679,7 +679,7 @@ void AcquireReleaseImpl(ThreadState *thr, uptr pc,
SyncClock *c);
 // The trick is that the call preserves all registers and the compiler
 // does not treat it as a call.
 // If it does not work for you, use normal call.
-#if TSAN_DEBUG == 0
+#if defined(__x86_64__) && TSAN_DEBUG == 0

This change is also acceptable?


[Bug tree-optimization/64946] New: For Aarch64, vectorization with "abs" instruction is not hapenning with vector elements of char/short type.

2015-02-05 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946

Bug ID: 64946
   Summary: For Aarch64,  vectorization with "abs" instruction is
not hapenning with vector elements of char/short type.
   Product: gcc
   Version: 5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vekumar at gcc dot gnu.org

For the below test case.

signed char a[100],b[100];
void absolute_s8 (void)
{
 int i;
 for (i=0; i<16; i++)
 a[i] = (b[i] > 0 ? b[i] : -b[i]);
};

gcc version 5.0.0 20150203 (experimental) (GCC) with  -O3 -S on
aarch64-none-linux-gnu generates the following assembly 

absolute_s8:
adrpx1, b
adrpx0, a
add x1, x1, :lo12:b
add x0, x0, :lo12:a
ldr q0, [x1]  <== loads vector of 16 char elements
sshll   v1.8h, v0.8b, 0   <== 
sshll2  v0.8h, v0.16b, 0  <==
sshll   v3.4s, v1.4h, 0   <==
sshll   v2.4s, v0.4h, 0   <==
sshll2  v1.4s, v1.8h, 0   <==
sshll2  v0.4s, v0.8h, 0   <== promotes every element to "int"
abs v3.4s, v3.4s  <== Performs abs as vector of ints. 
abs v2.4s, v2.4s
abs v1.4s, v1.4s
abs v0.4s, v0.4s
xtn v4.4h, v3.4s
xtn2v4.8h, v1.4s
xtn v1.4h, v2.4s
xtn2v1.8h, v0.4s
xtn v0.8b, v4.8h
xtn2v0.16b, v1.8h
str q0, [x0]
ret

Vectorization is done in INT or SI mode although Aarch64 supports abs v0.16b
v0.16b.

Expected code 

absolute_s8:
adrpx1, b
adrpx0, a
add x1, x1, :lo12:b
add x0, x0, :lo12:a
ldr q0, [x1]  <== loads vector of 16 char elements
abs v0.16b, v0.16b<== abs in vector of chars
str q0, [x0]
ret


[Bug tree-optimization/64946] For Aarch64, vectorization with "abs" instruction is not hapenning with vector elements of char/short type.

2015-02-05 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946

--- Comment #1 from vekumar at gcc dot gnu.org ---
The test case is got from gcc.target/aarch64/vect-abs-compile.c


[Bug tree-optimization/64946] [AArch64] gcc.target/aarch64/vect-abs-compile.c - "abs" vectorization fails for char/short types

2015-02-25 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #3 from vekumar at gcc dot gnu.org ---
Richard, 

As per your suggestion, adding a pattern for type demotion in match.pd solves
this.

(simplify
 ( convert (abs (convert@1 @0)))
 ( if (INTEGRAL_TYPE_P (type)
  /* We check for type compatibility between @0 and @1 below,
 so there's no need to check that @1/@3 are integral types.  */
  && INTEGRAL_TYPE_P (TREE_TYPE (@0))
  && INTEGRAL_TYPE_P (TREE_TYPE (@1))
  /* The precision of the type of each operand must match the
 precision of the mode of each operand, similarly for the
 result.  */
  && (TYPE_PRECISION (TREE_TYPE (@0))
  == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (@0
  && (TYPE_PRECISION (TREE_TYPE (@1))
  == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (@1
  && TYPE_PRECISION (type) == GET_MODE_PRECISION (TYPE_MODE (type))
  /* The inner conversion must be a widening conversion.  */
  && TYPE_PRECISION (TREE_TYPE (@1)) > TYPE_PRECISION (TREE_TYPE (@0))
  && ((GENERIC
   && (TYPE_MAIN_VARIANT (TREE_TYPE (@0))
   == TYPE_MAIN_VARIANT (type)))
  || (GIMPLE
  && types_compatible_p (TREE_TYPE (@0), type
   (abs @0)))


I have not yet tested it. Will it have implication on targets that does not
support vectorization with short/char types?


[Bug tree-optimization/64946] [AArch64] gcc.target/aarch64/vect-abs-compile.c - "abs" vectorization fails for char/short types

2015-02-25 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946

--- Comment #6 from vekumar at gcc dot gnu.org ---
(In reply to Andrew Pinski from comment #5)
> I think you should always use an unsigned type here so it will be defined in
> the IR.  This is mentioned in bug 22199#c3 .

Andrew I missed to include something like this 

+  (if (TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
+   (convert (op @0 @1)))

as in  https://gcc.gnu.org/viewcvs?rev=220695&root=gcc&view=rev


[Bug tree-optimization/64946] [AArch64] gcc.target/aarch64/vect-abs-compile.c - "abs" vectorization fails for char/short types

2015-02-26 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946

--- Comment #9 from vekumar at gcc dot gnu.org ---
This match.pd pattern vectorizes the PR but works only with -fwrapv.

(simplify
 ( convert (abs (convert@1 @0)))
 ( if (INTEGRAL_TYPE_P (type)
  /* We check for type compatibility between @0 and @1 below,
 so there's no need to check that @1/@3 are integral types.  */
  && INTEGRAL_TYPE_P (TREE_TYPE (@0))
  && INTEGRAL_TYPE_P (TREE_TYPE (@1))
  /* The precision of the type of each operand must match the
 precision of the mode of each operand, similarly for the
 result.  */
  && (TYPE_PRECISION (TREE_TYPE (@0))
  == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (@0
  && (TYPE_PRECISION (TREE_TYPE (@1))
  == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (@1
  && TYPE_PRECISION (type) == GET_MODE_PRECISION (TYPE_MODE (type))
  /* The inner conversion must be a widening conversion.  */
  && TYPE_PRECISION (TREE_TYPE (@1)) > TYPE_PRECISION (TREE_TYPE (@0))
  && ((GENERIC
   && (TYPE_MAIN_VARIANT (TREE_TYPE (@0))
   == TYPE_MAIN_VARIANT (type)))
  || (GIMPLE
  && types_compatible_p (TREE_TYPE (@0), type
   (if (TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
(abs @0

For default cases (when no -fwrapv is given), doing ABSE_EXPR(shorttype) will
invoke undefined behaviour when value is -32678. similarly for signed char min.

As per Richard suggestion we need to move to a new tree code ABSU_EXPR to do
this type of folding optimization.


[Bug c/65287] Current trunk ICE in address_matters_p, at symtab.c:1908

2015-03-03 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65287

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||vekumar at gcc dot gnu.org

--- Comment #3 from vekumar at gcc dot gnu.org ---
Also faced this bug when I tired to cross compile for aarch64-none-linux-gnu


[Bug target/63949] New: Aarch64 instruction combiner does not optimize subsi_sxth function as expected

2014-11-19 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949

Bug ID: 63949
   Summary: Aarch64 instruction combiner does not optimize
subsi_sxth function as expected
   Product: gcc
   Version: 5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vekumar at gcc dot gnu.org

Reference: https://bugs.linaro.org/show_bug.cgi?id=863

Test case 

int   subsi_sxth (int a, short  i)
{
  /* { dg-final { scan-assembler "sub\tw\[0-9\]+,.*sxth #?1" } } */
  return a - ((int)i << 1);
}

Assembly generated with GCC 5.0.0 20141114

subsi_sxth:
sbfiz   w1, w1, 1, 16
sub w0, w0, w1
ret

Expected 
   subww0,w0,w1,sxth 1


Combiner Says Failed to mismatch 

set (reg/i:SI 0 x0)
(minus:SI (reg:SI 0 x0 [ a ])
(subreg:SI (sign_extract:DI (mult:DI (reg:DI 1 x1 [ i ])
(const_int 2 [0x2]))
(const_int 17 [0x11])
(const_int 0 [0])) 0)))

We have a pattern that would match in aarch64.md file, but it is not
recognized.

(define_insn "*sub__multp2"
   [(set (match_operand:GPI 0 "register_operand" "=rk")
 (minus:GPI (match_operand:GPI 4 "register_operand" "r")
(ANY_EXTRACT:GPI
 (mult:GPI (match_operand:GPI 1 "register_operand" "r")
   (match_operand 2 "aarch64_pwr_imm3" "Up3"))
 (match_operand 3 "const_int_operand" "n")
 (const_int 0]
   "aarch64_is_extend_from_extract (mode, operands[2], operands[3])"
   "sub\\t%0, %4, %1, xt%e3 %p2"
   [(set_attr "type" "alu_ext")]
 )


[Bug bootstrap/61440] Bootstrap failure with --with-build-config=bootstrap-lto

2014-12-01 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61440

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||vekumar at gcc dot gnu.org

--- Comment #5 from vekumar at gcc dot gnu.org ---
Yes same issue work around would be to use configure with
--enable-stage1-checking=release as mentioned in 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62077#c54



[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)

2014-12-19 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949

--- Comment #4 from vekumar at gcc dot gnu.org ---
(In reply to Richard Earnshaw from comment #3)
> make_extraction is unable to generate bit-field extractions in more than one
> mode.  This causes the extractions that it does generate to be wrapped in
> subregs when SImode results are wanted.
> 
> Ideally, we should teach make_extraction to be more sensible, but I'm not
> sure what the impact of that would be on other ports that really can only
> support one mode for bit-field extracts.

Yes, make_extraction converts mult to sign extract RTL.

RTL from (mult: SI (reg: SI 1 x1 [i])
  ( constant_int 2 [0x2]))

to 

(subreg: SI (sign_extract: DI (ashift: DI (reg DI 1 x1 [i]) 
  (constant_int 1 [0x1]))
  (constant_int 17 [0x11]))
(constant_int 0 [0x0]))


[Bug target/63949] Aarch64 instruction combiner does not optimize subsi_sxth function as expected (gcc.target/aarch64/extend.c fails)

2014-12-19 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63949

--- Comment #5 from vekumar at gcc dot gnu.org ---
Richard, what the function get_best_reg_extraction_insn is supposed to do in
make_extraction ?


[Bug bootstrap/68667] New: GCC trunk build fails compiling graphite-isl-ast-to-gimple.c

2015-12-02 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68667

Bug ID: 68667
   Summary: GCC trunk build fails compiling
graphite-isl-ast-to-gimple.c
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: bootstrap
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vekumar at gcc dot gnu.org
  Target Milestone: ---
Target: x86_64-*-*

Build breaks while compiling graphite related file.
Occurred with trunk r231212.

(Snip) 
../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c: In member function
âtree_node* translate_isl_ast_to_gimple::binary_op_to_tree(tree,
isl_ast_expr*,,
 ivs_params&)â:
../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c:591:10: error:
âisl_ast_op_zdiv_râ was not declared in this scope
 case isl_ast_op_zdiv_r:
  ^
../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c: In member function
âtree_node* translate_isl_ast_to_gimple::gcc_expression_from_isl_expr_op(tree,
isl_ast_expr*, ivs_params&)â:
../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c:762:10: error:
âisl_ast_op_zdiv_râ was not declared in this scope
 case isl_ast_op_zdiv_r:
  ^
Makefile:1085: recipe for target 'graphite-isl-ast-to-gimple.o' failed
make[2]: *** [graphite-isl-ast-to-gimple.o] Error 1
(Snip)

[Bug bootstrap/68667] GCC trunk build fails compiling graphite-isl-ast-to-gimple.c

2015-12-02 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68667

--- Comment #1 from vekumar at gcc dot gnu.org ---
(In reply to vekumar from comment #0)
> Build breaks while compiling graphite related file.
> Occurred with trunk r231212.
> 
> (Snip) 
> ../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c: In member function
> âtree_node* translate_isl_ast_to_gimple::binary_op_to_tree(tree,
> isl_ast_expr*,,
>  ivs_params&)â:
> ../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c:591:10: error:
> âisl_ast_op_zdiv_râ was not declared in this scope
>  case isl_ast_op_zdiv_r:
>   ^
> ../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c: In member function
> âtree_node*
> translate_isl_ast_to_gimple::gcc_expression_from_isl_expr_op(tree,
> isl_ast_expr*, ivs_params&)â:
> ../../gcc-fsf-trunk/gcc/graphite-isl-ast-to-gimple.c:762:10: error:
> âisl_ast_op_zdiv_râ was not declared in this scope
>  case isl_ast_op_zdiv_r:
>   ^
> Makefile:1085: recipe for target 'graphite-isl-ast-to-gimple.o' failed
> make[2]: *** [graphite-isl-ast-to-gimple.o] Error 1
> (Snip)

I do ./contrib/download_prerequisites in to the gcc folder and it downloaded
isl-0.14.

[Bug tree-optimization/68417] [6 Regression] Missed vectorization opportunity when setting struct field

2015-12-09 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68417

--- Comment #4 from vekumar at gcc dot gnu.org ---
Older trunk showed (gcc version 6.0.0 20151202 (experimental) (GCC))
iftmp.1_19 = p1_36->y;
tree could trap...

Today's trunk (gcc version 6.0.0 20151209)
Applying if-conversion
new phi replacement stmt
iftmp.1_6 = m1_16 >= m2_18 ? iftmp.1_19 : iftmp.1_20;
Removing basic block 4
basic block 4, loop depth 1

[Bug tree-optimization/68417] [6 Regression] Missed vectorization opportunity when setting struct field

2015-12-09 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68417

--- Comment #5 from vekumar at gcc dot gnu.org ---
Richard,

STMT: m1 = p1->x - m;

While hashing p1->x being a component ref,  we are hashing operand 0 part 
ie TREE_OPERAND (ref, 0). This is unconditionally read and written.

STMT: p3->y = (m1 >= m2) ? p1->y : p2->y;
Now for p1->y master DR is what we hashed previously for p1->x since p1->y is
alos a component ref. 

Fix for PR 68583
https://gcc.gnu.org/viewcvs?rev=231444&root=gcc&view=rev

Checks if candidate DR is unconditional accessed outside and also is a read
only.

 /* If a is unconditionally accessed then ... */
  if (DR_RW_UNCONDITIONALLY (*master_dr))
{
  /* an unconditional read won't trap.  */
  if (DR_IS_READ (a))
return true;

This holds true now and hence does if conversion.

Since this is expected can we close this PR as fixed?

[Bug tree-optimization/58135] [x86] Missed opportunities for partial SLP

2015-12-10 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58135

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2015-12-10
 Ever confirmed|0   |1

[Bug middle-end/68621] [6 Regression] FAIL: gcc.dg/tree-ssa/ifc-8.c scan-tree-dump-times ifcvt "Applying if-conversion" 1

2016-01-13 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68621

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||vekumar at gcc dot gnu.org

--- Comment #1 from vekumar at gcc dot gnu.org ---
We do this optimization under -fno-common. In case of -fpic this option does
not have any effect and the array declaration is assumed as can be overridden.

[Bug target/65951] [AArch64] Will not vectorize 64bit integer multiplication

2015-07-09 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65951

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||vekumar at gcc dot gnu.org

--- Comment #6 from vekumar at gcc dot gnu.org ---
I found similar pattern in SPEC2006 hmmer benchmark, when comparing x86_64 (
-O3 + -march=bdver3 vs. -O3 + -mcpu=cortex-a57). x86_64 was able to vectorize 5
additional loops. Out of 5 loops, two were cost model related and aarch64
rejects because of running high vector cost. 

Remaining three loops are of this pattern. one used a constant 104. 
The other two of them used multiplication by 4 and that could be converted to
vector shifts.

I made a simple test case and wanted to open a PR. James pointed me to this PR.
Thought of posting it as comments.


unsigned long int __attribute__ ((aligned (64)))arr[100];
int i;

void test_vector_shifts()
{
for(i=0; i<=99;i++)
arr[i]=arr[i]<<2;
}


void test_vectorshift_via_mul()
{
for(i=0; i<=99;i++)
arr[i]=arr[i]*4;

}

Assembly

.cpu cortex-a57+fp+simd+crc
.file   "test.c"
.text
.align  2
.p2align 4,,15
.global test_vector_shifts
.type   test_vector_shifts, %function
test_vector_shifts:
adrpx0, arr
add x0, x0, :lo12:arr
adrpx1, arr+800
add x1, x1, :lo12:arr+800
.p2align 2
.L2:
ldr q0, [x0]
shl v0.2d, v0.2d, 2 <==vector shifts 
str q0, [x0], 16
cmp x0, x1
bne .L2
adrpx0, i
mov w1, 100
str w1, [x0, #:lo12:i]
ret
.size   test_vector_shifts, .-test_vector_shifts
.align  2
   .p2align 4,,15
.global test_vectorshift_via_mul
.type   test_vectorshift_via_mul, %function
test_vectorshift_via_mul:
adrpx0, arr
add x0, x0, :lo12:arr
adrpx2, arr+800
add x2, x2, :lo12:arr+800
.p2align 2
.L6:
ldr x1, [x0]
lsl x1, x1, 2
str x1, [x0], 8 <==scalar shifts 
cmp x0, x2
bne .L6
adrpx0, i
mov w1, 100
str w1, [x0, #:lo12:i]
ret
.size   test_vectorshift_via_mul, .-test_vectorshift_via_mul


[Bug target/65952] [AArch64] Will not vectorize storing induction of pointer addresses for LP64

2015-07-20 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||vekumar at gcc dot gnu.org

--- Comment #9 from vekumar at gcc dot gnu.org ---
As per Richards, suggestion I added a pattern in vector recog.
This seems to vectorize the this PR. 

However I need some help on the following  

(1)How do I check the shift amount and also care about type/signedness.  There
could be different shift amounts allowed in the target architecture when
looking for power 2 constants.

(2)Should I need to check if target architecture supports vectorized shifts
before converting the pattern?

---Patch---
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index f034635..995c9b2 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -76,6 +76,10 @@ static gimple vect_recog_vector_vector_shift_pattern
(vec *,
  tree *, tree *);
 static gimple vect_recog_divmod_pattern (vec *,
 tree *, tree *);
+
+static gimple vect_recog_multconst_pattern (vec *,
+ tree *, tree *);
+
 static gimple vect_recog_mixed_size_cond_pattern (vec *,
  tree *, tree *);
 static gimple vect_recog_bool_pattern (vec *, tree *, tree *);
@@ -90,6 +94,7 @@ static vect_recog_func_ptr
vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
vect_recog_rotate_pattern,
vect_recog_vector_vector_shift_pattern,
vect_recog_divmod_pattern,
+vect_recog_multconst_pattern,
vect_recog_mixed_size_cond_pattern,
vect_recog_bool_pattern};

@@ -2147,6 +2152,90 @@ vect_recog_vector_vector_shift_pattern (vec
*stmts,
   return pattern_stmt;
 }

+static gimple
+vect_recog_multconst_pattern (vec *stmts,
+   tree *type_in, tree *type_out)
+{
+  gimple last_stmt = stmts->pop ();
+  tree oprnd0, oprnd1, vectype, itype, cond;
+  gimple pattern_stmt, def_stmt;
+  enum tree_code rhs_code;
+  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
+  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
+  bb_vec_info bb_vinfo = STMT_VINFO_BB_VINFO (stmt_vinfo);
+  optab optab;
+  tree q;
+  int dummy_int, prec;
+  stmt_vec_info def_stmt_vinfo;
+
+  if (!is_gimple_assign (last_stmt))
+return NULL;
+
+  rhs_code = gimple_assign_rhs_code (last_stmt);
+  switch (rhs_code)
+{
+case MULT_EXPR:
+  break;
+default:
+  return NULL;
+}
+
+  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
+return NULL;
+
+  oprnd0 = gimple_assign_rhs1 (last_stmt);
+  oprnd1 = gimple_assign_rhs2 (last_stmt);
+  itype = TREE_TYPE (oprnd0);
+  if (TREE_CODE (oprnd0) != SSA_NAME
+  || TREE_CODE (oprnd1) != INTEGER_CST
+  || TREE_CODE (itype) != INTEGER_TYPE
+  || TYPE_PRECISION (itype) != GET_MODE_PRECISION (TYPE_MODE (itype)))
+return NULL;
+  vectype = get_vectype_for_scalar_type (itype);
+  if (vectype == NULL_TREE)
+return NULL;
+
+  /* If the target can handle vectorized division or modulo natively,
+ don't attempt to optimize this.  */
+  optab = optab_for_tree_code (rhs_code, vectype, optab_default);
+  if (optab != unknown_optab)
+{
+  machine_mode vec_mode = TYPE_MODE (vectype);
+  int icode = (int) optab_handler (optab, vec_mode);
+  if (icode != CODE_FOR_nothing)
+return NULL;
+}
+
+  prec = TYPE_PRECISION (itype);
+  if (integer_pow2p (oprnd1))
+{
+  /*if (TYPE_UNSIGNED (itype) || tree_int_cst_sgn (oprnd1) != 1)
+return NULL;
+ */
+
+  /* Pattern detected.  */
+  if (dump_enabled_p ())
+dump_printf_loc (MSG_NOTE, vect_location,
+ "vect_recog_multconst_pattern: detected:\n");
+
+  tree shift;
+
+  shift = build_int_cst (itype, tree_log2 (oprnd1));
+  pattern_stmt
+= gimple_build_assign (vect_recog_temp_ssa_var (itype, NULL),
+   LSHIFT_EXPR, oprnd0, shift);
+  if (dump_enabled_p ())
+dump_gimple_stmt_loc (MSG_NOTE, vect_location, TDF_SLIM, pattern_stmt,
+  0);
+
+ stmts->safe_push (last_stmt);
+
+  *type_in = vectype;
+  *type_out = vectype;
+  return pattern_stmt;
+   } 
+return NULL;
+}
 /* Detect a signed division by a constant that wouldn't be
otherwise vectorized:

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 48c1f8d..833fe4b 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1131,7 +1131,7 @@ extern void vect_slp_transform_bb (basic_block);
Additional pattern recognition functions can (and will) be added
in the future.  */
 typedef gimple (* vect_recog_func_ptr) (vec *, tree *, tree *);
-#defin

[Bug target/65952] [AArch64] Will not vectorize storing induction of pointer addresses for LP64

2015-07-20 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952

--- Comment #10 from vekumar at gcc dot gnu.org ---
With the patch I get 
loop:
adrpx0, array
ldr q1, .LC0
ldr q2, .LC1
adrpx1, ptrs
add x1, x1, :lo12:ptrs
ldr x0, [x0, #:lo12:array]
dup v0.2d, x0
add v1.2d, v0.2d, v1.2d <== vectorized
add v0.2d, v0.2d, v2.2d <== vectorized
str q1, [x1]
str q0, [x1, 16]
ret
.size   loop, .-loop
.align  4
.LC0:
.xword  0
.xword  16
.align  4
.LC1:
.xword  32
.xword  48


[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2015-08-10 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 65952, which changed state.

Bug 65952 Summary: [AArch64] Will not vectorize storing induction of pointer 
addresses for LP64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED


[Bug target/65952] [AArch64] Will not vectorize storing induction of pointer addresses for LP64

2015-08-10 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #11 from vekumar at gcc dot gnu.org ---

This is getting fixed after patch
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=226675

Vectorize mult expressions with power 2 constants via shift, for targets has no
vector multiplication support.

2015-08-06  Venkataramanan Kumar  

* tree-vect-patterns.c (vect_recog_mult_pattern): New function
for vectorizing multiplication patterns.
* tree-vectorizer.h: Adjust the number of patterns.

2015-08-06  Venkataramanan Kumar  

* gcc.dg/vect/vect-mult-pattern-1.c: New test.
* gcc.dg/vect/vect-mult-pattern-2.c: New test.


[Bug tree-optimization/54803] Rotates are not vectorized

2015-08-13 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54803

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||vekumar at gcc dot gnu.org

--- Comment #5 from vekumar at gcc dot gnu.org ---
On bdver4 when we enable -march=bdver4 and -mno-prefer-avx128 vectorizes using
YMM
Otherwise uses vprotq instruction.

.L13:
vmovdqa (%r8,%r9), %ymm0
incq%rax
vpsrlq  $32, %ymm0, %ymm1
vpsllq  $32, %ymm0, %ymm0
vpor%ymm0, %ymm1, %ymm0
vmovdqa %ymm0, (%rdx,%r9)
addq$32, %r9
cmpq%rax, %r10
ja  .L13


[Bug tree-optimization/54803] Rotates are not vectorized

2015-08-13 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54803

--- Comment #6 from vekumar at gcc dot gnu.org ---
(In reply to vekumar from comment #5)
> On bdver4 when we enable -march=bdver4 and -mno-prefer-avx128 vectorizes
> using YMM
> Otherwise uses vprotq instruction.
> 
> .L13:
> vmovdqa (%r8,%r9), %ymm0
> incq%rax
> vpsrlq  $32, %ymm0, %ymm1
> vpsllq  $32, %ymm0, %ymm0
> vpor%ymm0, %ymm1, %ymm0
> vmovdqa %ymm0, (%rdx,%r9)
> addq$32, %r9
> cmpq%rax, %r10
> ja  .L13


This is with trunk gcc version 6.0.0 20150810 (experimental) (GCC)


[Bug tree-optimization/67326] [5/6 Regression] -ftree-loop-if-convert-stores does not vectorize conditional assignment (anymore)

2015-08-25 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67326

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||vekumar at gcc dot gnu.org

--- Comment #2 from vekumar at gcc dot gnu.org ---
Hi Richard,

As a first step I am trying to allow if conversion to happen under
-ftree-loop-if-convert-stores for cases where we know it is already accessed
(read) outside unconditionally once and also the memory access is read and
write. 

 __attribute__((aligned(32))) float a[LEN]; void 
 test() { for (int i = 0; i < LEN; i++) {
if (a[i] > (float)0.) { //<== Already read here unconditionally 
 a[i] =3 ;  //<== if we now it is read and write memory access
we can allow if conversion.
} 

As you said the cases in PR we need to enhance if-conversion pass to do bounds
checking the array "a" accessing using values of i.


[Bug tree-optimization/71992] New: Missed BB SLP vectorization in GCC

2016-07-25 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71992

Bug ID: 71992
   Summary: Missed BB SLP vectorization in GCC
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vekumar at gcc dot gnu.org
  Target Milestone: ---

The below test case fails to vectorize.
gcc version 7.0.0 20160724 (experimental) (GCC)

gcc -Ofast -mavx -fvect-cost-model=unlimited slp.c -S -fdump-tree-slp-all

struct st
{
double x;
double y;
double z;
double p;
double q;
}*obj;

double a,b,c;

void slp_test()
{

obj->x = a*a+3.0;
obj->y= b*b+c;
obj->z= a+b*3.0;
obj->p= a+b*3.0;
obj->q =a+b+c;

}

LLVM is able to SLP vectorize looks like it is creating vector of [a,c]  and
[b*3.0,b*b] and does vector add.

GCC is not SLP vectorizing.  Group slitting also not working. I expected it to
get split and vectorize these statements.

  obj->z= a+b*3.0;
  obj->p= a+b*3.0;

Another case 

struct st
{
double x;
double y;
double z;
double p;
double q;
}*obj;

double a,b,c;

void slp_test()
{

obj->x = a*b;
obj->y= b+c;
obj->z= a+b*3.0;
obj->p= a+b*3.0;
obj->q =a+b+c;

}


LLVM forms vector [b*3.0,a+b] [a,c] and does vector addition.

[Bug target/77270] Flag -mprftchw is shared with 3dnow for -march=k8

2016-08-21 Thread vekumar at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77270

vekumar at gcc dot gnu.org changed:

   What|Removed |Added

 CC||vekumar at gcc dot gnu.org

--- Comment #8 from vekumar at gcc dot gnu.org ---
There are 2 issues .

#issue1 
-mprfchw should be enabled only for targets that supports (3DNowprefetch).
On K8, 3DNowprefetch is not available and -march=k8 should not set this flag.

I can see behavior now corrected with Uros flag. 
Although I have to verify changes done other targets. 

#2 issue2
prefetchw ISA is also available in 3DNow!. Generating prefetchw by the GCC
backend is functionally correct if write prefetches are requested.

Looking at the test case why write prefetches are requested.

void f() {
extern int size;
int i;
float * fvec;
float * fptr = (float *) get();
for(i = 0; i < size; ++i)
fvec[i] = fptr[i];
get();
}

I have to keep one more call statement so that "fvec" definition is not killed. 
prefetchw is generated for memory stores via fvec.  They are written only.

[Bug tree-optimization/118380] New: GCC is not optimizing computataion and code with avx intrinsics.

2025-01-08 Thread vekumar at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118380

Bug ID: 118380
   Summary: GCC is not optimizing computataion and code with avx
intrinsics.
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vekumar at gcc dot gnu.org
  Target Milestone: ---

For the test case in the given link 
https://godbolt.org/z/MP88MaTva

LLVM is able to optimize the loop and computations hapenning completely.
GCC is not able to do so. 

Arrays are defined locally and may not be the case with real world application. 
Nevertheless GCC can also optimize this case.