[Bug tree-optimization/63302] [4.9 Regression] Code with 64-bit long long constants is miscompiled on 32-bit host

2014-10-12 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63302

--- Comment #19 from Zhenqiang Chen  ---
(In reply to John David Anglin from comment #18)
> Hi Zhenqiang,
> 
> Do you plan to submit patch to gcc-patches soon?

Yes. It is in internal review process. I hope to send out it this week.


[Bug rtl-optimization/63210] ira does not select the best register compared with gcc 4.8 for ARM THUMB1

2014-10-28 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63210

--- Comment #4 from Zhenqiang Chen  ---
(In reply to Ramana Radhakrishnan from comment #3)
> Fixed is it? And does it fail in GCC 4.9 ?

Fixed on trunk. Same fail in GCC 4.9.

It is a performance issue. Do you think it is OK for 4.9?


[Bug target/61578] Code size increase for ARM thumb compared to 4.8.x when compiling with -Os

2014-11-02 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61578

--- Comment #11 from Zhenqiang Chen  ---
> Added updated CSiBE benchmark for GCC 4.8, GCC 4.9 and trunk.
> It's obvious that excluding ip gives shorted code.
> Then there is something on trunk that makes some project become very large,
> which should be investigated perhaps.

When compiling CSiBE with trunk, please add option "-std=gnu89".


[Bug tree-optimization/63743] New: Thumb1: big regression for float operators by r216728

2014-11-05 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63743

Bug ID: 63743
   Summary: Thumb1: big regression for float operators by r216728
   Product: gcc
   Version: 5.0
Status: UNCONFIRMED
  Severity: critical
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: zhenqiang.chen at arm dot com

Created attachment 33887
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33887&action=edit
test case

Root cause: the fold_stmt swaps the operands, which leads to register shuffle.

commit f619ecaed41d1487091098a0f4fdf4d6ed1fa379
Author: rguenth 
Date:   Mon Oct 27 11:30:23 2014 +

2014-10-27  Richard Biener  

* tree-ssa-forwprop.c: Include tree-cfgcleanup.h and tree-into-ssa.h.
(lattice): New global.
(fwprop_ssa_val): New function.
(fold_all_stmts): Likewise.
(pass_forwprop::execute): Finally fold all stmts.

* gcc.dg/tree-ssa/forwprop-6.c: Scan ccp1 dump instead.
* gcc.dg/strlenopt-8.c: Adjust and XFAIL for non_strict_align
target due to memcpy inline-expansion.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@216728
138bc75d-0d04-0410-961f-82ee72b054a4

A simplified case is attached.

Options: -mthumb -Os -mcpu=cortex-m0

Before the patch, tree codes like

_20 = _14 + _19;
_21 = _20 * x_13;

After the patch, tree codes like

_20 = _14 + _19;
_21 = x_13 * _20;

Without HARD fpu support, all operators will be changed to function calls. The
assemble codes change like:

Before the patch,
bl  __aeabi_dadd
ldr r2, [sp]
ldr r3, [sp, #4]
/* r0, r1 are reused from the return values of the previous call. */
bl  __aeabi_dmul

After the patch,
bl  __aeabi_dadd
mov r2, r0
mov r3, r1
ldr r0, [sp]
ldr r1, [sp, #4]
bl  __aeabi_dmul


[Bug tree-optimization/63743] Thumb1: big regression for float operators by r216728

2014-11-05 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63743

--- Comment #2 from Zhenqiang Chen  ---
(In reply to Jakub Jelinek from comment #1)
> Were we swapping operands before?  I mean, if you rewrite the testcase to
> swap the * arguments in the source, did you get the same more efficient code
> in the past?

Yes. I tried the test case:

double
test1 (double x, double y)
{
  return x * (x + y);
}
double
test2 (double x, double y)
{
  return (x + y) * x;
}

Without r216728, I got efficient codes for both functions. But with r216728, I
got inefficient codes for both functions.


[Bug tree-optimization/63743] Thumb1: big regression for float operators by r216728

2014-11-05 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63743

--- Comment #4 from Zhenqiang Chen  ---
(In reply to Jakub Jelinek from comment #3)
> (In reply to Zhenqiang Chen from comment #2)
> > (In reply to Jakub Jelinek from comment #1)
> > > Were we swapping operands before?  I mean, if you rewrite the testcase to
> > > swap the * arguments in the source, did you get the same more efficient 
> > > code
> > > in the past?
> > 
> > Yes. I tried the test case:
> > 
> > double
> > test1 (double x, double y)
> > {
> >   return x * (x + y);
> > }
> > double
> > test2 (double x, double y)
> > {
> >   return (x + y) * x;
> > }
> > 
> > Without r216728, I got efficient codes for both functions. But with r216728,
> > I got inefficient codes for both functions.
> 
> What about
> double
> test3 (double x, double y)
> {
>   return (x + y) * (x - y);
> }
> ?  At least from quick looking at ppc -msoft-float -O2 -m32, I see the same
> issue there, add called first, sub called second, and result of second
> returned in the same registers as used for the first argument. So something
> to handle at expansion or RA rather than in GIMPLE anyway IMHO.

Same issue for the case on Thumb1.


[Bug rtl-optimization/63917] [5 Regression] r217646 caused many failures

2014-11-17 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63917

--- Comment #2 from Zhenqiang Chen  ---
Thanks for the report. I can reproduce the FAIL on IA32. Since the patch does
not impact on -O0, it seams some library functions are wrong. I will
investigate it.


[Bug rtl-optimization/63917] [5 Regression] r217646 caused many failures

2014-11-18 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63917

--- Comment #3 from Zhenqiang Chen  ---
Root cause: r217646 enhances ifcvt to handle cbranchcc4 instruction. But ifcvt
does not check whether an insn will clobber CC or not, when moving it before
the cbranchcc4 instruction.

I will work out a patch to fix it.


[Bug middle-end/61225] [5 Regression] Several new failures after r210458 on x86_64-*-* with -m32

2014-11-21 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61225

--- Comment #15 from Zhenqiang Chen  ---
Thank you for the reminder. I will rework the patch.


[Bug target/64015] [5.0 Regression] AArch64 ICE due to conditional compare

2014-11-23 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64015

--- Comment #2 from Zhenqiang Chen  ---
You force it to register? In fact, I tend to not force it to register in
gen_ccmp_next, since it will introduce more overhead for ccmp, which
performance maybe worse.

My patch to fix the issue is at:
https://gcc.gnu.org/ml/gcc-patches/2014-11/msg02966.html

For CCMP, we still miss two optimizations for it:
1) Change the order of compares. In the case, if you change it to

  b > 252 && a > 10

You don't need "mov w0, 252"

uxtbw1, w1
uxtbw0, w0
cmpw1, 252
ccmpw0, 10, 0, hi
csetw0, hi
ret

2) How to justify it is valueable (the overhead of ccmp is OK) when generating
ccmp?


[Bug target/64015] [5.0 Regression] AArch64 ICE due to conditional compare

2014-11-23 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64015

--- Comment #5 from Zhenqiang Chen  ---
It seams you always win with ccmp. Please go ahead for your patch and make sure
the following case work.

int
test (unsigned short a, unsigned char b)
{
  return a > 0xfff2 && b > 252;
}

Thanks!
-Zhenqiang


[Bug target/64015] [5.0 Regression] AArch64 ICE due to conditional compare

2014-11-26 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64015

--- Comment #7 from Zhenqiang Chen  ---
Sorry for blocking your benchmark tests. I had reverted the ccmp patch. 

I will rework the patch based on Richard Henderson's comments:
https://gcc.gnu.org/ml/gcc-patches/2014-11/msg03100.html


[Bug middle-end/61225] [5 Regression] Several new failures after r210458 on x86_64-*-* with -m32

2014-12-07 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61225

--- Comment #16 from Zhenqiang Chen  ---
Still in discussions in two threads about Combine and Compare-elim.

[PATCH] Fix PR 61225
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00558.html
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00577.html
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00578.html
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00579.html
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00612.html

Compare-elim pass (was: Re: [PATCH] Fix PR 61225)
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00581.html


[Bug rtl-optimization/59535] [4.9 regression] -Os code size regressions for Thumb1/Thumb2 with LRA

2014-09-02 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59535

Zhenqiang Chen  changed:

   What|Removed |Added

 CC||zhenqiang.chen at arm dot com

--- Comment #20 from Zhenqiang Chen  ---
Here is a small case to show lra introduces one more register copy (tested with
trunk and 4.9).

int isascii (int c)
{
  return c >= 0 && c < 128;
}
With options: -Os -mthumb -mcpu=cortex-m0, I got

isascii:
movr3, #0
movr2, #127
movr1, r3   //???
cmpr2, r0
adcr1, r1, r3
movr0, r1
bxlr

With options: -Os -mthumb -mcpu=cortex-m0 -mno-lra, I got

isascii:
movr2, #127
movr3, #0
cmpr2, r0
adcr3, r3, r3
movr0, r3
bxlr


[Bug rtl-optimization/63210] New: ira does not select the best register compared with gcc 4.8 for ARM THUMB1

2014-09-08 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63210

Bug ID: 63210
   Summary: ira does not select the best register compared with
gcc 4.8 for ARM THUMB1
   Product: gcc
   Version: 5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: zhenqiang.chen at arm dot com

Here is a case shown ira does not select the best register compared with gcc
4.8 for ARM Cortex-M0 with options:

-Os -mthumb -mcpu=cortex-m0

int foo1 (int c);
int foo2 (int c);

int test (int c)
{
  return (foo1 (c) || foo2 (c));
}

Its rtl is like:

2: r115:SI=r0:SI
7: r0:SI=r115:SI
8: r0:SI=call [`foo1'] argc:0
9: r111:SI=r0:SI
4: r110:SI=0x1
   10: pc={(r111:SI!=0)?L17:pc}
   12: r0:SI=r115:SI
   13: r0:SI=call [`foo2'] argc:0
   14: r112:SI=r0:SI
   16: {r110:SI=r112:SI!=0;clobber r118:SI;}
   17: L17:
   23: r0:SI=r110:SI

For gcc 4.8, r115 is assigned first, which gets "r4" since

  Allocno a3r115 of GENERAL_REGS(9) has 4 avail. regs  4-7, ...

Then r110 is assigned to "r0". "r0:SI=r110:SI" can be optimized.

But for trunk/4.9, r110 is assigned first. r110 is conflict with r115 and the
confict cost of "r0" is high since "r0" is not in "avail. regs  4-7" for r115.
So r110 is not assigned with "r0".


[Bug rtl-optimization/63210] ira does not select the best register compared with gcc 4.8 for ARM THUMB1

2014-09-08 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63210

--- Comment #1 from Zhenqiang Chen  ---
Here is a workaround patch to show the point.

diff --git a/gcc/ira-color.c b/gcc/ira-color.c
index e2ea359..1573fb5 100644
--- a/gcc/ira-color.c
+++ b/gcc/ira-color.c
@@ -1709,6 +1709,8 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
 {
   ira_allocno_t conflict_a = OBJECT_ALLOCNO (conflict_obj);
   enum reg_class conflict_aclass;
+  HARD_REG_SET prof_regs;
+  prof_regs = ALLOCNO_COLOR_DATA (conflict_a)->profitable_hard_regs;

   /* Reload can give another class so we need to check all
  allocnos.  */
@@ -1780,7 +1782,7 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
 hard_regno = ira_class_hard_regs[aclass][j];
 ira_assert (hard_regno >= 0);
 k = ira_class_hard_reg_index[conflict_aclass][hard_regno];
-if (k < 0)
+if (k < 0 || !TEST_HARD_REG_BIT (prof_regs, hard_regno))
   continue;
 full_costs[j] -= conflict_costs[k];
   }

For this case, "r0" is not available for r115. The conflict for r110 on "r0"
maybe meaningless.


[Bug tree-optimization/63302] [4.9 Regression] Code with 64-bit long long constants is miscompiled on 32-bit host

2014-09-27 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63302

--- Comment #6 from Zhenqiang Chen  ---
I double checked the function optimize_range_tests_diff. Overall, I think it
does the right thing. X86 and ARM work correctly. The ldil.c.169t.optimized is

  :
  x_2 = ival_1(D) & -2147481601;
  _8 = x_2 + 2147483648;
  _9 = _8 & -2147483649;
  _10 = _9 == 0;
  _6 = (int) _10;
  return _6;

If we can not fix the wide-int issue, I will create a patch to workaround it
for 4.9 since I never expected optimize_range_tests_diff can work between a
positive value and a negative value.


[Bug tree-optimization/63302] [4.9 Regression] Code with 64-bit long long constants is miscompiled on 32-bit host

2014-09-28 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63302

--- Comment #10 from Zhenqiang Chen  ---
(In reply to dave.anglin from comment #8)
> On 28-Sep-14, at 10:34 AM, dave.anglin at bell dot net wrote:
> 
> > This is what I see on the trunk, but 4.9 is wrong.  Possibly, there is
> > a transformation
> > after optimize_range_tests_diff where things go wrong on 4.9.
> 
> It seems the difference actually occurs in reassoc1.
> 
> 4.9:
> 
>:
>x_2 = ival_1(D) & -2147481601;
>_3 = x_2 == 0;
>_8 = x_2 & 2147483647;
>_9 = _8 == 0;
>_4 = x_2 == -2147483648;
>_5 = _9;
>_6 = (int) _5;
>return _6;

This is not optimized by my patch: optimize_range_tests_diff

> trunk:
> 
>:
>x_2 = ival_1(D) & -2147481601;
>_3 = x_2 == 0;
>_8 = x_2 + 2147483648;
>_9 = _8 & -2147483649;
>_10 = _9 == 0;
>_4 = x_2 == -2147483648;
>_5 = _10;
>_6 = (int) _5;
>return _6;

This is optimized by the patch. optimize_range_tests_diff is the only change I
added for the patch. Others were just reorganization.

Can you show more detail dumps with -fdump-tree-reassoc1-details?

Thanks!
-Zhenqiang


[Bug tree-optimization/63302] [4.9 Regression] Code with 64-bit long long constants is miscompiled on 32-bit host

2014-09-28 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63302

--- Comment #13 from Zhenqiang Chen  ---
For 4.9, some function optimizes the code as:

Optimizing range tests x_2 -[-2147483648, -2147483648] and -[0, 0]
 into (x_2 & 2147483647) != 0

For trunk, optimize_range_tests_diff optimizes the code as:

Optimizing range tests x_2 -[-2147483648, -2147483648] and -[0, 0]
 into (x_2 + 2147483648 & -2147483649) != 0

I tried to remove need_64bit_hwint=yes" from config.gcc to build i686. But it
does not build. So still need your help to idnetify the root cause.

Can you try to remove the line in tree-ssa-reassoc.c?

  any_changes |= optimize_range_tests_1 (opcode, first, length, true,
 ops, ranges);
And add a printf at the place:

  printf ("changes: %d\n", any_changes);


[Bug tree-optimization/63302] [4.9 Regression] Code with 64-bit long long constants is miscompiled on 32-bit host

2014-09-28 Thread zhenqiang.chen at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63302

--- Comment #14 from Zhenqiang Chen  ---
Created attachment 33608
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33608&action=edit
patch

After investigation, I found I mis-use tree_log2.

Please try the attached patch. If it works, I will run all tests and send it
for community review.

Thanks!