ipa vrp implementation in gcc

2016-01-10 Thread Kugan
Hi All,

I am looking at implementing a ipa vrp pass. Jan Hubicka also talks
about this in 2013 GNU Cauldron as one of the optimization he would like
to see in gcc. So my question is, is any one implementing it. If not we
would like to do that.

I also looked at the ipa-cp implementation to see how this can be done.
Going by this, one of the way to implement this is (skipping all the
details):

- Have an early tree-vrp so that we can have value ranges for parameters
at call sites.

- Create jump functions that captures the value ranges of call sites
propagate the value ranges. In 2013 talk, Jan Hubicka talks about

- Modifying ipa-prop.[h|c] to handles this but wouldn't it be easier to
have its own and much more simpler implementation ?

- Once we have the value ranges for parameter/return values, we could
rely on tree-vrp to use this and do the optimizations


Does this make any sense? Any thoughts/suggestions to work on this is
highly appreciated.

Thanks,
Kugan


Re: ipa vrp implementation in gcc

2016-01-17 Thread Kugan

> Hello I am Vivek Pandya, I am actually working on a GSoC 2016 proposal
> for his work and it is very similar to extending ipa-cp pass. I am also
> in touch with Jan Hubicka.

Hi Vivek,

Glad to know that you are planning to work on this. Could you please put
you plan in an accessible place (or post it here) so that we know what
you plans are. That way we can work on what you are not working. And
also possible contribute to your plan in other ways (like testing and
reviewing).

Thanks,
Kugan


Re: ipa vrp implementation in gcc

2016-01-17 Thread Kugan
Hi,

> Another potential use of value ranges is the profile estimation. 
> http://www.lighterra.com/papers/valuerangeprop/Patterson1995-ValueRangeProp.pdf
> It seems to me that we may want to have something that can feed sane loop
> bounds for profile estimation as well and we can easily store the known
> value ranges to SSA name annotations.
> So I think separate local pass to compute value ranges (perhaps with less
> accuracy than full blown VRP) is desirable.

Thanks for the reference. I am looking at implementing a local pass for
VRP. The value range computation in tree-vrp is based on the above
reference and uses ASSERT_EXPR insertion (I understand that you posted
the reference above for profile estimation). As Richard mentioned in his
reply, the local pass should not rely on ASSERT_EXPR insertion.
Therefore, do you have any specific algorithm in mind (i.e. Any
published paper or reference from book)?. Of course we can tweak the
algorithm from the reference above but would like to understand what
your intension are.


> I think the ipa-prop.c probably won't need any siginificant changes.  The 
> code basically analyze what values are passed thorugh the function and
> this works for constants as well as for intervals. In fact ipa-cp already
> uses the same ipa-prop analysis for 
>  1) constant propagation
>  2) alignment propagation
>  3) propagation of known polymorphic call contextes.
> 
> So replacing 1) by value range propagation should be easily doable. 
> I would also like to replace alignment propagation by bitwise constant
> propagation (i.e. propagating what bits are known to be zero and what
> bits are known to be one). We already do have bitwise CCP, so we could
> deal with this basically in the same way as we deal with value ranges.
> 
> ipa-prop could use bit of clenaing up and modularizing that I hope will
> be done next stage1 :)

We (Myself and Prathamesh) are interested in working on LTO
improvements. Let us have a look at this.

>>
>>> - Once we have the value ranges for parameter/return values, we could
>>> rely on tree-vrp to use this and do the optimizations
>>
>> Yep.  IPA transform phase should annotate parameter default defs with
>> computed ranges.
> 
> Yep, in addition we will end up with known value ranges stored in aggregates
> for that we need better separate representaiton
> 
> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68930
>>

Thanks,
Kugan


Re: ipa vrp implementation in gcc

2016-02-09 Thread kugan



On 19/01/16 04:10, Jan Hubicka wrote:

In general, given that we have existing VRP implementation I would suggest
first implementing the IPA propagation and profile estimation bits using
existing VRP pass and then try to compare the simple dominator based approach
with the VRP we have and see what are the compile time/code quality effects
of both. Based on that we can decide how complex VRP we really want.

It will be probably also more fun to implement it this way:)
I plan to collect some data on early VRP and firefox today or tomorrow.



Thanks. I started experimenting with it. Prototype patch is attached. I 
haven't tested it in any detailed way yet. This is just to understand 
the LTO and see how we can implement it.



I wanted to set the value range to parameter based on the ipa-vrp. For 
example:


extern void foo (int);

void bar (unsigned long l)
{
  foo(l == 0);
}

void bar2 (unsigned long l)
{
  foo(l & 0x2);
}


unsigned long x;

int main()
{
  x = 0;
  bar (x);
  x = 1;
  bar (x);
  x = 3;
  bar2 (x);
  x = 5;
  bar2 (x);
}


In the above case, I wanted value range of the ssa_name that gets 
initialized to [0,2]. As can be seen from the ipa-cp dump (attached), 
this is now happening. Any comments ? I also have some questions:



1.I think even if we are not going to use the tree-vrp for 
intra-procedural value range propagation, we can factor out some of the 
routines and share it. Any thoughts on this?



2. Is the DOM based intra-procedural prototype Richard Biener 
implemented available anywhere. Can you please point me to that.



Thanks,
Kugan



IPA structures before propagation:

Function parameters:
  function  foo/6 parameter descriptors:
param #0 used undescribed_use
  function  main/3 parameter descriptors:
  function  bar2/1 parameter descriptors:
param #0 used undescribed_use
  function  bar/0 parameter descriptors:
param #0 used undescribed_use

Jump functions:
  Jump functions of caller  __builtin_puts/7:
  Jump functions of caller  foo/6:
callsite  foo/6 -> __builtin_puts/7 : 
   param 0: CONST: &"test"[0]
 Alignment: 1, misalignment: 0
  Jump functions of caller  main/3:
callsite  main/3 -> foo/6 : 
   param 0: CONST: 0
 Unknown alignment
callsite  main/3 -> foo/6 : 
   param 0: CONST: 2
 Unknown alignment
callsite  main/3 -> foo/6 : 
   param 0: CONST: 0
 Unknown alignment
callsite  main/3 -> foo/6 : 
   param 0: CONST: 1
 Unknown alignment
  Jump functions of caller  bar2/1:
callsite  bar2/1 -> foo/6 : 
   param 0: UNKNOWN
 Unknown alignment
  Jump functions of caller  bar/0:
callsite  bar/0 -> foo/6 : 
   param 0: UNKNOWN
 Unknown alignment

 Propagating constants:

Not considering foo for cloning; -fipa-cp-clone disabled.
Marking all lattices of foo/6 as BOTTOM
Not considering main for cloning; -fipa-cp-clone disabled.
Marking all lattices of main/3 as BOTTOM
Not considering bar2 for cloning; -fipa-cp-clone disabled.
Marking all lattices of bar2/1 as BOTTOM
Not considering bar for cloning; -fipa-cp-clone disabled.
Marking all lattices of bar/0 as BOTTOM

overall_size: 34, max_new_size: 11001

Estimating effects for bar2/1, base_time: 14.

Estimating effects for bar/0, base_time: 14.
Meeting
  [0, 2]
and
  [0, 1]
to
  [0, 2]

Estimating effects for foo/6, base_time: 6.

IPA lattices after all propagation:

Lattices:
  Node: foo/6:
param [0]: BOTTOM
 ctxs: BOTTOM
 Alignment unusable (BOTTOM)
[0, 2]AGGS BOTTOM
  Node: main/3:
  Node: bar2/1:
param [0]: BOTTOM
 ctxs: BOTTOM
 Alignment unusable (BOTTOM)
UNDEFINEDAGGS BOTTOM
  Node: bar/0:
param [0]: BOTTOM
 ctxs: BOTTOM
 Alignment unusable (BOTTOM)
UNDEFINEDAGGS BOTTOM

IPA decision stage:


Evaluating opportunities for bar2/1.

Evaluating opportunities for bar/0.

Evaluating opportunities for foo/6.

IPA constant propagation end

Reclaiming functions:
Reclaiming variables:
Clearing address taken flags:
Symbol table:

puts/7 (__builtin_puts) @0x7ffa9ff50730
  Type: function
  Visibility: external public
  References: 
  Referring: 
  Availability: not_available
  First run: 0
  Function flags:
  Called by: foo/6 (0.19 per call) 
  Calls: 
foo/6 (foo) @0x7ffa9ff505c0
  Type: function definition analyzed
  Visibility: externally_visible public
  References: 
  Referring: 
  Read from file: t1.o
  Availability: available
  First run: 0
  Function flags:
  Called by: bar/0 (1.00 per call) bar2/1 (1.00 per call) main/3 (1.00 per 
call) main/3 (1.00 per call) main/3 (1.00 per call) main/3 (1.00 per call) 
  Calls: puts/7 (0.19 per call) 
x/2 (x) @0x7ffa9ff51000
  Type: variable definition analyzed
  Visibility: externally_visible public common
  References: 
  Referring: main/3 (write)main/3 (write)main/3 (write)main/3 (write)
  Read from file: t2.o
  Availability: overwritable
  

Re: ipa vrp implementation in gcc

2016-03-19 Thread kugan

On 18/01/16 20:42, Richard Biener wrote:

I have (very incomplete) prototype patches to do a dominator-based
approach instead (what is refered to downthread as non-iterating approach).
That's cheaper and is what I'd like to provide as an "utility style" interface
to things liker niter analysis which need range-info based on a specific
dominator (the loop header for example).


I am not sure if this is still an interest for GSOC. In the meantime, I 
was looking at intra procedural early VRP as suggested.


If I understand this correctly, we have to traverses the dominator tree 
forming subregion (or scope) where a variable will have certain range. 
We would have to record the ranges in the region in subregion (scope) 
context and use this to detect more (for any operation that is used as 
operands with known ranges). We will have to keep the context in stack. 
We also have to handle loop index variables.


For example,
void bar1 (int, int);
void bar2 (int, int);
void bar3 (int, int);
void bar4 (int, int);

void foo (int a, int b)
{
  int t = 0;

  //region 1
  if (a < 10)
{
  //region 2
  if (b > 10)
{
  //region 3
  bar1 (a, b);
}
  else
{
  //region 4
  bar2 (a, b);
}
}
  else
{
  //region 5
  bar3 (a, b);
}

  bar4 (a, b);
}


I am also wondering whether we should split the live ranges to get 
better value ranges (for the example shown above)?


Thanks,
Kugan


fstrict-enums and value ranges in VRP

2016-06-01 Thread kugan

Hi All,

When I compile the following code with g++ using -fstrict-enums and -O2

enum v
{
  OK = 0,
  NOK = 1,
};

int foo0 (enum v a)
{
  if (a > NOK)
return 0;
  return 1;
}

vrp1 dump looks like:
Value ranges after VRP:

a.0_1: VARYING
_2: [0, 1]
a_3(D): VARYING


int foo0(v) (v a)
{
  int a.0_1;
  int _2;

  :
  a.0_1 = (int) a_3(D);
  if (a.0_1 > 1)
goto ;
  else
goto ;

  :

  :
  # _2 = PHI <0(2), 1(3)>
  return _2;

}

Should we infer value ranges for the enum since this is -fstrict-enums 
and optimize it?


@item -fstrict-enums
@opindex fstrict-enums
Allow the compiler to optimize using the assumption that a value of
enumerated type can only be one of the values of the enumeration (as
defined in the C++ standard; basically, a value that can be
represented in the minimum number of bits needed to represent all the
enumerators).  This assumption may not be valid if the program uses a
cast to convert an arbitrary integer value to the enumerated type.


Thanks,
Kugan


Re: anti-ranges of signed variables

2016-11-13 Thread kugan

Hi,

On 12/11/16 06:19, Jakub Jelinek wrote:

On Fri, Nov 11, 2016 at 11:51:34AM -0700, Martin Sebor wrote:

On 11/11/2016 10:53 AM, Richard Biener wrote:

On November 11, 2016 6:34:37 PM GMT+01:00, Martin Sebor  
wrote:

I noticed that variables of signed integer types that are constrained
to a specific subrange of values of the type like so:

[-TYPE_MAX + N, N]

are reported by get_range_info as the anti-range

[-TYPE_MAX, TYPE_MIN - 1]

for all positive N of the type regardless of the variable's actual
range.  Basically, such variables are treated the same as variables
of the same type that have no range info associated with them at all
(such as function arguments or global variables).

For example, while a signed char variable between -1 and 126 is
represented by

VR_ANTI_RANGE [127, -2]


? I'd expect [-1, 126].  And certainly never range-min > range-max


Okay.  With this code:

  void f (void *d, const void *s, signed char i)
  {
if (i < -1 || 126 < i) i = -1;
__builtin_memcpy (d, s, i);
  }

I see the following in the output of -fdump-tree-vrp:

  prephitmp_11: ~[127, 18446744073709551614]
  ...
  # prephitmp_11 = PHI <_12(3), 18446744073709551615(2)>
  __builtin_memcpy (d_8(D), s_9(D), prephitmp_11);


At some point get_range_info for anti-ranges has been represented
by using min larger than max, but later on some extra bit on SSA_NAME has
been added.  Dunno if the code has been adjusted at that point.


Commit changed that and removed it is.
commit 0c20fe492bc5b8c9259d21dd2dab03ff5155facb
Author: rsandifo 
Date:   Thu Nov 28 16:32:44 2013 +

wide-int version of SSA_NAME_ANTI_ALIAS_P patch.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/wide-int@205491 
138bc75d-0d04-0410-961f-82ee72b054a4



But looking closely:

enum value_range_type { VR_UNDEFINED, VR_RANGE,
VR_ANTI_RANGE, VR_VARYING, VR_LAST };
in set_range_info, we have:
  SSA_NAME_ANTI_RANGE_P (name) = (range_type == VR_ANTI_RANGE);

in get_range_info, we have:
  return SSA_NAME_RANGE_TYPE (name);

I think we should change the get_range_info to:

diff --git a/gcc/tree-ssanames.c b/gcc/tree-ssanames.c
index 913d142..f33b9c0 100644
--- a/gcc/tree-ssanames.c
+++ b/gcc/tree-ssanames.c
@@ -371,7 +371,7 @@ get_range_info (const_tree name, wide_int *min, 
wide_int *max)


   *min = ri->get_min ();
   *max = ri->get_max ();
-  return SSA_NAME_RANGE_TYPE (name);
+  return SSA_NAME_RANGE_TYPE (name) ? VR_ANTI_RANGE : VR_RANGE;
 }

Is this OK after testing ?

Thanks,
Kugan


Re: anti-ranges of signed variables

2016-11-13 Thread kugan




I think we should change the get_range_info to:

diff --git a/gcc/tree-ssanames.c b/gcc/tree-ssanames.c
index 913d142..f33b9c0 100644
--- a/gcc/tree-ssanames.c
+++ b/gcc/tree-ssanames.c
@@ -371,7 +371,7 @@ get_range_info (const_tree name, wide_int *min,
wide_int *max)

*min = ri->get_min ();
*max = ri->get_max ();
-  return SSA_NAME_RANGE_TYPE (name);
+  return SSA_NAME_RANGE_TYPE (name) ? VR_ANTI_RANGE : VR_RANGE;
  }


OK, this is what SSA_NAME_RANGE_TYPE in tree.h is doing.
#define SSA_NAME_RANGE_TYPE(N) \
(SSA_NAME_ANTI_RANGE_P (N) ? VR_ANTI_RANGE : VR_RANGE)

So, we shouldn't do it again. Sorry about the noise.

Kugan



Using particular register class (like floating point registers) as spill register class

2014-05-16 Thread Kugan
I would like to know if there is anyway we can use registers from
particular register class just as spill registers (in places where
register allocator would normally spill to stack and nothing more), when
it can be useful.

In AArch64, in some cases, compiling with -mgeneral-regs-only produces
better performance compared not using it. The difference here is that
when -mgeneral-regs-only is not used, floating point register are also
used in register allocation. Then IRA/LRA has to move them to core
registers before performing operations as shown below.

.
fmovs1, w8 <--
mov w21, 49622
movkw21, 0xca62, lsl 16
add w21, w16, w21
add w21, w21, w2
eor w10, w0, w10
add w10, w21, w10
ror w8, w7, 27
add w7, w10, w8
ror w7, w7, 27
fmovw0, s1 <--
add w7, w0, w7
add w13, w13, w7
fmovw0, s4 <--
add w0, w0, w20
fmovs4, w0 <--
ror w18, w18, 2
fmovw0, s2 <--
add w0, w0, w18
fmovs2, w0 <--
add w12, w12, w27
add w14, w14, w15
mov w15, w24
fmovx0, d3 <--
subsx0, x0, #1
fmovd3, x0 <--
bne .L2
fmovx0, d0 <--

 .

In this case, costs for allocnos calculated by IRA based on the cost
model supplied by the back-end is like:
a0(r667,l0) costs: GENERAL_REGS:0,0 FP_LO_REGS:3960,3960
FP_REGS:3960,3960 ALL_REGS:3960,3960 MEM:3960,3960

Thus, changing the cost of floating point register class is not going to
help. If I increase further, register allocated will just spill these
live ranges to memory and will ignore floating point register in this case.

Is there any other back-end in gcc that does anything to improve cases
like this, that I can refer to?

Thanks in advance,
Kugan


Re: Using particular register class (like floating point registers) as spill register class

2014-05-16 Thread Kugan


On 16/05/14 20:40, pins...@gmail.com wrote:
> 
> 
>> On May 16, 2014, at 3:23 AM, Kugan  wrote:
>>
>> I would like to know if there is anyway we can use registers from
>> particular register class just as spill registers (in places where
>> register allocator would normally spill to stack and nothing more), when
>> it can be useful.
>>
>> In AArch64, in some cases, compiling with -mgeneral-regs-only produces
>> better performance compared not using it. The difference here is that
>> when -mgeneral-regs-only is not used, floating point register are also
>> used in register allocation. Then IRA/LRA has to move them to core
>> registers before performing operations as shown below.
> 
> Can you show the code with fp register disabled?  Does it use the stack to 
> spill?  Normally this is due to register to register class costs compared to 
> register to memory move cost.  Also I think it depends on the processor 
> rather the target.  For thunder, using the fp registers might actually be 
> better than using the stack depending if the stack was in L1. 
Not all the LDR/STR combination match to fmov. In the testcase I have,

aarch64-none-linux-gnu-gcc sha_dgst.c -O2  -S  -mgeneral-regs-only
grep -c "ldr" sha_dgst.s
50
grep -c "str" sha_dgst.s
42
grep -c "fmov" sha_dgst.s
0

aarch64-none-linux-gnu-gcc sha_dgst.c -O2  -S
grep -c "ldr" sha_dgst.s
42
grep -c "str" sha_dgst.s
31
grep -c "fmov" sha_dgst.s
105

I  am not saying that we shouldn’t use floating point register here. But
from the above, it seems like register allocator is using it as more
like core register (even though the cost mode has higher cost) and then
moving the values to core registers before operations. if that is the
case, my question is, how do we just make this as spill register class
so that we will replace ldr/str with equal number of fmov when it is
possible.

Thanks,
Kugan


Zero/Sign extension elimination using value ranges

2014-05-19 Thread Kugan

This is based on my earlier patch
https://gcc.gnu.org/ml/gcc-patches/2013-10/msg00452.html. Before I post
the new set of patches, I would like to make sure that I understood
review comments and my idea makes sense and acceptable. Please let me
know If I am missing anything or my assumptions are wrong.

To recap the basic idea, when GIMPLE_ASSIGN stmts are expanded to RTL,
if we can prove that zero/sign extension to fit the type is redundant,
we can generate RTL without it. For example, when an expression is
evaluated and it's value is assigned to variable of type short, the
generated RTL currently look similar to (set (reg:SI 110)
(zero_extend:SI (subreg:HI (reg:SI 117) 0))). Using value ranges, if we
can show that the value of the expression which is present in register
117 is within the limits of short and there is no sign conversion, we do
not need to perform zero_extend.

Cases to handle here are :

1.  Handling NOP_EXPR or CONVERT_EXPR that are in the IL because they
are required for type correctness. We have two cases here:

A) Mode is smaller than word_mode. This is usually from where the
zero/sign extensions are showing up in final assembly.
For example :
int = (int) short
which usually expands to
 (set (reg:SI )
  (sext:SI (subreg:HI (reg:SI 
We can expand  this
 (set (reg:SI ) (((reg:SI 

If following is true:
1. Value stored in RHS and LHS are of the same signedness
2. Type can hold the value. i.e., In cases like char = (char) short, we
check that the value in short is representable char type. (i.e. look at
the value range in RHS SSA_NAME and see if that can be represented in
types of LHS without overflowing)

Subreg here is not a paradoxical subreg. We are removing the subreg and
zero/sign extend here.

I am assuming here that QI/HI registers are represented in SImode
(basically word_mode) with zero/sign extend is used as in
(zero_extend:SI (subreg:HI (reg:SI 117)).

B) Mode is larger than word_mode
 long = (long) int

which usually expands to
   (set:DI (sext:DI (reg:SI)))
We have to expand this as paradoxical subreg
   (set:DI (subreg:DI (reg:SI)))

I am not sure that these cases results in actual zero/sign extensions
being generated. Therefore I think we should skip this case altogether.


2. Second are promotions required by the target (PROMOTE_MODE) that do
arithmetic on wider registers like:

char = char  + char

In this case we will have the value ranges of RHS char1 and char2. We
will have to compute the value range of (char1 + char2) in promoted mode
(from the values range stored in char1 SSANAME and char2 SSA_NAME) and
see if that value range can be  represented in LHS type.

Once again, if following is true, we can remove the subreg and zero/sign
extension in assignment:
1. Value stored in RHS and LHS are of the same signedness
2. Type can hold the value.

And also, when LHS  is promoted and thus the target is (subreg:XX N),
RHS has been expanded in XXmode. Dependent on the value-range and mode
XX which is bigger than word mode, set this to a paradoxical subreg of
the expanded result. However, since we are only interested in XXmode
lesser than word_mode (that is where most of the final zero/sign
extension asm are coming from), we don’t have to consider paradoxical
subreg here.

Does this make sense?

Thanks,

Kugan


Re: Zero/Sign extension elimination using value ranges

2014-05-20 Thread Kugan

On 20/05/14 16:52, Jakub Jelinek wrote:
> On Tue, May 20, 2014 at 12:27:31PM +1000, Kugan wrote:
>> 1.  Handling NOP_EXPR or CONVERT_EXPR that are in the IL because they
>> are required for type correctness. We have two cases here:
>>
>> A) Mode is smaller than word_mode. This is usually from where the
>> zero/sign extensions are showing up in final assembly.
>> For example :
>> int = (int) short
>> which usually expands to
>>  (set (reg:SI )
>>   (sext:SI (subreg:HI (reg:SI 
>> We can expand  this
>>  (set (reg:SI ) (((reg:SI 
>>
>> If following is true:
>> 1. Value stored in RHS and LHS are of the same signedness
>> 2. Type can hold the value. i.e., In cases like char = (char) short, we
>> check that the value in short is representable char type. (i.e. look at
>> the value range in RHS SSA_NAME and see if that can be represented in
>> types of LHS without overflowing)
>>
>> Subreg here is not a paradoxical subreg. We are removing the subreg and
>> zero/sign extend here.
>>
>> I am assuming here that QI/HI registers are represented in SImode
>> (basically word_mode) with zero/sign extend is used as in
>> (zero_extend:SI (subreg:HI (reg:SI 117)).
> 
> Wouldn't it be better to just set proper flags on the SUBREG based on value
> range info (SUBREG_PROMOTED_VAR_P and SUBREG_PROMOTED_UNSIGNED_P)?
> Then not only the optimizers could eliminate in zext/sext when possible, but
> all other optimizations could benefit from that.

Thanks for the comments. Here is an attempt (attached) that sets
SUBREG_PROMOTED_VAR_P based on value range into. Is this the good place
to do this ?

Thanks,
Kugan
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index b7f6360..d23ae76 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -3120,6 +3120,60 @@ expand_return (tree retval)
 }
 }
 
+
+static bool
+is_assign_promotion_redundant (struct separate_ops *ops)
+{
+  double_int type_min, type_max;
+  double_int min, max;
+  bool uns = TYPE_UNSIGNED (ops->type);
+  double_int msb;
+
+  /* We remove extension for integral stmts.  */
+  if (!INTEGRAL_TYPE_P (ops->type))
+return false;
+
+  if (TREE_CODE_CLASS (ops->code) == tcc_unary)
+{
+  switch (ops->code)
+   {
+   case CONVERT_EXPR:
+   case NOP_EXPR:
+
+ /* Get the value range.  */
+ if (TREE_CODE (ops->op0) != SSA_NAME
+ || POINTER_TYPE_P (TREE_TYPE (ops->op0))
+ || get_range_info (ops->op0, &min, &max) != VR_RANGE)
+   return false;
+
+ msb = double_int_one.rshift (TYPE_PRECISION (TREE_TYPE (ops->op0)));
+ if (!uns && min.cmp (msb, uns) == 1
+ && max.cmp (msb, uns) == 1)
+   {
+ min = min.sext (TYPE_PRECISION (TREE_TYPE (ops->op0)));
+ max = max.sext (TYPE_PRECISION (TREE_TYPE (ops->op0)));
+   }
+
+ /* Signedness of LHS and RHS should match or value range of RHS
+should be all positive values to make zero/sign extension 
redundant.  */
+ if ((uns != TYPE_UNSIGNED (TREE_TYPE (ops->op0)))
+  && (min.cmp (double_int_zero, TYPE_UNSIGNED (TREE_TYPE 
(ops->op0))) == -1))
+   return false;
+
+ type_max = tree_to_double_int (TYPE_MAX_VALUE (ops->type));
+ type_min = tree_to_double_int (TYPE_MIN_VALUE (ops->type));
+
+ /* If rhs value range fits lhs type, zero/sign extension is
+   redundant.  */
+ if (max.cmp (type_max, uns) != 1
+ && (type_min.cmp (min, uns)) != 1)
+   return true;
+   }
+}
+
+  return false;
+}
+
 /* A subroutine of expand_gimple_stmt, expanding one gimple statement
STMT that doesn't require special handling for outgoing edges.  That
is no tailcalls and no GIMPLE_COND.  */
@@ -3240,6 +3294,12 @@ expand_gimple_stmt_1 (gimple stmt)
  }
ops.location = gimple_location (stmt);
 
+   if (promoted && is_assign_promotion_redundant (&ops))
+ {
+   promoted = false;
+   SUBREG_PROMOTED_VAR_P (target) = 0;
+ }
+
/* If we want to use a nontemporal store, force the value to
   register first.  If we store into a promoted register,
   don't directly expand to target.  */


Re: Zero/Sign extension elimination using value ranges

2014-05-22 Thread Kugan
On 21/05/14 17:05, Jakub Jelinek wrote:
> On Wed, May 21, 2014 at 12:53:47PM +1000, Kugan wrote:
>> On 20/05/14 16:52, Jakub Jelinek wrote:
>>> On Tue, May 20, 2014 at 12:27:31PM +1000, Kugan wrote:
>>>> 1.  Handling NOP_EXPR or CONVERT_EXPR that are in the IL because they
>>>> are required for type correctness. We have two cases here:
>>>>
>>>> A) Mode is smaller than word_mode. This is usually from where the
>>>> zero/sign extensions are showing up in final assembly.
>>>> For example :
>>>> int = (int) short
>>>> which usually expands to
>>>>  (set (reg:SI )
>>>>   (sext:SI (subreg:HI (reg:SI 
>>>> We can expand  this
>>>>  (set (reg:SI ) (((reg:SI 
>>>>
>>>> If following is true:
>>>> 1. Value stored in RHS and LHS are of the same signedness
>>>> 2. Type can hold the value. i.e., In cases like char = (char) short, we
>>>> check that the value in short is representable char type. (i.e. look at
>>>> the value range in RHS SSA_NAME and see if that can be represented in
>>>> types of LHS without overflowing)
>>>>
>>>> Subreg here is not a paradoxical subreg. We are removing the subreg and
>>>> zero/sign extend here.
>>>>
>>>> I am assuming here that QI/HI registers are represented in SImode
>>>> (basically word_mode) with zero/sign extend is used as in
>>>> (zero_extend:SI (subreg:HI (reg:SI 117)).
>>>
>>> Wouldn't it be better to just set proper flags on the SUBREG based on value
>>> range info (SUBREG_PROMOTED_VAR_P and SUBREG_PROMOTED_UNSIGNED_P)?
>>> Then not only the optimizers could eliminate in zext/sext when possible, but
>>> all other optimizations could benefit from that.
>>
>> Thanks for the comments. Here is an attempt (attached) that sets
>> SUBREG_PROMOTED_VAR_P based on value range into. Is this the good place
>> to do this ?
> 
> But you aren't setting it in your patch in any way, you are just resetting
> it instead.  The thing is, start with a testcase where you get that
> (subreg:HI (reg:SI)) as the RTL of some SSA_NAME (is that the case on ARM?,
> I believe on e.g. i?86/x86_64 you'd just get (reg:HI) instead and thus you
> can't take advantage of that), and at that point where it is created check
> the range info and if it is properly sign or zero extended, set
> SUBREG_PROMOTED_VAR_P and SUBREG_PROMOTED_UNSIGNED_SET.

Here is another attempt (a quick hack patch is attached). Is this
a reasonable direction? I think I will have to look for other places
where SUBREG_PROMOTED_UNSIGNED_P are used for possible optimisations.
Before that I want to make sure I am on the right track.

> Note that right now we use 2 bits for the latter, which encode values
> -1 (weirdo pointer extension), 0 (sign extension), 1 (zero extension).
> Perhaps it would be nice to allow encoding value 2 (zero and sign extension)
> for cases where the range info tells you that the value is both zero and
> sign extended (i.e. minimum of range is >= 0 and maximum is <= signed type
> maximum).

Do you suggest changing rtx_def to achieve this like the following to be
able to store 2 in SUBREG_PROMOTED_UNSIGNED_SET?  probably not.
-  unsigned int unchanging : 1;
+  unsigned int unchanging : 2;


Thanks,
Kugan



diff --git a/gcc/expr.c b/gcc/expr.c
index 2868d9d..15183fa 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -328,7 +328,8 @@ convert_move (rtx to, rtx from, int unsignedp)
   if (GET_CODE (from) == SUBREG && SUBREG_PROMOTED_VAR_P (from)
   && (GET_MODE_PRECISION (GET_MODE (SUBREG_REG (from)))
  >= GET_MODE_PRECISION (to_mode))
-  && SUBREG_PROMOTED_UNSIGNED_P (from) == unsignedp)
+  && (SUBREG_PROMOTED_UNSIGNED_P (from) == 2
+ || SUBREG_PROMOTED_UNSIGNED_P (from) == unsignedp))
 from = gen_lowpart (to_mode, from), from_mode = to_mode;
 
   gcc_assert (GET_CODE (to) != SUBREG || !SUBREG_PROMOTED_VAR_P (to));
@@ -9195,6 +9196,51 @@ expand_expr_real_2 (sepops ops, rtx target, enum 
machine_mode tmode,
 }
 #undef REDUCE_BIT_FIELD
 
+static bool
+is_value_extended (tree lhs, enum machine_mode rhs_mode, bool rhs_uns)
+{
+  wide_int type_min, type_max;
+  wide_int min, max;
+  unsigned int prec;
+  tree lhs_type;
+  bool lhs_uns;
+
+  if (TREE_CODE (lhs) != SSA_NAME)
+return false;
+
+  lhs_type = lang_hooks.types.type_for_mode (rhs_mode, rhs_uns);
+  lhs_uns = TYPE_UNSIGNED (TREE_TYPE (lhs));
+
+  /* We remove extension for integrals.  */
+  if (!INTEGRAL_TYPE_P (TREE_TYPE (lhs)))
+return false;
+
+  /* Get the value range.  */
+  if (POINTER_TYPE_P (TREE_TYPE (lhs))
+  

Re: Zero/Sign extension elimination using value ranges

2014-06-02 Thread Kugan
ED:   \
  _rtx->volatil = 1;\
  _rtx->unchanging = 1; \
  break;\
   }\
 } while (0)


#define SUBREG_PROMOTED_GET(RTX)\
  (2 * ((RTL_FLAG_CHECK1 ("SUBREG_PROMOTED_GET", (RTX), SUBREG))->volatil)\
   + (RTX)->unchanging - 1)

#define SUBREG_PROMOTED_SIGNED_P(RTX)   \
  RTL_FLAG_CHECK1 ("SUBREG_PROMOTED_SIGNED_P", (RTX), SUBREG)->volatil)\
 + (RTX)->unchanging) == 0) ? 0 : ((RTX)->unchanging == 1))

 #define SUBREG_PROMOTED_UNSIGNED_P(RTX)\
  RTL_FLAG_CHECK1 ("SUBREG_PROMOTED_UNSIGNED_P", (RTX),
SUBREG)->volatil)\
 + (RTX)->unchanging) == 0) ? -1 : ((RTX)->volatil == 1))

#define SUBREG_CHECK_PROMOTED_SIGN(RTX, SIGN) \
 ((SIGN) ? SUBREG_PROMOTED_UNSIGNED_P((RTX))\
 : SUBREG_PROMOTED_SIGNED_P((RTX))) \

Does this look reasonable?

Thanks,
Kugan
diff --git a/gcc/calls.c b/gcc/calls.c
index 78fe7d8..a1e7468 100644
--- a/gcc/calls.c
+++ b/gcc/calls.c
@@ -1484,8 +1484,11 @@ precompute_arguments (int num_actuals, struct arg_data 
*args)
  args[i].initial_value
= gen_lowpart_SUBREG (mode, args[i].value);
  SUBREG_PROMOTED_VAR_P (args[i].initial_value) = 1;
- SUBREG_PROMOTED_UNSIGNED_SET (args[i].initial_value,
-   args[i].unsignedp);
+
+ if (is_promoted_for_type (args[i].tree_value, mode, 
!args[i].unsignedp))
+   SUBREG_PROMOTED_SET (args[i].initial_value, 
SRP_SIGNED_AND_UNSIGNED);
+ else
+   SUBREG_PROMOTED_SET (args[i].initial_value, args[i].unsignedp);
}
}
 }
@@ -3365,7 +3368,8 @@ expand_call (tree exp, rtx target, int ignore)
 
  target = gen_rtx_SUBREG (TYPE_MODE (type), target, offset);
  SUBREG_PROMOTED_VAR_P (target) = 1;
- SUBREG_PROMOTED_UNSIGNED_SET (target, unsignedp);
+ SUBREG_PROMOTED_SET (target, unsignedp);
+
}
 
   /* If size of args is variable or this was a constructor call for a stack
diff --git a/gcc/expr.c b/gcc/expr.c
index d99bc1e..7a1a2b9 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -328,7 +328,7 @@ convert_move (rtx to, rtx from, int unsignedp)
   if (GET_CODE (from) == SUBREG && SUBREG_PROMOTED_VAR_P (from)
   && (GET_MODE_PRECISION (GET_MODE (SUBREG_REG (from)))
  >= GET_MODE_PRECISION (to_mode))
-  && SUBREG_PROMOTED_UNSIGNED_P (from) == unsignedp)
+  && (SUBREG_CHECK_PROMOTED_SIGN (from, unsignedp)))
 from = gen_lowpart (to_mode, from), from_mode = to_mode;
 
   gcc_assert (GET_CODE (to) != SUBREG || !SUBREG_PROMOTED_VAR_P (to));
@@ -702,7 +702,7 @@ convert_modes (enum machine_mode mode, enum machine_mode 
oldmode, rtx x, int uns
 
   if (GET_CODE (x) == SUBREG && SUBREG_PROMOTED_VAR_P (x)
   && GET_MODE_SIZE (GET_MODE (SUBREG_REG (x))) >= GET_MODE_SIZE (mode)
-  && SUBREG_PROMOTED_UNSIGNED_P (x) == unsignedp)
+  && (SUBREG_CHECK_PROMOTED_SIGN (x, unsignedp)))
 x = gen_lowpart (mode, SUBREG_REG (x));
 
   if (GET_MODE (x) != VOIDmode)
@@ -4375,6 +4375,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, 
rtx size,
 {
   /* Handle calls that pass values in multiple non-contiguous locations.
 The Irix 6 ABI has examples of this.  */
+
   if (GET_CODE (reg) == PARALLEL)
emit_group_load (reg, x, type, -1);
   else
@@ -5201,8 +5202,7 @@ store_expr (tree exp, rtx target, int call_param_p, bool 
nontemporal)
  && GET_MODE_PRECISION (GET_MODE (target))
 == TYPE_PRECISION (TREE_TYPE (exp)))
{
- if (TYPE_UNSIGNED (TREE_TYPE (exp))
- != SUBREG_PROMOTED_UNSIGNED_P (target))
+ if (!(SUBREG_CHECK_PROMOTED_SIGN (target, TYPE_UNSIGNED (TREE_TYPE 
(exp)
{
  /* Some types, e.g. Fortran's logical*4, won't have a signed
 version, so use the mode instead.  */
@@ -9209,6 +9209,52 @@ expand_expr_real_2 (sepops ops, rtx target, enum 
machine_mode tmode,
 }
 #undef REDUCE_BIT_FIELD
 
+/* Return TRUE if value in RHS is already zero/sign extended for lhs type
+   (type here is the combination of LHS_MODE and LHS_UNS) using value range
+   information stored in RHS. Return FALSE otherwise. */
+bool
+is_promoted_for_type (tree rhs, enum machine_mode lhs_mode, bool lhs_uns)
+{
+  wide_int type_min, type_max;
+  wide_int min, max;
+  unsigned int prec;
+  tree lhs_type;
+  bool rhs_uns;
+
+  if (flag_wrapv
+  || (rhs == NULL_TREE)
+  || (TREE_CODE (rhs) != SSA_NAME)
+  || !INTEGRAL_TYPE_P (TREE_TYPE (rhs))
+  || POINTER_TYPE_P (TREE_TYPE (rhs))
+  || (get_range_info (rhs, &min, &max) !

Question about LRA in aarch64_be-none-elf

2014-06-07 Thread Kugan
Hi All,

I am looking at a regression (in aarch64_be-none-elf-gcc -Og and
test-case attached) where a TImode register is assigned two DImode
values and then passed to the  __multf3 as argument. When, the
intermediate pseudo(TImode) is assigned a FP_REG to hold this value, the
regression shows up. Difference in asm to the one working and not
working is below.

fmovd1, x20
fmovv1.d[1], x19
+   str q0, [x29, 64]
+   str x22, [x29, 64]
fmovd0, x21
-   fmovv0.d[1], x22
bl  __multf3


When LRA assigns one of the DImode value to TImode register, it spills
the TImode register into memory. Appends the DImode and then reloads (as
shown below in the dump). However, It is not setting up it in the right
place of TImode and due to that one of the moves becomes dead and
removed by dce. If I compile the test-case with
fno-dce, I get the following asm.


If I compile with -fno-dce.
fmovd1, x3
fmovv1.d[1], x19
str q0, [x29, 64]
str x19, [x29, 64]
ldr q0, [x29, 64]
fmovd0, x3
bl  __addtf3

What is causing the LRA to generate moves like this?

Thanks,
Kugan



t.c.214r.reload
---
(insn 88 87 133 3 (clobber (reg:TI 32 v0 [orig:108 d+-8 ] [108])) t.c:11 -1
 (nil))

---89, 134 and 91 stores
(insn 133 88 89 3 (set (mem/c:TI (plus:DI (reg/f:DI 29 x29)
(const_int 64 [0x40])) [0 %sfp+-16 S16 A128])
(reg:TI 32 v0 [orig:108 d+-8 ] [108])) t.c:11 37 {*movti_aarch64}
 (nil))
(insn 89 133 134 3 (set (mem/c:DI (plus:DI (reg/f:DI 29 x29)
(const_int 64 [0x40])) [0 %sfp+-16 S8 A128])
(reg:DI 19 x19 [orig:102 d ] [102])) t.c:11 34 {*movdi_aarch64}
 (nil))
(insn 134 89 90 3 (set (reg:TI 32 v0 [orig:108 d+-8 ] [108])
(mem/c:TI (plus:DI (reg/f:DI 29 x29)
(const_int 64 [0x40])) [0 %sfp+-16 S16 A128])) t.c:11 37
{*movti_aarch64}
 (nil))
(insn 90 134 91 3 (set (reg:DI 32 v0 [orig:108 d ] [108])
(reg:DI 20 x20 [orig:105 d+8 ] [105])) t.c:11 34 {*movdi_aarch64}
 (nil))
(insn 91 90 15 3 (set (reg:TF 32 v0)
(reg:TF 32 v0 [orig:108 d+-8 ] [108])) t.c:11 40 {*movtf_aarch64}
 (nil))
(call_insn/u 15 91 129 3 (parallel [
(set (reg:TF 32 v0)
(call (mem:DI (symbol_ref:DI ("__addtf3") [flags 0x41])
[0  S8 A8])
(const_int 0 [0])))
(use (const_int 0 [0]))
(clobber (reg:DI 30 x30))
]) t.c:11 28 {*call_value_symbol}
 (expr_list:REG_EH_REGION (const_int -2147483648 [0x8000])
(nil))
(expr_list (use (reg:TF 33 v1))
(expr_list (use (reg:TF 32 v0))
(nil



t.c.228r.cprop_hardreg
--
(insn 88 174 133 3 (clobber (reg:TI 32 v0 [orig:108 d+-8 ] [108])) t.c:11 -1
 (nil))
(insn 133 88 89 3 (set (mem/c:TI (plus:DI (reg/f:DI 29 x29)
(const_int 64 [0x40])) [0 %sfp+-16 S16 A128])
(reg:TI 32 v0 [orig:108 d+-8 ] [108])) t.c:11 37 {*movti_aarch64}
 (expr_list:REG_DEAD (reg:TI 32 v0 [orig:108 d+-8 ] [108])
(nil)))
(insn 89 133 134 3 (set (mem/c:DI (plus:DI (reg/f:DI 29 x29)
(const_int 64 [0x40])) [0 %sfp+-16 S8 A128])
(reg:DI 19 x19 [orig:102 d ] [102])) t.c:11 34 {*movdi_aarch64}
 (nil))
(insn 134 89 90 3 (set (reg:TI 32 v0 [orig:108 d+-8 ] [108])
(mem/c:TI (plus:DI (reg/f:DI 29 x29)
(const_int 64 [0x40])) [0 %sfp+-16 S16 A128])) t.c:11 37
{*movti_aarch64}
 (expr_list:REG_UNUSED (reg:TI 32 v0 [orig:108 d+-8 ] [108])
(nil)))
(insn 90 134 15 3 (set (reg:DI 32 v0 [orig:108 d ] [108])
(reg:DI 3 x3 [orig:105 d+8 ] [105])) t.c:11 34 {*movdi_aarch64}
 (nil))
(call_insn/u 15 90 175 3 (parallel [
(set (reg:TF 32 v0)
(call (mem:DI (symbol_ref:DI ("__addtf3") [flags 0x41])
[0  S8 A8])
(const_int 0 [0])))
(use (const_int 0 [0]))
(clobber (reg:DI 30 x30))
]) t.c:11 28 {*call_value_symbol}
 (expr_list:REG_DEAD (reg:TF 33 v1)
(expr_list:REG_EH_REGION (const_int -2147483648
[0x8000])
(nil)))
(expr_list (use (reg:TF 33 v1))
(expr_list (use (reg:TF 32 v0))
(nil



t.c.229r.rtl_dce
---
(insn 88 174 133 3 (clobber (reg:TI 32 v0 [orig:108 d+-8 ] [108])) t.c:11 -1
 (nil))
(insn 133 88 89 3 (set (mem/c:TI (plus:DI (reg/f:DI 29 x29)
(const_int 64 [0x40])) [0 %sfp+-16 S16 A128])
(reg:TI 32 v0 [orig:108 d+-8 ] [108])) t.c:11 37 {*movti_aarch64}
 (expr_list:REG_DEAD (reg:TI 32 v0 [orig:108 d+-8 ] [108])
(nil)))
(insn 89 133 90 3 (set (mem/c:DI (plus:DI (reg/f:DI 29 x29)
(const_int 64 [0x40])) [0 %sfp+-16 S8 A128])
(reg:DI 19 x19 [orig:102 d ] [102])) t.c:11 34 {*movdi_aarch64}
 (nil))
(insn 90 89 15 3 (set (reg:DI 32 v

reg_equiv_mem and reg_equiv_address are NULL for true_regnum == -1

2014-10-01 Thread Kugan
Hi All,

I am looking at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62254.

Here, in arm_reload_in, for REG_P (ref) which has true_regnum (ref) ==
-1, both reg_equiv_mem (REG_P (ref)) and  reg_equiv_address (REG_P
(ref)) are NULL. Can this happen?

Thanks,
Kugan


Re: A Question About LRA/reload

2014-12-09 Thread Kugan
On 09/12/14 20:37, lin zuojian wrote:
> Hi,
> I have read ira/lra code for a while, but still fails to understand
> their relationship. The main question is why ira do color so early?
> lra pass will do the assignment anyway. Sorry if I mess up coloring
> and hard register assignment, but I think it's better to get job
> done after lra elimiation, inheriation, ...


IRA does the register allocation and LRA matches insn constraints.
Therefore IRA has to do the coloring. LRA, in the process matching
constraints may change some of these assignment. Please look at the
following links for more info.

https://ols.fedoraproject.org/GCC/Reprints-2007/makarov-reprint.pdf
https://gcc.gnu.org/wiki/cauldron2012?action=AttachFile&do=get&target=Local_Register_Allocator_Project_Detail.pdf


Thanks,
Kugan



Re: A Question About LRA/reload

2014-12-09 Thread Kugan
On 09/12/14 21:14, lin zuojian wrote:
> Hi Kugan,
> I have read these pdfs. My question is LRA will change the insns, so
> why brother do the coloring so early. Changing the insns can
> generates new pseudo registers, so they needs to re-assign. Is that
> correct?

Hi,

IRA's job here is register allocation and LRA's job is matching the
constraints. For example, LRA might have to reload a value into a
different register class to match a constraint. To do that, LRA will
need a free register from certain register class. In order to get that
free register, LRA might have to change the IRA's allocation decision.

LRA needs the registrar allocation (that is the coloring info) and
spilled pseudo information to see if the constraints can be matched. It
iteratively have to change the insns till all the constraints are
matched. To get all the details you have to look at the code.

Thanks,
Kugan




Re: issue with placing includes in gcc-plugin.h

2015-01-14 Thread Kugan
On 14/01/15 21:24, Prathamesh Kulkarni wrote:
> On 14 January 2015 at 14:37, Richard Biener  wrote:
>> On Wed, 14 Jan 2015, Prathamesh Kulkarni wrote:
>>
>>> Hi,
>>> I am having an issue with placing includes of expr.h in gcc-plugin.h.
>>> rtl.h is required to be included before expr.h, so I put it in gcc-plugin.h.
>>> However the front-ends then fail to build because rtl.h is not allowed
>>> in front-ends,
>>> and the front-ends include gcc-plugin.h (via plugin.h).
>>>
>>> For instance ada/gcc-interface/misc.c failed to build with following error:
>>> In file included from ../../gcc/gcc/gcc-plugin.h:64:0,
>>>  from ../../gcc/gcc/plugin.h:23,
>>>  from ../../gcc/gcc/ada/gcc-interface/misc.c:53:
>>> ../../gcc/gcc/rtl.h:20:9: error: attempt to use poisoned "GCC_RTL_H"
>>>
>>> However rtl.h is required to be included before expr.h, so we cannot skip
>>> including rtl.h in gcc-plugin.h. How do we get around this ?
>>> As a temporary hack, could we #undef IN_GCC_FRONTEND in gcc-plugin.h ?
>>> java/builtins.c does this to include expr.h.
>>
>> Err - obviously nothing in GCC itself should include gcc-plugin.h,
>> only plugins should.  Do we tell plugins that they should include
>> plugin.h?!  Why is the include in there?
>>
>> I'd simply remove it
> That doesn't work.
> For instance removing plugin.h include from c/c-decl.h resulted in
> following build errors:
> ../../gcc/gcc/c/c-decl.c: In function \u2018void finish_decl(tree,
> location_t, tree, tree, tree)\u2019:
> ../../gcc/gcc/c/c-decl.c:4990:27: error:
> \u2018PLUGIN_FINISH_DECL\u2019 was not declared in this scope
> ../../gcc/gcc/c/c-decl.c:4990:51: error:
> \u2018invoke_plugin_callbacks\u2019 was not declared in this scope
> ../../gcc/gcc/c/c-decl.c: In function \u2018void finish_function()\u2019:
> ../../gcc/gcc/c/c-decl.c:9009:29: error:
> \u2018PLUGIN_PRE_GENERICIZE\u2019 was not declared in this scope
> ../../gcc/gcc/c/c-decl.c:9009:58: error:
> \u2018invoke_plugin_callbacks\u2019 was not declared in this scope
> make[3]: *** [c/c-decl.o] Error 1
> make[2]: *** [all-stage1-gcc] Error 2
> make[1]: *** [stage1-bubble] Error 2
> make: *** [all] Error 2
> 
> Why do the front-ends require to include plugin.h ?


C/C++ Front-end seems to have callbacks to process declarations. Please
look at https://gcc.gnu.org/ml/gcc-patches/2010-04/msg00780.html which
added callback PLUGIN_FINISH_DECL.

Thanks,
Kugan



loop_latch_edge is NULL during jump threading

2015-03-01 Thread Kugan
In linaro-4.9-branch, with the following (reduced) test case, I run into
a situation where loop_latch_edge is NULL during jump threading. I am
wondering if this  a possible during jump threading or the error lies
some where else? I can't reproduce it with the trunk.

int a;
fn1() {
  enum { UQSTRING, SQSTRING, QSTRING } b = UQSTRING;
  while (1)
switch (a) {
case '\'':
  b = QSTRING;
default:
  switch (b)
  case UQSTRING:
  return;
  b = SQSTRING;
}
}

x.c:2:1: internal compiler error: Segmentation fault
 fn1() {
 ^
0x83694f crash_signal

/home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/toplev.c:337
0x96d8a8 thread_block_1

/home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/tree-ssa-threadupdate.c:797
0x96da3e thread_block

/home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/tree-ssa-threadupdate.c:941
0x96e59c thread_through_all_blocks(bool)

/home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/tree-ssa-threadupdate.c:1866
0x9d77e9 finalize_jump_threads

/home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/tree-vrp.c:9709
0x9d77e9 execute_vrp

/home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/tree-vrp.c:9864
0x9d77e9 execute

/home/kugan.vivekanandarajah/work/sources/gcc-fsf/linaro/gcc/tree-vrp.c:9938
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.

If I apply the following patch, segfault goes away. Is this aright approach?

diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index d1b289f..0bcef35 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -794,6 +794,8 @@ thread_block_1 (basic_block bb, bool noloop_only,
bool joiners)
   if (loop->header == bb)
 {
   e = loop_latch_edge (loop);
+  if (!e)
+   return false;
   vec *path = THREAD_PATH (e);

   if (path
@@ -1114,6 +1116,8 @@ thread_through_loop_header (struct loop *loop,
bool may_peel_loop_headers)
   basic_block tgt_bb, atgt_bb;
   enum bb_dom_status domst;

+  if (!latch)
+return false;
   /* We have already threaded through headers to exits, so all the
threading
  requests now are to the inside of the loop.  We need to avoid creating
  irreducible regions (i.e., loops with more than one entry block), and


Thanks,
Kugan


Re: loop_latch_edge is NULL during jump threading

2015-03-01 Thread Kugan
On 02/03/15 15:29, Jeff Law wrote:
> On 03/01/15 16:32, Kugan wrote:
>> In linaro-4.9-branch, with the following (reduced) test case, I run into
>> a situation where loop_latch_edge is NULL during jump threading. I am
>> wondering if this  a possible during jump threading or the error lies
>> some where else? I can't reproduce it with the trunk.
> There's really no way to tell without a lot more information. If you
> can't reproduce on the 4.9 branch or the trunk, then you're likely going
> to have to do the real digging.
> 
> THe first thing I tend to do with these things is to draw the CFG and
> annotate it with all the jump threading paths.  Then I look at how the
> jump threading paths interact with each other and the loop structure,
> then reconcile that with the constraints placed on threading in
> tree-ssa-threadupdate.c.

Thanks Jeff. I will do the same.

Kugan



Re: Combine changes ASHIFT into mult for non-MEM rtx

2015-04-02 Thread Kugan
On 02/04/15 20:39, Bin.Cheng wrote:
> Hi,
> In function make_compound_operation, the code/comment says:
> 
> case ASHIFT:
>   /* Convert shifts by constants into multiplications if inside
>  an address.  */
>   if (in_code == MEM && CONST_INT_P (XEXP (x, 1))
>   && INTVAL (XEXP (x, 1)) < HOST_BITS_PER_WIDE_INT
>   && INTVAL (XEXP (x, 1)) >= 0
>   && SCALAR_INT_MODE_P (mode))
> {
> 
> 
> Right now, it changes ASHIFT in any SET into mult because of below code:
> 
>   /* Select the code to be used in recursive calls.  Once we are inside an
>  address, we stay there.  If we have a comparison, set to COMPARE,
>  but once inside, go back to our default of SET.  */
> 
>   next_code = (code == MEM ? MEM
>: ((code == PLUS || code == MINUS)
>   && SCALAR_INT_MODE_P (mode)) ? MEM // <bogus?
>: ((code == COMPARE || COMPARISON_P (x))
>   && XEXP (x, 1) == const0_rtx) ? COMPARE
>: in_code == COMPARE ? SET : in_code);
> 
> This seems an overlook to me.  The effect is all targets have to
> support the generated expression in the corresponding pattern.  Some
> times the generated expression is just too stupid and missed.  For
> example below insn is tried by combine:
> (set (reg:SI 79 [ D.2709 ])
> (plus:SI (subreg:SI (sign_extract:DI (mult:DI (reg:DI 1 x1 [ i ])
> (const_int 2 [0x2]))
> (const_int 17 [0x11])
> (const_int 0 [0])) 0)
> (reg:SI 0 x0 [ a ])))
> 
> 
> It actually equals to
> (set (reg/i:SI 0 x0)
> (plus:SI (ashift:SI (sign_extend:SI (reg:HI 1 x1 [ i ]))
> (const_int 1 [0x1]))
> (reg:SI 0 x0 [ a ])))
> 
> equals to below instruction on AARCH64:
> addw0, w0, w1, sxth 1
> 
> 
> Because of the existing comment, also because it will make backend
> easier (I suppose), is it reasonable to fix this behavior in
> combine.c?  Another question is, if we are going to make the change,
> how many targets might be affected?
> 

I think https://gcc.gnu.org/ml/gcc-patches/2015-01/msg01020.html is
related to this.


Thanks,
kugan


Re: optimization question

2015-05-18 Thread Kugan

On 19/05/15 12:58, mark maule wrote:
> Thank you for taking a look Martin.  I will attempt to pare this down,
> provide a sample with typedefs/macros expanded, etc.  and repost to
> gcc-help.  To address a couple of your points:


If you haven’t already, you can have a look at
https://gcc.gnu.org/wiki/A_guide_to_testcase_reduction

There are some examples/techniques to create a reduced test-case that
reproduces it.


Thanks,
Kugan


Re: LTO crashes with fortran code in SPEC CPU 2006

2017-01-15 Thread kugan



On 15/01/17 15:57, Andrew Pinski wrote:

Just this is just an FYI until I reduce the testcases but 5 benchmarks
in SPEC CPU 2006 with fortran code is causing an ICE on
aarch64-linux-gnu with -Ofast -flto -mcpu=thunderx2t99
-fno-aggressive-loop-optimizations -funroll-loops:
lto1: internal compiler error: in ipa_get_type, at ipa-prop.h:448
0x107c58f ipa_get_type
../../gcc/gcc/ipa-prop.h:448
0x107c58f propagate_constants_across_call
../../gcc/gcc/ipa-cp.c:2259
0x1080f4f propagate_constants_topo
../../gcc/gcc/ipa-cp.c:3170
0x1080f4f ipcp_propagate_stage
../../gcc/gcc/ipa-cp.c:3267
0x1081fcb ipcp_driver
../../gcc/gcc/ipa-cp.c:4997
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.
lto-wrapper: fatal error: gfortran returned 1 exit status

I don't know when this started as I am just starting to run SPEC CPU
2006 fp side with my spec cpu 2006 config.


I am seeing this too for aatch64 with -O3 -flto. It did work few weeks 
back. This must be a new bug.


Thanks,
Kugan




Thanks,
Andrew



[PR43721] Failure to optimise (a/b) and (a%b) into single call

2013-06-16 Thread Kugan

Hi,

I am attempting to fix Bug 43721 - Failure to optimise (a/b) and (a%b) 
into single __aeabi_idivmod call 
(http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43721)


execute_cse_sincos tree level pass does similar cse so I attempted to 
use similar approach here. Div/mod cse is not really using built-in 
functions though at this level.


For the case of div and mod operations, after CSE is performed, there 
isnt a way to represent the resulting stament in gimple. We will endup 
with divmod  taking two arguments and returning double the size of one 
arguments in the three address format (divmod will return reminder and 
quotient so the return type is double the size of argument type).


Since GIMPLE_ASSIGN will result in type checking failure in this case, I 
atempted use built-in functions (GIMPLE_CALL to represent the runtime 
library call). Name for the function here  is target specific and can be 
obtained from sdivmod_optab so the builtin function name defined in tree 
level is not used. I am not entirelt sure this is the right approach so 
I am attaching the first cut of the patch to get your feedback and 
understand the right approach to this problem.




Thank you,
Kugan

diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def
index 2634ecc..21c483a 100644
--- a/gcc/builtin-types.def
+++ b/gcc/builtin-types.def
@@ -250,6 +250,10 @@ DEF_FUNCTION_TYPE_2 (BT_FN_INT_CONST_STRING_FILEPTR,
 		 BT_INT, BT_CONST_STRING, BT_FILEPTR)
 DEF_FUNCTION_TYPE_2 (BT_FN_INT_INT_FILEPTR,
 		 BT_INT, BT_INT, BT_FILEPTR)
+DEF_FUNCTION_TYPE_2 (BT_FN_LONGLONG_INT_INT,
+		 BT_LONGLONG, BT_INT, BT_INT)
+DEF_FUNCTION_TYPE_2 (BT_FN_ULONGLONG_UINT_UINT,
+		 BT_ULONGLONG, BT_UINT, BT_UINT)
 DEF_FUNCTION_TYPE_2 (BT_FN_VOID_PTRMODE_PTR,
 		 BT_VOID, BT_PTRMODE, BT_PTR)
 DEF_FUNCTION_TYPE_2 (BT_FN_VOID_PTR_PTRMODE,
diff --git a/gcc/builtins.c b/gcc/builtins.c
index 402bb1f..1cae2bb 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -1876,7 +1876,9 @@ mathfn_built_in_1 (tree type, enum built_in_function fn, bool implicit_p)
   CASE_MATHFN (BUILT_IN_Y0)
   CASE_MATHFN (BUILT_IN_Y1)
   CASE_MATHFN (BUILT_IN_YN)
-
+  case BUILT_IN_DIVMOD:
+  case BUILT_IN_UDIVMOD:
+return builtin_decl_explicit (fn);
   default:
 	return NULL_TREE;
   }
@@ -2449,6 +2451,57 @@ expand_builtin_interclass_mathfn (tree exp, rtx target)
   return NULL_RTX;
 }
 
+/* Expand a call to the builtin divmod function to
+   library call.  */
+static rtx
+expand_builtin_divmod (tree exp, rtx target)
+{
+  rtx op0, op1;
+  enum machine_mode mode;
+  tree arg0, arg1;
+  rtx libval;
+  rtx libfunc;
+  rtx insns;
+  bool is_unsigned;
+
+  arg0 = CALL_EXPR_ARG (exp, 0);
+  arg1 = CALL_EXPR_ARG (exp, 1);
+
+  mode = TYPE_MODE (TREE_TYPE (arg0));
+  is_unsigned = TYPE_UNSIGNED (TREE_TYPE (arg0));
+
+  /* Get the libcall.  */
+  libfunc = optab_libfunc (is_unsigned ? udivmod_optab : sdivmod_optab, mode);
+  gcc_assert (libfunc);
+
+  op0 = expand_normal (arg0);
+  op1 = expand_normal (arg1);
+
+  if (MEM_P (op0))
+op0 = force_reg (mode, op0);
+  if (MEM_P (op1))
+op1 = force_reg (mode, op1);
+
+  /* The value returned by the library function will have twice as
+ many bits as the nominal MODE.  */
+  machine_mode libval_mode
+= smallest_mode_for_size (2 * GET_MODE_BITSIZE (mode),
+  MODE_INT);
+  start_sequence ();
+  libval = emit_library_call_value (libfunc, NULL_RTX, LCT_CONST,
+libval_mode, 2,
+op0, mode,
+op1, mode);
+  insns = get_insns ();
+  end_sequence ();
+  /* Move into the desired location.  */
+  if (target != const0_rtx)
+emit_libcall_block (insns, target, libval,
+gen_rtx_fmt_ee (is_unsigned ?  UMOD : MOD, mode, op0, op1));
+
+  return target;
+}
+
 /* Expand a call to the builtin sincos math function.
Return NULL_RTX if a normal call should be emitted rather than expanding the
function in-line.  EXP is the expression that is a call to the builtin
@@ -5977,6 +6030,13 @@ expand_builtin (tree exp, rtx target, rtx subtarget, enum machine_mode mode,
 	return target;
   break;
 
+case BUILT_IN_DIVMOD:
+case BUILT_IN_UDIVMOD:
+  target = expand_builtin_divmod (exp, target);
+  if (target)
+return target;
+  break;
+
 CASE_FLT_FN (BUILT_IN_SINCOS):
   if (! flag_unsafe_math_optimizations)
 	break;
diff --git a/gcc/builtins.def b/gcc/builtins.def
index 91879a6..7664700 100644
--- a/gcc/builtins.def
+++ b/gcc/builtins.def
@@ -599,6 +599,8 @@ DEF_C99_BUILTIN(BUILT_IN_VSCANF, "vscanf", BT_FN_INT_CONST_STRING_VALIST
 DEF_C99_BUILTIN(BUILT_IN_VSNPRINTF, "vsnprintf", BT_FN_INT_STRING_SIZE_CONST_STRING_VALIST_ARG, ATTR_FORMAT_PRINTF_NOTHROW_3_0)
 DEF_LIB_BUILTIN(BUILT_IN_VSPRINTF, "vsprintf", BT_FN_INT_STRING_CONST_STRING_VALIST_ARG, 

Re: [PR43721] Failure to optimise (a/b) and (a%b) into single call

2013-06-20 Thread Kugan

On 17/06/13 19:07, Richard Biener wrote:

On Mon, 17 Jun 2013, Kugan wrote:


Hi,

I am attempting to fix Bug 43721 - Failure to optimise (a/b) and (a%b) into
single __aeabi_idivmod call
(http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43721)

execute_cse_sincos tree level pass does similar cse so I attempted to use
similar approach here. Div/mod cse is not really using built-in functions
though at this level.


The issue with performing the transform at the same time as we
transform SINCOS is that the vectorizer will now no longer be able
to vectorize these loops.  It would need to be taught how to
handle the builtin calls (basically undo the transformation, I don't
know of any ISA that can do vectorized combined div/mod).  Which
means it should rather be done at the point we CSE reciprocals
(which also replaces computes with builtin target function calls).

Thanks Richard. Since execute_cse_reciprocals is handling reciprocals 
only, I added another pass to do divmod. Is that OK?



For the case of div and mod operations, after CSE is performed, there isnt a
way to represent the resulting stament in gimple. We will endup with divmod
taking two arguments and returning double the size of one arguments in the
three address format (divmod will return reminder and quotient so the return
type is double the size of argument type).

Since GIMPLE_ASSIGN will result in type checking failure in this case, I
atempted use built-in functions (GIMPLE_CALL to represent the runtime library
call). Name for the function here  is target specific and can be obtained from
sdivmod_optab so the builtin function name defined in tree level is not used.
I am not entirelt sure this is the right approach so I am attaching the first
cut of the patch to get your feedback and understand the right approach to
this problem.


If we don't want to expose new builtins to the user (I'm not sure we
want that), then using "internal functions" is an easier way to
avoid these issues (see gimple.h and internal-fn.(def|h)).


I have now changed to use internal functions. Thanks for that.


Generally the transform looks useful to me as it moves forward with
the general idea of moving pattern recognition done during RTL expansion
to an earlier place.

For the use of a larger integer type and shifts to represent
the modulo and division result I don't think that's the very best
idea.  Instead resorting to a complex integer type as return
value looks more appealing (similar to sincos using cexpi here).
That way you also avoid the ugly hard-coding of bit-sizes.



I have changed it to use complex integers now.


+  if (HAVE_divsi3
+   || (GET_MODE_BITSIZE (TYPE_MODE (type)) != 32)

watch out for types whose TYPE_PRECISION is not the bitsize
of their mode.  Also it should be GET_MODE_PRECISION here.

+   || !optab_libfunc (TYPE_UNSIGNED (type)? udivmod_optab :
sdivmod_optab,
+TYPE_MODE (type)))

targets that use a libfunc should also get this optimization, as
it always removes computations.  I think the proper test is
for whether the target can do division and/or modulus without
using a libfunc, not whether there is a divmod optab/libfunc.

I guess best way to do is by defining a target hook and let the target 
define the required behaviour. Is that what you had in mind?


I have attached a modified patch with these changes.


Others knowing this piece of the compiler better may want to comment
here, of course.

Thanks,
Richard.





Thanks,
Kugan
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index f030b56..3fae80e 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11375,3 +11375,8 @@ It returns true if the target supports GNU indirect functions.
 The support includes the assembler, linker and dynamic linker.
 The default value of this hook is based on target's libc.
 @end deftypefn
+
+@deftypefn {Target Hook} bool TARGET_COMBINE_DIVMOD (enum machine_mode @var{mode})
+This target hook returns @code{true} if the target provides divmod libcall operation for the machine mode @var{mode} and must be used to combine integer division and modulus operations. Return @code{false} otherwise.
+@end deftypefn
+
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index cc25fec..12974b1 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -11198,3 +11198,6 @@ memory model bits are allowed.
 @hook TARGET_ATOMIC_TEST_AND_SET_TRUEVAL
 
 @hook TARGET_HAS_IFUNC_P
+
+@hook TARGET_COMBINE_DIVMOD
+
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index b841abd..0db06f1 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -61,6 +61,62 @@ get_multi_vector_move (tree array_type, convert_optab optab)
   return icode;
 }
 
+/* Expand DIVMOD call STMT.  */
+static void
+expand_DIVMOD (gimple stmt)
+{
+  tree type, lhs, arg0, arg1;
+  rtx op0, op1, res0, res1, target;
+  enum machine_mode mode, compute_mode;
+  rtx libval;
+  rtx libfunc = NULL_RTX;
+  bool is_unsigned;
+
+  lhs = gimple_ca

Re: On-Demand range technology [1/5] - Executive Summary

2019-06-19 Thread Kugan Vivekanandarajah
Hi Andrew,

Thanks for working on this.

Enable elimination of zext/sext with VRP patch had to be reverted in
(https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00672.html) due to the
need for value ranges in PROMOTED_MODE precision for at least 1 test
case for alpha.

Playing with ranger suggest that it is not possible to get value
ranges in PROMOTED_MODE precision on demand. Or is there any way we
can use on-demand ranger here?

Thanks,
Kugan

On Thu, 23 May 2019 at 11:28, Andrew MacLeod  wrote:
>
> Now that stage 1 has reopened, I’d like to reopen a discussion about the
> technology and experiences we have from the Ranger project I brought up
> last year. https://gcc.gnu.org/ml/gcc/2018-05/msg00288.html .  (The
> original wiki pages are now out of date, and I will work on updating
> them soon.)
>
> The Ranger is designed to evaluate ranges on-demand rather than through
> a top-down approach. This means you can ask for a range from anywhere,
> and it walks back thru the IL satisfying any preconditions and doing the
> required calculations.  It utilizes a cache to avoid re-doing work. If
> ranges are processed in a forward dominator order, it’s not much
> different than what we do today. Due to its nature, the order you
> process things in has minimal impact on the overall time… You can do it
> in reverse dominator order and get similar times.
>
> It requires no outside preconditions (such as dominators) to work, and
> has a very simple API… Simply query the range of an ssa_name at any
> point in the IL and all the details are taken care of.
>
> We have spent much of the past 6 months refining the prototype (branch
> “ssa-range”) and adjusting it to share as much code with VRP as
> possible. They are currently using a common code base for extracting
> ranges from statements, as well as simplifying statements.
>
> The Ranger deals with just  ranges. The other aspects of  VRP are
> intended to be follow on work that integrates tightly with it, but are
> also independent and would be available for other passes to use.  These
> include:
> - Equivalency tracking
> - Relational processing
> - Bitmask tracking
>
> We have implemented a VRP pass that duplicates the functionality of EVRP
> (other than the bits mentioned above), as well as converted a few other
> passes to use the interface.. I do not anticipate those missing bits
> having a significant impact on the results.
>
> The prototype branch it quite stable and can successfully build and test
> an entire Fedora distribution (9174 packages). There is an issue with
> switches I will discuss later whereby the constant range of a switch
> edge is not readily available and is exponentially expensive to
> calculate. We have a design to address that problem, and in the common
> case we are about 20% faster than EVRP is.
>
> When utilized in passes which only require ranges for a small number of
> ssa-names we see significant improvements.  The sprintf warning pass for
> instance allows us to remove the calculations of dominators and the
> resulting forced walk order. We see a 95% speedup (yes, 1/20th of the
> overall time!).  This is primarily due to no additional overhead and
> only calculating the few things that are actually needed.  The walloca
> and wrestrict passes are a similar model, but as they have not been
> converted to use EVRP ranges yet, we don’t see similar speedups there.
>
> That is the executive summary.  I will go into more details of each
> major thing mentioned in follow on notes so that comments and
> discussions can focus on one thing at a time.
>
> We think this approach is very solid and has many significant benefits
> to GCC. We’d like to address any concerns you may have, and work towards
> finding a way to integrate this model with the code base during this
> stage 1.
>
> Comments and feedback always welcome!
> Thanks
> Andrew


Re: On-Demand range technology [1/5] - Executive Summary

2019-06-21 Thread Kugan Vivekanandarajah
Hi Andrew,

Thanks for looking into it and my apologies for not being clear.

My proposal was to use value ranges when expanding gimple to RTL and
eliminate redundant zero/sign extensions. I.e., if we know the value
generated by some gimple operation is already in the (zero/sign)
extended from based on our VR analysis, we could skip the SUBREG and
ZERO/SIGN_EXTEND (or set SRP_SIGNED_AND_UNSIGNED and likes).

However, the problem is, RTL operations are done in PROMOTE_MODE
precision while gimple value range is in natural types. This can be a
problem when type wraps (and shows up mainly for targets where
PROMOTE_MODE is DImode like alpha).

For example, as Uros pointed out with the reverted patch, for alpha-linux we had

FAIL: libgomp.fortran/simd7.f90   -O2  execution test
FAIL: libgomp.fortran/simd7.f90   -Os  execution test

The reason being that values wrap and in VR calculation we only
records the type precision (which is what matters for gimple) but in
order to eliminate the zero/sign extension we need the full precision
in the PROMOTE_MODE.

Extract from the testcase failing:
_343 = ivtmp.179_52 + 2147483645; [0x8004, 0x80043]
_344 = _343 * 2; [0x8, 0x86]
_345 = (integer(kind=4)) _344; [0x8, 0x86]

With the above VR of [0x8, 0x86] (in promoted precision which is
[0x10008, 0x10086]), my patch was  setting
SRP_SIGNED_AND_UNSIGNED which was wrong and causing the error
(eliminating and extension which was not redundant). If we had the VR
in promoted precision, the patch would be correct and used to
eliminate redundant zero/sign extensions.

Please let me know if my explanation is not clear and I will show it
with more examples.

Thanks,
Kugan


On Fri, 21 Jun 2019 at 23:27, Andrew MacLeod  wrote:
>
> On 6/19/19 11:04 PM, Kugan Vivekanandarajah wrote:
>
> Hi Andrew,
>
> Thanks for working on this.
>
> Enable elimination of zext/sext with VRP patch had to be reverted in
> (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00672.html) due to the
> need for value ranges in PROMOTED_MODE precision for at least 1 test
> case for alpha.
>
> Playing with ranger suggest that it is not possible to get value
> ranges in PROMOTED_MODE precision on demand. Or is there any way we
> can use on-demand ranger here?
>
> Thanks,
> Kugan
>
>
>
> I took a look at the thread, but I think I'm still missing some context.
>
> Lets go back to the beginning.  What is an example of the case we arent 
> getting that you want to get?
>
> I'm going to guess to start :-)
>
> short foo(unsigned char c)
>  {
>c = c & (unsigned char)0x0F;
>if( c > 7 )
>  return((short)(c - 5));
>else
>  return(( short )c);
>  }
>
>
>
> A run of this thru the ranger shows me code that looks like (on x86 anyway):
>
>  === BB 2 
> c_4(D)  [0, 255] unsigned char
>  :
> c_5 = c_4(D) & 15;
> _9 = c_4(D) & 8;
> if (_9 != 0)
>   goto ; [INV]
> else
>   goto ; [INV]
>
> c_5 : [0, 15] unsigned char
> _9 : [0, 0][8, 8] unsigned char
> 2->3  (T)*c_5 : [0, 15] unsigned char
> 2->3  (T) _9 :  [8, 8] unsigned char
> 2->4  (F)*c_5 : [0, 15] unsigned char
> 2->4  (F) _9 :  [0, 0] unsigned char
>
> === BB 3 
> c_5 [0, 15] unsigned char
>  :
> _1 = (unsigned short) c_5;
> _2 = _1 + 65531;
> _7 = (short int) _2;
> // predicted unlikely by early return (on trees) predictor.
> goto ; [INV]
>
> _1 : [0, 15] unsigned short
> _2 : [0, 10][65531, 65535] unsigned short
> _7 : [-5, 10] short int
>
> === BB 4 
> c_5 [0, 15] unsigned char
>  :
> _6 = (short int) c_5;
> // predicted unlikely by early return (on trees) predictor.
>
>
> I think I see.  we aren't adjusting the range of c_5 going into blocks 3 and 
> 4.  its obvious from the original source where the code says < 7, but once 
> its been "bitmasked" that info becomes obfuscated.
> .
> If you were to see a range in bb3 of c_5 = [8,15], and a range in bb4 of c_4 
> = [0,7],  would that solve your problem?
>
> so in bb3 , we'd then see ranges that look like:
>
> _1 : [8, 15] unsigned short
> _2 : [3, 10] unsigned short
> _7 : [3, 10] short int
>
> and then later on we'd see there is no negative/wrap value, and you could 
> could just drop the extension then?
>
> SO.
>
> yes. this is fixable,  but is it easy? :-)
>
> We're in the process of replacing the range extraction code with the 
> range-ops/gori-computes  components from the ranger.  This is the part which 
> figures ranges out from individual statements on exit to a block..
>
> We have implemented mostly the same func

Re: Duplicating loops and virtual phis

2017-05-21 Thread Kugan Vivekanandarajah
Hi Bin and Steve,

On 17 May 2017 at 19:41, Bin.Cheng  wrote:
> On Mon, May 15, 2017 at 7:32 PM, Richard Biener
>  wrote:
>> On May 15, 2017 6:56:53 PM GMT+02:00, Steve Ellcey  
>> wrote:
>>>On Sat, 2017-05-13 at 08:18 +0200, Richard Biener wrote:
>>>> On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey >>> om> wrote:
>>>> >
>>>> > (Short version of this email, is there a way to recalculate/rebuild
>>>> > virtual
>>>> > phi nodes after modifying the CFG.)
>>>> >
>>>> > I have a question about duplicating loops and virtual phi nodes.
>>>> > I am trying to implement the following optimization as a pass:
>>>> >
>>>> > Transform:
>>>> >
>>>> >   for (i = 0; i < n; i++) {
>>>> >A[i] = A[i] + B[i];
>>>> >C[i] = C[i-1] + D[i];
>>>> >   }
>>>> >
>>>> > Into:
>>>> >
>>>> >   if (noalias between A&B, A&C, A&D)
>>>> >for (i = 0; i < 100; i++)
>>>> >A[i] = A[i] + B[i];
>>>> >for (i = 0; i < 100; i++)
>>>> >C[i] = C[i-1] + D[i];
>>>> >   else
>>>> >for (i = 0; i < 100; i++) {
>>>> >A[i] = A[i] + B[i];
>>>> >C[i] = C[i-1] + D[i];
>>>> >}
>>>> >
>>>> > Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot
>>>be
>>>> > vectorized so it gives up and does not vectorize the loop.  If we
>>>split
>>>> > up the loop into two loops then the vector add with A[i] could be
>>>> > vectorized
>>>> > even if the one with C[i] could not.
>>>> Loop distribution does this transform but it doesn't know about
>>>> versioning for unknown dependences.
>>>>
>>>
>>>Yes, I looked at loop distribution.  But it only works with global
>>>arrays and not with pointer arguments where it doesn't know the size of
>>>the array being pointed at.  I would like to be able to have it work
>>>with pointer arguments.  If I call a function with 2 or
>>>more integer pointers, and I have a loop that accesses them with
>>>offsets between 0 and N where N is loop invariant then I should have
>>>enough information (at runtime) to determine if there are overlapping
>>>memory accesses through the pointers and determine whether or not I can
>>>distribute the loop.
>>
>> Not sure where you got that from. Loop distribution works with our data 
>> reference / dependence analysis.  The cost model might be more restricted 
>> but that can be fixed.
>>
>>>The loop splitting code seemed like a better template since it already
>>>knows how to split a loop based on a runtime determined condition. That
>>>part seems to be working for me, it is when I try to
>>>distribute/duplicate one of those loops (under the unaliased condition)
>>>that I am running into the problem with virtual PHIs.
>>
>> There's mark_virtual*for_renaming (sp?).
>>
>> But as said you are performing loop distribution so please enhance the 
>> existing pass rather than writing a new one.
> I happen to be working on loop distribution now (If guess correctly,
> to get hmmer fixed).  So far my idea is to fuse the finest distributed
> loop in two passes, in the first pass, we merge all SCCs due to "true"
> data dependence; in the second one we identify all SCCs and breaks
> them on dependent edges due to possible alias.  Breaking SCCs with
> minimal edge set can be modeled as Feedback arc set problem which is
> NP-hard. Fortunately the problem is small in our case and there are
> approximation algorithms.  OTOH, we should also improve loop
> distribution/fusion to maximize parallelism / minimize
> synchronization, as well as maximize data locality, but I think this
> is not needed to get hmmer vectorized.

I am also looking into vectoring homer loop. Glad to know you are also
looking at this problem and looking forward to seeing the patches.

I have some experimental patches where I added the data reference that
needs runtime checking to a list

static int
pg_add_dependence_edges (struct graph *rdg, vec loops, int dir,
  vec drs1,
- vec drs2)
+ vec drs2,
+ vec &ddrs,
+ bool runtime_alias_check)

Then I am vesioning the main loop based on the condition generated
from the runtime check. I have borrowed the logic from vectorizer
(like pruning the list and generating the condition). I have neither
verified nor benchmarked it enough yet.

As I understand, we also should  have some form of cost model where we
should be able too see the data access patterns and decide if the
distributed loops can be vectorized?

Cost model in  similar_memory_accesses also need to be relaxd based on
the ability to vectorize distributed loops.

Thanks,
Kugan


> Thanks,
> bin
>>
>> Richard.
>>
>>>Steve Ellcey
>>>sell...@cavium.com
>>


Loop reversal

2017-07-12 Thread Kugan Vivekanandarajah
I am looking into reversing loop to increased efficiency. There is
already a PR22041 for this and an old patch
https://gcc.gnu.org/ml/gcc-patches/2006-01/msg01851.html by Zdenek
which never made it to mainline.

For constant loop count, ivcanon pass is adding reverse iv but this
not selected by ivopt.

For example:

void copy (unsigned int N, double *a, double *c)
{
  for (int i = 0; i < 800; ++i)
  c[i] = a[i];
}

ivcanon pass Added canonical iv to loop 1, 799 iterations.
ivtmp_14 = ivtmp_15 – 1;

in ivopt, it selects candidates 10

Candidate 10:
Var befor: ivtmp.11
Var after: ivtmp.11
Incr POS: before exit test
IV struct:
Type: sizetype
Base: 0
Step: 8
Biv: N

If we look at the group :

Group 0:
Type: ADDRESS
Use 0.0:
At stmt: _5 = *_3;
At pos: *_3
IV struct:
Type: double *
Base: a_9(D)
Step: 8
Object: (void *) a_9(D)
Biv: N
Overflowness wrto loop niter: Overflow

Group 1:
Type: ADDRESS
Use 1.0:
At stmt: *_4 = _5;
At pos: *_4
IV struct:
Type: double *
Base: c_10(D)
Step: 8
Object: (void *) c_10(D)
Biv: N
Overflowness wrto loop niter: Overflow

Group 2:
Type: COMPARE
Use 2.0:
At stmt: if (ivtmp_14 != 0)
At pos: ivtmp_14
IV struct:
Type: unsigned int
Base: 799
Step: 4294967295
Biv: Y
Overflowness wrto loop niter: Overflow

ivopt cost model assumes that group0 and 1 will have infinite cost for
the iv added by ivcanon pass because of the lower precision with the
IV added by ivcanon pass.

If I change the example to:

void copy (unsigned int N, double *a, double *c)
{
 for (long i = 0; i < 800; ++i)
 c[i] = a[i];
}

It still has higher cost for group0 and 1 due to the negative step. I
think this can be improved. My question is:

1. For the case where the loop count is not constant, can we make
ivcanon to add reverse IV with the current implementation. Can ivopt
be taught to select the reverse iv ?

2. Or is the patch by Zdenek a better option. I am re-basing it for the trunk.

Thanks,
Kugan


Re: [RFC] type promotion pass

2017-09-17 Thread Kugan Vivekanandarajah
Hi Richard,

On 16 September 2017 at 06:12, Richard Biener
 wrote:
> On September 15, 2017 6:56:04 PM GMT+02:00, Jeff Law  wrote:
>>On 09/15/2017 10:19 AM, Segher Boessenkool wrote:
>>> On Fri, Sep 15, 2017 at 09:18:23AM -0600, Jeff Law wrote:
>>>> WORD_REGISTER_OPERATIONS works with PROMOTE_MODE.  The reason you
>>can't
>>>> define WORD_REGISTER_OPERATIONS on aarch64 is because that the
>>implicit
>>>> promotion is sometimes to 32 bits and sometimes to 64 bits.
>>>> WORD_REGISTER_OPERATIONS can't really describe that.
>>>
>>> WORD_REGISTER_OPERATIONS isn't well-defined.
>>>
>>> """
>>> @defmac WORD_REGISTER_OPERATIONS
>>> Define this macro to 1 if operations between registers with integral
>>mode
>>> smaller than a word are always performed on the entire register.
>>> Most RISC machines have this property and most CISC machines do not.
>>> @end defmac
>>> """
>>>
>>> Exactly what operations?  For almost all targets it isn't true for
>>*all*
>>> operations.  Or no targets even, if you include rotate, etc.
>>>
>>> For targets that have both 32-bit and 64-bit operations it is never
>>true
>>> either.
>>>
>>>> And I'm also keen on doing something with type promotion -- Kai did
>>some
>>>> work in this space years ago which I found interesting, even if the
>>work
>>>> didn't go forward.  It showed a real weakness.  So I'm certainly
>>>> interested in looking at Prathamesh's work -- with the caveat that
>>if it
>>>> stumbles across the same issues as Kai's work that it likely
>>wouldn't be
>>>> acceptable in its current form.
>>>
>>> Doing type promotion too aggressively reduces code quality.  "Just"
>>find
>>> a sweet spot :-)
>>>
>>> Example: on Power, an AND of QImode with 0xc3 is just one insn, which
>>> actually does a SImode AND with 0xffc3.  This is what we do
>>currently.
>>> A SImode AND with 0x00c3 is two insns, or one if we allow it to
>>write
>>> to CR0 as well ("andi."); same for DImode, except there isn't a way
>>to do
>>> an AND with 0xffc3 in one insn at all.
>>>
>>> unsigned char a;
>>> void f(void) { a &= 0xc3; };
>>Yes, these are some of the things we kicked around.  One of the most
>>interesting conclusions was that for these target issues we'd really
>>like a target.pd file to handle this class of transformations just
>>prior
>>to rtl expansion.
>>
>>Essentially early type promotion/demotion would be concerned with cases
>>where we can eliminate operations in a target independent manner and
>>narrow operands as much as possible.  Late promotion/demotion would
>>deal
>>with stuff like the target's desire to work on specific sized hunks in
>>specific contexts.
>>
>>I'm greatly oversimplifying here.  Type promotion/demotion is fairly
>>complex to get right.
>
> I always thought we should start with those promotions that are done by RTL 
> expansion according to PROMOTE_MODE and friends. The complication is that 
> those promotions also apply to function calls and arguments and those are 
> difficult to break apart from other ABI specific details.
>
> IIRC the last time we went over this patch I concluded a better first step 
> would be to expose call ABI details on GIMPLE much earlier. But I may 
> misremember here.

I think this would be very useful. Some of the regressions in type
promotion comes from parameters/return values. ABI in some cases
guarantees that they are properly extended but during type promotion
we promote (or extend) leading to additional extensions.

We might also need some way of having gimple statements that can
convert (or promote to the type without extensions) just to keep the
gimple type system happy.

Thanks,
Kugan

>
> Basically we couldn't really apply all promotions RTL expansion applies. One 
> of my ideas with doing them early also was to simplify RTL expansion and 
> especially promotion issues during SSA coalescing.
>
> Richard.
>
>>jeff
>


Re: [RFC] type promotion pass

2017-09-18 Thread Kugan Vivekanandarajah
Hi Steve,

On 19 September 2017 at 05:45, Steve Ellcey  wrote:
> On Mon, 2017-09-18 at 23:29 +0530, Prathamesh Kulkarni wrote:
>>
>> Hi Steve,
>> The patch is currently based on r249469. I will rebase it on ToT and
>> look into the build failure.
>> Thanks for pointing it out.
>>
>> Regards,
>> Prathamesh
>
> OK, I applied it to that version successfully.  The thing I wanted to
> check was to see if this helped with PR target/77729.  It does not,
> so I think even with this patch we would need my patch to address the
> issue of having GCC recognize that ldrb/ldhb zero out the top of a
> register and thus we do not need to mask it out later.
>
> https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00929.html

I tried the testases you have in the patch with type promotion. Looks
like forwprop is reversing the promotion there. I haven't looked in
detail yet but -fno-tree-forwprop seems to remove 6 "and" from the
test case. I have a slightly different version to what Prathamseh has
posted and hope that there isn't any difference here.

Thanks,
Kugan


Handling prefetcher tag collisions while allocating registers

2017-10-23 Thread Kugan Vivekanandarajah
Hi All,

I am wondering if there is anyway we can prefer certain registers in
register allocations. That is, I want to have some way of recording
register allocation decisions (for loads in loop that are accessed in
steps) and use this to influence register allocation of other loads
(again that are accessed in steps).

This is for architectures (like falkor AArch64) that use hardware
perefetchers that use signatures of the loads to lock into and tune
prefetching parameters. Ideally, If the loads are from the same
stream, they should have same signature and if they are from different
stream, they should have different signature. Destination, base
register and offset are used in the signature. Therefore, selecting
different register can influence this.

In LLVM, this is implemented as a machine specific pass that runs
after register allocation. It then inserts mov instruction with
scratch registers to manage this. We can do a machine reorg pass in
gcc but detecting strided loads at that stage is not easy.

I am trying to implement this in gcc and wondering what is the
preferred and acceptable way to implement this. Any thoughts ?

Thanks,
Kugan


Re: Handling prefetcher tag collisions while allocating registers

2017-10-24 Thread Kugan Vivekanandarajah
Hi Bin,

On 24 October 2017 at 18:29, Bin.Cheng  wrote:
> On Tue, Oct 24, 2017 at 12:44 AM, Kugan Vivekanandarajah
>  wrote:
>> Hi All,
>>
>> I am wondering if there is anyway we can prefer certain registers in
>> register allocations. That is, I want to have some way of recording
>> register allocation decisions (for loads in loop that are accessed in
>> steps) and use this to influence register allocation of other loads
>> (again that are accessed in steps).
>>
>> This is for architectures (like falkor AArch64) that use hardware
>> perefetchers that use signatures of the loads to lock into and tune
>> prefetching parameters. Ideally, If the loads are from the same
>> stream, they should have same signature and if they are from different
>> stream, they should have different signature. Destination, base
>> register and offset are used in the signature. Therefore, selecting
>> different register can influence this.
> I wonder why the destination register is used in signature.  In an extreme 
> case,
> load in loop can be unrolled then allocated to different dest registers. 
> Forcing
> the same dest register could be too restricted.

My description is very simplified. Signature is based on part of the
register number. Thus, two registers can have same signature. What we
don't want is to have collisions when they are from two different
memory stream. So this is not an issue.

Thanks,
Kugan

>
> Thanks,
> bin
>
>>
>> In LLVM, this is implemented as a machine specific pass that runs
>> after register allocation. It then inserts mov instruction with
>> scratch registers to manage this. We can do a machine reorg pass in
>> gcc but detecting strided loads at that stage is not easy.
>>
>> I am trying to implement this in gcc and wondering what is the
>> preferred and acceptable way to implement this. Any thoughts ?
>>
>> Thanks,
>> Kugan


Re: Global analysis of RTL

2017-10-25 Thread Kugan Vivekanandarajah
Hi,

On 26 October 2017 at 14:13, R0b0t1  wrote:
> On Thu, Oct 19, 2017 at 8:46 AM, Geoff Wozniak  wrote:
>> R0b0t1  writes:
>>>
>>> When I first looked at the GCC codebase, it seemed to me that most
>>> operations should be done on the GIMPLE representation as it contains the
>>> most information. Is there any reason you gravitated towards RTL?
>>
>>
>> Naiveté, really.
>>
>> My team and I didn’t know much about the code base when we started looking
>> at the problem, although we knew a little about the intermediate formats.
>> GIMPLE makes the analysis more complicated, although not impossible, and it
>> can make the cost model difficult to pin down. Raw assembly/machine code is
>> ideal, but then we have to deal with different platforms and would likely
>> have to do all the work in the linker. RTL is sufficiently low-level enough
>> (as far as we know) to start counting instructions, and platform independent
>> enough that we don’t have to parse machine code.
>>
>> Essentially, working with RTL makes the implementation a little easier but
>> we didn’t know that the pass infrastructure wasn’t in our favour.
>>
>> It’s likely we’ll turn our attention to GIMPLE and assembler/machine code,
>> unless we can come up with something (or anyone has a suggestion).
>>
>
> Admittedly I do not know much about compiler design, but your response
> has put some of what I read about analysis of RTL into context. It it
> is hard to be sure, but I think analysis of RTL has fallen out of
> favor and has been replaced with the analysis of intermediate
> languages. For example, compare clang and llvm's operation.

It is not really being replaced (at least I am not aware of it). It is
true that more and more of the high level optimisations are moved to
gimple. When we move high level intermediate format to lower level
intermediate format, we tend to lose some information and gets more
closer to machine representation. An obvious example is, in RTL sign
is not represented. Even in RTL, after reload, we will have one to one
mapping fro RTL to actual machine instruction (i.e. more closer to
asm). In short, gcc goes from generic to gimple to RTL as stamens are
lowered from high level languages to asm.

Thanks,
Kugan

>
> The missing link is that you seem to be right about cost calculation.
> Cost calculation is difficult for high level operations. Would online
> analysis of the produced machine code be sufficient? That seems to be
> a popular solution from what I have read.
>
> Thanks for the response, and best of luck to you.
>
> Cheers,
>  R0b0t1.


Re: Problems in IPA passes

2017-10-29 Thread Kugan Vivekanandarajah
Hi Jeff,

On 28 October 2017 at 18:28, Jeff Law  wrote:
>
> Jan,
>
> What's the purpose behind calling vrp_meet and
> extract_range_from_unary_expr from within the IPA passes?

This is used such that when we have an argument to a function and this
for which we know the VR and this intern is passed as a parameter to
another. For example:

void foo (int i)
{
...
  bar (unary_op (i))
...
}

This is mainly to share what is done in tree-vrp.
>
> AFAICT that is not safe to do.  Various paths through those routines
> will access static objects within tree-vrp.c which may not be
> initialized when IPA runs (vrp_equiv_obstack, vr_value).

IPA-VRP does not track equivalence and vr_value is not used.

Thanks,
Kugan

>
> While this seems to be working today, it's a failure waiting to happen.
>
> Is there any way you can avoid using those routines?  I can't believe
> you really need all the complexity of those routines, particularly
> extract_range_from_unary_expr.  Plus it's just downright fugly from a
> modularity standpoint.
>
>
> ?
>
> Jeff


Re: Problems in IPA passes

2017-10-31 Thread Kugan Vivekanandarajah
Hi Jeff,

On 31 October 2017 at 14:47, Jeff Law  wrote:
> On 10/29/2017 03:54 PM, Kugan Vivekanandarajah wrote:
>> Hi Jeff,
>>
>> On 28 October 2017 at 18:28, Jeff Law  wrote:
>>>
>>> Jan,
>>>
>>> What's the purpose behind calling vrp_meet and
>>> extract_range_from_unary_expr from within the IPA passes?
>>
>> This is used such that when we have an argument to a function and this
>> for which we know the VR and this intern is passed as a parameter to
>> another. For example:
>>
>> void foo (int i)
>> {
>> ...
>>   bar (unary_op (i))
>> ...
>> }
>>
>> This is mainly to share what is done in tree-vrp.
> Presumably you never have equivalences or anything like that, which
> probably helps with not touching vrp_bitmap_obstack which isn't
> initialized when you run the IPA bits.
>
>>>
>>> AFAICT that is not safe to do.  Various paths through those routines
>>> will access static objects within tree-vrp.c which may not be
>>> initialized when IPA runs (vrp_equiv_obstack, vr_value).
>>
>> IPA-VRP does not track equivalence and vr_value is not used.
> But there's no enforcement and I'd be hard pressed to believe that all
> the paths through the routines you use in tree-vrp aren't going to touch
> vr_value, or vrp_bitmap_obstack.  vrp_bitmap_obstack turns out to be
> incredibly tangled into the implementations within tree-vrp.c :(
>

I looked into the usage and it does seem to be not using vr_value
unless I am missing something. There are two overloaded functions
here:

extract_range_from_unary_expr (value_range *vr,
   enum tree_code code, tree type,
   value_range *vr0_, tree op0_type)
is safe as this works with value_range and does not use
get_value_range to access vr_value.

extract_range_from_unary_expr (value_range *vr, enum tree_code code,
   tree type, tree op0)
This is not safe as this takes tree as an argument and gets
value_range by calling get_value_range. May be we should change the
names to reflect this.

Thanks,
Kugan


Re: poly_uint64 / TYPE_VECTOR_SUBPARTS question

2018-02-08 Thread Kugan Vivekanandarajah
Hi,

On 9 February 2018 at 09:08, Steve Ellcey  wrote:
> I have a question about the poly_uint64 type and the TYPE_VECTOR_SUBPARTS
> macro.  I am trying to copy some code from i386.c into my aarch64 build
> that is basically:
>
> int n;
> n = TYPE_VECTOR_SUBPARTS (type_out);
>
> And it is not compiling for me, I get:
>
> /home/sellcey/gcc-vectmath/src/gcc/gcc/config/aarch64/aarch64-builtins.c:1504:37:
>  error: cannot convert ‘poly_uint64’ {aka ‘poly_int<2, long unsigned int>’} 
> to ‘int’ in assignment
>n = TYPE_VECTOR_SUBPARTS (type_out);

AFIK, you could use to_constant () if known to be a compile time constant.

Thanks,
Kugan
>
> My first thought was that I was missing a header file but I put
> all the header includes that are in i386.c into aarch64-builtins.c
> and it still does not compile.  It works on the i386 side.  It looks
> like poly-int.h and poly-int-types.h are included by coretypes.h
> and I include that header file so I don't understand why this isn't
> compiling and what I am missing.  Any help?
>
> Steve Ellcey
> sell...@cavium.com


Generating gimple assign stmt that changes sign

2018-05-21 Thread Kugan Vivekanandarajah
Hi,

I am looking to introduce ABSU_EXPR and that would create:

unsigned short res = ABSU_EXPR (short);

Note that the argument is signed and result is unsigned. As per the
review, I have a match.pd entry to generate this as:
(simplify (abs (convert @0))
 (if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (@0)))
  (convert (absu @0


Now when gimplifying the converted tree, how do we tell that ABSU_EXPR
will take a signed arg and return unsigned. I will have other match.pd
entries so this will be generated while in gimple.passes too. Should I
add new functions in gimple.[h|c] for this.

Is there any examples I can refer to. Conversion expressions seems to
be the only place where sign can change in gimple assignment but they
are very specific.

Thanks,
Kugan


Re: Generating gimple assign stmt that changes sign

2018-05-21 Thread Kugan Vivekanandarajah
Hi Jeff,

Thanks for the prompt reply.

On 22 May 2018 at 09:10, Jeff Law  wrote:
> On 05/21/2018 04:50 PM, Kugan Vivekanandarajah wrote:
>> Hi,
>>
>> I am looking to introduce ABSU_EXPR and that would create:
>>
>> unsigned short res = ABSU_EXPR (short);
>>
>> Note that the argument is signed and result is unsigned. As per the
>> review, I have a match.pd entry to generate this as:
>> (simplify (abs (convert @0))
>>  (if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (@0)))
>>   (convert (absu @0
>>
>>
>> Now when gimplifying the converted tree, how do we tell that ABSU_EXPR
>> will take a signed arg and return unsigned. I will have other match.pd
>> entries so this will be generated while in gimple.passes too. Should I
>> add new functions in gimple.[h|c] for this.
>>
>> Is there any examples I can refer to. Conversion expressions seems to
>> be the only place where sign can change in gimple assignment but they
>> are very specific.
> What's the value in representing ABSU vs a standard ABS followed by a
> conversion?

It is based on PR https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946.
Specifically, comment 13.

>
> You'll certainly want to do verification of the type signedness in the
> gimple verifier.
I am doing it and it is failing now.

>
> In general the source and destination types have to be the same.
> Conversions are the obvious exception.  There's a few other nodes that
> have more complex type rules (MEM_REF, COND_EXPR and a few others).  But
> I don't think they're necessarily going to be helpful.


Thanks,
Kugan
>
> jeff


Sched1 stability issue

2018-07-03 Thread Kugan Vivekanandarajah
Hi,

We noticed a difference in the code generated for aarch64 gcc 7.2
hosted in Linux vs mingw. AFIK, we are supposed to produce the same
output.

For the testacse we have (quite large and I am trying to reduce), the
difference comes from sched1 pass. If I disable sched1 the difference
is going away.

Is this a known issue? Attached is the sched1 dump snippet where there
is the difference.

Thanks,
Kugan


 verify found no changes in insn with uid = 41.
 starting the processing of deferred insns
 ending the processing of deferred insns
 df_analyze called

 Pass 0 for finding pseudo/allocno costs


   r84 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:2
FP_REGS:2 ALL_REGS:2 MEM:8000
   r83 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:2
FP_REGS:2 ALL_REGS:2 MEM:8000
   r80 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:1
FP_REGS:1 ALL_REGS:1 MEM:8000
   r79 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:4000
FP_REGS:4000 ALL_REGS:1 MEM:8000
   r78 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:4000
FP_REGS:4000 ALL_REGS:1 MEM:8000
   r77 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:9000
FP_REGS:9000 ALL_REGS:1 MEM:8000


 Pass 1 for finding pseudo/allocno costs

 r86: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r85: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r84: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r83: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r82: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r81: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r80: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r79: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r78: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r77: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r76: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r75: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r74: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r73: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r72: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r71: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r70: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r69: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r68: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r67: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS

   r84 costs: GENERAL_REGS:0 FP_LO_REGS:2 FP_REGS:2
ALL_REGS:2 MEM:8000
   r83 costs: GENERAL_REGS:0 FP_LO_REGS:2 FP_REGS:2
ALL_REGS:2 MEM:8000
   r80 costs: GENERAL_REGS:0 FP_LO_REGS:1 FP_REGS:1
ALL_REGS:1 MEM:8000
   r79 costs: GENERAL_REGS:0 FP_LO_REGS:1 FP_REGS:1
ALL_REGS:1 MEM:8000
   r78 costs: GENERAL_REGS:0 FP_LO_REGS:1 FP_REGS:1
ALL_REGS:1 MEM:8000
   r77 costs: GENERAL_REGS:0 FP_LO_REGS:1 FP_REGS:1
ALL_REGS:1 MEM:8000

 ;;   ==
 ;;   -- basic block 2 from 3 to 48 -- before reload
 ;;   ==

 ;;  0--> b  0: i  24 r77=ap-0x40
:cortex_a53_slot_any:GENERAL_REGS+1(1)FP_REGS+0(0)
 ;;  0--> b  0: i  26 r78=0xffc8
:cortex_a53_slot_any:GENERAL_REGS+1(1)FP_REGS+0(0)
 ;;  1--> b  0: i  25 [sfp-0x10]=r77
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(-1)@FP_REGS+0(0)
--
-;;  1--> b  0: i   9 [ap-0x8]=x7
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(-1)@FP_REGS+0(0)
--
-;;  2--> b  0: i  22 [sfp-0x20]=ap
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:GENERAL_REGS+0(0)FP_REGS+0(0)
+;;  1--> b  0: i  22 [sfp-0x20]=ap
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(0)@FP_REGS+0(0)
 ;;  2--> b  0: i  23 [sfp-0x18]=ap
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:GENERAL_REGS+0(0)FP_REGS+0(0)
-;;  3--> b  0: i  27 [sfp-0x8]=r78
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:GENERAL_REGS+0(-1)FP_REGS+0(0)
+;;  2--> b  0: i  27 [sfp-0x8]=r78
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:GENERAL_REGS+0(-1)FP_REGS+0(0)
 ;;  3--> b  0: i  28 r79=0xff80
:cortex_a53_slot_any:GENERAL_REGS+1(1)FP_REGS+0(0)
-;;  4--> b  0: i  10 [ap-0xc0]=v0
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(0)@FP_REGS+0(-1)
+;;  3--> b  0: i  10 [ap-0xc0]=v0
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(0)@FP_REGS+0(-1)
 ;;